The Sim Book
A guide to building and testing distributed systems with moonpool, a deterministic simulation framework for Rust.
Moonpool brings together ideas from FoundationDB, TigerBeetle, and Antithesis into a single library-level framework. Write your system once with provider traits, then run it against a simulated world that is deliberately worse than production. Same code, different wiring. No #[cfg(test)]. No mocks.
This book covers the philosophy, the architecture, and the practical details of building simulation-tested distributed systems.
Index
- Quick Start Routes
- Part I: Why Simulation Testing
- Part II: Foundations
- Part III: Building Simulations
- Part IV: Simulating Existing Applications
- Part V: Networking and RPC
- Part VI: Building on Top
- Appendix
A sitemap of every chapter in the Moonpool book. Each entry links to a chapter with a summary of what it covers.
Quick Start Routes
- “What is Moonpool?” — The Case for Simulation, then Why Moonpool Exists
- “How do I write my first simulation?” — Your First Simulation and its sub-chapters
- “How do providers work?” — The Provider Pattern
- “How do I add chaos/faults?” — Chaos in Moonpool
- “How do I use assertions?” — Assertions: Finding Bugs
- “How does networking/RPC work?” — Simulating the Network
- “How do I test an existing app (e.g. axum)?” — Using moonpool-sim Standalone
- “How does multiverse exploration work?” — Multiverse Exploration
- “What assertions are available?” — Assertion Reference
- “What configuration options exist?” — Configuration Reference
Part I: Why Simulation Testing
- The Case for Simulation — Why distributed systems need simulation; the gap between localhost and production; failure statistics
- Prevention vs Discovery — Two testing philosophies: regression (prevention) vs generative (discovery)
- From Mocks to Simulation — Why mocks break at scale; the
#[cfg(test)]trap; maintenance cost - A Brief History — FoundationDB simulator origins, TigerBeetle storage faults, Antithesis assertions
- Why Moonpool Exists — Synthesizing ideas from FDB, TigerBeetle, and Antithesis into one framework
Part II: Foundations
- Determinism as a Foundation — Three non-determinism sources: threads, I/O, randomness; why reproducibility matters
- The Single-Core Constraint — Single-threaded execution guarantees one legal ordering; tokio local runtime
- Seed-Driven Reproducibility — One u64 seed controls entire simulation; ChaCha8Rng; cross-platform determinism
- The Provider Pattern — Five traits (Time, Network, Task, Random, Storage) abstract all I/O; swap real vs simulated
- Quick Start: Swapping Implementations — Practical example: generic function running against TokioProviders or SimProviders
- Deep Dive: Why Providers Exist — Problems with
#[cfg(test)]and mocks; providers eliminate both - The Five Providers — TimeProvider, NetworkProvider, TaskProvider, RandomProvider, StorageProvider details
- System Under Test vs Test Driver — Process (server code) vs Workload (test driver); two distinct roles
- Process: Your Server — Process trait:
name(),run(); recreated fresh on every boot from factory - Workload: Your Test Driver — Workload trait:
setup(),run(),check(); survives reboots; drives and validates
Part III: Building Simulations
- Your First Simulation — End-to-end walkthrough: KV server process, workload, assertions, builder
- Defining a Process — KvServer implementing Process trait; handling TCP; respecting shutdown
- Writing a Workload — KvWorkload tracking state; sending requests; validating responses
- Configuring the SimulationBuilder — Builder pattern:
.workload(),.processes(), chaos config, iterations - Running and Observing —
cargo xtask sim run; reading reports; simulation binary structure - Chaos Testing vs Simulation — Chaos engineering (production, reactive) vs simulation (deterministic, proactive)
- Chaos in Moonpool — Four fault dimensions: buggify, attrition, network faults, storage faults
- Buggify: Fault Injection — Two-phase activation; testing error paths; FoundationDB-inspired
- Attrition: Process Reboots — Graceful, crash, wipe reboot types; randomized kills; recovery delay
- Network Faults — Connection-level: latency, partition, drops, reordering, clogging
- Storage Faults — TigerBeetle-inspired: corruption, misdirected I/O, phantom writes, sync failures; per-process storage config and crash/wipe scoped by IP
- Assertions: Finding Bugs — Record and continue (Antithesis principle); cascade discovery
- Invariants vs Discovery vs Guidance — Three assertion categories: invariants, sometimes, numeric
- Always and Sometimes —
assert_always!(must hold) vsassert_sometimes!(exploration guidance) - Numeric Assertions —
assert_always_less_than!; watermark tracking; explorer optimizes bounds - Compound Assertions —
assert_sometimes_all!for simultaneous sub-goals; frontier tracking - System Invariants — Invariant trait runs after every event; cross-system properties; conservation laws
- Event Timelines — Append-only typed timelines for temporal invariants; fault timeline auto-emitted by simulator
- Designing Workloads That Find Bugs — Targeted adversarial design vs white noise; strategy matters
- Debugging a Failing Seed — Five-step workflow: reproduce, isolate, understand, fix, verify
- Reproducing with FixedCount — Pin seed with
set_debug_seeds()+set_iterations(1); exact replay - Reading the Event Trace — Event queue ordering;
RUST_LOG=trace; causal chain reconstruction - Common Pitfalls — Don’t
stop().awaitin workloads (deadlock); usedrop()instead - Discovering Properties — Systematic property discovery using attention focuses; finding where to place assertions and buggify
Part IV: Simulating Existing Applications
- Using moonpool-sim Standalone — Standalone simulation engine for existing code (axum, Postgres, etc.)
- Where to Draw the Line — Fakes vs test containers; binary failure limitations
- Wiring a Web Service — Worked example: axum service in simulation with Store trait fake, chaos, assertions
- What You’re Testing (and What You’re Not) — Tests handler logic and HTTP under chaos; doesn’t test TLS, proxies, startup code
Part V: Networking and RPC
- Simulating the Network — TCP-level simulation; connection-level faults; FlowTransport architecture
- Peers and Connections — Logical connection resilience; reconnection on drop; message draining
- Backoff and Reconnection — Exponential backoff (FDB pattern); prevents storms; 100ms initial, 30s max
- Wire Format — Packet layout: length, checksum, token, payload; CRC32 validation
- RPC with #[service] — Proc macro: write trait, get client/server/endpoints generated
- Defining a Service —
#[service(id = ...)]trait, request/response types, serialization - Server, Client, and Endpoints — Server setup, client connection, endpoint routing, RequestStream, ReplyPromise
- Delivery Modes — Four guarantees: send, try_get_reply, get_reply, get_reply_unless_failed_for
- Failure Monitor — Address-level and endpoint-level reachability tracking
- Load Balancing and Fan-Out —
load_balance()withQueueModel, plus four fan-out shapes (all/quorum/race/partial) - Designing Simulation-Friendly RPC — Idempotent design, versioning, bounded retries, deduplication, causality
Part VI: Building on Top
- Multiverse Exploration — Checkpoint-and-branch with fork(); timeline tree; exponential trial reduction
- The Exploration Problem — Sequential Luck Problem: N unlikely events need exponential trials without branching
- Fork at Discovery — Unix fork() copies process; reseed with FNV-1a; tree of timelines
- Coverage and Energy Budgets — Fixed-count splitting; global energy cap; prevents exponential blowup
- Adaptive Forking — Batch-based exploration; productive marks earn more; barren marks cut early
- Multi-Seed Exploration — Coverage-preserving seed transitions; selective reset; explored map carries forward
Appendix
- Assertion Reference — Complete table of 15 assertion macros with behavior and parameters
- Crate Map — 8-crate workspace diagram and dependency hierarchy
- Configuration Reference — SimulationBuilder methods, ChaosConfiguration, AttritionConfiguration, exploration
- Fault Reference — Every fault by category with config fields and defaults
- Glossary — Alphabetical definitions: adaptive forking, always assertion, attrition, buggify, coverage bitmap, etc.
State of Moonpool
Moonpool is a hobby project under active development. It is not production-ready, and the APIs will change. This book documents the framework as it exists today, with honest markers for what works and what remains experimental.
What works
The simulation engine is the most mature piece. Single-threaded deterministic execution, seed-driven reproducibility, and simulated time advancement all function as described in this book and exercised by the simulation binaries shipped with the repository. A failing seed gives you a reproducible local debugging session, every time.
Chaos testing covers network faults (partitions, latency, connection drops, reordering) and storage faults (corruption, torn writes, misdirected I/O) with per-process scoping so each node experiences independent failures. The BUGGIFY-style injection system runs at configurable probability, and the Hurst exponent manipulation produces correlated, cascading failures that mirror real datacenter behavior.
The assertion suite implements the full Antithesis-inspired taxonomy: always, sometimes, reachable, unreachable, numeric, and sometimes-all assertions. These live in shared memory and survive fork boundaries for multiverse exploration.
Transport and RPC provide a trait-based networking layer with peer connections, wire format, and service definitions via proc-macro. The same code runs against real TCP or the simulated network.
Fork-based multiverse exploration is operational: coverage-guided forking, adaptive energy budgets, and multi-seed exploration with coverage preservation across seeds.
What is experimental
Parallel exploration (multi-process exploration across CPU cores) is not yet implemented.
How to read this book
Part I stands alone as philosophy. You can read it without ever touching moonpool code, and the ideas apply to any simulation framework.
Parts II through V are practical and reflect current APIs. Code examples compile against the latest version, but expect them to evolve. When an API is experimental, the text says so.
Part V on multiverse exploration describes the most novel piece of moonpool. If you are evaluating whether fork-based exploration matters for your use case, start there.
The Case for Simulation
You deploy on a Friday afternoon. The tests are green. Code review was thorough. You even ran the integration suite twice. You go home.
At 2am, your phone lights up. A network partition isolated two nodes for eleven seconds. During recovery, a message arrived out of order. A retry collided with a timeout. The system entered a state that nobody on your team imagined was reachable. Data was lost.
The code was correct for the world your tests described. It was not correct for the world production delivered.
The gap
Development environments are clean. The network is localhost. Disks never fail. Clocks agree. Messages arrive in order, exactly once. We know this is fictional, yet our test environments faithfully reproduce the fiction. We test against a world that does not exist, then express surprise when the real world finds the bugs we missed.
Production is a different animal. A study of 198 failures in Cassandra, HBase, HDFS, MapReduce, and Redis found that 74% were deterministic, most could be reproduced on 3 or fewer nodes, and 77% could have been caught by a unit test asserting against the correct error condition. The bugs were not exotic. They were ordinary mistakes in error handling code that nobody thought to test. Research on network partition failures in cloud services showed that 80% of catastrophic failures in distributed systems are caused by partition-related bugs, and 27% of those result in data loss. These are not exotic edge cases. They are Tuesday.
The gap between development and production is not a minor oversight we can close by writing more careful tests. It is structural.
The combinatorial problem
Consider a modest distributed system: three nodes, a leader election protocol, and replicated state. Now list the things that can go wrong. Any node can crash and restart. The network between any pair of nodes can partition. Messages can be delayed, reordered, or duplicated. Disk writes can be torn or lost. Clocks can drift.
A single test scenario might be: “Node B crashes during a leader election while Node A has an in-flight write and the network between A and C drops for two seconds.” That is one scenario. How many are there?
Even with coarse-grained modeling, a three-node cluster with five failure types and ten time steps produces thousands of distinct failure histories. A five-node cluster with realistic failure granularity produces millions. And that is before considering application-level state: what data was in flight, which transactions were uncommitted, which clients were retrying.
Consider a simple e-commerce API as an example. Six variable dimensions (user types, payment methods, delivery options, promotions, inventory status, currencies) require 648 unique test combinations for basic coverage. Adding one option to each dimension pushes it past 4,000. A real system has hundreds of dimensions. In one real project, a 300-line feature required a 10,000-line test PR to maintain combinatorial coverage.
This is not a tooling problem. No test framework makes writing 10,000 tests sustainable. The combinatorial space of a distributed system grows faster than any team can write tests for it.
Why coverage metrics lie
You might look at your coverage report and feel reassured. 85% line coverage. 70% branch coverage. These numbers measure how much code your tests execute. They say nothing about how much state space your tests explore.
A distributed system can execute the same lines of code in thousands of different orderings with thousands of different timing relationships. Line coverage treats all of those as identical. Branch coverage is slightly better but still blind to interleaving. You can have 100% branch coverage and never once test what happens when a leader election overlaps with a network partition during a compaction.
Will Wilson put it precisely: the very reason tests are needed (humans cannot enumerate all control flow paths) is exactly what makes it impossible for humans to write comprehensive tests. This is not a failure of discipline. It is a logical impossibility. Manual tests verify what developers imagined. They cannot verify what developers did not imagine. And bugs, by definition, live in the places nobody imagined.
The structural impossibility
Here is the argument in its sharpest form.
Distributed systems fail in ways that depend on the ordering of concurrent events. The number of possible orderings grows combinatorially with system size. Humans write tests based on scenarios they can imagine. The scenarios that cause bugs are, almost by definition, the ones nobody imagined. Therefore, comprehensive manual testing of a distributed system is not merely difficult. It is structurally impossible.
This does not mean we should give up on testing. It means we need a fundamentally different approach. Instead of writing individual tests by hand, we need to generate tests automatically. Instead of testing against a clean, predictable environment, we need to test against one that is worse than production. Instead of hoping we covered the important cases, we need infrastructure that systematically explores the space of all possible failures.
That is what simulation gives us. Not better tests. A different kind of testing entirely.
Prevention vs Discovery
- Prevention: guarding what you know
- Discovery: finding what you do not know
- Three principles that enable discovery
- The feedback loop
- The spectrum, not the binary
Most teams think about testing as one thing. Write tests, run tests, check the results. But there are actually two fundamentally different activities hiding under that single word, and conflating them is one of the most consequential mistakes in software engineering.
Prevention: guarding what you know
The first kind of testing is prevention. You ship a feature. A user reports a bug. You fix the bug, then write a test that would have caught it. Next time someone modifies that code, the test fails before the bug can return.
This is regression testing. It is valuable, well-understood, and nearly universal. CI pipelines run thousands of these tests on every commit. The mental model is defensive: we are building a wall around known-good behavior, brick by brick. Each bug report adds a brick.
Prevention testing answers one question: did we break what already worked?
Discovery: finding what you do not know
The second kind of testing is discovery. Rather than guarding known behavior, discovery testing actively searches for unknown failure modes. It asks: what else is broken that we have not found yet?
Prevention tests encode specific scenarios a developer imagined. Discovery testing generates scenarios no developer imagined. Prevention is a wall. Discovery is a search party sent into unmapped territory.
Most teams have a testing portfolio that is 100% prevention and 0% discovery. Every test was written in response to a specific requirement or a specific bug. No test was written to find bugs the team does not yet know about. This is not a criticism of those teams. Until recently, discovery testing at scale was only available to a handful of organizations willing to invest years of infrastructure work. But the tools are catching up.
Three principles that enable discovery
What does it take to shift from prevention to discovery? Three enabling principles, each building on the last.
Deterministic simulation. Discovery testing must generate enormous numbers of scenarios and be able to reproduce any that fail. If a bug depends on a specific ordering of network events, we need to replay that exact ordering. Determinism makes every execution reproducible from a single seed value. A failing seed is a bug report that anyone on the team can replay, inspect, and fix. Without determinism, discovery testing produces irreproducible ghosts.
Controlled fault injection. Discovery testing must exercise failure paths that production encounters but development environments do not. Network partitions, disk corruption, message reordering, clock skew, process crashes during recovery. These are not exotic scenarios. They are the normal operating conditions of any distributed system at scale. Controlled injection means we decide when and how faults occur, driven by a seeded random number generator so they are reproducible and tunable.
Sometimes assertions. This is the subtle one. Prevention tests use assertions that say “this must always be true.” Discovery testing needs a second kind: assertions that say “this should sometimes be true.” A sometimes assertion on a timeout retry path does not check that retries work every time. It checks that our simulation actually exercised the retry path at all. Without sometimes assertions, you can run a million simulations and never know whether they explored the interesting parts of the state space or just repeated the happy path a million times.
The feedback loop
Put the three principles together and something powerful emerges. Simulation generates scenarios. Fault injection pushes those scenarios into failure territory. Always assertions catch violations. Sometimes assertions tell us whether we explored deeply enough.
When a sometimes assertion never fires, it means our simulation is not reaching some region of the state space. That is a signal to adjust: inject different faults, change the workload, add more concurrent operations. When all sometimes assertions are firing and all always assertions hold, we have evidence (not proof, but strong evidence) that our system handles the failure modes we care about.
This is the discovery feedback loop. It runs continuously. It finds bugs that no developer anticipated. And critically, it does not require writing new tests for each new scenario. The infrastructure does the exploration. The developer’s job shifts from writing tests to defining correctness properties and interpreting results.
Lawrie Green from Antithesis captured this duality perfectly: assertions are memos to two audiences. They tell the computer what to check during exploration. And they tell the developer what they believe about their system. When an assertion fails on correct code, the developer’s mental model was wrong. That is itself a critical finding.
The spectrum, not the binary
Prevention and discovery are not competing approaches. Every team needs prevention testing. But adding even a small amount of discovery testing changes the game fundamentally.
You do not need to go from zero to full deterministic simulation overnight. Injecting random delays into your existing integration tests is discovery testing. Property-based tests with randomized inputs are discovery testing. Adding sometimes assertions to your simulation suite is discovery testing.
The question is not whether you should do discovery testing. The question is how far along the spectrum you want to go. Every step reveals bugs that prevention testing, by construction, will never find.
From Mocks to Simulation
- Why mocks fail at scale
- The
#[cfg(test)]trap - The alternative: trait-based simulation
- The fidelity spectrum
- Error injection over expectations
- Moonpool’s provider pattern
Every experienced developer has a mock story. You spend a day writing mocks for a database client, carefully specifying which methods return what, in which order. The tests pass. Then someone refactors the internal call sequence without changing any external behavior, and every mock breaks. The mocks were not testing your system’s correctness. They were testing its implementation details.
This is not a failure of any particular mocking library. It is a structural problem with how mocks work.
Why mocks fail at scale
Mocks operate by replacing a dependency with a fake that returns pre-programmed responses. To program those responses, you need to know exactly which methods your code will call, in which order, with which arguments. This means the test author must carry a mental model of the entire internal call stack between the code under test and the mocked dependency.
For a unit test of a single function, this is manageable. For an integration test of a distributed protocol with concurrent operations, retries, timeouts, and failure handling, it becomes a maintenance nightmare. Every internal refactor risks breaking mocks that were testing behavior, not implementation. Every new failure path requires manually programming new mock responses. The mock setup code grows until it rivals the complexity of the system it is supposed to test.
And mocks jump abstraction layers. Your production code talks to a TCP socket. Your mock replaces the database client. Between those two layers live connection pooling, serialization, retry logic, timeout handling, and error translation. None of that code runs during mock-based tests. You are testing a different system than the one you deploy.
The #[cfg(test)] trap
Rust developers often reach for conditional compilation: #[cfg(test)] to swap in test-specific implementations. This is tempting because it requires no runtime cost and no trait indirection. But it means the binary you test is literally different from the binary you deploy. Different code paths, different struct fields, different behavior.
If a bug lives in the interaction between your retry logic and your connection pool, and your test build replaces the connection pool with an in-memory stub via #[cfg(test)], that bug is invisible to your test suite. You have not tested the system. You have tested a system-shaped thing that happens to share some code.
The alternative: trait-based simulation
There is a different approach. Instead of replacing entire subsystems with hand-programmed fakes, define a trait that describes the interface your code needs. Implement it once for production (real TCP, real disk, real clock). Implement it once for simulation (simulated network, simulated disk, simulated clock). Your application code depends on the trait, not the implementation. The trait implementation runs real logic, not pre-programmed responses, so every execution path that production exercises, simulation exercises too.
#![allow(unused)]
fn main() {
#[async_trait(?Send)]
pub trait TimeProvider: Clone {
async fn sleep(&self, duration: Duration) -> Result<(), TimeError>;
fn now(&self) -> Duration;
}
}
In production, sleep calls tokio::time::sleep. In simulation, sleep registers a timer with the simulated event loop and advances simulated time. The application code is identical in both cases. No #[cfg(test)]. No conditional compilation. The exact same binary logic runs in production and in simulation.
This is the provider pattern. It is the same architectural decision FoundationDB made with their INetwork interface: one trait, two implementations, zero conditional logic in the application.
The fidelity spectrum
Not every simulated implementation needs full fidelity. There is a spectrum.
No-op: the simplest fake. sleep returns immediately, send discards the message. Useful for testing pure logic that happens to call I/O functions.
In-memory: messages go into a queue, disk writes go into a HashMap. Fast, deterministic, but does not model timing, failures, or ordering.
Controlled simulation: messages are delayed by randomized amounts, connections drop according to a fault schedule, disk writes can be torn or corrupted. This is where bugs hide, because the system must handle not just the happy path but all the ways the real world deviates from it. Critically, these trait-based fakes scale across the entire codebase. You write each implementation once, and every component that uses the trait gets simulation for free. No per-test mock setup. No maintenance burden that grows with the test suite.
Full simulation: an entire cluster of processes with simulated network topology, coordinated fault injection, and time advancement. FoundationDB runs hundreds of simulated processes in a single thread, compressing hours of cluster behavior into seconds.
The right level depends on what you are testing. A serialization function needs no-op I/O. A retry loop needs controlled failure injection. A consensus protocol needs full cluster simulation.
Error injection over expectations
The deepest difference between mocks and simulation is the direction of control.
Mocks specify outputs: “when this method is called with these arguments, return this value.” The test author must predict every call. If the code takes a different path, the mock panics.
Simulation injects conditions: “connections drop with 5% probability, messages are delayed 1-100ms, disk writes fail 1 in 1000.” The simulation does not care which methods are called or in which order. It cares whether the system recovers correctly regardless of which failures occur.
This is a fundamental shift. Mock-based tests verify that your code follows a specific execution path. Simulation-based tests verify that your code produces correct results across all execution paths the simulation explores. One tests implementation. The other tests behavior.
Moonpool’s provider pattern
This is exactly what moonpool implements. Every interaction with the outside world goes through a provider trait: TimeProvider for clocks and timers, TaskProvider for spawning concurrent work, NetworkProvider for connections and messages, StorageProvider for disk I/O, RandomProvider for randomness.
Your application code calls time.sleep() instead of tokio::time::sleep(). It calls task_provider.spawn_task() instead of tokio::spawn(). In production, these call through to the real runtime. In simulation, they feed into a deterministic event loop where every timer, every message, every disk operation is controlled, reproducible, and subject to fault injection.
No mocks to maintain. No #[cfg(test)] to diverge your test binary from your production binary. The same code, running in a simulated world that is deliberately worse than reality.
A Brief History
- FoundationDB’s radical bet (2009-2015)
- Antithesis generalizes it (2020-present)
- The adoption wave (2020-present)
- The unifying philosophy
Simulation-driven development did not appear from nowhere. It evolved over fifteen years across three distinct eras, each expanding who could use it and how.
FoundationDB’s radical bet (2009-2015)
In 2009, a small team set out to build a distributed transactional database. Before writing a single line of database code, they spent roughly two years building a simulator. No storage engine, no query layer, no client protocol. Just the simulation.
The idea was radical and, to most observers, insane. But the team had a thesis: distributed systems fail in ways that depend on the ordering of concurrent events, and the only way to test those orderings systematically is to control them completely.
They built Flow, a custom extension to C++ that provided actor-model concurrency compiled down to single-threaded callbacks. Every network connection, every disk operation, every timer went through an abstract interface (INetwork) with two implementations: Net2 for production (real TCP via Boost.ASIO) and Sim2 for simulation (in-memory buffers with deterministic delays). A single seeded random number generator controlled everything. Same seed, same execution. Every time.
Two techniques made the simulation radically effective. BUGGIFY injected faults at strategic points throughout the codebase: sending packets out of order, truncating disk writes, skipping timeouts, shrinking buffer sizes. Active at 25% probability during simulation, BUGGIFY broadened the effective contract being tested far beyond what any specification documented. It tested not what the system was supposed to handle, but what it could handle.
Hurst exponent manipulation produced correlated, cascading hardware failures. Real datacenters exhibit failure correlation: a hard drive failing in a rack makes nearby drives more likely to fail. Naive testing models failures as independent events, making cascading failures astronomically rare. FoundationDB’s simulator cranked the correlation up, producing failure patterns that would take years to encounter in production but that simulation could generate in milliseconds.
The scale was staggering. Tens of thousands of simulations every night, each simulating minutes to hours of cluster behavior under extreme fault injection, amounting to roughly one trillion CPU-hours of simulated testing. A physical validation cluster called Sinkhole, real server motherboards wired to programmable power switches and toggled continuously, served as the ultimate reality check. Sinkhole never found a single database bug that simulation had missed. It only found bugs in other software and hardware.
The cost was equally staggering. The approach required a custom language, simulation-first architecture, and two years of pure infrastructure investment before building the actual product. Only an elite team with unusual patience and conviction could pull it off. For over a decade, FoundationDB’s simulation remained a benchmark that others admired but could not replicate.
Antithesis generalizes it (2020-present)
Will Wilson, one of FoundationDB’s original engineers, co-founded Antithesis with a different question: what if you could apply the same techniques to any software, without rewriting it?
The answer was a deterministic hypervisor. Instead of requiring developers to build simulation into their system from day one, Antithesis intercepts nondeterminism at the virtual machine level. Any program, in any language, running on any framework, becomes deterministically reproducible. The hypervisor turns arbitrary software into a pure function from an input byte stream to a sequence of states. Save, restore, fork, replay. No Flow. No custom language. No architectural prerequisites.
On top of the hypervisor, Antithesis built a guided exploration engine. Their SDK provides declarative assertions that tell the platform what to explore without specifying how: Sometimes (ensure this condition fires at least once), Always (ensure this invariant holds), SometimesAll (drive exploration across a frontier of sub-goals). The platform uses these signals, combined with coverage-guided forking and adaptive search, to systematically explore the state space.
The results validated the approach. Customers with mature, well-tested systems found new bugs within 2 to 3 weeks of onboarding. Not shallow bugs. Deep concurrency issues, subtle data corruption, failure recovery defects that had survived years of conventional testing. The exploration engine required minimal domain knowledge to be effective. In one demonstration, Antithesis beat the entire game of Gradius by tracking just 3 bytes of game memory and maximizing time alive. No game-specific strategy. Just depth and coverage.
The adoption wave (2020-present)
The third era is happening now. Simulation-driven development is spreading beyond the teams that invented it.
TigerBeetle, building a financial transactions database, adopted deterministic simulation from day one, adding storage fault patterns (misdirected reads, phantom writes, uninitialized memory) that go beyond network-level testing. Dropbox used simulation to validate their sync engine rewrite. WarpStream applied deterministic simulation testing across their entire SaaS platform.
Clever Cloud uses simulation-driven development to build Materia, a distributed multi-model multi-tenant database. 30 minutes of simulation covers roughly 24 hours of equivalent chaos testing.
What took an elite team two years of custom infrastructure in 2009 is becoming accessible to small teams building on existing frameworks and languages. The investment required is shrinking. The bugs found per engineering-hour are increasing.
The unifying philosophy
Across all three eras, the core philosophy is the same. Make all execution deterministic so bugs are reproducible. Inject faults that are worse than production so surviving systems are overqualified for the real world. Explore systematically so bugs are found by infrastructure, not by imagination.
The tools change. The languages change. The accessibility changes. The philosophy does not.
Why Moonpool Exists
After ten years of operating distributed systems, one pattern keeps repeating: the bugs that hurt most are the ones nobody imagined.
You build a system. You test the scenarios you can think of. You deploy. Then a network partition overlaps with a retry storm during a rolling upgrade, and the system enters a state no developer on the team anticipated. The code was not wrong. It just was not tested against that particular combination of events.
The combinatorial problem makes this inevitable. A simple e-commerce API with six variable dimensions (user types, payment methods, delivery options, promotions, inventory status, currencies) requires 648 unique test combinations for basic coverage. Adding one option to each dimension pushes it past 4,000. A distributed system with network partitions, node crashes, disk faults, and clock drift has orders of magnitude more dimensions. You cannot test what you do not know. Manual tests ensure regression coverage, not absence of bugs.
That realization is what led to moonpool: the testing toolbox I wish I had when I started building distributed systems.
What I found
The answer already existed, scattered across several projects and decades of work.
FoundationDB had shown that simulating a distributed system’s network inside a single-threaded process, with deterministic fault injection and seed-driven reproducibility, could find more bugs in a night than production found in a year. Their BUGGIFY technique and Hurst exponent manipulation made the simulated world worse than reality. Their Sinkhole cluster never found a database bug that simulation missed.
TigerBeetle extended the simulation philosophy to storage. Where FoundationDB focused on network faults (partitions, latency, connection drops), TigerBeetle modeled disk-level faults drawn from real hardware failure modes: torn writes, misdirected reads, phantom writes, read corruption, sync failures, uninitialized memory. A financial transactions database cannot afford to trust that disks behave according to spec.
Antithesis generalized FoundationDB’s assertion system into a declarative SDK. Instead of hand-coding what to explore, you declare properties: Sometimes (this condition should fire), Always (this invariant must hold), SometimesAll (drive exploration along a frontier). The platform figures out how to reach those states. Their work on NES games demonstrated that coverage-guided forking and adaptive exploration could beat complex state spaces with minimal domain knowledge.
What moonpool synthesizes
Moonpool brings these ideas together in a single Rust framework.
From FoundationDB: the simulation engine. Single-threaded deterministic execution. A seeded PRNG controlling all network timing, fault injection, and process scheduling. Simulated time that jumps forward when all tasks are blocked, compressing hours of cluster behavior into seconds. The same code runs against real networking or the simulated network, swapped at the provider level.
From TigerBeetle: storage fault injection. Simulated disk operations that can corrupt reads, tear writes, fail syncs, and misdirect I/O. Not just network chaos but disk chaos, because real systems fail at both layers.
From Antithesis: the assertion suite and fork-based exploration. Always, sometimes, reachable, unreachable, numeric, and frontier assertions that live in shared memory. Coverage-guided forking that branches the simulation at interesting points, exploring multiple futures from a single state. Adaptive energy budgets that allocate exploration effort where coverage is still improving.
Where moonpool sits
Moonpool is a library-level simulation framework. There is no hypervisor, no custom virtual machine, no special runtime. You add a dependency to your Rust project, implement a few provider traits, and your system becomes simulatable.
This is a deliberate tradeoff. Antithesis’s hypervisor can simulate any software without modification. Moonpool requires that your code use provider traits for I/O, which means designing for simulation from the start or refactoring existing code. In exchange, you get zero-cost abstractions in production, full control over fault injection, and the ability to run simulations as ordinary cargo test invocations. No infrastructure to deploy. No service to call. Everything runs in your CI pipeline.
The provider pattern means your production binary and your simulation binary share the same application logic. No #[cfg(test)]. No conditional compilation. The same code, tested in a world that is deliberately worse than production.
What makes it different
Many simulation frameworks stop at network-level fault injection. Moonpool adds two things.
First, storage simulation at the same fidelity as network simulation. Disk faults drawn from the TigerBeetle and FoundationDB fault models, deterministically controlled by the same seed that controls the network.
Second, fork-based multiverse exploration. When the simulation reaches an interesting state (a new assertion fires, a coverage bit flips, a numeric watermark improves), moonpool forks the process. The parent continues its original trajectory. The child explores from the interesting state with a different seed. This turns a single simulation run into a tree of timelines, each branching at points of maximum discovery potential.
This is not a gimmick. The sequential luck problem (finding a bug that requires getting lucky multiple times in sequence, where the probability is p^n) is the central bottleneck in simulation-based testing. Fork-based exploration attacks it directly: instead of hoping a single timeline stumbles through all the right doors, we branch at each door and explore both sides.
Part VI covers multiverse exploration in depth. For now, the key point is that moonpool is not just a simulation engine. It is a simulation engine with a built-in search strategy for the state spaces that matter most: the ones where bugs require sequences of unlikely events to surface.
Determinism as a Foundation
A distributed system dropped writes for 47 seconds. Logs show three nodes disagreed about leadership, but the window where it happened has already passed. The cluster self-healed. You cannot reproduce it locally. You cannot reproduce it in staging. You spend three days staring at logs, form a theory, ship a fix you are 60% sure about, and wait. Until it happens again.
This is what debugging looks like without determinism. The bug is real, but the conditions that triggered it are gone. You cannot replay the execution. You cannot step through it. You cannot even confirm your fix addresses the right cause.
The Two Enemies
Rust programs that use async networking have three sources of non-determinism: thread scheduling, real I/O, and random number generation.
Thread scheduling means the OS decides when threads run. Two threads racing on a shared counter might increment it correctly 999,999 times, then corrupt it once. The interleaving that triggers the bug depends on CPU load, thermal throttling, how many browser tabs you have open. A multi-threaded tokio runtime executes tasks across a thread pool. The order tasks resume after .await is up to the scheduler. Run the same program twice, get two different execution orders.
Real I/O means the outside world injects randomness. A TCP packet arrives 2ms late. A DNS lookup takes 800ms instead of 5ms. A disk write returns ENOSPC because another process filled the partition. Network calls complete in unpredictable order. Timeouts race against responses. Your code is a deterministic function, but its inputs are chaos.
Random number generation introduces a subtler form of non-determinism. Calling rand::rng() pulls entropy from the OS. Two runs with identical inputs can make different random choices, leading to different retry delays, different leader elections, different shard assignments.
Together, these forces make distributed systems bugs the hardest kind to find. The bug only appears under a specific thread interleaving, during a specific network delay pattern, with a specific disk latency. Reproducing it means reproducing all three. Which is effectively impossible.
Moonpool’s Answer
Moonpool eliminates both sources. Completely.
Single-core execution removes thread scheduling from the equation. One thread. One execution order. No races, no interleavings, no “works on my machine but fails in CI.” We cover this in detail in the next chapter.
Provider abstraction replaces real I/O and real randomness with simulated equivalents. Every system call your code makes (network, disk, time, random numbers) goes through a trait. In production, the trait calls tokio. In simulation, the trait calls moonpool’s deterministic runtime. Same code, different wiring. No #[cfg(test)] branching. The production code path is the tested code path.
With both sources eliminated, the simulation becomes a pure function of its seed. A single u64 value determines everything: which connections fail, when timeouts fire, what order messages arrive, whether disk writes corrupt. Same seed, same execution, same bugs. Every time.
What This Gives You
A failing seed turns a production incident into a local debugging session. Instead of “it happened once between nodes 7 and 12 under load,” you get:
FAILED seed=8839214571 — leadership invariant violated at sim_time=12.4s
You paste that seed into your test, run it, hit a breakpoint. The bug is right there. You fix it. You run 10,000 more seeds. They pass. You ship with confidence.
This is not aspirational. FoundationDB ran 5 to 10 million simulation iterations per night, each with a different seed. Their physical validation cluster (real servers with programmable power switches, toggled continuously) never found a single database bug that simulation missed. Determinism made that possible.
Everything in moonpool builds on this foundation. Chaos testing, assertion coverage, multiverse exploration, all of it requires one guarantee: given the same seed, the simulation produces the same result. The rest of this part explains how we achieve that guarantee.
The Single-Core Constraint
Moonpool runs every simulation on a single thread. This is not a limitation we reluctantly accept. It is the first design decision we made, and every other decision follows from it.
One Thread, One Order
A multi-threaded tokio runtime uses a work-stealing thread pool. When a task yields at an .await, any thread in the pool might pick it up. Two tasks that resume “at the same time” can execute in either order, and that order changes between runs. This is fine for production throughput. It is fatal for deterministic simulation.
With a single thread, there is exactly one legal execution order for any given set of ready tasks. Task A resumes before Task B, or Task B before Task A, and the choice is controlled by our scheduler, not the OS. The RNG picks the order. The seed determines the RNG. Done.
How We Get There
Moonpool uses tokio’s single-threaded, local runtime:
#![allow(unused)]
fn main() {
tokio::runtime::Builder::new_current_thread()
.enable_io()
.enable_time()
.build_local(Default::default())
}
This is not a LocalSet on top of a multi-threaded runtime. build_local creates a true single-threaded runtime where spawn_local is the only spawn mechanism. No Send bounds. No Sync bounds. No possibility of cross-thread data races.
This means every type in simulation can use Rc, RefCell, raw pointers, thread-local storage, whatever is needed. No Arc<Mutex<>> overhead. No Send bounds propagating through your entire type hierarchy.
The ?Send Constraint
All networking traits in moonpool use #[async_trait(?Send)]:
#![allow(unused)]
fn main() {
#[async_trait(?Send)]
pub trait TimeProvider: Clone {
async fn sleep(&self, duration: Duration) -> Result<(), TimeError>;
fn now(&self) -> Duration;
// ...
}
}
The ?Send bound tells Rust that futures produced by these traits do not need to be Send. This is correct because we never move them between threads. It also means your async code can hold Rc<RefCell<T>> across .await points, which is impossible with Send-requiring runtimes.
The Trade-Off
Single-core simulation cannot exploit CPU parallelism within a simulation run. A simulation of 20 nodes runs on one core, not 20.
In practice, this barely matters. Simulation time compression means a single core simulates hundreds of seconds of cluster time in a few seconds of wall-clock time. And you can run many simulations in parallel across cores, each with a different seed. A 16-core machine runs 16 independent simulations concurrently, each fully deterministic on its own thread.
Your production code is unaffected. The same code that runs single-threaded in simulation can run on a multi-threaded tokio runtime in production. The provider pattern (covered next) makes this swap transparent.
Seed-Driven Reproducibility
One number. A single u64. That is all moonpool needs to fully determine a simulation: which connections fail, when packets arrive, whether BUGGIFY triggers, how long disk writes take, what order events process. Same seed, same execution, same bugs. This is the property that makes everything else work.
The RNG Core
Moonpool uses ChaCha8Rng from the rand_chacha crate, seeded from the u64 value. ChaCha8 is fast, produces high-quality randomness, and is deterministic across platforms. The RNG lives in thread-local storage, which is correct because each simulation runs on a single thread.
#![allow(unused)]
fn main() {
// At the start of every simulation run
set_sim_seed(seed);
// Inside the simulation engine, every decision uses the thread-local RNG:
let latency: u64 = sim_random_range(1..50); // network delay in ms
let should_fail: bool = sim_random_f64() < 0.25; // 25% fault probability
let value: f64 = sim_random(); // general-purpose random value
}
These are framework-internal functions used by the simulation engine itself. Your application code should use providers.random().random() and providers.random().random_range() instead, which route through the same underlying RNG but go through the provider abstraction.
The functions sim_random(), sim_random_range(), and sim_random_f64() all draw from the same thread-local RNG. This means the order of calls matters. Adding a new sim_random() call anywhere in the simulation shifts every subsequent random value. This is intentional. It means small code changes produce different simulation trajectories, naturally exploring new parts of the state space.
Call Count Tracking
Every RNG call increments a thread-local counter. You can read it with get_rng_call_count(). This sounds mundane, but it is one of the most useful debugging tools in the framework.
When a seed produces a bug, you can narrow down exactly where the execution diverges from expected behavior by watching the call count. “The bug triggers after RNG call 847” tells you precisely which decision in the simulation started the chain of events leading to failure. Combined with breakpoints and logging, this turns a mysterious distributed failure into a step-through debugging session.
The explorer framework takes this further: it records call counts at fork points, creating a “recipe” of count@seed transitions that can replay an exact exploration path.
Multi-Seed Testing
A single seed tests one execution path. To build confidence, you need many paths. Moonpool’s builder supports two modes:
FixedCount runs a set number of iterations, each with a different random seed:
#![allow(unused)]
fn main() {
SimulationBuilder::new()
.set_iterations(100) // 100 different seeds
// ... workloads ...
.run();
}
TimeLimit runs for a wall-clock duration, burning through as many seeds as time allows:
#![allow(unused)]
fn main() {
SimulationBuilder::new()
.set_time_limit(Duration::from_secs(300)) // 5 minutes of exploration
// ... workloads ...
.run();
}
The power of seed-driven testing compounds over time. Run 1,000 seeds in CI on every commit. Run 100,000 overnight. Each seed explores a different combination of timing, faults, and ordering. Bugs that require three independent unlikely events to coincide will surface within a few thousand seeds because the simulation amplifies failure probability through BUGGIFY.
Debugging a Failing Seed
When CI reports a failure, the output includes the seed:
FAILED seed=17429853261 — connection timeout during leader election
Pin that seed and run it locally with logging enabled:
#![allow(unused)]
fn main() {
SimulationBuilder::new()
.set_iterations(1)
.set_debug_seeds(vec![17429853261])
// ... same workloads ...
.run();
}
The simulation replays the exact same execution. Set a breakpoint. Step through. The bug is deterministic now.
The Provider Pattern
Every distributed system does five things: it talks over the network, it reads the clock, it spawns concurrent tasks, it generates random values, and it reads and writes files. That is the entire surface area where non-determinism leaks in. Moonpool’s provider pattern seals all five.
The Core Idea
Define a trait for each category of I/O. Implement the trait twice: once backed by real tokio calls for production, once backed by a deterministic simulation engine for testing. Your application code is generic over the trait. It never knows which implementation it is running against.
Application Code
(generic over Providers trait)
|
+-----------+-----------+
| |
TokioProviders SimProviders
(real TCP, real (simulated TCP,
clock, real disk) logical clock,
fault-injected disk)
This is interface swapping, the same technique FoundationDB used with their INetwork interface (production Net2 vs. simulation Sim2). The difference is that Rust’s type system enforces it at compile time. If your code compiles with P: Providers, it works with both implementations. No runtime surprises.
The Providers Bundle
Carrying five separate type parameters through every function signature would be painful:
#![allow(unused)]
fn main() {
// Nobody wants to write this
fn run_server<N, T, K, R, S>(net: N, time: T, task: K, rand: R, storage: S)
where
N: NetworkProvider, T: TimeProvider, K: TaskProvider,
R: RandomProvider, S: StorageProvider,
{ /* ... */ }
}
Moonpool solves this with a single bundle trait called Providers:
#![allow(unused)]
fn main() {
pub trait Providers: Clone + 'static {
type Network: NetworkProvider + Clone + 'static;
type Time: TimeProvider + Clone + 'static;
type Task: TaskProvider + Clone + 'static;
type Random: RandomProvider + Clone + 'static;
type Storage: StorageProvider + Clone + 'static;
fn network(&self) -> &Self::Network;
fn time(&self) -> &Self::Time;
fn task(&self) -> &Self::Task;
fn random(&self) -> &Self::Random;
fn storage(&self) -> &Self::Storage;
}
}
Now your code carries one type parameter:
#![allow(unused)]
fn main() {
fn run_server<P: Providers>(providers: P) {
let time = providers.time().clone();
let net = providers.network().clone();
// Use them naturally
}
}
Two implementations exist: TokioProviders for production, and SimProviders (in moonpool-sim) for simulation. SimProviders::new(sim, seed, ip) takes an IP address so its storage provider is scoped to the correct process. Your application code never imports either one directly. It only sees P: Providers.
One Line Changes Everything
The swap between “testing a real distributed system” and “testing inside a deterministic simulation” happens at the call site, not inside your application logic:
#![allow(unused)]
fn main() {
// Production
let providers = TokioProviders::new();
run_server(providers);
// Simulation (inside a workload, the builder provides SimProviders via SimContext)
let providers = ctx.providers().clone(); // SimProviders
run_server(providers);
}
Same run_server. Same code path. Same binary. The only difference is which Providers implementation gets plugged in. This is the architectural foundation that makes everything else in moonpool possible: chaos testing, assertion coverage, multiverse exploration, all of it rests on the guarantee that your production code runs unmodified inside the simulator.
Quick Start: Swapping Implementations
Here is what using providers looks like in practice. We will write a function that uses time and network providers, then show it running in both production and simulation contexts.
A Function Generic Over Providers
#![allow(unused)]
fn main() {
use moonpool_core::{Providers, TimeProvider, NetworkProvider};
use std::time::Duration;
/// Connect to a peer and retry with exponential backoff.
async fn connect_with_retry<P: Providers>(
providers: &P,
addr: &str,
max_retries: u32,
) -> std::io::Result<<P::Network as NetworkProvider>::TcpStream> {
let mut delay = Duration::from_millis(100);
for attempt in 0..max_retries {
match providers.network().connect(addr).await {
Ok(stream) => return Ok(stream),
Err(e) if attempt + 1 < max_retries => {
// Backoff before retrying — uses provider, not tokio directly
providers.time().sleep(delay).await.ok();
delay *= 2;
}
Err(e) => return Err(e),
}
}
unreachable!()
}
}
Notice what is not in this code: no tokio::time::sleep(). No tokio::net::TcpStream::connect(). The function uses providers.time().sleep() and providers.network().connect(). That is the entire discipline.
The Forbidden List
These direct tokio calls break determinism. Never use them in application code:
| Forbidden | Use instead |
|---|---|
tokio::time::sleep() | providers.time().sleep() |
tokio::time::timeout() | providers.time().timeout() |
tokio::spawn() | providers.task().spawn_task() |
tokio::net::TcpStream::connect() | providers.network().connect() |
tokio::net::TcpListener::bind() | providers.network().bind() |
tokio::fs::* | providers.storage().open() / exists() / etc. |
rand::rng() | providers.random().random() |
Any direct tokio call in your application code is a hole in the simulation. The call will use real I/O, real time, and the simulation has no control over it. The result is non-determinism: different behavior between runs with the same seed.
Running in Production
#![allow(unused)]
fn main() {
use moonpool_core::TokioProviders;
let providers = TokioProviders::new();
// Real TCP connection, real exponential backoff with wall-clock delays
let stream = connect_with_retry(&providers, "10.0.1.1:9000", 5).await?;
}
TokioProviders bundles TokioTimeProvider, TokioNetworkProvider, TokioTaskProvider, TokioRandomProvider, and TokioStorageProvider. Each one delegates to the real tokio equivalent.
Running in Simulation
Inside a simulation workload, the builder gives you SimProviders:
#![allow(unused)]
fn main() {
// The simulation provides SimProviders to your workload
// SimProviders bundles simulated time, network, tasks, random, and storage
let stream = connect_with_retry(&providers, "10.0.1.1:9000", 5).await?;
}
Same function call. But now sleep() advances simulation time instead of wall-clock time. connect() goes through the simulated network where connections can be delayed, dropped, or partitioned. The retry loop exercises the exact same code path, but under controlled, deterministic conditions.
That is the entire provider workflow: write your code generic over P: Providers, use provider methods instead of raw tokio, and the framework handles the rest.
Deep Dive: Why Providers Exist
Providers are not the first idea anyone reaches for when testing distributed systems. Most teams start with #[cfg(test)] or mock frameworks. Both approaches have fundamental problems that providers solve.
Why Not #[cfg(test)]
Conditional compilation is tempting. Wrap the real network call in production code, swap in a fake during tests:
#![allow(unused)]
fn main() {
#[cfg(not(test))]
async fn connect(addr: &str) -> io::Result<TcpStream> {
TcpStream::connect(addr).await
}
#[cfg(test)]
async fn connect(addr: &str) -> io::Result<TcpStream> {
FakeStream::new() // returns immediately, no real network
}
}
The problem: you are no longer testing your production code. The test binary compiles a different function. Every #[cfg(test)] block is a fork in your codebase where production and test behavior can silently diverge. As Oxide Computer’s engineering team documented across their five major repositories, #[cfg(test)] blocks “prevent testing real code paths.” The tested code is not the shipped code.
Providers eliminate this entirely. There is one connect() implementation in your application. It calls providers.network().connect(). In production, that dispatches to tokio. In simulation, that dispatches to the simulator. The application code is identical in both cases. Same binary, same compiler output, same code path.
Why Not Mocks
Mock frameworks like mockall record expectations: “this method should be called with these arguments and return this value.” They test that your code calls the right methods in the right order. They do not test what happens when the network does something unexpected.
A mock for TCP might say: “when connect("10.0.1.1:9000") is called, return Ok(stream).” This tells you nothing about what happens when the connection takes 3 seconds, or when it succeeds but the first read returns ConnectionReset, or when the connection succeeds on the third retry but the remote peer has rebooted and lost state.
Providers are full implementations, not recorded expectations. The simulation network provider maintains connection state, buffers packets, injects delays, simulates TCP half-close, drops messages under partition. It is a complete in-memory TCP implementation. When your code connects through the simulation provider, it gets a real (simulated) TCP session with real (simulated) failure modes.
This is the fidelity spectrum that Oxide’s codebases demonstrate: from no-op fakes (methods return Ok(())), through in-memory implementations (real data operations without real I/O), all the way to full protocol simulation. Providers operate at the highest fidelity level because simulation needs to exercise the same edge cases that production encounters.
The Pattern
The recipe is simple. Oxide’s engineering teams converged on it independently across five repositories with near-zero mock framework usage:
- Define a trait for the external dependency
- Implement it for production (real tokio calls)
- Implement it for simulation (deterministic, controllable)
- Inject via generics (compile-time dispatch, zero runtime overhead)
#![allow(unused)]
fn main() {
// Step 1: The trait
#[async_trait(?Send)]
pub trait TimeProvider: Clone {
async fn sleep(&self, duration: Duration) -> Result<(), TimeError>;
fn now(&self) -> Duration;
}
// Step 2: Production implementation
impl TimeProvider for TokioTimeProvider {
async fn sleep(&self, duration: Duration) -> Result<(), TimeError> {
tokio::time::sleep(duration).await; // real wall-clock delay
Ok(())
}
}
// Step 3: Simulation implementation (in moonpool-sim)
// sleep() schedules a timer event, the simulator advances logical time
// Step 4: Inject via generics
async fn heartbeat_loop<T: TimeProvider>(time: T) {
loop {
send_heartbeat().await;
time.sleep(Duration::from_secs(5)).await.ok();
}
}
}
The compiler guarantees correctness: if heartbeat_loop compiles with T: TimeProvider, it works with both TokioTimeProvider and the simulation time provider. No runtime dispatch. No dynamic casts. No “forgot to wire up the mock” surprises.
What You Get
The provider pattern gives you three things that mocks and #[cfg(test)] cannot:
Same code path. Your production code is your test code. No divergence, no conditional compilation, no “it works in test but fails in production” surprises.
Full behavior, not just call verification. Providers simulate the actual behavior of the subsystem, not just whether your code calls the right methods. A simulated network can partition, delay, reorder, and corrupt. A mock just records calls.
Compile-time safety. If you forget to use a provider and call tokio directly, the simulation still compiles and runs, but the behavior will not be deterministic. Moonpool’s code conventions (enforced in review) catch these: no direct tokio::time::sleep(), no tokio::spawn(), no tokio::net::*. Always go through the provider.
The Five Providers
Moonpool abstracts every interaction between your code and the outside world into five provider traits. Each trait covers one category of I/O. Together, they form a complete boundary around your application, giving the simulator full control over every source of non-determinism.
TimeProvider
Time is the most pervasive dependency in distributed systems. Every timeout, backoff, heartbeat, and lease check goes through TimeProvider.
#![allow(unused)]
fn main() {
#[async_trait(?Send)]
pub trait TimeProvider: Clone {
/// Sleep for the specified duration.
async fn sleep(&self, duration: Duration) -> Result<(), TimeError>;
/// Get exact current time.
fn now(&self) -> Duration;
/// Get drifted timer time (simulates clock drift between nodes).
fn timer(&self) -> Duration;
/// Run a future with a timeout.
async fn timeout<F, T>(&self, duration: Duration, future: F) -> Result<T, TimeError>
where
F: std::future::Future<Output = T>;
}
}
The distinction between now() and timer() is borrowed from FoundationDB’s sim2. In production, both return the same value. In simulation, timer() can drift up to 100ms ahead of now(), testing how your code handles clock skew between processes. Use now() for event scheduling. Use timer() for application-level time checks like lease expiry and heartbeat deadlines.
Production: TokioTimeProvider delegates sleep to tokio::time::sleep, timeout to tokio::time::timeout, and now to std::time::Instant::elapsed.
Simulation: Sleep schedules an event on the simulation event queue. When all tasks are blocked, the simulator performs “time travel,” jumping forward to the next scheduled event. This compresses hours of simulated cluster time into seconds of wall-clock time.
NetworkProvider
#![allow(unused)]
fn main() {
#[async_trait(?Send)]
pub trait NetworkProvider: Clone {
type TcpStream: AsyncRead + AsyncWrite + Unpin + 'static;
type TcpListener: TcpListenerTrait<TcpStream = Self::TcpStream> + 'static;
/// Create a TCP listener bound to the given address.
async fn bind(&self, addr: &str) -> io::Result<Self::TcpListener>;
/// Connect to a remote address.
async fn connect(&self, addr: &str) -> io::Result<Self::TcpStream>;
}
#[async_trait(?Send)]
pub trait TcpListenerTrait {
type TcpStream: AsyncRead + AsyncWrite + Unpin + 'static;
/// Accept a single incoming connection.
async fn accept(&self) -> io::Result<(Self::TcpStream, String)>;
/// Get the local address this listener is bound to.
fn local_addr(&self) -> io::Result<String>;
}
}
The associated types TcpStream and TcpListener let each implementation provide its own concrete types. Production gives you tokio::net::TcpStream. Simulation gives you an in-memory stream backed by buffers with controllable latency, reordering, and connection failures.
The API deliberately matches what you would expect from tokio networking. bind, connect, accept behave like their tokio counterparts. The streams implement AsyncRead + AsyncWrite, so they work with any tokio-compatible codec or framing layer.
Production: TokioNetworkProvider wraps tokio::net.
Simulation: Connections are in-memory buffer pairs with deterministic delivery delays, TCP half-close simulation, and fault injection (connection drops, partitions, delayed delivery).
TaskProvider
#![allow(unused)]
fn main() {
#[async_trait(?Send)]
pub trait TaskProvider: Clone {
/// Spawn a named task that runs on the current thread.
fn spawn_task<F>(&self, name: &str, future: F) -> tokio::task::JoinHandle<()>
where
F: Future<Output = ()> + 'static;
/// Yield control to allow other tasks to run.
async fn yield_now(&self);
}
}
Tasks are always local (no Send bound on F). The name parameter is used for tracing and debugging. In simulation, it shows up in event logs so you can trace which task generated which event.
Production: TokioTaskProvider uses tokio::task::spawn_local.
Simulation: The simulator controls task scheduling order, making it deterministic and seed-dependent.
RandomProvider
#![allow(unused)]
fn main() {
pub trait RandomProvider: Clone {
/// Generate a random value of type T.
fn random<T>(&self) -> T
where
StandardUniform: Distribution<T>;
/// Generate a random value within a specified range (start..end).
fn random_range<T>(&self, range: Range<T>) -> T
where
T: SampleUniform + PartialOrd;
/// Generate a random f64 between 0.0 and 1.0.
fn random_ratio(&self) -> f64;
/// Generate a random bool with the given probability of being true.
fn random_bool(&self, probability: f64) -> bool;
}
}
RandomProvider is the only provider without #[async_trait(?Send)] because random number generation is synchronous.
Production: TokioRandomProvider uses rand::rng() (thread-local, non-deterministic).
Simulation: Uses the seeded ChaCha8Rng from the simulation’s RNG system. Every call draws from the same deterministic stream, maintaining reproducibility.
StorageProvider
#![allow(unused)]
fn main() {
#[async_trait(?Send)]
pub trait StorageProvider: Clone {
type File: StorageFile + 'static;
async fn open(&self, path: &str, options: OpenOptions) -> io::Result<Self::File>;
async fn exists(&self, path: &str) -> io::Result<bool>;
async fn delete(&self, path: &str) -> io::Result<()>;
async fn rename(&self, from: &str, to: &str) -> io::Result<()>;
}
#[async_trait(?Send)]
pub trait StorageFile: AsyncRead + AsyncWrite + AsyncSeek + Unpin {
async fn sync_all(&self) -> io::Result<()>;
async fn sync_data(&self) -> io::Result<()>;
async fn size(&self) -> io::Result<u64>;
async fn set_len(&self, size: u64) -> io::Result<()>;
}
}
Storage is the newest provider, and the one with the richest fault model. OpenOptions mirrors std::fs::OpenOptions with read, write, create, truncate, and append flags.
Production: TokioStorageProvider wraps tokio::fs.
Simulation: In-memory filesystem with fault injection inspired by TigerBeetle and FoundationDB patterns: read/write corruption, crash and torn writes, misdirected reads/writes, sync failures, and IOPS/bandwidth timing simulation. Each SimStorageProvider is scoped to a process IP (SimStorageProvider::new(sim, ip)), and files are tagged with owner_ip so fault injection uses the correct per-process configuration.
The Providers Bundle
All five come together in the Providers trait:
#![allow(unused)]
fn main() {
pub trait Providers: Clone + 'static {
type Network: NetworkProvider + Clone + 'static;
type Time: TimeProvider + Clone + 'static;
type Task: TaskProvider + Clone + 'static;
type Random: RandomProvider + Clone + 'static;
type Storage: StorageProvider + Clone + 'static;
fn network(&self) -> &Self::Network;
fn time(&self) -> &Self::Time;
fn task(&self) -> &Self::Task;
fn random(&self) -> &Self::Random;
fn storage(&self) -> &Self::Storage;
}
}
TokioProviders bundles all five production implementations. SimProviders bundles all five simulation implementations and requires an IP address at construction (SimProviders::new(sim, seed, ip)) so that the storage provider is scoped to the correct process. Your application code sees P: Providers and nothing else.
System Under Test vs Test Driver
- Two Perspectives on the Same System
- Process: The Thing You Are Building
- Workload: The Thing That Tests
- Different Lifecycles
- Why Not One Trait?
- What Comes Next
Every simulation in moonpool has two distinct roles. Understanding this separation is the single most important concept before you write your first line of simulation code.
Two Perspectives on the Same System
Think about how you test a web server in production. You have the server itself, handling requests, managing connections, storing data. And you have something else: a client, a load generator, a test harness that sends requests and checks responses.
These two things have fundamentally different jobs. The server is the system. The test driver exercises the system.
Moonpool formalizes this split into two traits: Process and Workload.
Process: The Thing You Are Building
A Process is your server, your node, your distributed system participant. It is the system under test. The code inside a Process is the code you ship to production.
Processes have a difficult life in simulation:
- They crash. The simulation kills them without warning.
- They reboot. After a crash, the simulation creates a fresh instance from scratch.
- They lose state. All in-memory fields vanish on reboot. Only data written to storage survives.
- They get partitioned. Network connections break, packets get delayed, peers become unreachable.
This is by design. The whole point of simulation testing is to throw chaos at your server code and see what breaks.
In FoundationDB’s simulation, the equivalent is the fdbd process. Each simulated FDB node runs the same code as production, but inside a controlled environment where the simulator decides when clocks advance, when packets arrive, and when machines die.
Workload: The Thing That Tests
A Workload is your test driver. It lives outside the system under test. It sends requests, observes responses, and checks that the system behaved correctly.
Workloads have a much easier life:
- They never crash. The simulation does not reboot workloads.
- They survive everything. Processes come and go, but workloads keep running.
- They see the whole picture. Workloads know about all processes, all IPs, the full topology.
- They judge correctness. After the simulation ends, workloads run their
check()method to validate final state.
In FoundationDB’s simulation, the equivalent is tester.actor.cpp. The tester knows the cluster topology, drives operations against it, and validates results.
Different Lifecycles
This distinction matters because their lifecycles are completely different.
A Process lifecycle looks like this: created from factory, runs, gets killed, factory creates a fresh one, runs again, gets killed again. Each incarnation starts with empty in-memory state. The factory produces a blank slate every time.
A Workload lifecycle is linear: setup() runs once at the start, run() executes the test logic, check() validates at the end. One continuous life from start to finish.
Here is the key insight: a Process does not know when it will die, and a Workload does not care when Processes die. The Process just tries to do its job correctly. The Workload just keeps sending requests and tracking what happened.
Why Not One Trait?
You might wonder why we need this separation at all. Why not just have “participants” in the simulation?
Because mixing the two concerns creates a mess. If your server code also has to track test state, you cannot reboot it cleanly. If your test driver also runs server logic, it cannot survive process crashes.
The separation also matches real production architecture. You deploy servers. You run integration tests against them. Different code, different lifecycles, different concerns.
What Comes Next
The next two chapters cover each trait in detail. We will look at the Process trait with its factory pattern and reboot semantics, then the Workload trait with its three-phase lifecycle. After that, we will build a complete simulation from scratch.
Process: Your Server
- The Process Trait
- The Factory Pattern
- State and Reboots
- IP Addressing
- Tags for Role Assignment
- Graceful vs Crash Reboots
- A Concrete Example
A Process represents the system under test. It is the code you would ship to production, running inside the simulation where the framework controls time, network, and failure.
The Process Trait
The trait is minimal by design:
#![allow(unused)]
fn main() {
#[async_trait(?Send)]
pub trait Process: 'static {
fn name(&self) -> &str;
async fn run(&mut self, ctx: &SimContext) -> SimulationResult<()>;
}
}
Two methods. name() identifies this process type for reporting. run() is where your server logic lives. The ?Send bound exists because moonpool runs on a single thread, so nothing needs to be Send.
When run() returns Ok(()), the process has exited voluntarily. When the simulation kills the process, the future is cancelled and run() never returns at all.
The Factory Pattern
You never construct a Process once. You give the builder a factory that can produce fresh instances:
#![allow(unused)]
fn main() {
SimulationBuilder::new()
.processes(3, || Box::new(MyServer::new()))
}
Why a factory? Because of reboots. When the simulation kills a process and restarts it, the framework calls your factory to get a brand new instance. This guarantees that each boot starts with a clean slate, just like restarting a real server.
The factory is called once per process per boot. Three processes with two reboots each means the factory runs nine times total.
State and Reboots
This is the rule you must internalize: all in-memory state is lost on reboot.
If your Process has a HashMap<String, Vec<u8>> field tracking client sessions, that map is gone after a reboot. The new instance from the factory starts empty. Only data written to the simulated storage layer survives.
This matches reality. When a server process crashes and restarts, it does not magically recover its heap. It reads persistent state from disk and rebuilds from there.
IP Addressing
Each process instance gets its own IP address in the 10.0.1.0/24 range:
Process 0 → 10.0.1.1
Process 1 → 10.0.1.2
Process 2 → 10.0.1.3
Workloads get IPs in the 10.0.0.0/24 range. This clean separation makes it easy to identify what is a server and what is a test driver when reading logs.
Your process accesses its IP through ctx.my_ip(). Other process IPs are available through ctx.topology().all_process_ips().
Tags for Role Assignment
Many distributed systems need nodes with different roles: leader and follower, primary and secondary, different data centers. Tags handle this:
#![allow(unused)]
fn main() {
SimulationBuilder::new()
.processes(5, || Box::new(MyNode::new()))
.tags(&[
("role", &["leader", "follower"]),
("dc", &["east", "west", "eu"]),
])
}
Tags distribute round-robin. With 5 processes and 2 role values, the assignment looks like:
| Process | role | dc |
|---|---|---|
| 0 | leader | east |
| 1 | follower | west |
| 2 | leader | eu |
| 3 | follower | east |
| 4 | leader | west |
Inside your process, read tags from the context:
#![allow(unused)]
fn main() {
async fn run(&mut self, ctx: &SimContext) -> SimulationResult<()> {
let role = ctx.topology().my_tags().get("role");
match role.as_deref() {
Some("leader") => run_leader(ctx).await,
Some("follower") => run_follower(ctx).await,
_ => Ok(()),
}
}
}
Graceful vs Crash Reboots
Not all deaths are equal. Moonpool supports three reboot kinds:
Graceful: The simulation signals the shutdown token. Your process has a grace period to finish in-flight work, flush buffers, and close connections cleanly. If it does not exit in time, it gets force-killed anyway.
Crash: Instant death. The process task is cancelled immediately. All connections abort. No cleanup, no buffer drain. Peers see connection reset errors.
CrashAndWipe: Same as Crash but also wipes all persistent storage owned by that process. Simulates total disk failure or a fresh node joining the cluster. The wipe is immediate and scoped to the process’s IP address, so other processes’ storage is unaffected.
To handle graceful shutdown, check the cancellation token in your main loop:
#![allow(unused)]
fn main() {
async fn run(&mut self, ctx: &SimContext) -> SimulationResult<()> {
let listener = ctx.network().bind(ctx.my_ip()).await?;
loop {
if ctx.shutdown().is_cancelled() {
break;
}
// Accept connections, handle requests...
}
Ok(())
}
}
You do not need to handle crash reboots. There is nothing to handle. The simulation cancels your future and moves on.
A Concrete Example
Here is a simple echo server as a Process:
#![allow(unused)]
fn main() {
struct EchoServer;
#[async_trait(?Send)]
impl Process for EchoServer {
fn name(&self) -> &str {
"echo"
}
async fn run(&mut self, ctx: &SimContext) -> SimulationResult<()> {
let listener = ctx.network().bind(ctx.my_ip()).await?;
loop {
if ctx.shutdown().is_cancelled() {
break;
}
match ctx.time().timeout(
Duration::from_millis(100),
listener.accept()
).await {
Ok(Ok((mut stream, _addr))) => {
let mut buf = vec![0u8; 4096];
while let Ok(n) = stream.read(&mut buf).await {
if n == 0 { break; }
let _ = stream.write_all(&buf[..n]).await;
}
}
_ => continue,
}
}
Ok(())
}
}
}
Notice the patterns: use ctx.network() for connections, ctx.time() for timeouts, ctx.shutdown() for graceful termination. Never call tokio directly.
Workload: Your Test Driver
- The Workload Trait
- The Three Phases
- Workloads Survive Everything
- The SimContext
- The Operation Alphabet
- Multiple Workload Instances
- What a Good Workload Looks Like
A Workload exercises the system under test and validates its correctness. While Processes are the code you ship, Workloads are the code that finds your bugs.
The Workload Trait
#![allow(unused)]
fn main() {
#[async_trait(?Send)]
pub trait Workload: 'static {
fn name(&self) -> &str;
async fn setup(&mut self, _ctx: &SimContext) -> SimulationResult<()> {
Ok(())
}
async fn run(&mut self, ctx: &SimContext) -> SimulationResult<()>;
async fn check(&mut self, _ctx: &SimContext) -> SimulationResult<()> {
Ok(())
}
}
}
Four methods, two with defaults. The lifecycle follows a strict order: setup(), then run(), then check(). Each phase has different rules and a different purpose.
The Three Phases
Setup: Prepare the Ground
setup() runs sequentially across all workloads. Workload A’s setup completes before workload B’s setup starts. No concurrency, no surprises.
Use setup to establish connections, initialize state, prepare data structures. By the time run() starts, every workload should be ready to go.
#![allow(unused)]
fn main() {
async fn setup(&mut self, ctx: &SimContext) -> SimulationResult<()> {
let server_ip = ctx.topology().all_process_ips()
.first()
.ok_or(SimulationError::InvalidState("no servers".into()))?;
self.connection = Some(ctx.network().connect(server_ip).await?);
Ok(())
}
}
Run: Drive the System
run() is where the action happens. All workloads run concurrently. Multiple clients hammering the same servers, competing for resources, triggering race conditions.
This is where you send requests, observe responses, track expected state, and use assertions to flag anomalies. The run phase continues until all workloads return or the simulation shuts them down.
#![allow(unused)]
fn main() {
async fn run(&mut self, ctx: &SimContext) -> SimulationResult<()> {
for _ in 0..self.num_operations {
if ctx.shutdown().is_cancelled() {
break;
}
let op = random_op(ctx.random(), &self.accounts);
match op {
Op::Write { key, value } => {
self.send_write(&key, &value).await?;
self.model.write(&key, &value);
}
Op::Read { key } => {
let result = self.send_read(&key).await?;
let expected = self.model.read(&key);
assert_always!(
result == expected,
format!("read mismatch for key '{}'", key)
);
}
}
}
Ok(())
}
}
Check: Validate the Outcome
check() runs sequentially after all workloads finish and all pending events drain. The system is quiescent. No more messages in flight, no more timeouts pending.
Use check for final state validation. Did the conservation law hold? Are all balances non-negative? Did every committed write survive?
#![allow(unused)]
fn main() {
async fn check(&mut self, _ctx: &SimContext) -> SimulationResult<()> {
let total = self.model.total_balance();
let expected = self.model.total_deposited - self.model.total_withdrawn;
assert_always!(
total == expected,
format!("conservation law violated: {} != {}", total, expected)
);
Ok(())
}
}
Workloads Survive Everything
This is the fundamental difference from Processes. When the simulation kills a server, your workload keeps running. When a connection breaks, your workload can reconnect. When the network partitions, your workload observes the failures and adapts.
Your workload tracks what happened across the entire simulation lifetime, including across process reboots. This is how you verify that a server recovers correctly after a crash: the workload sent a write before the crash, the server rebooted, and the workload reads the value back to check that it survived.
The SimContext
Every lifecycle method receives a SimContext that gives workloads access to everything they need:
ctx.my_ip()returns this workload’s IP addressctx.topology()has process IPs, peer info, and tag registriesctx.network()provides simulated TCP connectionsctx.time()provides simulated clocks and timeoutsctx.random()provides deterministic random numbersctx.state()provides cross-workload shared state for invariantsctx.shutdown()provides the cancellation token
Use ctx.topology().all_process_ips() to find your servers. Use ctx.peer("server") if you know the name. Use ctx.topology().ips_tagged("role", "leader") if you need role-specific targeting.
The Operation Alphabet
Strong workloads define an “operation alphabet”: the set of actions they can perform. Deposits, withdrawals, reads, writes, delays. Each operation has a weight controlling how often it fires.
#![allow(unused)]
fn main() {
pub fn random_op(random: &SimRandomProvider, accounts: &[String]) -> Op {
let roll = random.random_range(0..100);
match roll {
0..30 => Op::Deposit { ... },
30..50 => Op::Withdraw { ... },
50..70 => Op::Read { ... },
70..90 => Op::Transfer { ... },
_ => Op::SmallDelay,
}
}
}
The weights matter. Too many reads and you never test write conflicts. Too many writes and you never test read-after-crash consistency. The alphabet should cover normal operations, adversarial inputs, and small delays that let background work complete.
Multiple Workload Instances
Sometimes one client is not enough. Use workloads() to create multiple instances:
#![allow(unused)]
fn main() {
SimulationBuilder::new()
.workloads(WorkloadCount::Fixed(3), |i| {
Box::new(ClientWorkload::new(i))
})
}
Each instance gets its own client_id (accessible via ctx.client_id()) and its own IP. Multiple clients hitting the same server concurrently is where the interesting bugs hide.
For variable topology testing, use WorkloadCount::Random(1..6) to spawn a different number of clients each iteration. The count is determined by the simulation RNG, so it stays deterministic per seed.
What a Good Workload Looks Like
The best workloads follow a pattern:
- Define an operation alphabet with weighted random selection
- Track a reference model (expected state computed locally)
- Assert on every response using
assert_always!for invariants - Use
assert_sometimes!for coverage of interesting paths - Validate final state in
check()using the reference model - Publish state via
ctx.state()for cross-workload invariant checking
The next part walks through building a complete simulation from scratch, demonstrating all of these patterns.
Your First Simulation
We have covered the theory. Deterministic execution, providers, the split between Process and Workload. Now we build something real.
What We Are Building
Over the next four chapters, we will create a complete simulation test for a key-value server. By the end, you will have:
- A Process that accepts TCP connections and responds to get/set requests
- A Workload that drives random operations and tracks expected state
- Assertions that verify correctness during and after the simulation
- A SimulationBuilder configuration that runs hundreds of iterations with different seeds
The key-value server is deliberately simple. The interesting part is not the server logic but how the simulation wraps around it, finding bugs you would never catch with unit tests.
Prerequisites
You need Nix for the development environment. All cargo commands run inside nix develop:
nix develop --command cargo build
The simulation binary will live alongside moonpool’s existing simulation binaries, managed by xtask:
cargo xtask sim list # see what exists
cargo xtask sim run kv # run our simulation (once we build it)
The Plan
Chapter: Defining a Process covers implementing the Process trait. We will set up a TCP listener, parse incoming requests, and handle graceful shutdown. The process is a simple in-memory key-value store that loses all state on reboot.
Chapter: Writing a Workload covers implementing the Workload trait. We will define an operation alphabet (get, set, delete), build a reference model that tracks expected state, and use assertions to catch bugs.
Chapter: Configuring the SimulationBuilder covers the builder’s fluent API. We will configure the number of processes, set iteration control, add invariants, and enable chaos phases with attrition.
Chapter: Running and Observing covers execution and output. We will run the simulation, read the report, understand what success and failure look like, and debug a failing seed.
Why This Order
We build bottom-up. The Process comes first because it is the system under test, the thing everything else depends on. The Workload comes second because it exercises the Process. The builder ties them together. Running is last because you need all the pieces in place before you can execute.
Each chapter produces code that builds on the previous one. By the end of the fourth chapter, you will have a working simulation you can extend with your own chaos experiments.
Defining a Process
- The Process Struct
- Implementing the Trait
- Handling Requests
- Registering with the Builder
- What the Shutdown Token Gives You
- Key Takeaways
Our key-value server is a Process. It listens for connections, handles get/set requests, and respects shutdown signals. Everything it does goes through providers, never raw tokio calls.
The Process Struct
Start with the struct. A Process is recreated from a factory on every boot, so the struct starts empty:
#![allow(unused)]
fn main() {
use async_trait::async_trait;
use moonpool_sim::{Process, SimContext, SimulationResult};
struct KvServer;
}
No fields. Each time the simulation boots this process, the factory returns a fresh KvServer. Any data it accumulates lives only until the next crash.
Implementing the Trait
The trait has two methods: name() for identification and run() for the main logic.
#![allow(unused)]
fn main() {
#[async_trait(?Send)]
impl Process for KvServer {
fn name(&self) -> &str {
"kv"
}
async fn run(&mut self, ctx: &SimContext) -> SimulationResult<()> {
let listener = ctx.network().bind(ctx.my_ip()).await?;
let mut store: HashMap<String, Vec<u8>> = HashMap::new();
loop {
if ctx.shutdown().is_cancelled() {
break;
}
let accept_result = ctx
.time()
.timeout(Duration::from_millis(100), listener.accept())
.await;
match accept_result {
Ok(Ok((stream, _addr))) => {
handle_connection(stream, &mut store).await;
}
Ok(Err(e)) => {
tracing::warn!("accept error: {}", e);
}
Err(_) => {
// Timeout, loop back and check shutdown
continue;
}
}
}
Ok(())
}
}
}
Walk through this line by line.
Binding the listener: ctx.network().bind(ctx.my_ip()) creates a TCP listener on the process’s assigned IP. In the simulated world, this registers the IP for incoming connections. No real ports are opened.
The store: A plain HashMap that lives on the stack. When this process crashes, the HashMap vanishes. When the factory creates a new instance, it starts with an empty map. This is the “all in-memory state is lost on reboot” principle in action.
The main loop: We loop forever, checking the shutdown token each iteration. ctx.shutdown().is_cancelled() returns true during graceful reboots, giving us a chance to break cleanly. For crash reboots, the framework cancels the entire future, so we never reach this check.
Timeout on accept: We use ctx.time().timeout() instead of tokio::time::timeout(). The simulated timer means the framework controls when the timeout fires, keeping everything deterministic.
Handling Requests
The connection handler parses a simple wire protocol. For a real system, you would use moonpool-transport’s RPC layer, but a raw protocol shows the fundamentals:
#![allow(unused)]
fn main() {
async fn handle_connection(
mut stream: SimTcpStream,
store: &mut HashMap<String, Vec<u8>>,
) {
let mut buf = vec![0u8; 4096];
loop {
let n = match stream.read(&mut buf).await {
Ok(0) => break, // Connection closed
Ok(n) => n,
Err(_) => break, // Connection error
};
// Parse and handle the request
let request = &buf[..n];
let response = match request[0] {
b'G' => { // GET
let key = String::from_utf8_lossy(&request[1..]);
store.get(key.as_ref())
.cloned()
.unwrap_or_default()
}
b'S' => { // SET
// Format: S<key_len:u8><key><value>
let key_len = request[1] as usize;
let key = String::from_utf8_lossy(
&request[2..2 + key_len]
).to_string();
let value = request[2 + key_len..].to_vec();
store.insert(key, value.clone());
value
}
_ => vec![],
};
let _ = stream.write_all(&response).await;
}
}
}
Registering with the Builder
The Process is registered through .processes() on the builder. The first argument is how many instances to run, the second is the factory:
#![allow(unused)]
fn main() {
SimulationBuilder::new()
.processes(3, || Box::new(KvServer))
}
This creates 3 KvServer instances at IPs 10.0.1.1, 10.0.1.2, and 10.0.1.3. Each one runs independently. Each one can be killed and restarted independently.
For variable cluster sizes, pass a range:
#![allow(unused)]
fn main() {
.processes(3..=7, || Box::new(KvServer))
}
Now each iteration randomly picks between 3 and 7 servers, deterministically based on the seed.
What the Shutdown Token Gives You
Checking ctx.shutdown() is optional but valuable. During graceful reboots, the simulation cancels the token and gives a grace period. Your process can:
- Finish in-flight requests
- Flush write buffers
- Close connections cleanly so peers see EOF instead of reset errors
If you do not check the token, graceful reboots still work. The framework just force-cancels your future after the grace period expires. But checking gives your process a chance to exit cleanly, which tests a different code path than a hard crash.
Key Takeaways
The pattern for every Process is the same:
- Bind a listener using
ctx.network() - Accept connections in a loop
- Check
ctx.shutdown()for graceful termination - Use
ctx.time()for timeouts, never raw tokio - Keep your state in-memory, expect to lose it
The factory produces a blank instance. The simulation manages the lifecycle. Your job is to write the server logic and let the framework handle chaos.
Writing a Workload
- The Workload Struct
- Setup: Finding the Servers
- Run: The Operation Loop
- Check: Final Validation
- Publishing State for Invariants
- Patterns That Find Bugs
The workload drives our key-value server and checks that it behaves correctly. It sends requests, tracks expected state in a reference model, and uses assertions to catch bugs as they happen.
The Workload Struct
When registered via .workload(), a single workload instance is reused across all iterations, letting it accumulate state. (Workloads registered via .workloads() with a factory are recreated each iteration.) The struct holds everything the workload needs to track:
#![allow(unused)]
fn main() {
use async_trait::async_trait;
use moonpool_sim::{
SimContext, SimulationResult, Workload,
assert_always, assert_sometimes,
};
struct KvWorkload {
/// Number of operations per run
num_ops: usize,
/// Reference model: what we expect the server to contain
model: HashMap<String, Vec<u8>>,
/// Keys we use for operations
keys: Vec<String>,
}
}
The model is the most important field. It mirrors what the server should contain. Every time we write to the server, we write the same value to the model. Every time we read from the server, we compare the result against the model.
Setup: Finding the Servers
The setup() method runs before any workload’s run() starts. Use it to locate processes and prepare connections:
#![allow(unused)]
fn main() {
#[async_trait(?Send)]
impl Workload for KvWorkload {
fn name(&self) -> &str {
"kv-client"
}
async fn setup(&mut self, ctx: &SimContext) -> SimulationResult<()> {
// Verify we have servers to talk to
let process_ips = ctx.topology().all_process_ips();
if process_ips.is_empty() {
return Err(moonpool_sim::SimulationError::InvalidState(
"no server processes available".into(),
));
}
self.model.clear();
Ok(())
}
}
The key discovery mechanism is ctx.topology(). It knows about every participant in the simulation: which IPs are processes, which are workloads, what tags each process has.
Common patterns for finding servers:
#![allow(unused)]
fn main() {
// All server IPs
let all_servers = ctx.topology().all_process_ips();
// A specific peer by name
let server_ip = ctx.peer("server").expect("server exists");
// Servers with a particular role tag
let leaders = ctx.topology().ips_tagged("role", "leader");
}
Run: The Operation Loop
The run() method is where bugs get found. We generate random operations, execute them against the server, and verify each response:
#![allow(unused)]
fn main() {
async fn run(&mut self, ctx: &SimContext) -> SimulationResult<()> {
let server_ips = ctx.topology().all_process_ips().to_vec();
for i in 0..self.num_ops {
if ctx.shutdown().is_cancelled() {
break;
}
// Pick a random server
let server_idx = ctx.random().random_range(0..server_ips.len());
let server_ip = &server_ips[server_idx];
// Generate a random operation
let roll = ctx.random().random_range(0..100);
match roll {
0..40 => {
// SET operation
let key = &self.keys[
ctx.random().random_range(0..self.keys.len())
];
let value = format!("v{}", i).into_bytes();
match self.send_set(ctx, server_ip, key, &value).await {
Ok(()) => {
self.model.insert(key.clone(), value);
assert_sometimes!(true, "set_succeeded");
}
Err(e) => {
tracing::warn!("set failed: {}", e);
assert_sometimes!(true, "set_failed_network");
}
}
}
40..80 => {
// GET operation
let key = &self.keys[
ctx.random().random_range(0..self.keys.len())
];
match self.send_get(ctx, server_ip, key).await {
Ok(value) => {
let expected = self.model.get(key)
.cloned()
.unwrap_or_default();
assert_always!(
value == expected,
format!(
"read mismatch for '{}': got {} bytes, expected {}",
key, value.len(), expected.len()
)
);
}
Err(e) => {
tracing::warn!("get failed: {}", e);
}
}
}
_ => {
// Small delay to let simulation events process
let _ = ctx.time().sleep(Duration::from_millis(10)).await;
}
}
}
Ok(())
}
}
Notice the two assertion types working together:
assert_always! guards invariants that must never be violated. If a GET returns data that does not match our model, something is broken. An always-assertion failure is a definite bug.
assert_sometimes! marks paths that should fire at least once across all iterations. If "set_succeeded" never triggers across hundreds of seeds, something is wrong with our test setup. If "set_failed_network" never triggers, we might not have enough chaos.
Check: Final Validation
After all workloads finish and pending events drain, check() runs for final state validation:
#![allow(unused)]
fn main() {
async fn check(&mut self, _ctx: &SimContext) -> SimulationResult<()> {
// Verify model consistency
let total_keys = self.model.len();
assert_always!(
total_keys <= self.keys.len(),
format!(
"model has more keys than expected: {} > {}",
total_keys, self.keys.len()
)
);
Ok(())
}
}
}
The check phase is your last chance to validate. The system is quiet. No messages in flight, no operations pending. What the model says should match what the server actually contains.
Publishing State for Invariants
Workloads can publish state that invariant functions read after every simulation event. This enables cross-workload validation:
#![allow(unused)]
fn main() {
// In run(), after each operation:
ctx.state().publish("kv_model", self.model.clone());
}
An invariant function (registered on the builder) can then read this state:
#![allow(unused)]
fn main() {
fn check_model_size(state: &StateHandle, _sim_time_ms: u64) {
if let Some(model) = state.get::<HashMap<String, Vec<u8>>>("kv_model") {
assert_always!(
model.len() <= 100,
"model grew beyond expected bounds"
);
}
}
}
This is the publish-and-check pattern: the workload publishes its reference model, and invariants validate cross-workload properties after every event. The invariants chapter covers this in depth.
Patterns That Find Bugs
The strongest workloads combine several techniques:
Reference model: Track expected state locally. Compare against actual server responses. Any divergence is a bug.
Weighted operation alphabet: Mix writes, reads, and delays. Control the distribution. Too predictable means you only test happy paths.
Both assertion types: assert_always! for correctness properties that must hold on every single call. assert_sometimes! for coverage goals that should fire at least once across the full run.
Handle failures gracefully: Network errors during chaos are expected. Log them, maybe track them in the model, but do not treat them as test failures. The bug is when the server returns the wrong answer, not when it returns an error.
Configuring the SimulationBuilder
- The Minimal Builder
- Adding Processes
- Iteration Control
- Seed Control
- Tags for Role Distribution
- Invariants
- Chaos and Attrition
- Randomized Network
- Putting It All Together
The SimulationBuilder is the glue. It takes your Process, your Workload, your invariants, and your chaos configuration, and wires them into a runnable simulation.
The Minimal Builder
The simplest possible simulation has one workload and runs once:
#![allow(unused)]
fn main() {
let report = SimulationBuilder::new()
.workload(KvWorkload::new(100, keys))
.run()
.await;
}
This creates a single workload at IP 10.0.0.1, runs it with a random seed, and produces a SimulationReport. No processes, no chaos, no multiple iterations. Useful for smoke testing, but not for finding bugs.
Adding Processes
To test a client-server system, add processes alongside the workload:
#![allow(unused)]
fn main() {
let report = SimulationBuilder::new()
.processes(3, || Box::new(KvServer))
.workload(KvWorkload::new(100, keys))
.run()
.await;
}
The builder creates 3 server processes at 10.0.1.1 through 10.0.1.3 and one workload at 10.0.0.1. The workload finds server IPs through ctx.topology().all_process_ips().
Iteration Control
One iteration is not enough. Different seeds produce different scheduling orders, different random choices, different failure patterns. You need hundreds or thousands of iterations to find bugs hiding in rare interleavings.
Fixed count runs a specific number of iterations:
#![allow(unused)]
fn main() {
.set_iterations(100)
// or equivalently:
.set_iteration_control(IterationControl::FixedCount(100))
}
Time limit runs until a wall-clock deadline:
#![allow(unused)]
fn main() {
.set_time_limit(Duration::from_secs(60))
}
Each iteration gets a different seed, producing a different execution. The seeds are deterministic and derived from the iteration manager, so the same configuration always explores the same seeds.
Seed Control
When a simulation fails on a specific seed, you need to reproduce it. Use set_debug_seeds() to run exactly those seeds:
#![allow(unused)]
fn main() {
SimulationBuilder::new()
.processes(3, || Box::new(KvServer))
.workload(KvWorkload::new(100, keys))
.set_debug_seeds(vec![42, 7891])
.run()
.await;
}
This runs exactly 2 iterations with seeds 42 and 7891. Combined with RUST_LOG=error, this is the primary debugging workflow: find the failing seed in the report, reproduce it in isolation, add logging, find the bug.
Tags for Role Distribution
When your distributed system has roles, tags assign them to processes:
#![allow(unused)]
fn main() {
SimulationBuilder::new()
.processes(5, || Box::new(ConsensusNode))
.tags(&[
("role", &["leader", "follower"]),
("dc", &["east", "west", "eu"]),
])
}
Tags distribute round-robin. Process 0 gets role=leader, dc=east. Process 1 gets role=follower, dc=west. Process 2 gets role=leader, dc=eu. And so on, wrapping around.
Inside a Process, read tags via ctx.topology().my_tags().get("role"). Inside a Workload, query the tag registry: ctx.topology().ips_tagged("role", "leader") returns the IPs of all leader processes.
Invariants
Invariants run after every simulation event. They check cross-workload properties that must hold at all times, not just at the end.
Trait-based invariant:
#![allow(unused)]
fn main() {
struct AgreementInvariant;
impl Invariant for AgreementInvariant {
fn name(&self) -> &str { "agreement" }
fn check(&self, state: &StateHandle, _sim_time_ms: u64) {
if let Some(model) = state.get::<ConsensusModel>("consensus_model") {
for (slot, values) in &model.committed_values {
let unique: HashSet<_> = values.iter().collect();
assert_always!(unique.len() <= 1, "agreement violated");
}
}
}
}
// Register on builder:
.invariant(AgreementInvariant)
}
Closure-based invariant for simpler cases:
#![allow(unused)]
fn main() {
.invariant_fn("key_count_bounded", |state, _time| {
if let Some(model) = state.get::<KvModel>("kv_model") {
assert_always!(model.len() <= 1000, "too many keys");
}
})
}
Invariants read from the StateHandle, which workloads write to via ctx.state().publish(). This is how the test driver communicates its reference model to the invariant checker.
Chaos and Attrition
Real distributed systems do not just run cleanly. Servers crash, networks partition, and then things have to recover. The builder models this with chaos_duration:
#![allow(unused)]
fn main() {
use moonpool_sim::Attrition;
SimulationBuilder::new()
.processes(3, || Box::new(KvServer))
.workload(KvWorkload::new(200, keys))
.chaos_duration(Duration::from_secs(30))
.attrition(Attrition {
max_dead: 1,
prob_graceful: 0.3,
prob_crash: 0.5,
prob_wipe: 0.2,
recovery_delay_ms: Some(1000..5000),
grace_period_ms: Some(2000..4000),
})
.set_iterations(100)
.run()
.await;
}
The simulation lifecycle:
-
Chaos phase (30 simulated seconds): Workloads run concurrently with fault injectors. Attrition randomly kills and restarts processes, respecting
max_deadto avoid killing everything at once. -
Workload completion: After chaos ends, faults stop and the system continues until all workloads finish. Workloads should be finite (do N operations, or sleep for a sim-time duration, then return).
-
Settle: The orchestrator drains remaining events. If the system does not settle within 30 seconds (sim time), the test fails with diagnostics, surfacing cleanup bugs like leaked tasks or unclosed connections.
-
Check: The
check()methods run inside the event loop, so network RPCs work normally.
max_dead: 1 means at most one process is down at any time. The probability weights control the mix of graceful shutdowns (shutdown token fired, grace period) versus instant crashes (no warning, connections abort).
Randomized Network
For additional chaos, enable randomized network configuration:
#![allow(unused)]
fn main() {
.random_network()
}
This varies latency, packet delay distributions, and other network parameters per iteration, based on the seed. Without this flag, the network uses default configuration (consistent, low-latency).
Putting It All Together
A production-grade simulation configuration looks like this:
#![allow(unused)]
fn main() {
let report = SimulationBuilder::new()
.processes(3, || Box::new(KvServer))
.tags(&[("role", &["primary", "replica"])])
.workload(KvWorkload::new(500, keys.clone()))
.invariant(ConservationLaw)
.chaos_duration(Duration::from_secs(30))
.attrition(Attrition {
max_dead: 1,
prob_graceful: 0.3,
prob_crash: 0.5,
prob_wipe: 0.2,
recovery_delay_ms: None,
grace_period_ms: None,
})
.random_network()
.set_iterations(100)
.run()
.await;
}
The builder takes care of the rest: creating the simulated world, assigning IPs, seeding the RNG, running the orchestration loop, collecting metrics, and producing the report.
Running and Observing
- Running with xtask
- Reading the SimulationReport
- What Success Means
- When Things Fail
- Debugging a Failing Seed
- cargo nextest vs cargo xtask sim
- Exit Codes
- The Feedback Loop
You have a Process, a Workload, and a builder configuration. Now we run the simulation and make sense of the output.
Running with xtask
The primary way to run simulation binaries is through xtask:
cargo xtask sim list # List all simulation binaries
cargo xtask sim run kv # Run binaries matching "kv"
cargo xtask sim run-all # Run everything
The run subcommand matches against binary names. cargo xtask sim run transport would run both sim-transport-e2e and sim-transport-messaging.
Each simulation binary is a standalone Rust binary that constructs a SimulationBuilder, calls .run().await, and prints the report. A typical main function:
fn main() {
let _ = tracing_subscriber::fmt()
.with_max_level(tracing::Level::WARN)
.try_init();
let local_runtime = tokio::runtime::Builder::new_current_thread()
.enable_io()
.enable_time()
.build_local(Default::default())
.expect("Failed to build local runtime");
let report = local_runtime.block_on(async move {
SimulationBuilder::new()
.processes(3, || Box::new(KvServer))
.workload(KvWorkload::new(200, keys))
.set_iterations(100)
.random_network()
.run()
.await
});
report.eprint();
if !report.seeds_failing.is_empty() || !report.assertion_violations.is_empty() {
std::process::exit(1);
}
}
Notice build_local(), not build(). Moonpool requires a single-threaded local runtime. Using build() will produce runtime errors because the simulation uses !Send types.
Reading the SimulationReport
The report prints to stderr with .eprint(). Here is what a healthy report looks like:
=== Simulation Report ===
Iterations: 100 | Passed: 100 | Failed: 0 | Rate: 100.0%
Avg Wall Time: 12ms Total: 1.20s
Avg Sim Time: 45.23s
Avg Events: 8,432
--- Assertions (4) ---
PASS [always ] "read matches model" 12,847 pass 0 fail
PASS [always ] "conservation law" 8,200 pass 0 fail
PASS [sometimes ] "set_succeeded" 6,102 / 12,847 (47.5%)
PASS [sometimes ] "set_failed_network" 412 / 12,847 (3.2%)
The critical lines:
- Rate: 100.0% means no iteration panicked or returned an error
- 0 fail on always-assertions means no invariant violations
- PASS on sometimes-assertions means every coverage goal was hit at least once
What Success Means
A simulation succeeds when two conditions hold simultaneously:
- No always-assertion violations: Every
assert_always!passed on every evaluation across all iterations - All sometimes-assertions fired: Every
assert_sometimes!evaluated to true at least once across all iterations
Both matter. A simulation that never violates invariants but also never exercises error paths is not testing enough. A simulation that hits every code path but tolerates wrong answers is not checking enough.
When Things Fail
A failing report shows faulty seeds and violations:
=== Simulation Report ===
Iterations: 100 | Passed: 98 | Failed: 2 | Rate: 98.0%
Faulty seeds: [7891, 42033]
--- Assertions (4) ---
FAIL [always ] "read matches model" 12,800 pass 47 fail
PASS [always ] "conservation law" 8,200 pass 0 fail
PASS [sometimes ] "set_succeeded" 6,102 / 12,847 (47.5%)
PASS [sometimes ] "set_failed_network" 412 / 12,847 (3.2%)
--- Assertion Violations ---
- Always "read matches model": 47 failures out of 12,847 evaluations
The report tells you:
- Which seeds failed: 7891 and 42033
- Which assertion broke: “read matches model” had 47 failures
- How often: 47 out of 12,847 evaluations, so it is a rare condition
Debugging a Failing Seed
Take the failing seed and isolate it:
#![allow(unused)]
fn main() {
SimulationBuilder::new()
.processes(3, || Box::new(KvServer))
.workload(KvWorkload::new(200, keys))
.set_debug_seeds(vec![7891])
.run()
.await;
}
Now increase logging. Set the environment variable:
RUST_LOG=debug cargo xtask sim run kv
Because the simulation is deterministic, seed 7891 reproduces the exact same scheduling, the exact same random choices, the exact same failure. You can add tracing::debug! statements, rerun with the same seed, and see exactly what happened.
The debugging workflow:
- Find the seed in the report’s faulty seeds list
- Isolate it with
set_debug_seeds(vec![seed]) - Add logging in the Process and Workload code
- Rerun and trace the execution
- Fix the root cause in the Process code
- Verify by running the full iteration suite again
cargo nextest vs cargo xtask sim
Moonpool has two ways to run tests:
cargo nextest run runs unit tests and integration tests. These are fast, focused tests for specific modules. Use nextest during development for quick feedback.
cargo xtask sim run runs simulation binaries. These are comprehensive, multi-iteration chaos tests that take longer but find deeper bugs. Use xtask for validation before merging.
Both should pass before work is complete. The typical workflow: write code, run nextest for fast iteration, then run xtask sim for thorough validation.
Exit Codes
Simulation binaries should exit with code 0 on success and code 1 on failure. The standard pattern:
#![allow(unused)]
fn main() {
if !report.seeds_failing.is_empty()
|| !report.assertion_violations.is_empty()
{
std::process::exit(1);
}
}
Check both seeds_failing (iterations that panicked or errored) and assertion_violations (always-type assertions that failed). Coverage violations (coverage_violations) indicate sometimes-assertions that never fired, which may or may not be worth failing the build over depending on your testing philosophy.
The Feedback Loop
Simulation testing is iterative. You write a workload, run it, find a bug, fix the bug, add assertions to prevent regression, run again. Each round makes the system more robust.
The report’s assertion table is your scoreboard. When you see all PASS with high hit counts, you know your test is both thorough and correct. When you see MISS on a sometimes-assertion, you know there is a code path your chaos is not reaching. When you see FAIL on an always-assertion, you know there is a real bug to fix.
This is the rhythm of simulation-driven development: build, test, observe, improve.
Chaos Testing vs Simulation
- Two Schools of Breaking Things
- Chaos Engineering: Learning from Production
- Simulation: Breaking Things Before They Exist
- Simulation Subsumes Chaos
- Complementary, Not Competing
- Moonpool’s Approach
Two Schools of Breaking Things
There are two dominant approaches to testing distributed systems under failure. Chaos engineering takes a running production system and injects faults into it: kill a VM, drop a network link, fill a disk. Simulation testing builds a fake world and runs the system inside it, injecting faults by rewriting the laws of physics. Both break things on purpose. But they break things in fundamentally different ways, and understanding those differences shapes how we use them.
Chaos Engineering: Learning from Production
Chaos engineering, popularized by Netflix’s Chaos Monkey, treats production (or a production-like staging environment) as the laboratory. You pick an experiment (“what happens if we kill this database node?”), form a hypothesis (“traffic reroutes within 5 seconds”), run the experiment against real infrastructure, and observe what happens.
This approach has real strengths. It tests the actual system with real configurations, real dependencies, and real network stacks. It catches problems that no amount of pre-production testing would find: misconfigured load balancers, stale DNS caches, monitoring gaps. When chaos engineering finds something, you know it matters because it happened in the real world.
But chaos engineering has structural limitations:
- Non-deterministic. You cannot reproduce the exact sequence of events that caused a failure. The bug happened once, under conditions you cannot fully reconstruct.
- Reactive. You find problems after they exist in production, not before.
- Slow. Each experiment takes real time against real infrastructure. Running one scenario might take minutes. Running a million is not practical.
- Blast radius. Real users can be affected. Even with careful scoping, an experiment that goes wrong can cause actual outages.
Chaos engineering finds symptoms. The database went down and something bad happened. But why? Was it a race condition? A missing retry? A lock held too long during recovery? Getting from symptom to root cause requires forensic debugging after the fact, often without a way to reproduce the exact failure.
Simulation: Breaking Things Before They Exist
Simulation testing takes the opposite approach. Instead of injecting faults into a real system, you build a simulated world where faults are a first-class feature. The network drops packets because you told it to. Disks corrupt writes because the configuration says they should. Clocks drift because the simulator makes them drift.
Everything runs in a single process, single-threaded, driven by a seeded pseudorandom number generator. The same seed produces the same execution, every time. A failing test is not a flaky signal. It is a reproducible bug with an exact replay.
The advantages follow directly:
- Deterministic. Every failure is reproducible. A failing seed is a permanent regression test.
- Proactive. You find bugs before code reaches production, before it reaches staging, often before it leaves a developer’s machine.
- Fast. No real I/O, no real time. Moonpool simulates hundreds of seconds of cluster behavior in single-digit real seconds. FoundationDB ran 5 to 10 million simulation runs per night.
- Exhaustive. Different seeds explore different fault combinations. Run enough seeds and you cover failure scenarios no human would think to write tests for.
Simulation Subsumes Chaos
Here is the key insight: simulation subsumes chaos engineering. Everything chaos engineering does, simulation does too, but with reproducibility, speed, and exhaustiveness added on top.
Chaos engineering injects a random partition? Simulation injects a random partition, and you can replay the exact moment it happened with the exact timing. Chaos engineering kills a process? Simulation kills a process, restarts it, and verifies the recovery path, all in milliseconds of real time.
Will Wilson made this point concrete with FoundationDB’s Sinkhole: a rack of real servers wired to programmable power switches, toggled continuously to validate the real system. Sinkhole never found a single database bug that simulation had missed. It only found bugs in other software and in hardware. The simulation was a stricter, more adversarial environment than reality.
Complementary, Not Competing
This does not mean chaos engineering is useless. The two approaches are complementary:
Simulation is for development. It runs on every commit, explores millions of fault scenarios, and catches bugs before they ship. It tests the logic of your system against the worst possible world.
Chaos engineering is for production validation. It verifies that your real deployment, with its real configuration and real dependencies, behaves as expected. It catches the things simulation cannot model: actual kernel behavior, real cloud provider quirks, misconfigured infrastructure.
Think of simulation as the wind tunnel and chaos engineering as the flight test. You would never skip the wind tunnel because you plan to fly the plane. And you would never skip the flight test because the wind tunnel looked good.
Moonpool’s Approach
In moonpool, chaos injection is a tool within the simulation, not an alternative to it. Network faults, storage corruption, process reboots, code path perturbation: all of these are chaos techniques, but they run inside a deterministic simulator where every fault is reproducible and every failure is debuggable.
The next chapters cover four dimensions of chaos that moonpool provides: buggify for code-level fault injection, attrition for process lifecycle chaos, network faults, and storage faults. Each one amplifies failure probabilities so that rare bugs become common, while keeping everything controlled by a single seed.
Chaos in Moonpool
- The Philosophy: Make Rare Bugs Common
- Four Dimensions of Chaos
- Determinism Is Non-Negotiable
- The Next Four Chapters
The Philosophy: Make Rare Bugs Common
Real distributed systems fail in ways that are individually rare but collectively inevitable. A network partition happens once a month. A disk corrupts a write once a year. A process crashes during recovery once in a thousand deployments. Individually, each event is unlikely. But multiply enough low-probability events across enough nodes and enough time, and the question is not if but when.
The core philosophy of chaos in moonpool is simple: amplify failure probabilities so that rare bugs become common. If a network partition happens once a month in production, make it happen every few seconds in simulation. If a disk write corrupts once a year, corrupt one every hundred writes. If a process crashes at the worst possible moment once in a thousand runs, make it crash at the worst possible moment every run.
This is not reckless. It is strategic. By making failures frequent, we force the code to handle them every time, not just when a developer remembers to write a test for them. And because everything runs inside a deterministic simulation, every failure is reproducible. A bug found at high fault probability is the same bug that would have appeared at low probability in production, just found a thousand times faster.
Four Dimensions of Chaos
Moonpool provides chaos injection along four distinct dimensions, each targeting a different layer of the system:
Code Paths: Buggify
Inspired by FoundationDB’s BUGGIFY macro, buggify!() scatters fault injection points throughout your application code. Each point randomly activates once per simulation run, then fires probabilistically on each call. This forces your code down error paths, timeout branches, and recovery logic that would otherwise require precise failure timing to exercise.
Infrastructure: Attrition
Processes crash. Processes restart. Sometimes they restart cleanly, sometimes they lose all their data. Attrition automatically cycles your server processes through graceful shutdowns, crashes, and data-wiping restarts during the chaos phase, while respecting a max_dead constraint that keeps enough processes alive for the system to remain operational.
Network Faults
TCP connections fail in subtle ways. Moonpool simulates connection drops, latency spikes, packet corruption, clock drift, partial writes, network partitions, and half-open connections. The simulation operates at the TCP connection level, not the packet level, because connection-level faults are what distributed systems actually need to handle.
Storage Faults
Disks lie. Following TigerBeetle’s fault model, moonpool injects read corruption, write corruption, torn writes, misdirected I/O, phantom writes, and sync failures. These are the faults that data-integrity code must survive, and they are nearly impossible to test without simulation.
Determinism Is Non-Negotiable
Every chaos mechanism in moonpool is controlled by the simulation seed. The same seed produces the same faults in the same order at the same times. This means:
- A failing seed is a permanent, reproducible bug report
- You can debug a failure by replaying it with logging turned up
- Fixing a bug and re-running the seed verifies the fix
- Different seeds explore different fault combinations automatically
This is what separates simulation chaos from production chaos. In production, a network partition happened and something went wrong, but you cannot reproduce the exact sequence of events. In simulation, you hand someone a seed number and they see exactly what you saw.
The Next Four Chapters
Each dimension of chaos has its own chapter with configuration details, code examples, and the design reasoning behind the choices moonpool makes:
- Buggify: Fault Injection covers the
buggify!()macro and code-level fault injection patterns - Attrition: Process Reboots covers automatic process lifecycle chaos
- Network Faults covers TCP-level network fault simulation
- Storage Faults covers TigerBeetle-inspired storage fault patterns
Buggify: Fault Injection
- The Idea
- Two-Phase Activation
- Five Injection Patterns
- Probability Calibration
- Anti-Patterns
- Production Safety
The Idea
Most distributed system bugs do not live in the happy path. They hide in error handlers, timeout branches, retry logic, and recovery code. These paths are exercised only when something goes wrong, which means they are the least tested code in the system and the most critical when failures happen.
FoundationDB solved this with a technique called BUGGIFY: scatter conditional fault injection points throughout the codebase, activated only during simulation. Each point perturbs code behavior in a small way: return an error instead of success, add an artificial delay, randomize a buffer size. Run enough seeds and these perturbations force the code through every error path, every retry loop, every recovery sequence.
Moonpool implements this as the buggify!() macro.
Two-Phase Activation
A naive approach would fire every buggify point on every call. That produces chaos but not useful chaos. If every operation fails, the system never makes progress and you never test the interesting interactions between partial failures.
Moonpool uses FoundationDB’s two-phase activation model:
Phase 1: Activation. The first time a buggify location is encountered during a simulation run, it is randomly activated or deactivated. This decision is fixed for the entire run. A location that is deactivated will never fire, no matter how many times it is reached.
Phase 2: Firing. Each time an activated location is reached, it fires with a fixed probability (25% by default). This means an active buggify point fires roughly one in four times, creating a mix of successful and failed operations.
#![allow(unused)]
fn main() {
// Each call site is a unique location (identified by file:line).
// First encounter: 50% chance of activation (configurable).
// Subsequent encounters at an active site: 25% chance of firing.
if buggify!() {
return Err(Error::timeout("simulated timeout"));
}
}
The activation probability is configurable per simulation. The firing probability can be customized per call site:
#![allow(unused)]
fn main() {
// Fire at 50% probability instead of the default 25%
if buggify_with_prob!(0.5) {
buffer_size = 1; // Force single-byte reads
}
}
This two-phase design means each seed tests a different combination of active fault injection points. Seed 42 might activate the timeout injection in your RPC layer but deactivate the one in your storage engine. Seed 43 might do the reverse. Run enough seeds and you cover the combinatorial space of fault interactions.
Five Injection Patterns
FoundationDB’s codebase uses BUGGIFY in five recurring patterns, all of which translate directly to moonpool:
1. Error Injection on Success Paths
The most common pattern. After a successful operation, sometimes return an error anyway:
#![allow(unused)]
fn main() {
let result = connection.send(message).await;
if result.is_ok() && buggify!() {
// Force the caller through its error handling path
return Err(Error::io("buggified send failure"));
}
}
2. Artificial Delays
Inject delays to expose race conditions and timing-dependent bugs:
#![allow(unused)]
fn main() {
if buggify!() {
// Slow down this operation to widen race windows
time.sleep(Duration::from_millis(100)).await?;
}
}
3. Parameter Randomization
Vary sizes, timeouts, and limits to test edge cases:
#![allow(unused)]
fn main() {
let batch_size = if buggify!() {
// Test with tiny batches to exercise boundary conditions
random.random_range(1..3)
} else {
DEFAULT_BATCH_SIZE
};
}
4. Alternative Code Paths
Force the system down paths it rarely takes:
#![allow(unused)]
fn main() {
let should_compact = needs_compaction() || buggify_with_prob!(0.1);
if should_compact {
// Exercise compaction logic more frequently
compact_storage().await?;
}
}
5. Process Restarts
Trigger restarts at specific points to test crash recovery:
#![allow(unused)]
fn main() {
if buggify_with_prob!(0.01) {
// Simulate a crash right after writing but before syncing
return Err(Error::crash("buggified crash after write"));
}
}
Probability Calibration
Not all buggify points should fire at the same rate. FoundationDB uses a three-tier calibration:
| Tier | Probability | Use Case |
|---|---|---|
| High | 5-10% | Common edge cases: buffer boundaries, retry paths |
| Medium | 1% | Standard scenarios: timeout handling, connection resets |
| Low | 0.1-0.01% | Rare critical failures: data corruption, crash during sync |
The default 25% firing probability works well for most injection points. Use buggify_with_prob!() when you need more control. High-probability points are useful for paths you want exercised frequently. Low-probability points model events that are individually rare but must be handled correctly.
Anti-Patterns
Do not use buggify for business logic. Buggify is for simulating infrastructure failures, not for feature flags or A/B testing. If the buggified branch changes application semantics rather than injecting a fault, it belongs in your application logic, not in a buggify block.
Do not use non-deterministic random. Buggify uses sim_random() internally, which is controlled by the simulation seed. Never mix in rand::random() or other non-deterministic entropy. That breaks reproducibility.
Do not use excessive probabilities without reason. A buggify_with_prob!(0.9) that fails 90% of attempts means the system almost never succeeds. That tests error handling but misses the interesting interactions between partial success and partial failure.
Production Safety
Buggify is gated behind simulation state. When the simulation is not running, buggify!() always returns false. There is no runtime cost in production: the check is a thread-local boolean read. You can leave buggify calls in your production code without worrying about them firing outside simulation.
This is the same guarantee FoundationDB provides: BUGGIFY is gated behind g_network->isSimulated(), ensuring zero production impact regardless of how aggressively chaos is injected during testing.
Attrition: Process Reboots
- Why Processes Need to Die
- Three Kinds of Reboot
- The Attrition Configuration
- Using Attrition
- The max_dead Constraint
- Custom Fault Injection
Why Processes Need to Die
A distributed system that only works when all nodes are healthy is not a distributed system. It is a single point of failure with extra network hops. Real clusters lose nodes constantly: rolling deployments restart processes, kernel panics crash them, power failures wipe their storage. Your system must handle all of these, and it must handle them while continuing to serve requests.
Attrition is moonpool’s built-in mechanism for automatically killing and restarting server processes during simulation. It runs during the chaos phase, picks random processes, kills them in various ways, waits a recovery delay, and restarts them. The goal is to continuously verify that your system can tolerate node failures without manual intervention.
Three Kinds of Reboot
Moonpool provides three reboot types, each modeling a different real-world failure:
Graceful
The controlled shutdown. A cancellation token fires, giving the process a grace period to drain buffers, close connections cleanly, and flush pending writes. If the process does not exit within the grace period, it gets force-killed. After shutdown, connections deliver remaining buffered data (FIN semantics), then the process restarts with fresh state.
This models rolling deployments, planned maintenance, and well-behaved process managers. The process has a chance to clean up, but that chance is time-bounded.
Crash
The sudden death. The process task is immediately cancelled. All connections abort with no buffer drain. Peers see connection reset errors. Any in-memory state is lost. The process restarts after a recovery delay.
This models kernel panics, OOM kills, and hardware failures. There is no warning and no cleanup. Code that assumes a graceful shutdown will always happen gets a rude surprise.
CrashAndWipe
The worst case. Same as Crash, but all persistent storage for the process is also deleted. The process restarts as if it were a brand new node joining the cluster for the first time.
This models total disk failures, accidental data deletion, or replacing a failed machine with a fresh one. The wipe is scoped to the crashed process’s IP address, so other processes’ storage is unaffected. Systems that rely on durable state for recovery must handle the case where that state is gone.
The Attrition Configuration
Attrition is configured through the Attrition struct:
#![allow(unused)]
fn main() {
Attrition {
max_dead: 1,
prob_graceful: 0.3,
prob_crash: 0.5,
prob_wipe: 0.2,
recovery_delay_ms: Some(1000..10000),
grace_period_ms: Some(2000..5000),
}
}
max_dead is the most important field. It caps the number of simultaneously dead processes. If you have a 3-node cluster with max_dead: 1, attrition will never kill a second node before the first has restarted. This ensures the system always has enough live nodes to remain operational (assuming your replication factor matches).
prob_graceful, prob_crash, prob_wipe are weights, not probabilities. They do not need to sum to 1.0. The attrition injector normalizes them internally and picks a reboot kind by weighted random selection. Setting prob_wipe: 0.0 disables wipe reboots entirely.
recovery_delay_ms controls how long a dead process stays dead before restarting. The actual delay is drawn randomly from this range, so different seeds test different recovery timings. The default is 1 to 10 seconds of simulated time.
grace_period_ms controls how long a graceful shutdown has to complete. Again, drawn randomly from the range. The default is 2 to 5 seconds.
Using Attrition
Attrition is configured on the simulation builder and requires a chaos duration:
#![allow(unused)]
fn main() {
SimulationBuilder::new()
.processes(3, || Box::new(MyProcess::new()))
.attrition(Attrition {
max_dead: 1,
prob_graceful: 0.3,
prob_crash: 0.5,
prob_wipe: 0.2,
recovery_delay_ms: None, // use defaults
grace_period_ms: None, // use defaults
})
.chaos_duration(Duration::from_secs(60))
.workload(MyWorkload::new())
.run()
.await;
}
The .chaos_duration() call is required because attrition runs only during the chaos phase. After the chaos duration elapses, fault injectors stop and the system continues until all workloads complete. A settle phase then drains remaining events before checks run, surfacing cleanup bugs rather than hiding them behind an arbitrary timer.
The max_dead Constraint
max_dead deserves special attention because it is the bridge between chaos and correctness. Without it, attrition could kill all your nodes simultaneously, which is technically chaos but not useful chaos. No distributed system survives simultaneous failure of all replicas.
Set max_dead to match your system’s fault tolerance. A system with replication factor 3 can tolerate 1 failure, so max_dead: 1. A system that needs 3 of 5 nodes alive should use max_dead: 2. This ensures attrition tests failures your system should survive, not failures that are inherently unrecoverable.
Custom Fault Injection
Attrition covers the common case of random process reboots. For more targeted fault injection, implement the FaultInjector trait:
#![allow(unused)]
fn main() {
#[async_trait(?Send)]
impl FaultInjector for RollingRestart {
fn name(&self) -> &str { "rolling_restart" }
async fn inject(&mut self, ctx: &FaultContext) -> SimulationResult<()> {
// Restart each process in order, waiting for recovery between each
for ip in ctx.process_ips().to_vec() {
ctx.reboot(&ip, RebootKind::Graceful)?;
ctx.time().sleep(Duration::from_secs(15)).await
.map_err(|e| SimulationError::InvalidState(e.to_string()))?;
if ctx.chaos_shutdown().is_cancelled() {
break;
}
}
Ok(())
}
}
}
The FaultContext provides access to process reboots, network partitions, and tag-based targeting. You can combine built-in attrition with custom fault injectors by registering both on the builder.
Network Faults
- TCP, Not Packets
- The Fault Catalog
- Graceful vs Abort Disconnect
- The Swizzling Insight
- Configuration in Practice
TCP, Not Packets
Moonpool simulates network faults at the TCP connection level, not the individual packet level. This is a deliberate design choice, inherited from FoundationDB. In practice, distributed systems rarely deal with individual packets. They deal with connections: connections that drop, connections that stall, connections that report success on one side and failure on the other. These are the faults that matter for application correctness.
Packet-level simulation (what TigerBeetle does) is useful for testing network stacks themselves. But for application-level distributed systems, connection-level faults exercise the code paths that actually fail in production: reconnection logic, request retries, leader election on disconnect, and state reconciliation after a partition.
The Fault Catalog
Moonpool’s ChaosConfiguration controls a wide range of network faults. Each fault is independently configurable and randomized per seed when using NetworkConfiguration::random_for_seed().
Latency Injection
Every network operation (bind, accept, connect, read, write) has a configurable latency range. The simulator picks a random duration from the range for each operation. This models the basic reality that network operations take time, and that time varies.
For tail latency testing, moonpool supports bimodal latency distribution, following FoundationDB’s halfLatency() pattern. In bimodal mode, 99.9% of operations use normal latency, but 0.1% experience latencies multiplied by 5x to 20x. This is how real networks behave: most requests are fast, but a small fraction hit GC pauses, cross-datacenter hops, or congestion.
Connection Drops
Random close injects spontaneous connection failures during I/O operations, at a configurable probability (default 0.001%). When triggered, 30% of closes are explicit (the caller gets an error) and 70% are silent (the connection just stops working). This ratio, taken from FoundationDB, tests both error-handling paths and timeout-based failure detection.
A cooldown period prevents cascading closes from overwhelming the system. The goal is to test recovery, not to make the system completely inoperable.
Clogging
Write clogging stalls data delivery on a connection for a random duration (100-300ms by default). This simulates network congestion, TCP backpressure, and flow control contention. Code that assumes writes complete promptly will fail under clogging.
Partial Writes
Writes are truncated to a random length (0 to 1000 bytes by default), following FoundationDB’s approach. This tests TCP fragmentation handling and message framing logic. If your wire protocol assumes that a single write delivers a complete message, partial writes will break that assumption immediately.
Bit Flips
Packet data is corrupted with random bit flips at low probability (0.01% by default). The number of flipped bits follows a power-law distribution between 1 and 32. This tests checksum validation and corruption detection. Without bit-flip injection, corruption bugs only surface in production when cosmic rays or faulty NICs flip bits for you.
Clock Drift
Simulated clocks can drift by up to 100ms (configurable) between nodes. This tests anything that depends on time agreement: lease expiration, distributed consensus, TTL handling, and cache invalidation. Clock drift is subtle because the code often works correctly with small drift and fails catastrophically when drift exceeds a threshold.
Network Partitions
Moonpool supports three partition strategies:
| Strategy | Behavior | Tests |
|---|---|---|
| Random | Random IP pairs partitioned | General chaos |
| UniformSize | Partition of random size (1 to n-1 nodes) | Various quorum scenarios |
| IsolateSingle | One node isolated from all others | Common production failure |
Partitions have configurable probability and duration. They can be programmatic (via FaultContext::partition) or automatic (via partition_probability in the chaos config).
Connect Failures
Connection establishment can fail in two modes, following FoundationDB’s SIM_CONNECT_ERROR_MODE:
- AlwaysFail: Every buggified connect attempt returns
ConnectionRefused - Probabilistic: 50% fail with
ConnectionRefused, 50% hang forever (never complete)
The hanging mode is particularly nasty. Code that does not implement connect timeouts will block forever, which is exactly the kind of bug simulation should find.
Graceful vs Abort Disconnect
When a connection closes, moonpool models two distinct TCP behaviors:
Graceful close implements TCP half-close semantics. The closing side marks its send direction as closed and schedules a FinDelivery event that arrives after all in-flight data has been delivered. The remote side continues reading buffered data normally and sees EOF only after the FIN arrives. This models a clean shutdown(SHUT_WR) followed by close().
Abort close immediately terminates both directions. No FIN, no buffer drain. The remote side gets a connection reset error on its next read or write. This models a crashed process or a force-killed connection.
The distinction matters because many protocols depend on reading remaining data after the peer signals shutdown. HTTP/1.1 relies on this for chunked transfer encoding. gRPC uses it for trailing metadata. If your simulation only models abort closes, you will miss bugs in graceful shutdown handling.
The Swizzling Insight
One finding from FoundationDB’s simulation work deserves special mention: restoring network connections in reverse order of disconnection finds more bugs than restoring in forward order. This is called swizzling. As Will Wilson described it: “for reasons that we totally don’t understand, this is better at finding bugs than normal clogging.”
Why does this work? Forward restoration tests the easy case: the first connection dropped is the first restored, so recovery happens in the order the system expects. Reverse restoration forces the system to handle partial recovery where the most recently dropped connection comes back first. This creates asymmetric states that exercise recovery logic in ways no developer would think to test manually.
This is the kind of insight that only falls out of running thousands of simulations. No one sat down and reasoned that reverse-order restoration would find more bugs. The simulator tried both and the data spoke for itself.
Configuration in Practice
For maximum chaos testing, use NetworkConfiguration::random_for_seed(). This randomizes all parameters based on the simulation seed, so different seeds test different network conditions:
#![allow(unused)]
fn main() {
let network_config = NetworkConfiguration::random_for_seed();
}
For fast unit tests where network chaos would just slow things down, use NetworkConfiguration::fast_local():
#![allow(unused)]
fn main() {
let network_config = NetworkConfiguration::fast_local();
// Minimal latencies, all chaos disabled
}
For targeted testing of specific fault types, start with defaults and override:
#![allow(unused)]
fn main() {
let mut config = NetworkConfiguration::default();
config.chaos.partition_probability = 0.05;
config.chaos.partition_strategy = PartitionStrategy::IsolateSingle;
// Everything else at defaults
}
Storage Faults
- Disks Lie
- The Fault Taxonomy
- Performance Simulation
- The Step Loop Pattern
- Per-Process Storage Configuration
- Crash and Wipe Operations
- Configuration in Practice
Disks Lie
Every database developer eventually learns this lesson. write() returns success, but the data never reaches the platter. fsync() completes, but the drive’s firmware lied about flushing its cache. A cosmic ray flips a bit in DRAM between computing a checksum and writing to disk. A firmware bug directs a write to the wrong sector.
These are not hypothetical failures. TigerBeetle’s documentation catalogs them with references to real incidents: LSE studies showing 8.5% of SATA drives developing silent corruption, firmware bugs causing misdirected writes across drives in a RAID array, and enterprise SSDs that acknowledge fsync without actually flushing.
Moonpool’s storage fault injection is modeled on TigerBeetle’s fault taxonomy. The goal is to test that your data integrity code actually works, not by hoping these faults happen in production, but by making them happen deterministically in simulation.
The Fault Taxonomy
Moonpool’s StorageConfiguration controls seven types of storage faults:
Read Corruption
A read operation returns wrong data. The file contains correct bytes, but the value returned to the application has been corrupted. This models ECC failures, DRAM bit flips, and controller firmware bugs.
What it tests: Checksum validation on reads. If your system trusts data without verifying checksums, read corruption will silently propagate bad data through the system.
Write Corruption
A write operation stores wrong data. The application writes correct bytes, but what lands on disk is different. This models controller bugs, bad sectors, and write buffer corruption.
What it tests: Read-after-write verification and end-to-end checksums. Systems that compute checksums before writing and verify after reading will detect write corruption. Systems that do not will store garbage.
Crash Faults (Torn Writes)
The system crashes mid-write. Some bytes are written, others are not. This models power failures, kernel panics, and OOM kills during I/O.
What it tests: Write-ahead logging, atomic write protocols, and crash recovery. Any system that performs multi-step writes without a journal or atomic commit is vulnerable to torn writes.
Misdirected Writes
A write lands at the wrong location. The application writes to offset A, but the data ends up at offset B. This models firmware bugs and controller errors that TigerBeetle specifically documents as real-world failures.
What it tests: Per-record addressing verification. Systems that embed the expected offset in each record’s header can detect misdirected writes. Systems that trust the filesystem to put data where it was told will read the wrong records.
Misdirected Reads
A read returns data from the wrong location. The application reads offset A, but gets the contents of offset B. Same root causes as misdirected writes, from the read side.
What it tests: Same as misdirected writes. Checksums that include the expected position catch this.
Phantom Writes
A write appears to succeed but does not persist. The write() call returns Ok(n) and even fsync() completes, but the data is gone after a restart. This models drive firmware that lies about durability.
What it tests: Durability verification after recovery. Systems that write, sync, crash, and restart must verify that their data survived. Phantom writes ensure this verification logic works.
Sync Failures
sync_all() returns an error. This models disk errors during flush, full disks, and I/O errors that only manifest at sync time.
What it tests: Error handling in durability-critical code paths. Many systems call fsync() but do not check the return value. In simulation, a sync failure is a loud signal that your error handling has a gap.
Performance Simulation
Beyond faults, moonpool simulates realistic storage performance characteristics:
| Parameter | Default | Description |
|---|---|---|
| IOPS | 25,000 | Operations per second (SATA SSD range) |
| Bandwidth | 150 MB/s | Maximum throughput |
| Read latency | 50-200us | Per-operation delay |
| Write latency | 100-500us | Per-operation delay |
| Sync latency | 1-5ms | Per-sync delay |
These parameters ensure that storage-heavy code paths experience realistic timing, which is important for testing timeout logic and concurrent I/O patterns.
The Step Loop Pattern
Storage operations in moonpool differ from network operations in one critical way: storage operations return Poll::Pending and require simulation stepping. Network operations buffer data and return Poll::Ready immediately. Storage operations need the simulation engine to advance time and process the I/O.
This means you cannot just await a storage operation in a test. You need the step loop pattern:
#![allow(unused)]
fn main() {
let handle = tokio::task::spawn_local(async move {
// This runs inside the simulation
let mut file = provider.open("test.txt", OpenOptions::create_write()).await?;
file.write_all(b"hello").await?;
file.sync_all().await
});
// Drive the simulation until the task completes
while !handle.is_finished() {
while sim.pending_event_count() > 0 {
sim.step(); // Process one simulation event
}
tokio::task::yield_now().await; // Let the spawned task make progress
}
handle.await.unwrap().unwrap();
}
The outer loop checks if the spawned task has finished. The inner loop processes all pending simulation events (which include storage I/O completions). The yield_now() gives the spawned task a chance to run after events have been processed.
This pattern is mechanical but important. Without it, storage operations will hang forever waiting for simulation events that never get processed.
Per-Process Storage Configuration
Storage fault injection is scoped per process. Each process is identified by its IP address, and you can assign different StorageConfiguration to different processes. This models real-world heterogeneous hardware: one node with a flaky SSD, another with a healthy disk.
The StorageState maintains a global configuration as the default, plus optional per-process overrides in per_process_configs: HashMap<IpAddr, StorageConfiguration>. When the simulation needs a config for a file operation, StorageState::config_for(ip) checks for a per-process override first, falling back to the global config.
Set per-process configuration through SimWorld:
#![allow(unused)]
fn main() {
// Give process 10.0.1.2 a degraded disk
let degraded = StorageConfiguration {
read_fault_probability: 0.01, // 1% read corruption
write_fault_probability: 0.005,
..StorageConfiguration::default()
};
sim.set_process_storage_config("10.0.1.2".parse().unwrap(), degraded);
}
Every file opened by a process is tagged with that process’s IP (StorageFileState::owner_ip). Fault injection decisions (corruption probabilities, latency ranges, sync failures) use the config resolved for the file’s owner, not a single global setting.
Crash and Wipe Operations
Two SimWorld methods handle storage lifecycle during process failures:
simulate_crash_for_process(ip, close_files) simulates a power loss for a specific process. Pending writes are subject to crash fault injection (torn writes), and open file handles are optionally closed. This replaces the old simulate_crash() which operated globally.
wipe_storage_for_process(ip) deletes all persistent storage owned by the given process. This models total disk failure or replacing a machine. The CrashAndWipe reboot kind calls both: crash first, then wipe. The wipe happens immediately (not deferred).
Configuration in Practice
For chaos testing, use StorageConfiguration::random_for_seed(). This randomizes both performance parameters and fault probabilities based on the simulation seed:
#![allow(unused)]
fn main() {
let storage_config = StorageConfiguration::random_for_seed();
// Fault probabilities: 0.001% to 0.1% (low but present)
// IOPS: 10K to 100K
// Bandwidth: 50-500 MB/s
}
For fast unit tests, use StorageConfiguration::fast_local():
#![allow(unused)]
fn main() {
let storage_config = StorageConfiguration::fast_local();
// 1M IOPS, 1 GB/s, 1us latencies, zero faults
}
The fault probabilities in random_for_seed() are intentionally low (0.001% to 0.1%). Storage faults at higher rates would prevent the system from making progress. The goal is a steady trickle of faults that occasionally exercises corruption detection and recovery, not a deluge that makes every I/O fail.
Assertions: Finding Bugs
Imagine your simulation catches a process violating an invariant. You have two choices: crash immediately, or record the violation and keep going. In traditional testing, we crash. The logic seems obvious: something went wrong, stop everything, report the failure.
But that logic is wrong for simulation testing.
Why Assertions Never Crash
Here is the core insight: the first thing that goes wrong is rarely the worst thing that goes wrong. When a process violates a consistency invariant, that violation is a bug. But if we abort the simulation right there, we will never see what happens next. Maybe that corrupted state propagates to two other processes. Maybe the replication protocol silently accepts the bad data and commits it to all replicas. Maybe the system continues operating “normally” for a thousand more steps before the corruption surfaces in a way that would actually harm a user.
An early abort masks the cascade. And the cascade is where the real damage lives.
Moonpool follows the principle that Antithesis pioneered: assertions record violations and continue. When assert_always! detects a violation, it logs an error, increments a counter, and lets the simulation keep running. The simulation report at the end shows everything that went wrong, in what order, and how often.
This is not about being lenient. It is about being thorough. A single simulation run that records three violations across two different subsystems tells you far more than three separate runs that each crash on the first violation they find.
Dual Purpose
Assertions in moonpool serve two purposes that reinforce each other.
First, they verify correctness. An assert_always! is a property that must hold every time it is checked. If it fails even once across thousands of iterations, there is a bug. The system tracks pass and fail counts, giving you a precise success rate rather than a binary pass/fail.
Second, they guide exploration. When you enable multiverse mode (covered in Part V), certain assertions become active signals to the explorer. An assert_sometimes! that fires true for the first time tells the explorer “this is interesting, branch from here.” The explorer snapshots that moment and spawns new timelines from it. This is what turns a random walk through state space into a directed search.
The same assertion does both jobs. You write it once, thinking about correctness. The exploration framework uses it automatically to find more bugs.
The Reporting Pipeline
After each simulation iteration, the runner checks has_always_violations(). If any always-type assertion failed during the iteration, the runner marks that seed as a failure. But the simulation is not interrupted mid-run. The entire iteration completes, accumulating all violations.
When the simulation finishes, get_assertion_results() returns an AssertionStats for every tracked assertion:
#![allow(unused)]
fn main() {
pub struct AssertionStats {
/// Total number of times this assertion was evaluated
pub total_checks: usize,
/// Number of times the assertion condition was true
pub successes: usize,
}
}
You can ask for success_rate() to get a percentage. An always-assertion with a 99.7% success rate means it failed in 0.3% of checks. That is a bug, and the numbers tell you how hard it is to trigger.
At the end of all iterations, validate_assertion_contracts() performs final validation. It returns two categories of violations: always violations (definite bugs where invariants were broken) and coverage violations (sometimes-assertions that were never satisfied, meaning the simulation did not exercise certain paths). The distinction matters because always violations indicate bugs regardless of iteration count, while coverage violations are only meaningful after enough iterations for statistical confidence.
What Comes Next
The next four chapters walk through the assertion types from simple to complex. We start with the conceptual taxonomy (invariants, discovery, guidance), then cover boolean assertions, numeric watermarks, and compound assertions. Each type has a different relationship with the simulation and the explorer, and understanding that relationship is the key to writing assertions that actually find bugs.
Invariants vs Discovery vs Guidance
- Invariants: What Must Always Hold
- Discovery: What Must Happen Eventually
- Guidance: What to Optimize Toward
- Why the Taxonomy Matters
Not all assertions ask the same kind of question. Some say “this must always be true.” Others say “this should happen at least once.” And a few say “try to make this better.” These are fundamentally different kinds of statements, and the simulation framework treats them differently.
Moonpool organizes assertions into three categories based on their purpose. Getting the category right matters because it determines how the assertion interacts with the runner, how it interacts with the explorer, and what a violation actually means.
Invariants: What Must Always Hold
Invariant assertions state properties that must be true every single time they are checked. If they fail even once, there is a bug.
#![allow(unused)]
fn main() {
assert_always!(committed_count <= total_count, "commits cannot exceed total");
}
This is the assertion you reach for most often. In Antithesis’s own Pangolin database, about 90% of assertions are always-type. That ratio matches what we see in practice: the vast majority of what we want to say about a system is “this property holds.”
Invariant assertions do not interact with the explorer. They do not trigger forks or snapshots. Their job is purely to catch violations. When one fails, the runner records the violation and continues. The value is in the recording, not in the stopping.
There is an important subtlety: assert_always! also fails if it is never reached. An assertion that is never evaluated gives false confidence. If you have an assertion guarding a recovery path but your simulation never triggers recovery, the assertion tells you nothing. Moonpool flags unreached always-assertions as violations so you know your coverage has gaps.
For optional code paths where non-reachability is acceptable, use assert_always_or_unreachable! instead. It validates the property when reached but passes silently if the code path is never exercised.
Discovery: What Must Happen Eventually
Discovery assertions state properties that must occur at least once across all simulation iterations. They do not need to hold every time, but they must fire true at some point.
#![allow(unused)]
fn main() {
assert_sometimes!(leader_elected, "a leader should eventually be elected");
assert_reachable!("recovery path exercised");
}
Where invariants validate, discovery assertions prove coverage. They answer questions like: Did our simulation actually trigger a failover? Did a leader election happen? Did the retry path execute? Without discovery assertions, a simulation could run thousands of iterations exercising only the happy path and report zero failures. Everything looks green, but nothing interesting was tested.
Discovery assertions become exploration amplifiers in multiverse mode. When assert_sometimes! fires true for the first time, the explorer snapshots the simulation state and branches from that point. Why? Because reaching that state was hard, and there are likely more interesting states reachable from it.
Consider a bug that requires a failover (probability 1/1000) followed by a specific timing condition during recovery (probability 1/1000). Without exploration amplification, finding this bug requires roughly 1,000,000 random iterations. With a sometimes-assertion on the failover that triggers branching, the explorer takes a shortcut: find the failover once in ~1000 iterations, snapshot it, then find the timing condition in ~1000 more iterations from that checkpoint. The effective probability drops from multiplicative to additive.
This is why discovery assertions have “superpowers” in moonpool. They are not just coverage markers. They are active participants in the search for bugs.
Guidance: What to Optimize Toward
Guidance assertions steer the explorer toward interesting regions of the state space. They express goals that the system should try to achieve, and the explorer actively works to satisfy them.
#![allow(unused)]
fn main() {
assert_sometimes_greater_than!(throughput, 1000, "should sometimes achieve high throughput");
assert_sometimes_all!("full_cluster_ready", [
("leader_elected", has_leader),
("all_replicas_synced", replicas_in_sync),
("clients_connected", clients_ready),
]);
}
Numeric guidance assertions track watermarks. The system remembers the best value it has observed and forks when it discovers a better one. This creates a ratchet effect: the explorer steadily pushes toward boundary conditions where bugs tend to hide.
Compound guidance assertions track frontiers. assert_sometimes_all! counts how many sub-goals are simultaneously true and forks when that count increases. The explorer is driven to satisfy more sub-goals at once, which naturally leads it toward complex system states that are hard to reach by random exploration alone.
Why the Taxonomy Matters
The three categories are not just organizational labels. They determine what the framework does with the assertion:
| Category | Runner behavior | Explorer behavior | Violation meaning |
|---|---|---|---|
| Invariant | Flags failure | Nothing | Definite bug |
| Discovery | Checks coverage | Forks on first success | Insufficient testing |
| Guidance | Reports progress | Forks on improvement | Exploration target |
When writing an assertion, ask yourself: Am I checking a property that must never be violated? Am I verifying that something interesting happened? Or am I telling the explorer where to look? The answer determines which macro to use, and using the right one makes the entire framework more effective.
Always and Sometimes
- assert_always!
- assert_always_or_unreachable!
- assert_sometimes!
- assert_reachable!
- assert_unreachable!
- Choosing the Right One
The boolean assertion macros are the foundation of everything else in moonpool’s assertion system. There are five of them. Each takes a message string that identifies the assertion across iterations and across forked timelines.
assert_always!
The workhorse. This states a property that must hold every time it is evaluated.
#![allow(unused)]
fn main() {
assert_always!(
replica_count >= min_replicas,
"replica count must meet minimum"
);
}
If the condition is false, moonpool records the violation at ERROR level with the current seed and increments the failure counter. It does not panic. The simulation continues, and subsequent assertions in the same iteration can still fire and record their own results.
There is one behavior that surprises people at first: assert_always! also fails if it is never reached. The reasoning is that an untested invariant provides false confidence. If you write an assertion guarding your recovery path but the simulation never exercises recovery, the post-run validation will flag it. This forces you to either fix your simulation to reach that path or use a different assertion type.
assert_always_or_unreachable!
Same semantics as assert_always!, except it passes silently when the code path is never executed.
#![allow(unused)]
fn main() {
assert_always_or_unreachable!(
balance >= 0,
"balance must be non-negative after withdrawal"
);
}
Use this for properties that guard optional or conditional code paths. If a particular fault injection scenario never triggers a withdrawal, this assertion will not flag a coverage gap. But if the path is reached and the balance goes negative, that is a violation.
The distinction between assert_always! and assert_always_or_unreachable! prevents a common trap: littering your code with always-assertions, then getting flooded with “never reached” violations because your simulation configuration does not exercise every path. Reserve assert_always! for the paths you must test. Use assert_always_or_unreachable! for the rest.
assert_sometimes!
This states a property that should be true at least once across all iterations. It does not need to hold every time.
#![allow(unused)]
fn main() {
assert_sometimes!(
connections_dropped > 0,
"chaos should sometimes drop connections"
);
}
If the condition is never true after all iterations complete, the post-run validation reports a coverage violation. But unlike always-violations, coverage violations are statistical. They become meaningful only after enough iterations.
The real power of assert_sometimes! shows up in multiverse mode. When the condition fires true for the first time in a timeline, the explorer snapshots that moment and branches. New timelines start from that interesting state. This is what transforms sometimes-assertions from passive coverage checks into active exploration amplifiers.
Think of it this way: assert_sometimes! is how you tell the explorer “this state is worth investigating.” The explorer does the rest.
Here is a practical pattern. You want to verify that your system handles leader re-election after a crash:
#![allow(unused)]
fn main() {
// In your workload
assert_sometimes!(
leader_changed_after_crash,
"leader should change after crash"
);
}
Without multiverse exploration, this assertion just checks that your simulation exercises the re-election path. With exploration enabled, it creates a checkpoint at the moment re-election succeeds. The explorer then branches from that checkpoint, increasing the chance of finding bugs in the post-election state.
assert_reachable!
A simplified form of sometimes-assertion for code paths that should be hit at least once. No condition needed.
#![allow(unused)]
fn main() {
fn handle_timeout(&mut self) {
assert_reachable!("timeout handler executed");
// ... handle the timeout
}
}
This is equivalent to assert_sometimes!(true, "...") but reads more clearly when you just want to confirm a path is exercised. Like assert_sometimes!, it triggers a fork on first reach in multiverse mode.
assert_unreachable!
The inverse: marks a code path that should never execute.
#![allow(unused)]
fn main() {
fn handle_message(&mut self, msg: Message) -> Result<(), Error> {
match msg.kind {
MessageKind::Request => { /* ... */ }
MessageKind::Response => { /* ... */ }
MessageKind::Unknown => {
assert_unreachable!("received unknown message kind");
return Err(Error::UnknownMessage);
}
}
Ok(())
}
}
If this code path is reached, moonpool records a violation (equivalent to an always-violation). Unlike assert_always!, there is no condition to check. Being reached at all is the violation.
Note that the code after the assertion still executes. The assertion records the problem but does not prevent the error path from running. This lets the simulation discover what happens when “impossible” conditions occur.
Choosing the Right One
The decision tree is straightforward:
- Must hold every time, must be tested:
assert_always! - Must hold every time, might not be reached:
assert_always_or_unreachable! - Must happen at least once:
assert_sometimes! - Path must be exercised:
assert_reachable! - Path must never execute:
assert_unreachable!
In practice, the distribution follows a pattern similar to what Antithesis found with their Pangolin database: roughly 90% always-type, with sometimes/reachable filling in the coverage gaps. That ratio makes sense because most of what we want to say about a system is “this must hold,” not “this should eventually happen.”
Start with always-assertions on your core invariants. Add sometimes-assertions on the interesting error paths and recovery scenarios. The explorer will use those sometimes-assertions to find the bugs hiding behind the invariants.
Numeric Assertions
Boolean assertions answer yes-or-no questions. But many of the properties we care about in distributed systems are numeric: latency must stay below a threshold, throughput should reach a target, queue depth should not exceed a bound. Numeric assertions let us express these properties directly and give the explorer something extra to work with: a value to optimize.
Always: Numeric Bounds
The always-numeric macros state bounds that must hold on every check. They work like assert_always! but compare two numeric values:
#![allow(unused)]
fn main() {
assert_always_less_than!(queue_depth, max_queue_size, "queue must not overflow");
assert_always_greater_than_or_equal_to!(replica_count, 1, "must have at least one replica");
}
Four comparison operators are available: assert_always_greater_than!, assert_always_greater_than_or_equal_to!, assert_always_less_than!, and assert_always_less_than_or_equal_to!. Each takes a value, a threshold, and a message.
When the comparison fails, the behavior is the same as boolean always-assertions: the violation is recorded, an error is logged, and the simulation continues.
But something extra happens behind the scenes. The framework tracks a watermark for the value: the most extreme value observed across all checks. For assert_always_less_than!(x, 100, ...), the framework remembers the highest x it has seen. Even if x never reaches 100, the watermark tells you how close the system came to the boundary.
Why does this matter? Because assert_always_less_than!(x, 100) implicitly tells the explorer to maximize x. The explorer gravitates toward states where x is highest, naturally pushing toward the boundary condition. If there is a bug that only manifests when x reaches 99, the explorer will find it faster than random exploration would.
This boundary-seeking behavior is automatic. You write a bound, and the framework tries to find states that approach or violate it.
Sometimes: Numeric Goals
The sometimes-numeric macros state goals that should eventually be achieved. The comparison must hold at least once across all iterations:
#![allow(unused)]
fn main() {
assert_sometimes_greater_than!(throughput, 500, "should achieve >500 ops/sec");
assert_sometimes_less_than!(p99_latency, 100, "p99 should sometimes drop below 100ms");
}
Like their boolean counterparts, these are coverage assertions. If the goal is never achieved after all iterations, the validation reports a coverage violation.
The difference from boolean sometimes is in how they interact with the explorer. Sometimes-numeric assertions fork on watermark improvement. The framework tracks the best value observed, and when a new observation beats the previous best, the explorer snapshots and branches from that state.
This creates a ratchet. Suppose you write:
#![allow(unused)]
fn main() {
assert_sometimes_greater_than!(committed_transactions, 1000, "high commit throughput");
}
The first time committed_transactions reaches 50, the explorer branches. Then it finds a timeline where it reaches 200 and branches again. Then 450. Then 800. Each improvement creates a new branch point, and the explorer explores from progressively better states. The watermark only moves in one direction: toward the goal.
Watermark Mechanics
Every numeric assertion, whether always or sometimes, maintains a watermark in shared memory. The watermark is the best value of the left operand observed so far:
- For assertions that track the highest value (
maximize=true):assert_always_less_than!,assert_always_less_than_or_equal_to!,assert_sometimes_greater_than!,assert_sometimes_greater_than_or_equal_to!. These seek the boundary from below (always) or ratchet upward (sometimes). - For assertions that track the lowest value (
maximize=false):assert_always_greater_than!,assert_always_greater_than_or_equal_to!,assert_sometimes_less_than!,assert_sometimes_less_than_or_equal_to!. These seek the boundary from above (always) or ratchet downward (sometimes).
The watermark persists across fork boundaries in multiverse mode. When a child timeline improves the watermark, the improvement is visible to subsequent timelines through shared memory. This means the explorer collectively pushes toward the boundary rather than each timeline searching independently.
For sometimes-numeric assertions, a second watermark tracks the value at the last fork point. A new fork only triggers when the value improves past the last fork watermark. This prevents the same assertion from triggering unlimited forks for tiny incremental improvements.
Use Cases
Resource bounds:
#![allow(unused)]
fn main() {
assert_always_less_than!(
memory_usage_mb, max_memory_mb,
"memory must stay within budget"
);
}
The explorer pushes toward high memory states, testing the system under pressure.
Convergence targets:
#![allow(unused)]
fn main() {
assert_sometimes_less_than!(
convergence_time_ms, 500,
"cluster should converge within 500ms"
);
}
The explorer ratchets toward fast convergence, branching each time it finds a faster path.
Throughput validation:
#![allow(unused)]
fn main() {
assert_always_greater_than_or_equal_to!(
processed_count, expected_count,
"must process all submitted requests"
);
}
Catches dropped requests. The explorer seeks states where the gap between processed and expected is smallest (i.e., where the system is closest to dropping something).
The key insight with numeric assertions is that expressing a bound also expresses a direction for exploration. You get correctness checking and guided search from the same line of code.
Compound Assertions
- assert_sometimes_all!: Simultaneous Sub-Goals
- assert_sometimes_each!: Per-Value Coverage
- Quality Watermarks
- State Explosion and Practical Limits
- Choosing Between the Two
Boolean assertions check one condition. Numeric assertions track one value. But the most interesting system states involve multiple things being true at once: a leader is elected and all replicas are synced and clients are connected. Or we want every distinct configuration to be tested, not just the first one we stumble on. Compound assertions handle both of these cases.
assert_sometimes_all!: Simultaneous Sub-Goals
Some properties only matter when multiple conditions hold at the same time. A cluster is not truly healthy unless the leader is elected, replicas are caught up, and no partitions are active. Testing each condition individually does not tell you whether the system can achieve them all at once.
assert_sometimes_all! takes a message and a list of named boolean sub-goals:
#![allow(unused)]
fn main() {
assert_sometimes_all!("cluster_fully_operational", [
("leader_elected", has_leader),
("replicas_synced", all_replicas_in_sync),
("no_partitions", partition_count == 0),
("clients_connected", active_clients > 0),
]);
}
The assertion tracks a frontier: the maximum number of sub-goals that have been simultaneously true. It starts at zero. The first time any sub-goal is true, the frontier advances to 1. When two are true at once, it advances to 2. When all four are true simultaneously, the frontier reaches 4 and the assertion is fully satisfied.
Each time the frontier advances, the explorer forks. This creates a progression: the explorer first finds states where one sub-goal is met, then branches from there to find states where two are met, and so on. The exploration naturally follows the path of increasing difficulty.
This is powerful for multi-step objectives. Consider a distributed transaction system. The full operation requires: prepare all participants, get all votes, commit, and acknowledge to the client. As a sometimes-all assertion:
#![allow(unused)]
fn main() {
assert_sometimes_all!("distributed_commit_complete", [
("all_prepared", all_participants_prepared),
("all_voted_yes", all_votes_received),
("committed", transaction_committed),
("client_acked", client_received_ack),
]);
}
The explorer works through the stages. First it finds a state where all participants are prepared. It forks from there and pushes for votes. Then it finds committed states. Each stage is a stepping stone to the next.
Without this assertion, the explorer would need to stumble on the complete end-to-end scenario by chance. With it, the explorer gets intermediate checkpoints that guide it through the sequence.
assert_sometimes_each!: Per-Value Coverage
Sometimes you want to ensure that every distinct value of something is explored, not just any one value. If your system has 10 different error codes, you want the simulation to exercise all 10. If a consensus protocol has 5 possible states, you want coverage of each.
assert_sometimes_each! creates a separate assertion for each unique combination of identity keys:
#![allow(unused)]
fn main() {
// Ensure every node ID is tested
assert_sometimes_each!("node_tested", [("node_id", node_id)]);
// Ensure every (state, role) combination is explored
assert_sometimes_each!("state_coverage", [
("state", state as i64),
("role", role as i64),
]);
}
Each unique combination of key values gets its own bucket. The first time a new bucket is discovered, the explorer forks. This equalizes exploration across all discovered values, preventing the simulation from over-concentrating on values it finds easily while neglecting harder-to-reach ones.
The Antithesis team demonstrated this dramatically with The Legend of Zelda. They used SOMETIMES_EACH with screen coordinates to ensure the explorer visited all 128 overworld screens and 230 dungeon rooms. Without per-value bucketing, the explorer would revisit the starting area thousands of times while leaving distant rooms unexplored. With it, every room gets roughly equal attention.
For distributed systems, the same pattern applies. If you have a state machine with states {Follower, Candidate, Leader, Observer}, a sometimes-each assertion ensures the simulation exercises each state rather than getting stuck in the most common one:
#![allow(unused)]
fn main() {
assert_sometimes_each!("raft_state_exercised", [
("node", node_id),
("state", raft_state as i64),
]);
}
Quality Watermarks
Sometimes discovering a value is not enough. You also want to explore each value under good conditions. assert_sometimes_each! supports an optional second list of quality keys that track watermarks per bucket:
#![allow(unused)]
fn main() {
assert_sometimes_each!(
"dungeon_floor_explored",
[("floor", floor_number)], // identity: which floor
[("health", current_health)], // quality: explore each floor with good health
);
}
Each bucket remembers the best quality values observed. When a bucket is revisited with better quality, the explorer re-forks from that improved state. This prevents the “doomed state” problem described in Antithesis’s Castlevania analysis: the explorer reaches an area but in such bad shape that no useful exploration can happen from there. Quality watermarks ensure the representative state for each bucket is the best one found so far.
State Explosion and Practical Limits
Compound assertions can create a lot of buckets. If you use two identity keys with 100 distinct values each, that is 10,000 potential buckets. Each bucket consumes exploration energy. The explorer bounds energy per assertion to prevent any single assertion from monopolizing the search, but individual assertion effectiveness degrades when buckets proliferate.
Keep identity keys coarse enough to be useful. If you are tracking screen positions in a 256x256 grid, bucket them into 16x16 regions rather than tracking exact pixels. The goal is coverage of meaningful distinct states, not exhaustive enumeration of every possible value.
A good rule of thumb: if you would not manually write a separate test for each distinct value, it probably should not be its own bucket.
Choosing Between the Two
The choice is straightforward:
- Multiple conditions that must hold simultaneously:
assert_sometimes_all! - One condition that must hold for each distinct value:
assert_sometimes_each!
Use assert_sometimes_all! when the challenge is getting several things true at once. Use assert_sometimes_each! when the challenge is covering a space of distinct configurations or states.
Both are guidance assertions. Both trigger forks. Both make the explorer smarter about where to search. And both turn a single line of code into something that would otherwise require dozens of hand-crafted test scenarios.
System Invariants
- The Invariant Trait
- Sharing State with StateHandle
- A Real Example: The Agreement Invariant
- When to Use Invariants vs Assertions
- Performance
Assertions live inside workloads. They validate local properties: “this key should exist,” “this response matched what we sent.” But some correctness properties span the entire system. The conservation law in a consensus protocol is a good example: all committed values must agree across replicas, and no committed value can be lost. No single workload owns that property. We need something that watches the whole world.
That is what invariants are for.
The Invariant Trait
An invariant is a check that runs after every simulation event. The simulation engine calls it automatically. If the invariant panics, the simulation stops and reports the failing seed.
#![allow(unused)]
fn main() {
pub trait Invariant: 'static {
fn name(&self) -> &str;
fn check(&self, state: &StateHandle, sim_time_ms: u64);
}
}
Two inputs: a StateHandle containing shared state that workloads publish, and the current simulation time. The contract is simple: if the invariant holds, return normally. If it does not, panic with a descriptive message.
You register invariants on the builder:
#![allow(unused)]
fn main() {
SimulationBuilder::new()
.workload(ConsensusWorkload::new(3))
.invariant(AgreementInvariant)
.invariant(ValidityInvariant)
.run()
.await
}
For quick one-off checks, there is a closure shorthand:
#![allow(unused)]
fn main() {
SimulationBuilder::new()
.invariant_fn("single_leader", |state, _t| {
if let Some(model) = state.get::<ConsensusModel>("consensus_model") {
let leaders: Vec<_> = model.nodes.iter()
.filter(|(_, s)| s.role == Role::Leader)
.collect();
assert!(leaders.len() <= 1, "multiple leaders: {:?}", leaders);
}
})
}
Sharing State with StateHandle
Invariants need to see what workloads are doing. StateHandle is the bridge. It is a type-safe, Rc-based key-value store that workloads publish into and invariants read from.
Workloads publish their state after each operation:
#![allow(unused)]
fn main() {
// Inside the workload's run() method
self.model.record_commit(slot, value);
ctx.state().publish("consensus_model", self.model.clone());
}
Invariants read it back:
#![allow(unused)]
fn main() {
if let Some(model) = state.get::<ConsensusModel>("consensus_model") {
// validate...
}
}
The if let Some guard is important. Early in the simulation, before the workload has published anything, the key will not exist. Invariants should silently skip when their data is not yet available.
A Real Example: The Agreement Invariant
Consider a consensus protocol where multiple nodes must agree on committed values. The agreement invariant checks that no two nodes have committed different values for the same slot. This is the kind of property that catches subtle bugs: a leader change that replays a proposal, a vote that arrives after a new ballot, a crash during the accept phase.
#![allow(unused)]
fn main() {
pub struct AgreementInvariant;
impl Invariant for AgreementInvariant {
fn name(&self) -> &str {
"agreement"
}
fn check(&self, state: &StateHandle, _sim_time_ms: u64) {
if let Some(model) = state.get::<ConsensusModel>("consensus_model") {
for (slot, values) in &model.committed_values {
let unique: HashSet<_> = values.iter().collect();
assert_always!(
unique.len() <= 1,
format!(
"agreement violated at slot {}: nodes committed different values {:?}",
slot, values
)
);
}
}
}
}
}
This runs after every single simulation event. If a leader change causes two nodes to commit different values for the same slot, this invariant fires immediately.
When to Use Invariants vs Assertions
Assertions (assert_always!, assert_sometimes!) belong inside workloads. They validate local properties from the workload’s perspective: “this response has the right balance,” “this error path was exercised.”
Invariants validate global, cross-workload properties from an omniscient perspective. They see the full system state and check that it is consistent. Use them for:
- Conservation laws (messages, resources, committed values)
- No-phantom properties (never receive something that was not sent)
- Consistency across processes (leader election: at most one leader at any time)
- Monotonicity properties (ballot numbers only increase)
A useful rule of thumb: if the property involves state from more than one process or workload, it is an invariant. If it is about one workload’s local view, it is an assertion.
Performance
Invariants run after every simulation event. A typical simulation processes thousands of events per iteration, and you might run hundreds of iterations. Keep invariants fast.
Concretely: iterate a small collection, compare a few counters, check a simple predicate. Avoid expensive operations like sorting large datasets or doing string formatting on the happy path. The format! in the panic message is fine because it only runs when the invariant fails.
If you find yourself wanting a slow invariant (like replaying a log to verify consistency), consider running it only in the workload’s check() method at the end of the simulation rather than after every event.
Event Timelines
The previous chapter introduced StateHandle with publish/get semantics. Workloads publish a snapshot, invariants read it. But snapshots only show the current value. The history is gone. If a balance was 100, then 50, then 100 again, a snapshot-based invariant sees 100 and declares everything fine. The dip to 50 is invisible.
Temporal properties need the full sequence. Monotonicity (a term number never decreased), causal ordering (the lock was acquired before the write), conservation over time (total money stayed constant through a sequence of transfers). These require an append-only log, not a mutable register.
That is what event timelines provide.
Emitting Events
Inside a workload or process, call ctx.emit() with a timeline name and an event value:
#![allow(unused)]
fn main() {
#[derive(Debug, Clone)]
struct TransferEvent {
from: String,
to: String,
amount: u64,
}
// Inside a workload's run() method
ctx.emit("transfers", TransferEvent {
from: "alice".into(),
to: "bob".into(),
amount: 50,
});
}
Each entry is automatically wrapped in a TimelineEntry<T>:
#![allow(unused)]
fn main() {
pub struct TimelineEntry<T> {
pub event: T, // your payload
pub time_ms: u64, // simulation time (auto-captured from ctx.time().now())
pub source: String, // emitter IP (auto-captured from ctx.my_ip())
pub seq: u64, // global monotonic sequence number
}
}
The seq field deserves attention. It is a single counter shared across all timelines in the StateHandle. If timeline A gets seq 0 and timeline B gets seq 1, you know A’s event was emitted first, even if both have the same time_ms. This gives you a total ordering over all events in the simulation.
Reading Timelines
Workloads read timelines through ctx.timeline(). Invariants read them through state.timeline(). Both return an Option<Timeline<T>> that is None if no events have been emitted to that key yet.
#![allow(unused)]
fn main() {
impl Invariant for TransferOrderInvariant {
fn name(&self) -> &str { "transfer_ordering" }
fn check(&self, state: &StateHandle, _sim_time_ms: u64) {
let Some(tl) = state.timeline::<TransferEvent>("transfers") else {
return; // no transfers yet
};
// Check monotonicity: transfer times never go backwards
let entries = tl.all(); // zero-copy borrow
for window in entries.windows(2) {
assert_always!(
window[1].time_ms >= window[0].time_ms,
format!("transfer time went backwards: {} -> {}",
window[0].time_ms, window[1].time_ms)
);
}
}
}
}
Timeline<T> has four read methods:
| Method | Returns | Use case |
|---|---|---|
all() | Ref<Vec<TimelineEntry<T>>> | Full scan, zero-copy |
since(index) | Vec<TimelineEntry<T>> | Incremental processing from a cursor |
last() | Option<TimelineEntry<T>> | Most recent event |
len() | usize | Count check |
For invariants that run after every event, since() avoids rescanning the entire history. Store the cursor between calls and only process new entries.
The Fault Timeline
Every fault the simulator injects, from network partitions to storage corruption to process kills, is automatically emitted to a well-known timeline called "sim:faults". No workload code needed.
#![allow(unused)]
fn main() {
use moonpool_sim::{SimFaultEvent, SIM_FAULT_TIMELINE};
fn check(&self, state: &StateHandle, _sim_time_ms: u64) {
let Some(faults) = state.timeline::<SimFaultEvent>(SIM_FAULT_TIMELINE) else {
return;
};
// Count how many times each process was killed
let kill_count = faults.all().iter()
.filter(|e| matches!(&e.event, SimFaultEvent::ProcessForceKill { .. }))
.count();
assert_always!(
kill_count <= 10,
format!("too many kills in one iteration: {}", kill_count)
);
}
}
SimFaultEvent covers 17 fault variants across three categories:
- Process lifecycle:
ProcessGracefulShutdown,ProcessForceKill,ProcessRestart - Network:
PartitionCreated,PartitionHealed,ConnectionCut,CutRestored,HalfOpenError,SendPartitionCreated,RecvPartitionCreated,RandomClose,PeerCrash,BitFlip - Storage:
StorageReadFault,StorageWriteFault,StorageSyncFault,StorageCrash,StorageWipe
The real power is correlation. When an application-level invariant fires, cross-reference the fault timeline to understand what the infrastructure was doing at that moment. A conservation law violation at t=5000 that coincides with a ProcessForceKill at t=4980 tells a very different story than one with no faults nearby.
Snapshots vs Timelines
Use both. They solve different problems.
publish()/get() stores the latest snapshot. Good for properties about the current state: “the sum of all balances equals total deposits minus total withdrawals.” The conservation law invariant from the previous chapter is a snapshot invariant.
emit()/timeline() stores append-only history. Good for properties about how the state changed over time: “no message was received before it was sent,” “the leader term never decreased,” “every debit has a matching credit.”
A well-designed simulation typically publishes a reference model for snapshot invariants and emits events for temporal invariants. The fault timeline adds infrastructure context for free.
Designing Workloads That Find Bugs
- Strategy vs Tactics
- The Operation Alphabet
- Invariant Patterns
- Assertions as Memos
- Generate Sufficient Workload
A workload that exercises only the happy path is a workload that finds nothing. The previous chapters covered how to write a workload, wire up assertions, and register invariants. This chapter is about what to put inside them. What operations to generate, what to measure, and how to think about the relationship between input design and bug discovery.
Strategy vs Tactics
Will Wilson offers a useful framework for thinking about simulation workloads. There are two independent dimensions: strategy and tactics.
Strategy is what you measure and optimize. Code coverage, grid cells visited, time the system stays alive, request throughput. Strategy determines how the framework judges whether a run was productive.
Tactics is how you generate inputs. Uniform random? Weighted toward writes? Biased toward boundary values? Tactics determines the raw material the simulator works with.
These dimensions are independent. A workload with brilliant tactics (generating exactly the right fault sequences) can get by with crude strategy (just measuring crashes). A workload with sophisticated strategy (coverage-guided exploration) can get by with simple tactics (pure random inputs). You do not need both to be perfect. You need at least one to be good.
But there is a trap in “pure random” that deserves attention.
The white noise paradox: random inputs are maximally random at the micro level but minimally random at the macro level. Consider random per-packet network faults. Each packet has a 5% chance of being dropped. Sounds adversarial. But the probability of a sustained 10-second partition is 0.05 raised to the power of however many packets cross that link in 10 seconds. Essentially zero. Random drops average out to a slightly degraded but fundamentally healthy network. No sustained partition ever forms.
The same applies to random user inputs. Random button presses in a video game average out to no sustained action. Random API calls average out to no sustained workflow. The system never enters the deep states where bugs hide.
This is why moonpool uses correlated fault injection rather than random per-event drops. “The probability that there’s a network partition happening at time t=1 is highly dependent on whether there’s a network partition happening at time t=0.” A new partition starting is rare. A partition continuing is nearly certain. Correlated distributions produce the sustained fault patterns that expose real bugs.
The Operation Alphabet
From practical experience building workloads, we have found that the operations a workload generates fall into three categories. We call these the Operation Alphabet.
Normal operations are production-like traffic. Reads, writes, queries, transactions. The distribution should match what real users do. If your system handles 80% reads and 20% writes, your workload should approximate that ratio.
Adversarial inputs are what users actually send, whether you planned for it or not. Empty strings. Boundary values (0, -1, u64::MAX). Unicode edge cases. Maximum-length fields. Keys that collide in your hash function. A workload that skips adversarial inputs is testing a polite fiction.
Nemesis operations deliberately break things. Kill a process mid-transaction. Trigger a compaction during peak write load. Send conflicting writes to the same key from multiple clients simultaneously. These push the system into states that normal operation reaches only after months of uptime.
Only normal operations? You test the happy path. Add adversarial inputs and you test the validation path. Add nemesis operations and you test the recovery path. Bugs overwhelmingly live in recovery paths. Your alphabet needs all three letters.
#![allow(unused)]
fn main() {
let roll = ctx.random().random_range(0..100);
match roll {
0..60 => {
// Normal: production-like read/write mix
self.do_normal_operation(ctx).await?;
}
60..80 => {
// Adversarial: boundary values, empty inputs, collisions
self.do_adversarial_operation(ctx).await?;
}
80..100 => {
// Nemesis: conflict storms, concurrent mutations, stress
self.do_nemesis_operation(ctx).await?;
}
}
}
The percentages are tunable. Start with a heavy normal-operation bias and adjust based on what your assert_sometimes! statements tell you about coverage.
Invariant Patterns
Generating interesting operations is half the problem. The other half is knowing whether the system behaved correctly under those operations. Four patterns have proven reliable across many simulation workloads.
Reference models maintain an in-memory expected state and compare after operations. We saw this in the workload chapter: a BTreeMap mirrors what the server should contain, updated on every write, compared on every read. Use BTreeMap, not HashMap. Deterministic iteration order makes failures reproducible.
Conservation laws check quantities that must remain constant. Total record count across shards. Messages sent minus messages received. Committed values in a consensus protocol must agree across replicas. The invariants chapter covers this pattern in depth.
Structural integrity validates data structures after chaos. Traverse a linked list and verify no cycles. Walk a B-tree and check balance at every level. Count children and verify parent pointers. These catch corruption that reference models miss because they operate at a different abstraction level.
Operation logging resolves commitment uncertainty. When a write fails with a network error, did it commit or not? Log the intent alongside the mutation. After recovery, read back the state and reconcile against the log. Essential for workloads that operate through unreliable networks, which is all simulation workloads.
Assertions as Memos
Lawrie Green offers a perspective on assertions that changes how you think about placing them. Assertions serve two audiences simultaneously.
They inform the computer about expected behavior. In moonpool, assert_sometimes! tells the explorer “this state is interesting, branch from here.” assert_always! tells the runner “if this fails, stop and report.” Assertions are active participants in the search.
They document the developer’s mental model. When you write assert_always!(balance >= 0, "balance should never go negative"), you are recording a belief about how the system works. The assertion is a memo to your future self and to every developer who reads the code after you.
When an assertion fails on correct code, the mental model was wrong. That is itself a critical finding. Maybe the behavior is fine and the assertion needs updating. Maybe it reveals a design assumption that will cause trouble later. Either way, the mismatch between belief and reality is worth knowing about.
This dual purpose means assertions belong everywhere a developer has an opinion about what should happen. On the error path. On the recovery path. On the “this should never happen” path that definitely will happen in simulation.
Generate Sufficient Workload
A single client sending sequential requests will never trigger the race conditions where distributed bugs hide. The system needs enough concurrent activity for interesting interactions to emerge.
Define all actions the system can take. Not just the obvious ones. Multi-step workflows: “begin transaction, write three keys, read one back, commit.” Interacting operations: “two clients write the same key simultaneously.” Failure-spanning operations: “write during a partition, read after recovery.”
Then run many instances concurrently:
#![allow(unused)]
fn main() {
SimulationBuilder::new()
.processes(3, || Box::new(KvServer::new()))
.workloads(5, || Box::new(KvWorkload::new(200)))
.run()
.await
}
Five workloads running 200 operations each, against three servers, with chaos enabled. The combinatorial interactions between concurrent operations, across servers experiencing faults, produce the complex interleavings where bugs live.
There is a deeper principle here. Individual test cases scale multiplicatively with system complexity. A 300-line feature can require 10,000 lines of manual test code for combinatorial coverage. Test generators scale linearly. Adding one variant to your operation enum covers all new combinations with that operation. You write one line. The simulator explores thousands of new scenarios.
#![allow(unused)]
fn main() {
enum Operation {
Get { key: String },
Set { key: String, value: Vec<u8> },
Delete { key: String },
Scan { start: String, end: String },
// Adding this one variant automatically generates
// combinations with every other operation, every
// fault type, and every timing interleaving.
CompareAndSwap { key: String, expected: Vec<u8>, new: Vec<u8> },
}
}
This is the real payoff of simulation testing. Not that each individual test is better, but that the cost of coverage changes from exponential to linear. One well-designed workload, with a complete operation alphabet, strong invariants, and enough concurrency, finds more bugs than a thousand hand-written test cases.
Debugging a Failing Seed
A seed failed. The simulation report printed a number, maybe seed=17429853261, next to a red line. What now?
In production distributed systems, debugging a concurrency bug means staring at interleaved logs from multiple nodes, trying to reconstruct a sequence of events that you cannot replay. You form a theory, add logging, deploy, wait for the bug to happen again, and hope your new logs captured enough. It is slow, painful, and often inconclusive.
Deterministic simulation changes this completely. A failing seed is not a clue. It is a recording. Same seed, same execution, same bug, every time. Debugging becomes mechanical rather than archaeological.
The Workflow
The process has five steps:
- Reproduce the failure with the exact seed and
FixedCount(1) - Isolate by reading the event trace to find the triggering event
- Understand the causal chain that led to the violation
- Fix the root cause in your code
- Verify by re-running the original seed and then the full chaos suite
Each of these steps is straightforward because determinism gives us something rare in distributed systems: repeatability. We do not hunt ghosts. We replay recordings.
What Makes This Different
Traditional debugging tools assume non-determinism. You set a breakpoint and hope the thread schedule cooperates. You add print statements and hope the race condition still manifests. You run the test ten times and it passes nine.
With simulation, the RNG seed controls everything: which connections fail, when timers fire, what order events process, how long delays take. Pin the seed, and the entire execution is frozen in time. You can add logging, set breakpoints, restructure your investigation, and the bug will be there waiting, exactly where you left it.
The next three sections cover the practical details: how to pin a seed and reproduce, how to read the event trace, and what common mistakes look like so you can recognize them quickly.
A Note on Non-Determinism
If you reproduce with a failing seed and the bug disappears, you have a different problem: something in your system is non-deterministic. This is actually valuable information. Common sources include direct tokio calls that bypass providers, HashMap iteration order leaking into behavior, or system randomness sneaking in through a dependency. The reproducing chapter covers how to track these down.
Reproducing with FixedCount
The simulation report told you that seed 17429853261 failed. The first step is always the same: reproduce it.
Pinning a Seed
Take your simulation binary and add two things: the failing seed, and a single iteration.
#![allow(unused)]
fn main() {
SimulationBuilder::new()
.workload(MyWorkload::new())
.set_iterations(1)
.set_debug_seeds(vec![17429853261])
.run()
.await
}
set_debug_seeds fixes the RNG seed. set_iterations(1) tells the runner to execute exactly one iteration with that seed instead of sweeping through random seeds. Together, they replay the exact execution that failed.
Run it. You should see the same failure, the same panic message, the same assertion violation. Every time.
Turning Up the Logging
By default, simulation runs are quiet. When debugging a specific seed, turn up logging via the RUST_LOG environment variable. Start with ERROR to see assertion violations and failure messages:
RUST_LOG=error cargo xtask sim run my_simulation
For deeper investigation, RUST_LOG=trace shows the full event trace: every event processed, every timer fired, every connection state change. This is the firehose, but with a pinned seed it is a reproducible firehose.
When the Bug Disappears
You pinned the seed, ran it, and the bug is gone. This means one thing: something in your code is non-deterministic.
The simulation engine controls time, networking, and randomness through providers. But if your code bypasses those providers, it introduces real-world non-determinism into the simulated world. Common culprits:
- Direct tokio calls:
tokio::time::sleep()instead oftime.sleep(),tokio::spawn()instead oftask_provider.spawn_task(). These escape the simulation’s control. - System randomness: Using
rand::thread_rng()orstd::collections::HashMapiteration order instead of the simulation’sRandomProvider. - Wall-clock time: Calling
std::time::Instant::now()orSystemTime::now()instead of using the simulated clock. - External I/O: Reading files, making real network calls, or anything that touches the actual operating system.
If you suspect non-determinism, run the same seed twice with RUST_LOG=trace and diff the output. The first divergence point tells you exactly where determinism broke.
From Reproduction to Investigation
Once you have a reliable reproduction, the next step is understanding what happened. The event trace (covered in the next section) shows you the causal chain: which events fired, in what order, and what state they produced. With a pinned seed, you can also set breakpoints in your IDE and step through the exact execution path that triggers the bug.
The key insight: you are not searching for a bug anymore. You are reading a recording of a bug. That is a fundamentally easier problem.
Reading the Event Trace
- The Event Queue
- Key Event Types
- Tracing the Causal Chain
- Using RNG Call Count
- Infrastructure vs Workload Events
- Practical Tips
You have a pinned seed reproducing the failure. Now you need to understand what happened. The event trace is your primary tool.
The Event Queue
Moonpool’s simulation engine is built around an event queue: a priority queue ordered by logical time. Every side effect in the simulation, from network delivery to timer expiration to storage I/O, is an event scheduled at a specific time. Events at the same time are ordered by a monotonic sequence number, making execution fully deterministic.
When you enable trace-level logging (RUST_LOG=trace), you see every event as it fires:
Processing event at t=1.234s seq=47: Network { connection_id: 3, DataDelivery { ... } }
Processing event at t=1.234s seq=48: Timer { task_id: 12 }
Processing event at t=2.500s seq=49: Connection { id: 5, PartitionRestore }
Each line tells you what happened, when, and in what order.
Key Event Types
The simulation has a small set of event types, and learning to recognize them makes traces much easier to read.
Timer events wake sleeping tasks. When your workload calls time.sleep(Duration::from_secs(1)), that schedules a Timer event one second in the future. These are the heartbeat of your simulation.
Network events move data between connections. DataDelivery puts bytes into a connection’s receive buffer. ProcessSendBuffer drains the send side. FinDelivery signals a graceful close after all data has been delivered.
Connection events change connection state. ConnectionReady means a new connection is established. PartitionRestore ends a network partition. ClogClear lifts a simulated delay on writes. HalfOpenError starts failing a connection that looks alive but is not.
Storage events handle simulated disk I/O. Reads and writes are scheduled with realistic latency and can be injected with faults like corruption or torn writes.
Process lifecycle events manage reboots. ProcessGracefulShutdown signals a process to clean up. ProcessForceKill aborts it after the grace period. ProcessRestart brings it back.
Shutdown wakes all tasks for orderly termination at the end of a simulation.
Tracing the Causal Chain
When an assertion fires, the question is: what caused this? The event trace gives you the answer, but you read it backwards.
Start at the failure. Look at the last few events before the panic. Usually one of them is the trigger: a DataDelivery that delivered a stale message, a Timer that expired causing a timeout, a ConnectionReady that reconnected during a partition. Then ask what scheduled that event. Follow the chain back through the trace.
For example, suppose your conservation law invariant fires after event #312. Look at event #312: it is a DataDelivery on connection 7. What was connection 7? The trace shows it was established at event #201 between the workload and a KV server process. What did the delivery contain? A withdraw response. But the model expected a deposit. Now you have a lead.
Using RNG Call Count
Every random decision in the simulation consumes one or more calls to the deterministic RNG. The total call count at any point in the execution is a precise fingerprint of “where we are.”
When comparing a working seed against a failing seed, the RNG call count tells you exactly where their executions diverge. If both seeds process events identically through RNG call 847, but diverge at call 848, the code executing at that point made a different random choice that led down the failing path.
This technique is especially useful for regression testing: if you fix a bug and the RNG call pattern changes, you know your fix altered the execution path (which is expected). If it does not change, your fix might not be reaching the right code.
Infrastructure vs Workload Events
Not every event in the trace matters to your investigation. The simulation marks some events as infrastructure: PartitionRestore, SendPartitionClear, RecvPartitionClear, CutRestore, and ProcessRestart. These maintain simulation state but do not represent application work.
The simulation uses this distinction internally to decide when to terminate. After all workloads finish, if only infrastructure events remain in the queue, the simulation can safely end. When reading traces, you can often skip over these events and focus on DataDelivery, Timer, and Storage events that directly affect your application logic.
Practical Tips
Start narrow. Use RUST_LOG=error first to see just the failure. Then widen to RUST_LOG=debug or RUST_LOG=trace only if you need more context.
Search for the event sequence number. The invariant failure happens after a specific sim.step() call. The event processed in that step has a sequence number. Search for it in the trace.
Count backwards. If the failure is at event #312, the cause is often in the 5-10 events before it, not 200 events earlier.
Compare two seeds. Run a passing seed and a failing seed side by side with trace output. Diff the two traces. The first divergence point is where the bug’s path begins.
Common Pitfalls
- Storage Needs the Step Loop
- Missing
yield_now()Calls - Using
unwrap() - Direct Tokio Calls
- Using
LocalSet - Missing
#[async_trait(?Send)] - Borrow Checker Fights in
world.rs - HashMap Iteration Non-Determinism
- Forgetting to Publish State for Invariants
A reference list of mistakes we have seen (and made) when building simulations with Moonpool.
Storage Needs the Step Loop
Network operations buffer data and return Poll::Ready immediately. Storage operations return Poll::Pending and wait for the simulation to process them. If you await a storage operation without stepping the simulation, your workload hangs forever.
Fix: Use the step loop pattern for storage tests:
#![allow(unused)]
fn main() {
let handle = tokio::task::spawn_local(async move {
let mut file = provider.open("test.txt", OpenOptions::create_write()).await?;
file.write_all(b"hello").await?;
file.sync_all().await
});
while !handle.is_finished() {
while sim.pending_event_count() > 0 {
sim.step();
}
tokio::task::yield_now().await;
}
}
Missing yield_now() Calls
Spawned tasks via spawn_local do not run until the current task yields. If your workload spawns a task and immediately checks its result without yielding, the task has never run.
Fix: Call tokio::task::yield_now().await after spawning, and in loops where you wait for spawned tasks to complete.
Using unwrap()
Moonpool follows a strict no-unwrap() policy. In simulation, a panic from unwrap() is not a clean error report. It is an uncontrolled crash that may mask the real failure and confuse the assertion system.
Fix: Use Result<T, E> with ? everywhere. Map errors with context when needed.
Direct Tokio Calls
Calling tokio::time::sleep(), tokio::time::timeout(), or tokio::spawn() bypasses the simulation’s control of time and task scheduling. Your code will use real wall-clock time instead of simulated time, and the simulation cannot inject faults.
Fix: Use provider traits: time.sleep(), time.timeout(), task_provider.spawn_task().
Using LocalSet
The tokio::task::LocalSet runtime conflicts with Moonpool’s simulation engine.
Fix: Use tokio::runtime::Builder::new_current_thread().build_local() only.
Missing #[async_trait(?Send)]
Moonpool runs on a single thread. All types are !Send. If you derive #[async_trait] without the (?Send) bound, the compiler will require Send on your futures.
Fix: Always use #[async_trait(?Send)] for networking traits.
Borrow Checker Fights in world.rs
When working on simulation internals, you may need to access a connection (inner.network.connections.get_mut()) and then schedule an event (inner.event_queue.schedule()). The borrow checker sees both as borrows of inner.
Fix: Extract values from the connection into local variables before calling functions that take &mut SimInner. NLL allows the borrow of conn to end before you borrow inner again, as long as you do not use conn after the second borrow begins.
HashMap Iteration Non-Determinism
std::collections::HashMap does not guarantee iteration order, and the order can vary between runs. If your workload iterates a HashMap and the iteration order affects behavior (choosing which account to process, which message to send), you have introduced non-determinism.
Fix: Use BTreeMap when iteration order matters, or collect into a Vec and sort before iterating.
Forgetting to Publish State for Invariants
Invariants read from StateHandle. If your workload modifies its model but forgets to call ctx.state().publish(...), invariants see stale data and either miss bugs or report false violations.
Fix: Publish state after every mutation, not just at the end.
#![allow(unused)]
fn main() {
self.model.record_commit(slot, value);
ctx.state().publish("consensus_model", self.model.clone());
}
Discovering Properties
- The Attention Focus Pattern
- Eight Focuses for Moonpool Code
- Using the Focuses
- What Discovery Produces
- A Quick Example
You know the assertion macros. You know the buggify patterns. But when you sit down with a fresh Process and Workload, the hardest question is not how to assert — it is what to assert. Where should assert_always! go? Which code paths need assert_sometimes!? Where would buggify!() expose the most interesting failures?
A single unstructured pass through your code will find the obvious properties. The non-obvious ones — the properties that catch real bugs — require looking at the same code from multiple independent angles.
The Attention Focus Pattern
The idea comes from Antithesis’s property discovery methodology. Instead of reading through code once and noting assertions as they occur to you, you examine the same code eight times, each time through a different lens. Each lens is called an attention focus.
Why does this work? Because different failure modes hide in different mental models. A developer thinking about crash recovery notices different things than one thinking about concurrency. A focus on protocol contracts surfaces properties that a focus on resource boundaries would miss entirely. The structured repetition is the point.
Eight Focuses for Moonpool Code
State Integrity examines in-memory invariants and storage persistence. What state must survive reboots? What write ordering assumptions exist? If a process is killed between writing field A and field B, does recovery handle the inconsistency? Look for monotonicity properties (ballot numbers, sequence IDs), derived state that could diverge from its source, and reference model expectations that the process might violate.
Concurrency looks at races between workloads hitting the same process. Moonpool is single-threaded, but async interleaving across .await points creates real concurrency hazards. Check-then-act patterns where another task could mutate state between the check and the act. Multiple workloads accessing the same key. Rc<RefCell<>> state touched across yield points.
Crash Recovery asks what happens when a Process is killed and restarted from its factory. The factory returns a blank instance — all in-memory state is gone. Partially-written storage operations (write without sync_all()) may leave corrupt data. Recovery code that assumes clean state will miss torn writes. Operations interrupted between a storage write and a sync are the classic source of subtle bugs.
Network Faults examines behavior under connection drops, partitions, and reordering. RPC calls without timeouts. Retry logic that is not idempotent. Stale cached state after a partition heals (old leader references, expired peer info). Fire-and-forget sends where delivery actually matters.
Timing & Scheduling checks sensitivity to event ordering. Hardcoded timeouts that interact with other timeouts. Logic that assumes timers fire before network events arrive. Election timeouts that could overlap with heartbeat timeouts. Races between timer expiry and data delivery.
Resource Boundaries hunts for unbounded growth. Vec or VecDeque that grows without limit under sustained load. Missing backpressure on incoming requests. Operations that assume connections or file handles are always available. These bugs only surface under chaos, which is exactly when they matter most.
Protocol Contracts looks for guarantees that are claimed but not enforced. Doc comments that say “returns error if not found” but the code returns a default. Ordering assumptions between RPC calls (create before update) that nothing validates. Response types that mask partial failures as success.
Lifecycle Transitions examines startup, shutdown, and reboot sequences. Requests arriving before initialization completes. In-flight work silently dropped during graceful shutdown. State published to StateRegistry before it is valid. Shutdown ordering dependencies where closing connections before flushing storage loses data.
Using the Focuses
There are two ways to work through the focuses.
Ensemble mode runs all focuses in parallel. For each focus, a separate analysis pass examines the code through that single lens and produces candidate assertions and buggify points. The results are then synthesized: duplicates found by multiple focuses are high-confidence properties, while unique finds from a single focus are high-value catches that a single pass would have missed.
Sequential mode works through the focuses as a checklist. The key discipline is making an explicit pass for each focus. Do not skip a focus because an earlier pass “already covered” that area. The value is in the independent perspective — the same line of code looks different through the lens of crash recovery than through the lens of concurrency.
What Discovery Produces
Each discovered property maps to a concrete assertion or buggify placement:
- Location: exact file and line range
- Macro: which assertion type (
assert_always!,assert_sometimes!,buggify!(), etc.) - Message: unique, descriptive assertion message
- Rationale: what bug this catches and why it matters
- Provenance: which focus or focuses surfaced it
Properties found by multiple focuses independently are strong candidates. They represent invariants that sit at the intersection of multiple failure modes. Properties found by only one focus are equally important — they represent the blind spots that unstructured review misses.
A Quick Example
Consider a key-value Process that accepts writes over RPC and persists them to storage. A single review pass might add assert_always! for “response matches stored value” and call it done.
Running through the focuses reveals more:
- State Integrity: the process maintains a write counter. Is it monotonically increasing?
assert_always!(new_count > old_count, "write counter regression") - Crash Recovery: writes go to an in-memory map, then flush to storage on a timer. A crash between write and flush loses data.
buggify!()on the flush path to test this.assert_sometimes!to verify the recovery path is actually exercised. - Concurrency: two workloads writing the same key. The last-write-wins semantics should hold.
assert_always!on read-after-write consistency from each workload’s perspective. - Network Faults: the client retries on timeout, but the write is not idempotent. A retry after an ambiguous failure could double-apply.
assert_always!on the write count matching expected count.
Four focuses, four distinct properties, one of which (the retry idempotency bug) is the kind of subtle issue that crashes production systems. A single pass would likely have caught only the first.
The /discover-properties skill automates this process. It examines your Process and Workload code through all eight focuses and produces a structured list of assertion and buggify placements, ready to implement.
Using moonpool-sim Standalone
- The Technical Foundation
- Proof: Real HTTP Over Simulated TCP
- The Send Constraint
- What This Means For You
moonpool, the crate, re-exports the full framework: transport, RPC, #[service] macros. Plenty of machinery for building distributed systems from scratch.
But moonpool-sim is a standalone simulation engine. Provider traits, chaos injection, assertions, fork-based exploration. All of it works without importing a single transport type. No Peer, no NetTransport, no #[service]. Just deterministic simulation of your existing code.
Why does this matter? Because most teams aren’t building distributed systems from scratch. They’re running axum services behind a load balancer, talking to Postgres and Redis, shipping features. The transport layer is irrelevant to them. The simulation engine is not.
The Technical Foundation
The key fact that makes this possible lives in the NetworkProvider trait:
#![allow(unused)]
fn main() {
pub trait NetworkProvider: Clone {
type TcpStream: AsyncRead + AsyncWrite + Unpin + 'static;
// ...
}
}
SimTcpStream implements tokio::io::AsyncRead + AsyncWrite + Unpin. That makes it a drop-in replacement for tokio::net::TcpStream anywhere the tokio ecosystem uses trait-based I/O. And the tokio ecosystem uses trait-based I/O everywhere that matters: hyper, tonic, tower, axum (via hyper), sqlx’s wire protocol, redis-rs.
This isn’t an accident. We designed the provider traits to match tokio’s interfaces exactly because we wanted existing libraries to work unchanged.
Proof: Real HTTP Over Simulated TCP
The hyper integration test in moonpool-sim/tests/hyper_http.rs demonstrates this concretely. Unmodified hyper HTTP/1.1 running over simulated TCP with chaos injection:
#![allow(unused)]
fn main() {
struct HyperServer;
#[async_trait(?Send)]
impl Process for HyperServer {
fn name(&self) -> &str { "server" }
async fn run(&mut self, ctx: &SimContext) -> SimulationResult<()> {
let listener = ctx.network().bind(ctx.my_ip()).await?;
let (stream, _addr) = tokio::select! {
result = listener.accept() => result?,
_ = ctx.shutdown().cancelled() => return Ok(()),
};
// TokioIo bridges SimTcpStream into hyper's type system
let io = TokioIo::new(stream);
hyper::server::conn::http1::Builder::new()
.serve_connection(io, service_fn(handle_request))
.await?;
Ok(())
}
}
}
The TokioIo adapter bridges any AsyncRead + AsyncWrite into hyper’s internal I/O type. Because SimTcpStream satisfies those bounds, hyper never knows it’s running over simulated networking. The HTTP parser, chunked encoding, keep-alive logic, content-length validation: all exercised for real.
The client side follows the same pattern. Connect via ctx.network().connect(), wrap in TokioIo, hand to hyper::client::conn::http1::handshake. Real HTTP/1.1 request-response cycles over a network that drops packets, injects latency, and kills connections.
The Send Constraint
One constraint to understand: SimTcpStream is !Send. It lives inside the simulation’s single-threaded runtime. This means you cannot use tokio::spawn() for futures that hold a stream reference, because tokio::spawn requires Send.
The fix is straightforward: use tokio::task::spawn_local instead.
#![allow(unused)]
fn main() {
// Won't compile: spawn requires Send, SimTcpStream is !Send
// tokio::spawn(async move { serve_connection(io, service).await });
// Works: spawn_local runs on the current thread
tokio::task::spawn_local(async move {
hyper::server::conn::http1::Builder::new()
.serve_connection(io, service)
.await
});
}
Most web frameworks work fine under this constraint because the connection-level future holds the stream, and handlers are polled inline within that future. The handler functions themselves can be Send (axum requires this). The connection future that wraps them is !Send because it holds the stream. Both coexist because hyper polls handlers inline, never spawning them onto a separate task.
What This Means For You
If your application uses tokio::net::TcpStream through trait-based I/O (which hyper, axum, tonic, and most of the ecosystem do), you can simulate it. The process is:
- Define a
Processthat binds a listener and serves connections - Define a
Workloadthat connects and sends requests - Wire them together with
SimulationBuilder - Run thousands of iterations with chaos injection
No actor system required. No RPC framework. No new programming model. Just your existing HTTP handlers running over a network that tries to break them.
Where to Draw the Line
- The Partial Failure Problem
- Fakes Give You Control
- The Fidelity Spectrum
- Per-Dependency Guidance
- Rules of Engagement
- The 80% Argument
The instinct when someone says “fake your dependencies” is to feel like you’re cutting corners. Real databases catch real bugs. Test containers give you the real thing. A BTreeMap pretending to be Postgres is a compromise.
That instinct is wrong. Fakes aren’t a compromise. They’re more powerful than real dependencies.
The Partial Failure Problem
Test containers give you binary failure: the whole service is up, or the whole service is down. Docker kills the container, your test sees a connection refused, you verify your retry logic works.
Production failures are never binary. Kafka loses partition 3 while partitions 1, 2, and 4 stay healthy. A Postgres replica lags 800ms behind the primary while the primary is fine. One Redis shard OOMs while five others serve traffic normally. Your S3 bucket returns 503 on 2% of PUTs while GETs succeed at full speed.
These partial failures are where bugs hide. The request that reads from the lagging replica and writes to the healthy primary. The consumer that rebalances partitions and loses its offset for exactly one topic. The cache lookup that fails for one key prefix while the rest of the keyspace works.
Test containers cannot produce these failures. A container is a black box. You can start it or stop it. You cannot reach inside and make partition 3 return errors while partition 4 succeeds.
Fakes Give You Control
A trait-based fake controls the failure surface at arbitrary granularity. Your MessageBroker trait fake can return Ok for partition 1 and Err for partition 3 in the same call. Your Database fake can inject 200ms latency on reads from replica 2 while replica 1 responds instantly. Your Cache fake can evict entries for keys matching a pattern while retaining everything else.
Combined with moonpool’s buggify!() macro, these fakes become probabilistic fault injectors. Every operation has a chance of failure, controlled by a deterministic seed. When a test fails, you replay the exact same seed and get the exact same sequence of partial failures.
#![allow(unused)]
fn main() {
impl Store for InMemoryStore {
fn create(&self, name: &str) -> Result<Item, StoreError> {
// 25% chance of write failure when buggify is enabled.
// A real Postgres can only be fully up or fully down.
// This fake can fail individual writes.
if buggify!() {
return Err(StoreError::WriteFailed("buggified".into()));
}
// ... normal implementation
}
}
}
The Fidelity Spectrum
Not every dependency needs full simulation. Think of fakes on a spectrum:
No-op: Returns Ok(()) for everything. Useful when you don’t care about a dependency’s behavior, just that calls don’t crash. Logging, metrics, tracing facades.
In-memory: BTreeMap-based storage, VecDeque-based queues. Correct behavior without persistence. Good for most unit and integration tests.
Fault-injectable: In-memory with buggify!() on operations. Correct behavior most of the time, controllable failures when chaos is enabled. This is where most simulation fakes live.
Full simulation via moonpool: The dependency runs as a Process in the simulation. Network traffic is simulated with latency, drops, and corruption. Reserved for components where the network interaction IS the interesting behavior (your own services, consensus protocols).
Per-Dependency Guidance
Network (HTTP, gRPC, TCP): Simulate via moonpool. This is the sweet spot. Real HTTP parsing, real serialization, simulated transport. Your handlers run unchanged.
Database: Trait fake with BTreeMap. Model the operations your code actually uses (CRUD, transactions, queries by index). Inject failures per-operation. Don’t try to simulate SQL parsing.
Message brokers: Trait fake with per-partition control. A VecDeque<Message> per partition, with injectable failures per-partition. Model consumer group rebalancing if your code depends on it.
External HTTP APIs: Canned responses with injectable failures. Your Stripe client fake returns a known charge object, but buggify!() returns a 429 or network timeout 10% of the time.
Cache: Trait fake with partial cluster modeling if you need it, simple HashMap if you don’t. Inject evictions and connection failures.
Rules of Engagement
BTreeMap, not HashMap. Deterministic iteration order matters for reproducibility. A HashMap iterating in different order across runs makes your simulation non-deterministic.
Send + Sync for axum State. Axum requires State to be Send + Sync. Your fakes need Arc<RwLock<BTreeMap<...>>>, not Rc<RefCell<...>>. This is actually fine because the simulation runs single-threaded anyway. The lock is never contended.
The Oxide convention. Oxide’s omicron and crucible projects keep fakes in a fakes/ module alongside the real implementation. The trait and the fake ship together. When you change the trait, you update the fake in the same PR. This keeps fakes from drifting.
The 80% Argument
A fake covering 80% of a dependency’s behavior with determinism and fault injection is strictly better than a test container covering 100% with no control over failures.
Test containers are non-deterministic. They’re slow (seconds to start). They require Docker (not available in all CI environments). They break across versions (Postgres 15 vs 16 container images). They can’t produce partial failures. A failing test gives you a log and a prayer.
A fake compiles with your project. It runs in microseconds. It reproduces from a seed. It produces failures that containers physically cannot. The 80% you model covers the behavior your code actually depends on.
The remaining 20% you don’t model? That belongs in separate integration tests, run less often, against real infrastructure. Nightly CI with actual Postgres and Kafka containers. Those tests verify your fakes are faithful. The simulation tests, run on every commit, verify your code handles failure correctly.
Both kinds of tests improve each other. When an integration test reveals a failure mode your fake doesn’t model, you add it to the fake. When a simulation test finds a bug under partial failure, you verify the fix against real infrastructure. The two approaches compound.
Wiring a Web Service
- Step 1: The Store Trait
- Step 2: The InMemoryStore
- Step 3: The Axum Router
- Step 4: The Process
- Step 5: The Workload
- Step 6: Wire It Together
Theory is cheap. Here’s a complete worked example: an axum web service running inside moonpool-sim with chaos injection, fault-injectable storage, and assertion-based validation. The full source lives in moonpool-sim-examples/src/axum_web.rs.
Step 1: The Store Trait
Every dependency boundary starts with a trait. This one models item persistence:
#![allow(unused)]
fn main() {
pub trait Store: Send + Sync + 'static {
fn create(&self, name: &str) -> Result<Item, StoreError>;
fn get(&self, id: u64) -> Result<Option<Item>, StoreError>;
}
}
Send + Sync + 'static because axum requires State to be Send + Sync. In production, this trait is backed by Postgres or SQLite. In simulation, it’s backed by a BTreeMap.
Notice these are synchronous methods. The real database calls would be async, but for a fake that never does I/O, synchronous is simpler and equally correct. If your production trait has async methods, that works too.
Step 2: The InMemoryStore
BTreeMap for deterministic ordering. AtomicU64 for ID generation. RwLock because axum needs Send + Sync.
#![allow(unused)]
fn main() {
pub struct InMemoryStore {
items: RwLock<BTreeMap<u64, Item>>,
next_id: AtomicU64,
}
impl Store for InMemoryStore {
fn create(&self, name: &str) -> Result<Item, StoreError> {
// Fault injection: randomly fail writes.
// Models disk full, replication lag, constraint violations.
if buggify!() {
return Err(StoreError::WriteFailed("buggified".into()));
}
let id = self.next_id.fetch_add(1, Ordering::Relaxed);
let item = Item { id, name: name.to_string() };
self.items.write()
.map_err(|e| StoreError::WriteFailed(format!("{e}")))?
.insert(id, item.clone());
Ok(item)
}
fn get(&self, id: u64) -> Result<Option<Item>, StoreError> {
// Lower probability: reads fail less often than writes in practice.
if buggify_with_prob!(0.05) {
return Err(StoreError::ReadFailed("buggified".into()));
}
Ok(self.items.read()
.map_err(|e| StoreError::ReadFailed(format!("{e}")))?
.get(&id).cloned())
}
}
}
The buggify!() calls are the whole point. A Postgres container is either up or down. This fake can fail a write while the next read succeeds. It can fail creates at 25% while gets fail at 5%, modeling asymmetric failure that actually happens in production.
Step 3: The Axum Router
Standard axum. Nothing moonpool-specific here:
#![allow(unused)]
fn main() {
pub fn build_router(store: Arc<dyn Store>) -> axum::Router {
axum::Router::new()
.route("/health", get(health))
.route("/items", post(create_item))
.route("/items/{id}", get(get_item))
.with_state(store)
}
}
The handlers use State(store): State<Arc<dyn Store>> and return standard axum responses. create_item returns 201 on success, 500 when the store fails. get_item returns 200, 404, or 500. If you already have an axum app, your existing router works here.
Step 4: The Process
This is where moonpool enters the picture. A Process is the system under test, running on a simulated server node:
#![allow(unused)]
fn main() {
#[async_trait(?Send)]
impl Process for WebProcess {
fn name(&self) -> &str { "web" }
async fn run(&mut self, ctx: &SimContext) -> SimulationResult<()> {
let store = InMemoryStore::new();
let app = build_router(store);
let listener = ctx.network().bind(ctx.my_ip()).await?;
loop {
let (stream, _addr) = tokio::select! {
result = listener.accept() => result?,
_ = ctx.shutdown().cancelled() => return Ok(()),
};
let io = TokioIo::new(stream);
// TowerToHyperService bridges axum's tower::Service to hyper's Service
let service = TowerToHyperService::new(app.clone());
// spawn_local, not spawn: the future holds !Send SimTcpStream.
// Axum handlers ARE Send (axum's requirement), but hyper polls
// them inline within the connection future. Both coexist correctly.
tokio::task::spawn_local(async move {
if let Err(e) = hyper::server::conn::http1::Builder::new()
.serve_connection(io, service)
.await
{
tracing::debug!("hyper error (expected under chaos): {e}");
}
});
}
}
}
}
Two things to note. First, we use hyper::server::conn::http1::serve_connection, not axum::serve(). axum::serve takes tokio::net::TcpListener directly, so it can’t accept our simulated listener. serve_connection takes any AsyncRead + AsyncWrite, which SimTcpStream satisfies through the TokioIo adapter.
Second, spawn_local instead of spawn. The future holds a SimTcpStream which is !Send. Axum handlers remain Send (axum enforces this at compile time). The two coexist because hyper polls handlers inline within the connection future. The handler never escapes to another thread. This is architecturally correct, not a workaround.
Step 5: The Workload
The workload is the test driver. It connects to the process, sends requests, and validates responses:
#![allow(unused)]
fn main() {
#[async_trait(?Send)]
impl Workload for WebWorkload {
fn name(&self) -> &str { "client" }
async fn run(&mut self, ctx: &SimContext) -> SimulationResult<()> {
let server_ip = ctx.peer("web").ok_or_else(|| {
SimulationError::InvalidState("web process not found".into())
})?;
for round in 0..5 {
match self.send_round(ctx, &server_ip, round).await {
Ok(()) => {}
Err(e) => {
// Under chaos, requests can fail. That's expected.
assert_sometimes!(true, "request_round_failed");
tracing::debug!("round {round} failed: {e}");
}
}
}
Ok(())
}
}
}
Inside send_round, the workload creates a hyper client connection, sends requests, and uses assertions to validate behavior:
assert_always!for invariants: health returns 200, read-after-write returns the same data, nonexistent items return 404 or 500.assert_sometimes!for coverage: items sometimes created successfully, store reads sometimes fail, request rounds sometimes fail under chaos.
The assert_sometimes! calls are how moonpool knows it’s actually exercising error paths. If store_write_failed never triggers across thousands of iterations, something is wrong with the chaos configuration.
Step 6: Wire It Together
#![allow(unused)]
fn main() {
SimulationBuilder::new()
.processes(1, || Box::new(WebProcess))
.workload(WebWorkload)
.set_iterations(10)
.run();
}
One web server process, one workload driving requests, ten iterations with different seeds. Each iteration creates a fresh simulation: new network, new processes, new store state, new buggify activation decisions.
The default network configuration injects latency and connection faults. Combined with buggify!() in the store, your handlers face both network-level chaos (connection drops, latency spikes) and application-level chaos (write failures, stale reads) deterministically and reproducibly.
When a seed fails, you replay it with set_debug_seeds(vec![failing_seed]) and set_iterations(1) to reproduce the exact sequence of events that triggered the bug.
What You’re Testing (and What You’re Not)
Running an axum service inside moonpool-sim exercises a specific slice of your application’s behavior. Understanding what falls inside and outside that slice prevents both false confidence and unnecessary skepticism.
What You Are Testing
Real handler logic. Your axum handlers run unchanged. JSON serialization and deserialization happen for real (serde, not mocked). Routing matches against actual paths. Middleware executes in order. Extractors parse real HTTP requests. If your handler has a bug in its JSON response structure, simulation catches it.
HTTP behavior under chaos. Requests arrive over simulated TCP with injected latency, connection drops, and data corruption. Hyper’s HTTP/1.1 parser processes real wire bytes through a network that actively tries to break things. Half-closed connections, incomplete messages, connection resets mid-response: all exercised.
Concurrent request handling. Multiple connections served simultaneously via spawn_local. Race conditions between concurrent handlers accessing shared state (your Store fake) surface under different scheduling orders across seeds.
Error paths. Connection failures, timeouts, process reboots mid-request, store failures via buggify!(). The workload validates that your service handles these gracefully: returns appropriate status codes, doesn’t panic, doesn’t corrupt state.
Recovery after crash. With attrition enabled, moonpool kills and restarts processes. Your Process::run method executes fresh after each reboot. If your service leaks state across restarts or fails to rebind its listener, simulation finds it.
What You Are Not Testing
Production startup code. axum::serve() binds a real tokio::net::TcpListener and manages the accept loop internally. In simulation, we use hyper::server::conn::http1::serve_connection with a manual accept loop. If your production startup has a bug (wrong bind address, missing middleware registration), simulation won’t catch it.
Real database execution. The Store fake is a BTreeMap, not Postgres. SQL query plans, transaction isolation levels, connection pool behavior, schema migrations: none of these are exercised. A query that works on BTreeMap but generates incorrect SQL won’t be caught.
TLS. Simulated TCP streams are plaintext. TLS handshake failures, certificate validation, protocol negotiation: not exercised. If your service has a TLS configuration bug, you need integration tests against real TLS.
Real TCP backpressure and congestion. moonpool-sim models connection-level faults (drops, latency, corruption) but not TCP flow control, window sizing, or congestion algorithms. If your service has a bug that only manifests under real TCP backpressure, simulation won’t find it.
OS resource limits. File descriptor exhaustion, memory pressure, CPU scheduling. The simulation runs in a single process with no OS-level resource constraints. A service that leaks file descriptors will appear healthy in simulation.
The Same Tradeoff Everyone Makes
This is the exact tradeoff every simulation system makes. AWS has run ShardStore for over 15 years with the same architecture: real application logic, simulated network, faked storage. FoundationDB simulates network and disk but not the OS kernel. TigerBeetle simulates I/O but not the filesystem. Antithesis runs real binaries but controls the kernel, not the hardware.
The common thread: simulate the interaction boundaries where interesting failures occur (network, storage I/O), use real implementations for computation (parsers, serializers, business logic), and accept that some classes of bugs require different testing approaches.
Incremental Adoption
You don’t have to simulate your entire application on day one. Start with one module. The trait boundary you create for simulation (Store, Cache, MessageBroker) also improves your production architecture. Dependency injection makes code testable with or without simulation. Trait-based boundaries make components swappable.
Each module you bring under simulation compounds coverage. Your first module finds bugs in its own error handling. Your second module, interacting with the first under chaos, finds bugs in the interaction. By the third module, the simulation is finding bugs you didn’t know to look for.
The entry cost is one trait and one fake per dependency. The ongoing cost is maintaining fakes when traits change (the Oxide convention of shipping fake alongside trait helps). The return is deterministic, reproducible, chaos-tested coverage of your actual application logic, running in milliseconds on every commit.
Simulating the Network
When we built the provider traits in Part 2, we abstracted network I/O behind NetworkProvider. Now we need to decide what to simulate. This choice shapes everything that follows.
TCP, Not Packets
Many network simulators model individual packets: MTU fragmentation, congestion windows, selective acknowledgments, out-of-order delivery at the segment level. That approach is powerful for protocol research, but it is the wrong tool for testing distributed systems.
Our systems speak TCP. They open connections, send messages, read responses, and handle disconnections. The bugs that kill production clusters happen at connection granularity, not packet granularity:
- A node loses its connection mid-write and the remote sees a partial message
- A connection hangs in half-open state where one side thinks it is alive and the other does not
- Three nodes reconnect simultaneously after a network partition and overwhelm each other
- A process reboots and all its peers attempt reconnection at the same moment
These failure modes do not require modeling individual packet routing. They require modeling connection lifecycle: establishment, latency, graceful close, abrupt close, and half-open states.
FoundationDB reached the same conclusion. Their FlowTransport layer simulates connections, not packets. Their sim2.actor.cpp provides TCP-like streams that can be delayed, corrupted, or severed. Individual packet routing, MTU sizes, and congestion control are left to the real kernel. The moonpool-transport crate is a Rust reimplementation of FlowTransport’s architecture. The key difference: moonpool’s transport is codec-agnostic. A pluggable MessageCodec trait lets you use any serialization format, where FoundationDB is locked to its own serializer.
What We Simulate
The simulation network provides:
- Connection establishment with configurable latency. A
connect()call goes through the simulated network, introducing delays that the chaos engine can stretch or shorten. - Reliable byte streams that behave like TCP. Once connected, reads and writes operate on a stream of bytes, not discrete packets. The wire format handles framing.
- Connection failures injected by the chaos engine. Connections can be dropped at any point, forcing reconnection logic to activate.
- Half-open connections where one side believes the connection is alive while the other has closed it. This requires the FIN delivery mechanism we built into the simulation.
- Latency injection on data delivery events. Messages are not delivered instantly between simulated processes. They sit in the event queue with configurable delays.
What We Do Not Simulate
We deliberately skip:
- Individual packet routing between simulated hosts
- MTU and fragmentation at the IP layer
- Congestion windows and flow control (TCP slow start, etc.)
- DNS resolution
- TLS handshakes (the simulation trusts all connections)
These are real concerns in production, but they are not where distributed system bugs hide. A consensus protocol that breaks under packet reordering has a design flaw, not a testing gap.
Same Code, Two Worlds
The key architectural property: the same application code runs in both environments. A server process that opens connections, sends RPC requests, and handles failures uses the Providers trait bundle. In production, that bundle contains TokioNetworkProvider backed by real TCP sockets. In simulation, it contains the simulated network where every connection goes through the SimWorld event queue.
Application Code
│
▼
┌─────────────┐
│ Providers │ ◄── trait bundle
├─────────────┤
│ NetworkProv. │ ◄── connect(), bind(), accept()
└──────┬──────┘
│
┌────┴─────┐
│ │
▼ ▼
Real TCP SimWorld
(tokio) (event queue)
There is no #[cfg(test)] branching. No mock objects. The transport layer, the peer abstraction, the RPC system, and the wire format are all production code that happens to run inside a deterministic simulation.
This is what makes the approach work. We are not testing a simplified model of our networking. We are testing the actual networking code against a hostile simulated environment that is worse than production.
Peers and Connections
- The Peer Abstraction
- Message Queuing During Disconnection
- Connection Lifecycle
- Health Monitoring with Ping/Pong
- FailureMonitor Integration
- Under Simulation
A raw TCP connection is fragile. It can drop at any moment, and when it does, every in-flight message is lost. Our distributed system needs something more resilient: a logical connection that survives transient network failures. That is the Peer.
The Peer Abstraction
A Peer represents a logical connection to a remote endpoint. Behind the scenes it manages the actual TCP stream, but from the caller’s perspective it provides a simple interface: queue a message, and the peer will deliver it. If the connection drops, the peer reconnects automatically and drains any queued messages once the link is back.
This follows FoundationDB’s FlowTransport pattern (FlowTransport.actor.cpp:1016-1125). FDB’s Peer is an actor that owns a connection and handles reconnection internally. Ours works the same way, using a background task spawned via TaskProvider.
Creating a peer is straightforward:
#![allow(unused)]
fn main() {
let peer = Peer::new(
providers.clone(),
"10.0.1.2:4500".to_string(),
PeerConfig::default(),
);
}
The peer immediately spawns a connection_task that attempts to connect to the destination. Once connected, it enters a loop: read incoming packets, write outgoing packets, and handle failures.
Message Queuing During Disconnection
When the connection is down, outgoing messages do not vanish. The peer maintains two internal queues:
- Reliable queue: Messages that must survive disconnection. When the connection drops, these are preserved and sent first after reconnection.
- Unreliable queue: Messages that can be dropped on failure. These are drained after the reliable queue.
Both queues are bounded by PeerConfig::max_queue_size (default: 1000). If the queue fills up while disconnected, new messages are dropped and the caller gets a PeerError::QueueFull error. This prevents unbounded memory growth during long outages.
Connection Lifecycle
The peer’s background task follows a clear lifecycle:
┌──────────┐ connect ┌───────────┐
│Disconnect├───────────────►│ Connected │
│ ed │◄───────────────┤ │
└────┬─────┘ conn. error └─────┬─────┘
│ │
│ backoff │ send/recv
▼ ▼
┌──────────┐ ┌───────────┐
│Reconnect │ │ Active │
│ ing │ │ I/O │
└──────────┘ └───────────┘
On successful connection, the task enters the Active I/O phase. It reads incoming packets (dispatching them via a channel) and writes outgoing packets (draining the queues). If an I/O error occurs, it transitions to Disconnected, waits through backoff, then tries again.
For incoming connections (accepted by a server listener), the peer starts already connected. The constructor Peer::new_incoming() takes an existing TCP stream and skips the initial connection attempt. If this connection drops, the peer exits rather than reconnecting, because the remote side is responsible for initiating a new connection.
Health Monitoring with Ping/Pong
A TCP connection can appear alive while the remote process is actually unresponsive. To detect this, outbound peers run a ping/pong protocol modeled on FDB’s connectionMonitor (FlowTransport.actor.cpp:616-699).
The PingTracker state machine works like this:
- After each
ping_interval(default: 1 second), send a PING packet - Wait up to
ping_timeout(default: 2 seconds) for a PONG reply - If PONG arrives, record the RTT and return to idle
- If timeout but bytes were received since the ping, tolerate it (the connection is busy, not dead)
- If timeout and no bytes were received, or if
max_tolerated_timeoutsconsecutive tolerations occur, tear down the connection
Ping and pong packets use special wire tokens (PING_TOKEN and PONG_TOKEN) that are intercepted by the connection task and never delivered to the application layer.
This monitoring runs only on outbound peers. Incoming peers passively respond to pings but never initiate them.
FailureMonitor Integration
The peer’s connection task feeds status updates to the FailureMonitor, a reactive failure tracking system that delivery modes depend on. When the TCP link connects successfully, the task calls set_status(address, Available). When the connection drops, it calls set_status(address, Failed) and notify_disconnect(address).
These signals drive the RPC layer. try_get_reply races the server’s response against the disconnect signal. get_reply_unless_failed_for starts a timeout countdown from the disconnect event. Without the peer feeding accurate status updates, the delivery modes cannot detect failures.
See Failure Monitor for the full consumer API and Delivery Modes for how each mode uses these signals.
Under Simulation
In simulation, peers experience chaos:
- Random connection closes (0.001% probability)
- Connection failures (50% probabilistic during buggify)
- Partial writes
- Half-open connection simulation
The peer does not know it is running in simulation. It sees the same connect() failures and read() errors that would occur with a flaky real network. Its reconnection logic, backoff timing, and queue management are all exercised against these faults using the same code paths that run in production.
Backoff and Reconnection
When a connection fails, the worst thing a peer can do is immediately retry. If ten peers all lose their connections at the same moment (say, after a network partition heals), and all of them retry instantly, they will overwhelm the destination with simultaneous connection attempts. This creates a reconnection storm that can be worse than the original failure.
Exponential Backoff
Moonpool peers use exponential backoff on reconnection, following FoundationDB’s pattern (FlowTransport.actor.cpp:892-897). The ReconnectState tracks the current delay and doubles it after each failure, up to a configured maximum:
#![allow(unused)]
fn main() {
let next_delay = std::cmp::min(
state.reconnect_state.current_delay * 2,
config.max_reconnect_delay,
);
state.reconnect_state.current_delay = next_delay;
}
The default PeerConfig starts with a 100ms initial delay and caps at 30 seconds:
#![allow(unused)]
fn main() {
PeerConfig {
initial_reconnect_delay: Duration::from_millis(100),
max_reconnect_delay: Duration::from_secs(30),
max_queue_size: 1000,
connection_timeout: Duration::from_secs(5),
max_connection_failures: None, // Unlimited retries
monitor: Some(MonitorConfig::default()),
}
}
On a successful connection, the backoff resets to the initial delay. The failure counter resets to zero. The peer is ready for the next disruption with a clean slate.
Why It Matters in Simulation
Without backoff, simulation tests that inject network failures produce degenerate behavior. The event queue fills with connection attempts that all fail, each failure spawns another immediate retry, and the simulation spends all its time processing reconnection events instead of making progress on actual workload logic.
With backoff, the chaos engine can sever connections freely. Peers back off, the event queue stays manageable, and when connections restore, peers reconnect in a staggered pattern that avoids thundering herd effects.
You can use assert_sometimes_each! to track backoff depth across simulation runs, ensuring you exercise multiple levels of the exponential curve:
#![allow(unused)]
fn main() {
// Example: track that different backoff depths are reached
assert_sometimes_each!(
"backoff_depth",
[("attempt", failure_count)]
);
}
Profile Presets
Different network environments need different backoff tuning. PeerConfig provides presets:
| Profile | Initial Delay | Max Delay | Queue Size | Timeout | Max Failures |
|---|---|---|---|---|---|
| Default | 100ms | 30s | 1000 | 5s | Unlimited |
| Local | 10ms | 1s | 100 | 500ms | 10 |
| WAN | 500ms | 60s | 5000 | 30s | Unlimited |
For simulation tests, the default profile works well. The chaos engine can buggify the actual delays through the TimeProvider, stretching or shortening them to explore timing-sensitive code paths.
Wire Format
TCP gives us a reliable byte stream, but our application needs framed messages routed to specific endpoints. The wire format bridges this gap: it defines how we serialize a message with its destination, protect it with a checksum, and parse it back on the other side.
Packet Layout
Every packet on the wire follows this structure:
0 4 8
├───────────────────┼───────────────────┤
│ length (u32) │ checksum (u32) │
├───────────────────┴───────────────────┤
│ token (UID) │
│ first: u64 (8 bytes) │
│ second: u64 (8 bytes) │
├───────────────────────────────────────┤
│ payload (N bytes) │
└───────────────────────────────────────┘
Header: 24 bytes total
- length: 4 bytes, little-endian u32 (total packet size including header)
- checksum: 4 bytes, CRC32C of (token + payload)
- token: 16 bytes, two little-endian u64 values
The header is exactly 24 bytes (HEADER_SIZE). The payload can be up to 1MB (MAX_PAYLOAD_SIZE). Anything larger is rejected to prevent memory exhaustion.
CRC32C Checksums
Every packet carries a CRC32C checksum computed over the token and payload bytes:
#![allow(unused)]
fn main() {
fn compute_checksum(token: UID, payload: &[u8]) -> u32 {
let mut data = Vec::with_capacity(16 + payload.len());
data.extend_from_slice(&token.first.to_le_bytes());
data.extend_from_slice(&token.second.to_le_bytes());
data.extend_from_slice(payload);
crc32c::crc32c(&data)
}
}
The checksum covers the token because corrupted routing is just as dangerous as corrupted data. If a bit flip changes the destination token, the message would be delivered to the wrong endpoint silently. Including the token in the checksum catches this.
On deserialization, the receiver recomputes the checksum and compares. Any mismatch produces a WireError::ChecksumMismatch with both the expected and actual values for debugging.
UID-Based Token Routing
The 16-byte token field is a UID that identifies the destination endpoint. When a packet arrives, the transport layer looks up this token in the EndpointMap to find the right receiver.
UIDs come in two flavors:
- Well-known tokens have
first == u64::MAX. Thesecondfield is an index into a fixed-size array for O(1) lookup. These are used for system endpoints like ping (WellKnownToken::Ping). - Dynamic tokens use arbitrary values for both fields. These are looked up in a
BTreeMapand are used for request-response correlation and service endpoints.
Service methods derive their individual endpoint tokens from a single service ID using UID::new(interface_id, method_index), where interface_id is the service’s base UID and method_index is the method’s position in the trait definition.
Streaming Deserialization
Real TCP streams deliver data in chunks that do not align with packet boundaries. The try_deserialize_packet function handles this gracefully:
#![allow(unused)]
fn main() {
pub fn try_deserialize_packet(data: &[u8])
-> Result<Option<(UID, Vec<u8>, usize)>, WireError>
}
It returns:
Ok(Some((token, payload, consumed)))when a complete packet is availableOk(None)when more data is needed (not enough bytes for the header or full packet)Err(...)when the data is malformed
The consumed count tells the caller how many bytes to advance in the read buffer, making it straightforward to process multiple packets from a single TCP read.
Error Cases
The WireError enum covers four failure modes:
| Variant | Meaning |
|---|---|
InsufficientData | Not enough bytes to parse header or full packet |
ChecksumMismatch | Data was corrupted in transit |
PacketTooLarge | Payload exceeds 1MB limit |
InvalidLength | Length field is malformed (e.g., smaller than header size) |
In simulation, the chaos engine can inject data corruption that triggers ChecksumMismatch. This exercises the error handling paths in the connection task without needing to model individual bit errors on the wire.
RPC with #[service]
We have peers that manage connections, a wire format that frames messages, and an endpoint map that routes them. But writing the boilerplate for every RPC interface, manually serializing requests, registering endpoints, and correlating responses, gets tedious fast. The #[service] proc macro eliminates all of that.
Define a Trait, Get Everything
The idea is simple: write a Rust trait that describes your service interface, annotate it with #[service(id = ...)], and the macro generates all the networking plumbing.
#![allow(unused)]
fn main() {
#[service(id = 0xCA1C_0000)]
trait Calculator {
async fn add(&self, req: AddRequest) -> Result<AddResponse, RpcError>;
async fn sub(&self, req: SubRequest) -> Result<SubResponse, RpcError>;
}
}
From this single trait definition, the macro generates:
CalculatorServer<C>with aRequestStreamper method and aninit()method that registers all endpoints with the transportCalculatorClientwithServiceEndpointfields for each method, giving you full control over delivery mode at every call site- The trait itself, wrapped with
#[async_trait(?Send)]
The Service ID
Every service needs a unique id attribute:
#![allow(unused)]
fn main() {
#[service(id = 0xBA4E_4B00)]
}
This u64 value becomes the base for all endpoint tokens in the service. Method endpoints are derived using UID::new(interface_id, method_index), where method indices start at 1 (index 0 is reserved).
The hex convention makes it easy to identify services in wire captures and logs. 0xCA1C looks like “CALC”, 0xBA4E_4B00 looks like “BANKB00”. Choose values that are memorable and unique within your system.
What Gets Generated (RPC Mode)
For a two-method Calculator service, the macro produces roughly this structure:
Calculator (trait)
├── add(&self, AddRequest) -> Result<AddResponse, RpcError>
└── sub(&self, SubRequest) -> Result<SubResponse, RpcError>
CalculatorServer<C>
├── add: RequestStream<AddRequest, C> // endpoint at UID(0xCA1C_0000, 1)
├── sub: RequestStream<SubRequest, C> // endpoint at UID(0xCA1C_0000, 2)
├── init(transport, codec) -> Self
└── serve(transport, handler, providers) -> ServerHandle
CalculatorClient
├── new(address, codec) -> Self
├── add: ServiceEndpoint<AddRequest, AddResponse, C>
└── sub: ServiceEndpoint<SubRequest, SubResponse, C>
The serve() method is particularly useful: it consumes the server, spawns a background task per method that loops on recv_with_transport, and returns a ServerHandle that stops everything when dropped.
Defining a Service
The #[service] macro generates a complete RPC infrastructure from a trait definition. This chapter walks through defining a service from scratch: the trait definition, the request and response types, and how the generated code fits together.
The Trait Definition
A service starts as a Rust trait with #[service(id = ...)]:
#![allow(unused)]
fn main() {
use moonpool::{service, RpcError};
use serde::{Serialize, Deserialize};
#[derive(Debug, Clone, Serialize, Deserialize)]
struct AddRequest { a: i32, b: i32 }
#[derive(Debug, Clone, Serialize, Deserialize)]
struct AddResponse { result: i32 }
#[derive(Debug, Clone, Serialize, Deserialize)]
struct MulRequest { a: i32, b: i32 }
#[derive(Debug, Clone, Serialize, Deserialize)]
struct MulResponse { result: i32 }
#[service(id = 0xCA1C_0000)]
trait Calculator {
async fn add(&self, req: AddRequest) -> Result<AddResponse, RpcError>;
async fn multiply(&self, req: MulRequest) -> Result<MulResponse, RpcError>;
}
}
Every method must follow the same signature pattern: async fn name(&self, req: RequestType) -> Result<ResponseType, RpcError>. The macro parses the Result<T, RpcError> return type to extract the response type for code generation.
Request and response types need Serialize and Deserialize derives because they travel over the wire. They also need to be 'static and DeserializeOwned, which standard derives give you automatically.
Method Indexing
Methods are assigned indices starting at 1 in declaration order. Index 0 is reserved.
For our Calculator:
addgets index 1, routed toUID(0xCA1C_0000, 1)multiplygets index 2, routed toUID(0xCA1C_0000, 2)
These indices are stable as long as you do not reorder methods. Adding new methods at the end is safe. Reordering or removing methods changes the wire protocol and breaks compatibility with existing clients.
Implementing the Server
To handle requests, implement the generated trait on a concrete type:
#![allow(unused)]
fn main() {
struct CalculatorImpl;
#[async_trait::async_trait(?Send)]
impl Calculator for CalculatorImpl {
async fn add(&self, req: AddRequest) -> Result<AddResponse, RpcError> {
Ok(AddResponse { result: req.a + req.b })
}
async fn multiply(&self, req: MulRequest) -> Result<MulResponse, RpcError> {
Ok(MulResponse { result: req.a * req.b })
}
}
}
Note the #[async_trait(?Send)] attribute. All moonpool services are single-threaded (!Send), matching our deterministic execution model. The macro adds this attribute to the generated trait automatically, and you must repeat it on the impl block.
Serialization
The #[service] macro is codec-agnostic. The generated server and client are generic over C: MessageCodec. In practice, JsonCodec is the standard choice, but you can provide any codec that implements the MessageCodec trait.
The generated CalculatorClient serializes cleanly because each ServiceEndpoint stores just the destination address and method UID. Adding methods at the end does not change the serialized representation of existing endpoints.
Server, Client, and Endpoints
- Starting a Server
- Connecting a Client
- EndpointMap: Token Routing
- WellKnownToken
- RequestStream
- ReplyPromise and ReplyFuture
- ReplyError
- Putting It Together
The #[service] macro generates the types. Now we need to understand how to wire them up: how the server registers with the transport, how the client connects, and how the endpoint routing system delivers messages to the right place.
Starting a Server
The generated Server type provides two patterns. The simple path uses serve(), which spawns background tasks and returns a handle:
#![allow(unused)]
fn main() {
let server = CalculatorServer::init(&transport, JsonCodec);
let handle = server.serve(transport.clone(), Rc::new(CalculatorImpl), &providers);
// Tasks run until handle is dropped or stop() is called
}
init() registers all method endpoints with the transport’s EndpointMap. Each method gets its own RequestStream, backed by a NetNotifiedQueue that receives incoming request envelopes.
serve() consumes the server and spawns one task per method. Each task loops on recv_with_transport, dispatches to the handler, and sends the response back through the ReplyPromise. The returned ServerHandle holds close functions for each stream. Dropping it or calling stop() closes the streams, which causes the tasks to exit cleanly.
For more control, you can skip serve() and process each RequestStream manually:
#![allow(unused)]
fn main() {
let server = CalculatorServer::init(&transport, JsonCodec);
// Handle the `add` stream yourself
while let Some((req, reply)) = server.add.recv_with_transport(&transport).await {
reply.send(AddResponse { result: req.a + req.b });
}
}
Connecting a Client
The client side is even simpler. Create a Client with the server’s address, then call methods on its ServiceEndpoint fields:
#![allow(unused)]
fn main() {
let calc = CalculatorClient::new(server_address, JsonCodec);
// Each field is a ServiceEndpoint — you choose the delivery mode at the call site
let resp = calc.add.get_reply(&transport, AddRequest { a: 1, b: 2 }).await?;
assert_eq!(resp.result, 3);
}
There is no bind() step and no bound client type. The ServiceEndpoint carries the destination address, method UID, and codec, and you pass only the transport at each call. This makes the delivery mode explicit: get_reply for at-least-once, try_get_reply for at-most-once, send for fire-and-forget. See Delivery Modes for the full set.
Under the hood, get_reply creates a temporary ReplyFuture registered at a unique endpoint, then the response arrives as a packet routed to that endpoint.
EndpointMap: Token Routing
The EndpointMap is the routing table at the heart of NetTransport. When a packet arrives with a token, the transport looks it up here to find the receiver.
It uses a hybrid lookup strategy:
- Well-known endpoints use O(1) array access. The first 64 token indices are reserved for system endpoints.
WellKnownToken::Ping(index 1) is used for health monitoring,WellKnownToken::EndpointNotFound(index 0) handles unroutable messages. - Dynamic endpoints use a
BTreeMap<UID, Rc<dyn MessageReceiver>>. These are allocated at runtime for service methods and request-response correlation.
#![allow(unused)]
fn main() {
// Well-known: O(1) array lookup
map.insert_well_known(WellKnownToken::Ping, receiver)?;
// Dynamic: BTreeMap lookup
map.insert(UID::new(0xCA1C_0000, 1), receiver);
}
Well-known endpoints cannot be removed. Dynamic endpoints can be registered and deregistered as services come and go.
WellKnownToken
The WellKnownToken enum defines system-level endpoints:
| Token | Index | Purpose |
|---|---|---|
EndpointNotFound | 0 | Handles messages to unknown endpoints |
Ping | 1 | Connection health monitoring |
UnauthorizedEndpoint | 2 | Authentication failures |
FirstAvailable | 3 | First index available for user services |
A well-known UID has first == u64::MAX and second equal to the token index. The is_well_known() method checks this, letting the endpoint map take the fast array path.
RequestStream
RequestStream<Req, C> is the server-side abstraction for receiving typed requests. Each stream wraps a NetNotifiedQueue that the transport pushes incoming packets into. When you call recv() or recv_with_transport(), it awaits the next RequestEnvelope<Req> from the queue and returns the deserialized request paired with a ReplyPromise.
The RequestEnvelope bundles the request payload with a reply_to endpoint, the address where the client is listening for the response:
#![allow(unused)]
fn main() {
struct RequestEnvelope<T> {
request: T,
reply_to: Endpoint,
}
}
ReplyPromise and ReplyFuture
These two types form the request-response correlation mechanism.
ReplyPromise<T, C> lives on the server side. When the server finishes processing a request, it calls reply.send(response) to serialize and deliver the response to the client’s reply_to endpoint. If the promise is dropped without being fulfilled, it automatically sends a ReplyError::BrokenPromise to the client so the client does not hang forever.
#![allow(unused)]
fn main() {
// Server side
let (req, reply) = stream.recv_with_transport(&transport).await?;
reply.send(AddResponse { result: req.a + req.b });
}
ReplyFuture<T, C> lives on the client side. It implements Future and resolves when the server’s response arrives at the temporary endpoint that send_request registered. The future polls a NetNotifiedQueue for the response. If the queue is closed (connection failure), it resolves with the appropriate ReplyError.
ReplyFuture implements Drop to close its queue when the future is cancelled or goes out of scope. This prevents leaked wakers and ensures the temporary endpoint is cleaned up even if the caller abandons the request. Without this, a killed process would leave orphaned reply queues that hang forever.
Both types are !Send because they contain Rc<RefCell<...>> internally. This is deliberate. Our entire execution model is single-threaded, and these types are designed to be efficient within that constraint rather than paying the cost of Arc<Mutex<...>> for thread safety we will never use.
ReplyError
The ReplyError enum covers every failure mode in the request-response lifecycle:
| Variant | Meaning |
|---|---|
BrokenPromise | Server dropped the promise without responding |
ConnectionFailed | Network connection failed during the request |
Timeout | RPC timed out (default: 30 seconds) |
Serialization | Encoding or decoding failed |
EndpointNotFound | Destination endpoint is not registered |
MaybeDelivered | Peer disconnected, delivery is uncertain |
MaybeDelivered is the most important variant. It maps directly to FoundationDB’s request_maybe_delivered (error 1030). Instead of hiding delivery ambiguity behind a generic timeout, it tells you explicitly: the connection failed and we do not know whether the server processed your request. See Delivery Modes for how each delivery function produces this error and Designing Simulation-Friendly RPC for strategies to handle it.
Putting It Together
Here is the complete flow for a single RPC call:
- Client calls
calc.add.get_reply(&transport, req), which callssend_request send_requestcreates aReplyFutureat a unique temporary endpoint and registers it in theEndpointMap- The request is serialized as a
RequestEnvelopewith the temporary endpoint asreply_to, then sent to the server’s method endpoint (UID(0xCA1C_0000, 1)) - Transport routes the packet to the server’s
RequestStreamvia theEndpointMap - Server receives
(AddRequest, ReplyPromise)from the stream - Server calls
reply.send(AddResponse { ... }), which serializes and sends to thereply_toendpoint - Transport routes the response packet to the client’s temporary endpoint
- ReplyFuture resolves with the deserialized
AddResponse - The temporary endpoint is deregistered from the
EndpointMap
All of this happens over the same Peer connections and wire format we covered in previous chapters. In simulation, every step goes through the SimWorld event queue, making the entire RPC flow deterministic and subject to chaos injection.
Delivery Modes
- The Four Modes
- send: Fire-and-Forget
- try_get_reply: At-Most-Once
- get_reply: At-Least-Once
- get_reply_unless_failed_for: At-Least-Once with Timeout
- MaybeDelivered: Explicit Ambiguity
- Choosing a Delivery Mode
The #[service] macro gives you a clean RPC interface, but it hides an important question: what happens when the connection drops mid-request? The answer depends on which delivery mode you choose. Moonpool provides four, matching FoundationDB’s fdbrpc layer (fdbrpc.h:727-895).
The Four Modes
| Function | Guarantee | Transport | On Disconnect |
|---|---|---|---|
send | Fire-and-forget | Unreliable | Silently lost |
try_get_reply | At-most-once | Unreliable | MaybeDelivered |
get_reply | At-least-once | Reliable | Retransmits on reconnect |
get_reply_unless_failed_for | At-least-once + timeout | Reliable | MaybeDelivered after duration |
All four modes are available as methods on ServiceEndpoint, the type generated for each method in a #[service] client. You can also call the underlying functions in moonpool::delivery directly if you are working outside the #[service] macro. The difference between modes is in what guarantees they provide and how they handle failures.
send: Fire-and-Forget
The simplest mode. Send the request unreliably with no reply registered:
#![allow(unused)]
fn main() {
// Via ServiceEndpoint (generated client)
heartbeat.send(&transport, HeartbeatRequest { node_id })?;
// Via delivery module (manual endpoint)
delivery::send(&transport, &endpoint, HeartbeatRequest { node_id }, JsonCodec)?;
}
No ReplyFuture is created. No endpoint is registered for a response. If the connection is down, the message is silently dropped. If the server responds, the response is discarded.
Use this for heartbeats, notifications, and any message where losing one is harmless because the next one compensates.
try_get_reply: At-Most-Once
Send unreliably, then race the reply against a disconnect signal from the FailureMonitor:
#![allow(unused)]
fn main() {
// Via ServiceEndpoint
let response = balance.try_get_reply(&transport, GetBalanceRequest { account_id }).await?;
// Via delivery module
let response = delivery::try_get_reply::<_, BalanceResponse, _, _>(
&transport, &endpoint, GetBalanceRequest { account_id }, JsonCodec,
).await?;
}
Under the hood, this is a tokio::select!:
#![allow(unused)]
fn main() {
select! {
result = reply_future => result,
() = failure_monitor.on_disconnect_or_failure(&endpoint) => Err(MaybeDelivered),
}
}
There is a fast path: if the FailureMonitor already knows the endpoint is failed, it returns MaybeDelivered immediately without sending anything. This prevents wasting work on requests that cannot succeed.
The at-most-once guarantee means the server processes your request zero or one times. If you get MaybeDelivered, the request might have been processed, or it might not have. You must handle this ambiguity explicitly.
get_reply: At-Least-Once
Send reliably. If the connection drops and reconnects, the transport retransmits the request automatically:
#![allow(unused)]
fn main() {
// Via ServiceEndpoint
let response = join.get_reply(&transport, JoinRequest { node_id }).await?;
// Via delivery module (returns a ReplyFuture for manual control)
let reply_future = delivery::get_reply::<_, JoinResponse, _, _>(
&transport, &endpoint, JoinRequest { node_id }, JsonCodec,
)?;
let response = reply_future.await?;
}
The request sits in the peer’s reliable queue. If the TCP connection drops, the peer reconnects (with backoff), and the queued request is resent. The server may receive the same request multiple times. Your server must be prepared for duplicates.
This mode never gives up. If the remote process dies permanently, the ReplyFuture hangs until the 30-second RPC timeout fires, returning ReplyError::Timeout.
get_reply_unless_failed_for: At-Least-Once with Timeout
Like get_reply, but gives up if the endpoint has been continuously failed for a specified duration:
#![allow(unused)]
fn main() {
// Via ServiceEndpoint
let response = register.get_reply_unless_failed_for(
&transport, RegisterRequest { /* ... */ }, Duration::from_secs(10),
).await?;
// Via delivery module
let response = delivery::get_reply_unless_failed_for::<_, RegisterResponse, _, _>(
&transport, &endpoint, RegisterRequest { /* ... */ }, JsonCodec,
Duration::from_secs(10),
).await?;
}
This combines reliable delivery with a failure timeout. First it waits for a disconnect signal from the FailureMonitor, then sleeps for the sustained failure duration. If the connection recovers during that sleep, the reliable retransmit resolves the reply future first and the timeout is cancelled.
Use this for singleton RPCs (registration, recruitment) where you want reliable delivery but cannot wait forever if the destination is permanently gone.
MaybeDelivered: Explicit Ambiguity
The ReplyError::MaybeDelivered variant is the most important design decision in the transport layer. Most RPC frameworks hide delivery ambiguity behind generic timeouts. Moonpool, following FDB, makes it explicit.
When you receive MaybeDelivered, you know exactly one thing: the connection failed while the request was in flight. The request may have been fully processed, partially processed, or never received. The framework refuses to guess.
#![allow(unused)]
fn main() {
match delivery::try_get_reply(&transport, &ep, req, codec).await {
Ok(response) => handle_success(response),
Err(ReplyError::MaybeDelivered) => {
// Read state to determine if the request was processed
// before deciding whether to retry
}
Err(e) => handle_error(e),
}
}
This forces correct error handling at the application level. Simulation testing with chaos injection will trigger MaybeDelivered frequently, revealing any code path that ignores delivery ambiguity.
Choosing a Delivery Mode
| Use case | Mode | Example |
|---|---|---|
| Notification, no reply needed | send | Heartbeat, trigger |
| Single server, accept failure | try_get_reply | Probe, status check |
| Single server, must deliver | get_reply | TLog rejoin, state sync |
| Single server, failure timeout | get_reply_unless_failed_for | Registration, recruitment |
The generated #[service] client exposes each method as a ServiceEndpoint field with all delivery modes available. You choose the mode at the call site, so the same client can use get_reply for critical writes and try_get_reply for best-effort reads. For raw ReplyFuture control (useful in select! blocks), use send_request.
For strategies on handling MaybeDelivered correctly, including idempotent design, generation numbers, and read-before-retry, see Designing Simulation-Friendly RPC.
Failure Monitor
Delivery modes need a way to detect when a remote endpoint is unreachable. Polling for liveness would be wasteful and nondeterministic. Instead, moonpool uses a reactive failure monitor that tracks connection state and wakes interested futures when something changes. This follows FoundationDB’s SimpleFailureMonitor (FailureMonitor.h:146).
Two Levels of Tracking
The FailureMonitor tracks failures at two granularities:
Address-level: Is this machine reachable? The connection task calls set_status(address, Available) on successful connect and notify_disconnect(address) when the TCP link drops. Unknown addresses default to Failed, a conservative assumption that prevents sending requests into the void.
Endpoint-level: Is this specific endpoint permanently gone? When a ReplyFuture receives BrokenPromise (the server dropped the promise without responding), the delivery mode calls endpoint_not_found(endpoint). This marks the endpoint as permanently failed. Well-known endpoints (system tokens like Ping) are exempt from permanent failure.
#![allow(unused)]
fn main() {
// Producer side (connection_task)
failure_monitor.set_status("10.0.1.2:4500", FailureStatus::Available);
// ... later, on TCP drop:
failure_monitor.set_status("10.0.1.2:4500", FailureStatus::Failed);
failure_monitor.notify_disconnect("10.0.1.2:4500");
}
Reactive, Not Polling
The failure monitor never probes. It reacts to signals from the connection layer and wakes registered consumers. The consumer API returns futures that resolve when state changes:
| Method | Resolves when | Used by |
|---|---|---|
on_disconnect_or_failure(endpoint) | Address disconnects OR endpoint permanently fails | try_get_reply() |
on_disconnect(address) | Address disconnects | Connection monitoring |
on_state_changed(endpoint) | Any status change (never resolves if permanently failed) | get_reply() retry loop |
state(endpoint) | Immediate check | Fast-path in try_get_reply() |
All of these use Waker-based registration internally. When the producer calls set_status or notify_disconnect, it drains the waker list for that address and wakes every registered consumer. No background tasks, no timers, no allocation beyond the waker vector.
How Delivery Modes Use It
The connection between delivery modes and the failure monitor is the core of the RPC stack.
try_get_reply races the reply against on_disconnect_or_failure. If the connection drops before the server responds, the future resolves with MaybeDelivered. Before even sending, it checks state(endpoint) for the fast path: if the endpoint is already failed, return MaybeDelivered immediately without wasting a network round-trip.
get_reply_unless_failed_for first waits on on_disconnect_or_failure, then sleeps for the sustained failure duration via the TimeProvider. If the connection recovers during the sleep window, the reliable retransmit resolves the reply future first, winning the select! race. If not, the caller gets MaybeDelivered.
get_reply does not directly use the failure monitor for cancellation. It relies on the reliable queue and the 30-second RPC timeout. But BrokenPromise responses trigger endpoint_not_found, feeding information back into the monitor for future requests.
Under Simulation
In simulation, the failure monitor works identically. The SimWorld triggers the same set_status and notify_disconnect calls through simulated connection events. This means delivery mode behavior under chaos is fully exercised: connection drops trigger MaybeDelivered, address recovery wakes pending futures, and permanently failed endpoints are tracked correctly.
The monitor caps permanently failed endpoints at 100,000 entries to prevent memory growth in long-running simulations with many ephemeral endpoints. When the cap is hit, the entire map is cleared and a warning is logged.
Load Balancing and Fan-Out
The delivery modes in the previous chapters all talk to a single peer. Real distributed systems need two patterns the four-mode API does not directly express: picking one peer out of many (load balancing) and talking to many peers in parallel (fan-out). Moonpool ships both as thin layers over the existing ServiceEndpoint API, modelled directly on FoundationDB’s loadBalance() and the TLog commit fan-out.
Where the Two Patterns Differ
Both patterns start from a list of equivalent backends — N storage servers behind a key range, N TLogs in a replication set, N coordinators of the same cluster. The difference is what we want to do with that list.
| Pattern | Goal | Network calls per request |
|---|---|---|
| Load balance | Pick the best one. Retry on the next if it fails. | 1 (sometimes more on retry) |
| Fan-out | Talk to all of them in parallel. Combine the replies. | N |
A read against a replicated key is a load-balance: any replica can answer, we want the lowest-latency one, and we are happy to retry on a different replica if the first one is slow. A commit against a TLog set is a fan-out: every TLog needs to write the mutation, and we wait for all of them (or a quorum) before declaring the commit durable.
Load Balancing
load_balance() takes a group of equivalent endpoints, sends one request, and retries on a different alternative if the chosen one fails. It is the moonpool analog of fdbrpc/include/fdbrpc/LoadBalance.actor.h:823-1018 with hedging deferred to a follow-up.
#![allow(unused)]
fn main() {
use moonpool::rpc::{
Alternatives, AtMostOnce, Distance, LoadBalanceConfig, QueueModel, load_balance,
};
let alts = Alternatives::new(vec![
(replica_a.read.clone(), Distance::SameDc),
(replica_b.read.clone(), Distance::SameDc),
(replica_c.read.clone(), Distance::Remote),
]);
let model = QueueModel::new();
let config = LoadBalanceConfig::default();
let value = load_balance(
&transport,
&alts,
GetValueRequest { key: "user/42".into() },
AtMostOnce::False, // reads are idempotent
&model,
&config,
).await?;
}
Three things make this work.
Alternatives sorts by locality. Each entry carries a Distance tag (SameMachine, SameDc, Remote). The constructor sorts by distance ascending and computes count_best — the number of entries at the closest tier. Load balancing prefers the local-DC prefix and only falls back to remote entries when every local one is unavailable.
QueueModel tracks smoothed in-flight count and latency per endpoint. Internally it is a HashMap<u64, QueueData> keyed by the high half of the endpoint UID, exactly matching FDB’s getMeasurement(token.first()) convention. Each entry holds a Smoother (an exponential moving average with a one-second e-folding constant) for the outstanding count, plus the most recently observed latency. The selection step picks the alternative with the lowest smoothed outstanding count among the viable candidates.
AtMostOnce makes the idempotency contract explicit. With AtMostOnce::False the load balancer issues get_reply (reliable) and treats MaybeDelivered as a retry trigger. With AtMostOnce::True it issues try_get_reply (unreliable) and propagates MaybeDelivered immediately, so a side-effecting request is never retried on a different alternative when its outcome is ambiguous. This is the central design lever from FDB’s LoadBalance.actor.h:572-625: side effects must not silently double up.
The retry loop cycles through alternatives in best-first order, marking each one tried. After a full cycle without success it sleeps for an exponentially-growing backoff (backoff_start doubled each cycle up to backoff_max) and starts a fresh cycle, up to max_full_cycles cycles total before returning the most recent error. The defaults — two cycles, 50ms start, 1s cap, 2× multiplier — match FDB’s FLOW_KNOBS->LOAD_BALANCE_* shape, and every knob lives on LoadBalanceConfig so callers can tune per deployment. Production callers wrap the call in their own retry policy if they want indefinite retry.
When to use a QueueModel
A QueueModel is shared across many concurrent load_balance calls — typically wrap it in an Rc and hand out references. It uses RefCell internally so the borrow happens for the duration of one method call only, never across an await. A fresh model treats every endpoint as zero-outstanding, so the very first request to an unseen alternative looks maximally attractive — exactly what you want for cold-start fairness.
Fan-Out: Four Completion Semantics
Fan-out helpers send the same request to every endpoint in a slice and combine the replies. Moonpool ships four variants because real systems need different completion conditions.
#![allow(unused)]
fn main() {
use moonpool::rpc::{
fan_out_all, fan_out_quorum, fan_out_race, fan_out_all_partial,
};
}
fan_out_all — All Must Succeed
The TLog-style “every peer or nothing” pattern. The request is cloned to each endpoint, every reply is awaited, and the first failure aborts the rest. This mirrors FDB’s resolver fan-out (CommitProxyServer.actor.cpp:1127-1179), where any single resolver failure aborts the commit.
#![allow(unused)]
fn main() {
let resolutions = fan_out_all(
&transport,
&resolver_endpoints,
ResolveTransactionBatchRequest { /* ... */ },
).await?;
// resolutions[i] is the reply from resolver i, in input order
}
The returned Vec<Resp> is in input order, so the caller can correlate replies with senders by index.
fan_out_quorum — K of N
The TLog commit pattern from TagPartitionedLogSystem.actor.cpp:619-687. The function waits until required peers have replied successfully, then drops the rest. If too many peers fail to ever reach the threshold, it returns QuorumNotMet immediately rather than waiting on doomed futures.
#![allow(unused)]
fn main() {
let acks = fan_out_quorum(
&transport,
&tlog_endpoints,
TLogCommitRequest { version, mutations },
/* required = */ tlog_endpoints.len() - anti_quorum,
).await?;
}
The Vec<Resp> returned on success is in completion order, not input order — the caller is using a quorum vote, not per-peer correlation. MaybeDelivered and BrokenPromise errors count as ordinary failed peers; fan-out never retries (every peer was already addressed in parallel), so the AtMostOnce flag would be a no-op and is intentionally absent from the signature.
fan_out_race — First Success Wins
Send to all, return the first Ok, drop the rest. Useful for hedged reads against equivalent replicas when you do not want the bookkeeping cost of a full QueueModel.
#![allow(unused)]
fn main() {
let value = fan_out_race(&transport, &replica_endpoints, GetValueRequest { key }).await?;
}
If every peer errors, the function returns AllFailed { errors } with one error per peer.
fan_out_all_partial — Wait for All, Return Per-Peer Results
Sometimes you want every peer’s outcome — successes and failures together — without aborting. This variant never short-circuits.
#![allow(unused)]
fn main() {
let outcomes: Vec<Result<HealthCheck, RpcError>> =
fan_out_all_partial(&transport, &all_peers, HealthCheckRequest).await?;
let healthy = outcomes.iter().filter(|r| r.is_ok()).count();
}
The result vector has exactly endpoints.len() entries, in input order. The fan-out itself only fails if the input slice is empty.
Composing the Two
Real systems use both. A read path looks like load_balance(reads) to pick one storage server with retry. A write path looks like fan_out_quorum(commits, durability_quorum) to make sure enough TLogs persisted the mutation. The #[service] macro generates per-method ServiceEndpoint clones, so building an Alternatives or a Vec<ServiceEndpoint> from a slice of generated clients is a one-liner — no separate routing layer required.
Both patterns avoid spawning tasks. They compose futures via try_join_all, FuturesUnordered, and tokio::select!-style polling on the current task, which keeps moonpool’s single-threaded determinism contract intact.
Designing Simulation-Friendly RPC
Choosing a delivery mode is only half the problem. The harder question is: what does your application do when delivery is ambiguous? This chapter presents six strategies for handling RPC failures, drawn from FoundationDB’s production experience (fdbrpc.h, NativeAPI.actor.cpp, ClusterController.actor.cpp). Simulation testing with chaos injection is what proves you picked the right strategy.
The Core Insight
Most RPC frameworks treat network failures as exceptions. The connection drops, you get a timeout, and then what? Retry? But the server might have already processed the request. Skip it? But the server might not have received it.
FoundationDB’s answer, and moonpool’s, is to make this ambiguity a first-class error. MaybeDelivered tells you exactly what happened: the connection failed, and you do not know whether the request was processed. The application must decide what to do, and simulation testing will verify that decision under thousands of failure scenarios.
The Six Strategies
Strategy 1: Idempotent by Design
The simplest and most powerful approach. Design your request to describe the desired end state, not a delta. Re-delivery is harmless because applying the same state twice produces the same result.
#![allow(unused)]
fn main() {
// BAD: delta-based — duplicate delivery doubles the effect
TransferRequest { from: "A", to: "B", amount: 100 }
// GOOD: state-based — duplicate delivery is a no-op
SetBalanceRequest { account: "A", balance: 900, version: 42 }
}
Examples: worker registration (“I am node X with capabilities Y”), configuration updates (“set parameter X to value Y”), membership heartbeats.
Use get_reply freely with this strategy. The server can safely process duplicates. This is the default choice when you can reformulate the operation as a state assertion.
Strategy 2: Generation Numbers
Tag each request with a monotonic sequence number. The server tracks the last-seen number per client and ignores old duplicates:
#![allow(unused)]
fn main() {
struct RegisterRequest {
node_id: NodeId,
generation: u64, // monotonically increasing per client
capabilities: Vec<Capability>,
}
// Server side:
if req.generation <= last_seen_generation[&req.node_id] {
return Ok(stale_response); // already processed
}
last_seen_generation.insert(req.node_id, req.generation);
}
Use get_reply with this strategy. The reliable transport retransmits, and the server deduplicates via the generation check.
Strategy 3: Fire-and-Forget
Use send for messages where losing one is tolerable. The next message compensates: heartbeats, advisory notifications, metric reports.
The key test: if you send the message twice, is that worse than sending it zero times? If neither matters much, fire-and-forget is the right choice.
Strategy 4: Read-Before-Retry
On MaybeDelivered, read the server’s state to determine whether the previous request succeeded before deciding to retry:
#![allow(unused)]
fn main() {
match delivery::try_get_reply(&transport, &ep, commit_req, codec).await {
Ok(response) => Ok(response),
Err(ReplyError::MaybeDelivered) => {
// Query the server: did my commit go through?
let status = check_commit_status(&transport, &ep, commit_id).await?;
match status {
CommitStatus::Committed(version) => Ok(committed(version)),
CommitStatus::NotFound => {
// Safe to retry — the original was never processed
delivery::try_get_reply(&transport, &ep, commit_req, codec).await
}
}
}
Err(e) => Err(e),
}
}
This is FoundationDB’s approach for client commits (NativeAPI.actor.cpp:6829-6866). It requires the server to support a status query, but it gives you exactly-once semantics without requiring true distributed transactions.
Use try_get_reply with this strategy. The at-most-once guarantee means you know the server processed it at most once, and the read-before-retry resolves the ambiguity.
Strategy 5: Well-Known Endpoint Retry
For endpoints that survive process restarts (coordinators, cluster controllers), catch BrokenPromise and retry with backoff:
#![allow(unused)]
fn main() {
loop {
match delivery::get_reply(&transport, &coordinator_ep, req.clone(), codec) {
Ok(future) => match future.await {
Ok(response) => return Ok(response),
Err(ReplyError::BrokenPromise) => {
// Coordinator restarted — same endpoint, retry
time.sleep(jittered_delay).await;
continue;
}
Err(e) => return Err(e),
},
Err(e) => return Err(e.into()),
}
}
}
This only works for well-known tokens that are registered at the same endpoint across restarts. Ephemeral endpoints (dynamically allocated UIDs) cannot use this pattern because the new process instance has different endpoints.
Strategy 6: AtMostOnce Flag
When multiple equivalent servers can handle the same request, the question becomes whether to retry on an alternative after failure:
- Idempotent requests (reads): retry freely on the next server
- Non-idempotent requests (commits): propagate
MaybeDeliveredto the caller
This is FoundationDB’s load balancer pattern. Moonpool does not yet have a built-in load balancer, but the pattern applies whenever you maintain a list of alternative endpoints for the same logical service.
The Decision Flowchart
Can you lose the message entirely?
YES --> Strategy 3: send (fire-and-forget)
NO
|
Can you reformulate as "set state = X"?
YES --> Strategy 1: idempotent-by-design + get_reply
NO
|
Can the server track per-client sequence numbers?
YES --> Strategy 2: generation dedup + get_reply
NO
|
Can you read the state after failure to check?
YES --> Strategy 4: try_get_reply + read-before-retry
NO
|
Is the endpoint well-known and survives restarts?
YES --> Strategy 5: retry on BrokenPromise
NO --> Strategy 2 (add server-side tracking)
Simulation Proves Your Strategy
The reason these strategies matter in moonpool is that simulation testing will find the bugs if you pick the wrong one.
A process that uses get_reply for a non-idempotent request will see duplicate processing when the chaos engine severs and restores connections. A process that uses try_get_reply without handling MaybeDelivered will silently drop operations when the chaos engine triggers disconnects. A fire-and-forget heartbeat that should have been reliable will cause false failure detection when the chaos engine delays messages.
The simulation does not know which strategy is “correct” for your use case. But it generates the failure patterns that expose incorrect choices. Run with UntilAllSometimesReached(1000) and let the chaos engine prove that your RPC strategy handles every failure mode your system will encounter in production.
Multiverse Exploration
Throughout this book, we have built a simulation framework that runs deterministic tests, injects chaos, and validates correctness with assertions. We can run thousands of seeds, each exploring a different corner of the state space. For many bugs, that is enough.
But some bugs are not found by running more seeds. They require a sequence of unlikely events, and no single seed happens to produce that exact sequence. We teased this in Part I when we described the vision of simulation-driven development. Now we deliver the capstone feature: multiverse exploration.
The Core Idea
When a simulation reaches an interesting state for the first time (a retry fires, a timeout triggers, a leader election completes), we snapshot the entire universe and explore variations from that point. Instead of starting over with a new seed, we fork the process and continue from the discovery point with different randomness.
The insight is simple but powerful: if a seed managed to reach an interesting state, running forward from that state with different random choices is far more likely to find a second interesting event than starting from scratch. We are investing our exploration budget where it has already proven productive.
This is what Antithesis calls checkpoint-and-branch. In moonpool, we call it multiverse exploration because it creates a tree of alternate timelines branching from key discovery points.
Vocabulary
Before we go further, let us establish the terms we will use throughout this section.
Seed is a u64 that completely determines a simulation’s randomness. Same seed means same coin flips means same execution, every time. This is the foundation from Part II.
Timeline is one complete simulation run. A root seed plus a sequence of reseeding points uniquely identifies a timeline.
Splitpoint is a moment where the explorer decides to branch. It happens when a sometimes assertion succeeds for the first time, a numeric watermark improves, or a frontier advances. The splitpoint is identified by the RNG call count at the moment of discovery.
Multiverse is the tree of all timelines explored from one root seed. Each splitpoint creates new children with different seeds, and those children can encounter their own splitpoints, creating a recursive tree.
Recipe is the path from the root timeline to a specific descendant. It is a sequence of (rng_call_count, child_seed) pairs that tells you exactly which forks to take. If a bug is found ten levels deep in the multiverse tree, the recipe tells you how to get back there:
151@8837201 -> 80@1293847 -> 42@9918273
That reads: “at RNG call #151, reseed to 8837201. At call #80 in the new timeline, reseed to 1293847. At call #42, reseed to 9918273.” Follow those instructions and you arrive at the exact same bug, deterministically.
Mark is an assertion site that can trigger splitpoints. Each mark has a name, a shared-memory slot, and in adaptive mode its own energy allowance.
What This Section Covers
Over the next five chapters, we will build up the complete exploration system piece by piece:
-
The Exploration Problem explains why random simulation is not enough for hard bugs, and frames exploration as a resource allocation problem using NES game analogies from Antithesis.
-
Fork at Discovery describes the mechanism: OS-level
fork(), shared memory, coverage bitmaps, and bug recipes. -
Coverage and Energy Budgets introduces energy as a finite exploration resource that prevents unbounded forking.
-
Adaptive Forking replaces fixed-count splitting with batch-based forking that automatically invests more in productive splitpoints.
-
Multi-Seed Exploration shows how running multiple root seeds with moderate energy finds more bugs than a single seed with massive energy.
Each chapter builds on the previous one. The concepts are layered: first the problem, then the basic mechanism, then increasingly sophisticated resource management. By the end, you will understand how moonpool turns a single simulation seed into a tree of thousands of timelines that systematically hunt for bugs hiding behind sequences of unlikely events.
The moonpool-explorer crate that implements all of this is a leaf dependency with exactly one external dependency: libc. It has zero knowledge of processes, networks, or storage. It communicates with the simulation through two function pointers: one to read the RNG call count, and one to reseed the RNG. That minimal coupling is deliberate. The exploration engine is a general-purpose tool that works with any deterministic simulation, not just moonpool’s.
Let us start with the problem it solves.
The Exploration Problem
Running thousands of random seeds is a powerful technique. It finds race conditions, timing bugs, and failure-handling errors that hand-written tests never would. But there is a class of bugs it struggles with, and understanding why reveals the fundamental challenge of state-space exploration.
The Sequential Luck Problem
Consider a distributed lock service with a subtle bug: the lock can be granted to two nodes simultaneously, but only when a specific failover event happens during a specific rollback window. The failover has a 1/1000 chance per simulation step. The rollback has a 1/1000 chance per step. For the bug to manifest, both must happen in the same run.
With independent random seeds, the probability of both events occurring is roughly 1/1,000,000. We need about a million simulation runs to have a decent chance of seeing it. Each run takes a few seconds. That is weeks of compute for one bug.
This is the Sequential Luck Problem: a bug requiring N unlikely events in sequence has probability that decreases multiplicatively. Two events at 1/1000 each need ~10^6 trials. Three events need ~10^9. The search space grows exponentially with the depth of the required sequence.
Now consider what happens with checkpoint-and-branch. We run seeds until one of them triggers the failover (about 1000 runs). At that moment, we snapshot the simulation and spawn new timelines from the failover point with different randomness. Each of those new timelines has a 1/1000 chance of hitting the rollback window. We need about 1000 branch timelines, plus the ~1000 root seeds to reach the failover. Total: roughly 2000 timelines instead of 1,000,000.
The improvement is not incremental. It changes the complexity from multiplicative (p1 * p2) to additive (p1 + p2). For three-event bugs, the difference is even more dramatic: ~3000 timelines instead of ~10^9.
State Space Shapes
Not all exploration problems are alike. Antithesis, through their work running NES games with autonomous testing, discovered that state spaces have shapes that require fundamentally different approaches.
Breadth-Dominated: The Zelda Problem
The Legend of Zelda (1986) is an open-world game with 128 overworld screens, 230+ dungeon rooms, 25 prerequisite items, and multiple weapons. The state space is wide: there are many things to discover, and they do not need to happen in a strict sequence. The challenge is not getting past a single barrier but covering a large surface area.
Antithesis solved Zelda with SOMETIMES_EACH, which equalizes exploration across all discovered states. Visit screen #47? Good. Now visit screen #48 with the same frequency. The exploration engine spreads its budget across the breadth of the space, ensuring no region is neglected.
In moonpool terms, this maps to assert_sometimes_each! assertions that track coverage across identity values. A distributed database might use this to ensure all partition ranges see similar test traffic, or all node roles get exercised equally.
Depth-Dominated: The Gradius Insight
Gradius (1985) is a side-scrolling shooter that auto-scrolls. The player cannot go back. Progress is measured almost entirely by survival time. Antithesis tried a complex strategy tracking powerups, weapons, ship position, and score. Then they deleted all of it and ran with a minimal strategy: maximize time since power-on, tracking only 3 bytes of game memory (single-player mode, pause state, death animation).
The minimal strategy beat the entire game.
This revealed that for depth-dominated problems, the platform infrastructure (save/restore, checkpoint-and-branch) matters more than domain-specific strategy. The agent’s gameplay was alien: it burned powerups on speed, clipped through walls, hid from bosses until they timed out, and hammered the pause button. But it worked because the platform kept branching from the deepest point reached.
For simulation testing, this means that sometimes the right approach is a minimal liveness assertion (“the system is still making progress”) combined with aggressive checkpoint-and-branch, rather than elaborate state tracking.
Barriers: The Castlevania Trap
Castlevania’s Stage 6 has stompers: pillars that descend and crush the player. The exploration engine tracked Simon’s position on a 32-pixel grid and tried to equalize coverage across grid cells. It cruised through the first five stages. Then it got stuck.
The problem was doomed states. The best-known exemplar for the critical grid cells was Simon standing under a descending stomper. That state is irrecoverable: no sequence of future inputs can save him. The explorer kept restarting from this doomed state and making zero progress.
Two fixes unlocked progress. First, refining the grid from 32-pixel to 16-pixel cells created safe zones between the stompers where non-doomed exemplars could be established. Second, adding the stomper position to the state tuple distinguished “under a high stomper” from “under a low stomper.”
The lesson for simulation testing: when exploration gets stuck, the fix is either better input distribution (how chaos events are generated) or better output interpretation (what signals guide exploration). Heatmaps, assertion coverage reports, and sometimes-assertion hit rates are the diagnostic tools that tell you which one to adjust.
Continuous Optimization: The Metroid Economy
Metroid (1986) presented the hardest challenge. The exploration engine could reach everywhere accessible without missiles. But red doors require 5 missiles to open, and the engine kept spending missiles on enemies (because it helped short-term exploration) rather than hoarding them for progression.
The naive solutions failed. Adding missile count to the state tuple caused state explosion (position times missile count is too many combinations). Requiring 5+ missiles was too restrictive (some areas need you to spend them).
The solution was continuous optimization: decouple the optimization objective from the exploration objective. Explore all positions, but prefer states with more missiles and health. When a better-resourced path to a state is found, propagate the improvement through all downstream states.
This generalizes beyond games. In distributed systems testing, the analogues are memory usage (prefer states with lower memory to catch leaks), queue depth (prefer states with deeper queues to find overflow bugs), and latency (prefer states with higher latency to find timeout bugs).
Exploration as Resource Allocation
The NES examples reveal a deeper truth: exploration is fundamentally a resource allocation problem, not just a randomness problem.
We have a finite compute budget. We can spend it on breadth (more root seeds, wider coverage) or depth (more branches from interesting states, deeper sequences). We can spread it evenly across all splitpoints or concentrate it on the most productive ones. We can run one seed with massive energy or many seeds with moderate energy.
Every one of these decisions affects what bugs we find. A breadth-heavy approach misses deep sequential bugs. A depth-heavy approach misses bugs in unexplored regions. Even allocation wastes budget on barren splitpoints. Concentrated allocation might miss productive ones entirely.
The next four chapters describe how moonpool makes these allocation decisions automatically. We start with the basic mechanism (fork at discovery), add resource limits (energy budgets), make the allocation adaptive (batch-based forking with early termination), and finally show how multiple seeds explore genuinely different regions of the state space.
The goal is to find bugs that random testing cannot reach, using the same compute budget. And as we will see, the results are dramatic.
Fork at Discovery
- The Fork
- Why fork() Is Perfect for This
- Shared Memory: The Communication Channel
- The Coverage Bitmap
- Bug Recipes
- What Triggers a Splitpoint
- The Process Model
Now we know why checkpoint-and-branch matters. Let us see how it works in moonpool. The mechanism is surprisingly simple: Unix fork(), shared memory, and a coverage bitmap. No serialization. No checkpointing to disk. No custom snapshot format. Just the operating system doing what it already does well.
The Fork
When a sometimes assertion succeeds for the first time (or a numeric watermark improves, or a frontier advances), the explorer calls fork(). The operating system creates a child process that is an exact copy of the parent: same memory, same simulation state, same processes, same network buffers, same pending timers. Everything.
The child then reseeds its RNG with a deterministic new seed derived from the parent seed, the assertion name, and the child index:
child_seed = FNV-1a(parent_seed + mark_name + child_index)
From this point forward, the child’s simulation diverges. Every random decision (network latency, failure injection, timer jitter) follows a different path. But the starting state is identical to the parent’s at the moment of discovery.
The parent waits for the child to finish, then spawns the next one. Children can themselves encounter new splitpoints and fork again, building a tree of timelines:
Root (seed 42) ──────────┬── splitpoint at RNG #200
RNG #1..#200 │
├── Timeline A (new seed) ──────────── done
├── Timeline B (new seed) ──┬──────── done
│ │
│ nested splitpoint!
│ │
│ ├── B1 ── done
│ └── B2 ── BUG FOUND!
│
└── Timeline C (new seed) ──────────── done
Timeline B2 found the bug. It required two successive unlikely events: the one that triggered the first splitpoint in the root, and the one that triggered the nested splitpoint in B. No single random seed would have produced both.
Why fork() Is Perfect for This
Using fork() for simulation snapshots has properties that are hard to beat:
Zero-cost snapshots. The child gets the parent’s entire address space through copy-on-write (COW). No data is actually copied until one side writes to a page. For a simulation that mostly reads its state during the remaining execution, the overhead is minimal.
No serialization. We do not need to serialize and deserialize simulation state. The child inherits everything: Rc pointers, VecDeques, HashMaps, trait objects. This would be impossible with a checkpoint-to-disk approach for a Rust simulation with complex ownership graphs.
Isolation by default. The child cannot corrupt the parent’s state (or vice versa) because they have separate address spaces. The only shared state is explicitly allocated in shared memory.
Deterministic child seeds. The child seed is computed from the parent seed, assertion name, and child index using FNV-1a hashing. This means the entire multiverse tree is deterministic: same root seed always produces the same tree.
Shared Memory: The Communication Channel
Parent and child processes have separate address spaces, but they need to share some state: coverage data, assertion counters, energy budgets, bug recipes. All of this lives in memory allocated with mmap(MAP_SHARED | MAP_ANONYMOUS):
Parent process memory:
+----------------------------------------------------+
| Simulation state (processes, network, timers) | <- COW pages
| RNG state (will be reseeded in child) | <- COW pages
+----------------------------------------------------+
| MAP_SHARED memory: | <- truly shared
| - Assertion table (128 slots) |
| - Coverage bitmap (1024 bytes) |
| - Explored map (1024 bytes) |
| - Energy budget |
| - Fork stats + bug recipe |
+----------------------------------------------------+
The MAP_SHARED memory is the only communication channel between parent and child. This is why the moonpool-explorer crate depends only on libc. No channels, no sockets, no IPC. Just memory that both processes can read and write.
All shared state uses atomic operations. Assertion slots use compare-and-swap for first-time discovery flags and fetch_add for counters. Energy budgets use fetch_sub with rollback on failure. The bug recipe uses a CAS to ensure only the first bug’s recipe is recorded.
The Coverage Bitmap
Each timeline gets a small bitmap: 1024 bytes, which is 8192 bit positions. When an assertion fires, it sets the bit at position hash(assertion_name) % 8192:
Timeline A's bitmap:
byte 0 byte 1
[0 0 1 0 0 0 0 0] [0 0 0 0 0 1 0 0]
^ ^
bit 2 bit 13
(retry_fired) (timeout_hit)
The explored map (also called the virgin map, borrowing from AFL fuzzer terminology) is the union of all bitmaps across all timelines. It lives in MAP_SHARED memory. After each child finishes, the parent merges the child’s bitmap into the explored map with a bitwise OR:
After Timeline A: explored = 00100000 00000100 ...
After Timeline B: explored = 00100000 00000110 ...
^
Timeline B found bit 14!
(that's new coverage)
The critical question after each child finishes is: did this child find anything new? The has_new_bits() check answers this with a single pass over both bitmaps:
child = 00000110 (bits 1, 2 set)
explored = 00000100 (bit 2 already known)
result = child & !explored
= 00000010 <- bit 1 is NEW!
If the result is nonzero, the child discovered at least one assertion path that no previous timeline had reached. This information drives the adaptive forking system we will see in a later chapter.
Bug Recipes
When a child process detects an assertion violation, it exits with code 42 (the special “bug found” exit code). The parent detects this via waitpid(), records the bug in shared statistics, and saves the recipe: the complete sequence of splitpoints that led to the buggy timeline.
A recipe is a list of (rng_call_count, child_seed) pairs:
[(151, 8837201), (80, 1293847), (42, 9918273)]
Formatted as a human-readable timeline string:
151@8837201 -> 80@1293847 -> 42@9918273
To replay: start with the root seed, run until RNG call #151, reseed to 8837201. Run until call #80, reseed to 1293847. Run until call #42, reseed to 9918273. The simulation follows the exact same path to the bug.
This is the key benefit of deterministic simulation combined with fork-based exploration: bugs that required a tree of thousands of timelines to discover can be replayed as a single, straight-line execution guided by a recipe.
What Triggers a Splitpoint
Not every assertion causes the explorer to fork. Only discovery assertions trigger splitpoints, and only when they discover something genuinely new:
| Assertion kind | Triggers a splitpoint when… |
|---|---|
assert_sometimes! | The condition is true for the first time |
assert_reachable! | The code path is reached for the first time |
assert_sometimes_gt! | The observed value beats the previous watermark |
assert_sometimes_all! | More conditions are simultaneously true than ever |
assert_sometimes_each! | A new identity-key is seen, or quality improves |
assert_always! | Never (invariant, not a discovery) |
assert_unreachable! | Never (safety check, not a discovery) |
The “first time” guard uses a compare-and-swap on a split_triggered flag in shared memory. Once a mark has triggered, it will not trigger again for boolean assertions. Numeric watermarks and frontiers can trigger multiple times as they improve.
The Process Model
Putting it all together, here is what happens when a simulation runs with exploration enabled:
- The builder allocates shared memory and initializes the explorer context
- The simulation runs with its root seed
- When an assertion fires for the first time, the explorer:
- Records the RNG call count (the splitpoint position)
- Checks if it has energy budget remaining
- Saves the parent’s coverage bitmap
- For each child timeline: clears the child bitmap, computes a deterministic child seed, calls
fork() - The child reseeds, updates its depth and recipe, and returns from the split function to continue the simulation with new randomness
- The parent waits, merges coverage, checks for bugs
- After all children finish, the parent restores its bitmap and continues its own simulation
- At the end, the builder collects statistics and bug recipes from shared memory
The child process does not know it was forked. From its perspective, the assertion fired, the simulation continued, and it eventually completed (or found a bug). The forking is invisible to the simulation code. This transparency is what makes exploration composable with the rest of the framework: no changes to processes, workloads, or transport code.
In the next chapter, we will see why the “checks if it has energy budget remaining” step is critical. Without it, a single productive splitpoint could fork an exponentially growing tree that consumes all available memory and CPU.
Coverage and Energy Budgets
- The Simple Model: Fixed-Count Splitting
- The Three-Level Energy Budget
- The Coverage Bitmap and Saturation
- Configuration
The fork-at-discovery mechanism is powerful, but it has an obvious failure mode: exponential blowup. A splitpoint at depth 0 forks 4 children. Each of those children might hit a different splitpoint and fork 4 more. Two levels deep, we have 16 timelines. Three levels, 64. A simulation with many assertion sites and deep nesting could spawn millions of processes, exhausting memory and CPU.
We need a budget. Something that caps the total work and distributes it intelligently. In moonpool, that budget is called energy.
The Simple Model: Fixed-Count Splitting
The simplest exploration mode allocates a flat global energy pool. Every timeline spawned costs one unit. When energy hits zero, no more timelines are created, regardless of how many splitpoints remain untriggered.
#![allow(unused)]
fn main() {
ExplorationConfig {
max_depth: 2,
timelines_per_split: 3,
global_energy: 50,
adaptive: None,
parallelism: None,
}
}
This says: fork up to 3 timelines at each splitpoint, allow nesting up to depth 2, and spend at most 50 total timelines across the entire exploration. If 17 splitpoints trigger, the first 16 get their 3 timelines each (48 total), and the 17th gets 2 before energy runs out.
Fixed-count mode is easy to reason about. The maximum wall-clock time is bounded: 50 timelines, each running a full simulation. For quick smoke tests and CI, this is often all you need.
But it has a problem. Every splitpoint gets the same budget (3 timelines), regardless of whether it is productive (finding new code paths) or barren (repeating paths already seen). A splitpoint guarding a dead-end code path gets the same investment as one guarding a rich, unexplored subtree. That is wasteful.
The Three-Level Energy Budget
Adaptive mode replaces the flat pool with a three-level energy system. Each level serves a different purpose.
Level 1: Global Energy
The hard cap. Every timeline, regardless of which splitpoint created it, consumes one unit of global energy. When global energy reaches zero, all exploration stops. This is the circuit breaker that prevents runaway forking.
Level 2: Per-Mark Energy
Each assertion mark (splitpoint) gets its own initial budget. When a mark first triggers, it is allocated per_mark_energy units. The mark can only spawn timelines if it has per-mark energy remaining. This ensures that no single mark can consume the entire global budget.
Level 3: The Reallocation Pool
Here is where it gets interesting. When a mark is declared barren (it has spawned several batches of timelines and none of them found new coverage bits), its remaining per-mark energy is returned to a shared reallocation pool. Productive marks that exhaust their initial per-mark budget can draw from this pool to keep exploring.
Energy flows downhill: from barren marks, through the reallocation pool, to productive marks.
+------------------------------------------------------------+
| GLOBAL ENERGY (100K) |
| Every timeline consumes 1 from here first. |
| When this hits 0, ALL exploration stops. |
+------------------------------------------------------------+
| |
| +-----------------+ +-----------------+ +------------+ |
| | Mark A: 1000 | | Mark B: 1000 | | Mark C: | |
| | (productive) | | (barren) | | 1000 (new) | |
| | Used: 1000 | | Used: 60 | | Used: 0 | |
| | Needs more! ---+--+--- Returns 940 -+--+ | |
| +--------+--------+ +-----------------+ +------------+ |
| | | |
| v v |
| +----------------------------------------------------+ |
| | REALLOCATION POOL: 940 | |
| | Energy returned by barren marks. | |
| | Productive marks draw from here when their | |
| | per-mark budget runs out. | |
| +----------------------------------------------------+ |
+------------------------------------------------------------+
The Decrement Sequence
When the explorer wants to spawn one timeline for a mark, it follows a strict sequence:
-
Decrement global energy by 1. If global is at 0, stop. The hard cap is absolute.
-
Decrement per-mark energy by 1. If the mark still has budget, proceed.
-
If per-mark is exhausted, try the reallocation pool. Decrement pool by 1. If the pool has energy, proceed.
-
If neither per-mark nor pool has energy, undo the global decrement and stop for this mark. The mark is out of resources, but global energy is preserved for other marks.
All of these decrements use atomic fetch_sub operations in shared memory, with rollback on failure. This ensures consistency even when nested children (at different fork depths) are competing for the same budget.
The Coverage Bitmap and Saturation
The 8192-bit coverage bitmap is what the explorer uses to decide whether a mark is productive or barren. After each child finishes, the parent checks: did this child set any bits that the explored map did not already have?
But 8192 bits is a small space. With many assertion sites, hash collisions are inevitable. And as exploration progresses, more bits get set. Eventually, the bitmap saturates: most or all bits are set, and every new child appears to contribute “nothing new” even if it explored genuinely different behavior.
This is a real issue. A simulation with 200 assertion sites, each hashing to a different bit, fills 200 of 8192 positions. That is only 2.4% saturation, which is fine. But assert_sometimes_each! assertions with many identity values can hash hundreds of distinct values into the bitmap. A simulation tracking 1000 partition keys sets 1000 bits, reaching 12% saturation. At higher counts, collisions make the has_new_bits() check less discriminating.
The practical consequence: simulations with many unique assertion paths need higher min_timelines (the minimum exploration per mark before declaring it barren) because the coverage bitmap loses resolution. We will see how the adaptive forking system handles this in the next chapter.
For simulations with moderate assertion counts (under a few hundred unique paths), the 8192-bit bitmap is more than sufficient. The signal is clear, barren marks are detected quickly, and energy flows efficiently from unproductive to productive splitpoints.
Configuration
#![allow(unused)]
fn main() {
ExplorationConfig {
max_depth: 3,
timelines_per_split: 4, // ignored in adaptive mode
global_energy: 200_000,
adaptive: Some(AdaptiveConfig {
batch_size: 4,
min_timelines: 4,
max_timelines: 200,
per_mark_energy: 1000,
warm_min_timelines: None,
}),
parallelism: Some(Parallelism::MaxCores),
}
}
The max_depth parameter limits how deep the fork tree can grow. A depth of 3 means the root can fork children (depth 1), those children can fork grandchildren (depth 2), and grandchildren can fork great-grandchildren (depth 3). Beyond that, splitpoints are ignored. This prevents combinatorial explosion in fork tree depth while still allowing multi-step bugs to be discovered.
global_energy is the total number of timelines across the entire exploration. per_mark_energy is the initial budget per splitpoint. max_timelines is a hard cap per mark, even if it is still productive. The relationship between these values controls the exploration strategy:
- High global, low per-mark: many marks get explored, but none deeply. Good for breadth-dominated problems (the Zelda shape).
- High global, high per-mark: fewer marks, but each explored thoroughly. Good for depth-dominated problems (the Gradius shape).
- Large reallocation pool (from many barren marks): self-correcting. Budget flows to where it is productive.
The energy system is the bridge between “fork whenever something interesting happens” and “do not melt the computer.” It makes exploration a disciplined, bounded operation that can run in CI with predictable resource consumption. In the next chapter, we will see the adaptive algorithm that decides when a mark is barren and how to redistribute its energy.
Adaptive Forking
- The Batch Loop
- The min_timelines Floor
- Barren Mark Detection
- Energy Flow in Practice
- Parallel Adaptive Forking
- Configuration Guide
Fixed-count splitting gives every splitpoint the same number of timelines. The three-level energy budget caps total exploration and redistributes unused energy. But there is still a missing piece: how do we decide when a mark is productive and when it is barren?
This is the job of the adaptive forking algorithm. Instead of spawning all timelines at once, it works in batches, checking coverage yield after each batch. Productive marks earn more batches. Barren marks are cut off early and their remaining energy flows back to the pool.
The Batch Loop
When an assertion triggers a splitpoint, the adaptive explorer does not immediately spawn max_timelines children. It spawns a batch of batch_size children (say, 4), waits for them to finish, and asks: did any of those children discover something new?
The answer comes from the coverage bitmap. Before merging each child’s bitmap into the explored map, the parent calls has_new_bits(). If at least one child in the batch set a bit that was not in the explored map, the batch is productive. If no child set any new bit, the batch is barren.
The critical detail: the parent checks has_new_bits() before calling merge_from(). If it merged first, the second child’s “new” bits would be masked by the first child’s already-merged bits. Checking before merging ensures we accurately detect whether any child in the batch found genuinely new coverage.
Mark "retry_path":
Batch 1: spawn 4 timelines -> 2 found new bits (productive)
Batch 2: spawn 4 timelines -> 1 found new bits (productive)
Batch 3: spawn 4 timelines -> 0 found new bits (barren!)
-> Return remaining per-mark energy to pool.
-> Stop exploring this mark.
Mark "partition_heal":
Batch 1: spawn 4 timelines -> 3 found new bits (productive)
Batch 2: spawn 4 timelines -> 2 found new bits (productive)
Batch 3: spawn 4 timelines -> 2 found new bits (productive)
...continues until per-mark budget exhausted...
...draws from reallocation pool for more...
Batch 8: spawn 4 timelines -> hits max_timelines -> stop
The algorithm is greedy and practical: keep investing in a mark as long as it finds new paths. Stop when it stops finding them. Redistribute the savings.
The min_timelines Floor
There is a subtlety. A mark might appear barren after just one batch, but actually guard a rich code path that requires a few attempts to enter. If we cut it off after 4 timelines, we might miss bugs that the 8th or 12th timeline would have found.
The min_timelines parameter sets a floor: even if a mark looks barren from the start, the explorer will run at least min_timelines timelines before giving up. This prevents premature abandonment of marks that are slow starters.
How high should min_timelines be? It depends on the coverage bitmap saturation. In simulations with few assertion paths, the bitmap is sparse and has_new_bits() is a reliable signal. A min_timelines of one batch (4-8) is sufficient. In simulations with many paths (hundreds or thousands of distinct assert_sometimes_each! values), the bitmap starts to saturate. The signal becomes noisy. You need a higher floor (60-100+) to give marks a fair chance.
The concrete numbers from moonpool’s own simulation tests:
-
Maze simulation (moderate assertion count):
min_timelines: 100,max_timelines: 200,per_mark_energy: 1000. Global energy 50K. About 5,700 timelines, 103 bugs found, 4 seconds. -
Dungeon simulation (many assertion paths, bitmap saturates quickly):
min_timelines: 800,max_timelines: 2000,per_mark_energy: 20000. Global energy 2M. About 268K timelines, 943 bugs found. The highmin_timelinesis necessary because the 8192-bit coverage bitmap saturates with many unique bucket hashes.
Barren Mark Detection
A mark is declared barren when an entire batch of children produces zero new coverage bits and the mark has already run at least min_timelines total timelines. When this happens:
- The mark’s remaining per-mark energy is atomically swapped to 0
- That energy is added to the reallocation pool via
fetch_add - The explorer stops spawning children for this mark
This is not a permanent decision. If the same assertion fires again at a deeper fork depth (in a child timeline), it gets a fresh per-mark budget. The barren classification applies to one invocation of the split loop, not to the assertion itself.
Energy Flow in Practice
Let us trace the energy flow through a concrete example. A simulation has 5 splitpoints (marks A through E) with per_mark_energy: 100 and global_energy: 1000.
Initial state:
Global: 1000
Mark A: 100 Mark B: 100 Mark C: 100 Mark D: 100 Mark E: 100
Pool: 0
Mark A triggers. Productive. Exhausts budget.
Global: 900 Mark A: 0 ... Pool: 0
Mark B triggers. Barren after 20 timelines. Returns 80.
Global: 880 Mark B: 0 ... Pool: 80
Mark C triggers. Productive. Exhausts budget. Draws 80 from pool.
Global: 700 Mark C: 0 ... Pool: 0
Mark D triggers. Barren after 10 timelines. Returns 90.
Global: 690 Mark D: 0 ... Pool: 90
Mark E triggers. Productive. Exhausts budget. Draws 90 from pool.
Global: 500 Mark E: 0 ... Pool: 0
Mark C and E each got 180 timelines (100 per-mark + 80 or 90 from pool), while marks B and D got only 20 and 10. The energy system automatically invested more in productive areas and less in unproductive ones, with zero manual tuning.
Parallel Adaptive Forking
When parallelism is configured, the adaptive loop uses a sliding window of concurrent children instead of spawning one child at a time. The parent maintains a pool of bitmap slots (one per CPU core), and children write their coverage to their assigned slot. When a slot’s child finishes (detected via waitpid(-1)), the parent merges that slot’s bitmap, recycles the slot, and spawns the next child.
The batch yield check still happens at batch boundaries. The parent drains all active children before deciding whether to continue. This means parallelism speeds up each batch but does not change the stop/continue decision logic.
Sequential (4 children):
fork-wait-merge fork-wait-merge fork-wait-merge fork-wait-merge
[----batch----] check yield
Parallel (4 children, 4 cores):
fork fork fork fork
wait-any merge wait-any merge wait-any merge wait-any merge
[------------------batch------------------] check yield
The parallel path achieves near-linear speedup for CPU-bound simulations. A 16-core machine processes 16 children concurrently, reducing a 1000-timeline exploration from minutes to seconds.
Configuration Guide
#![allow(unused)]
fn main() {
AdaptiveConfig {
batch_size: 4, // children per batch before checking yield
min_timelines: 60, // minimum before declaring barren
max_timelines: 200, // hard cap even for productive marks
per_mark_energy: 1000, // initial budget per mark
warm_min_timelines: None, // for multi-seed (next chapter)
}
}
Rules of thumb:
- batch_size: 4-8 is a good default. Smaller batches detect barren marks faster. Larger batches reduce the overhead of yield checks.
- min_timelines: Start low (batch_size). Increase if you see productive marks being cut off prematurely. If your simulation has many
assert_sometimes_each!values, go higher (100+). - max_timelines: Cap based on your time budget. Each timeline runs a full simulation.
- per_mark_energy: Start at 5-10x
min_timelines. The surplus feeds the reallocation pool.
The adaptive system turns exploration from a “spray and pray” strategy into an intelligent resource allocator that automatically concentrates compute where it produces results. But even the best single-seed adaptive exploration eventually hits diminishing returns. In the next chapter, we will see how running multiple seeds breaks through that ceiling.
Multi-Seed Exploration
- Coverage-Preserving Seed Transitions
- Warm Starts
- Real Numbers
- The Builder Loop
- When to Use Multi-Seed
- The Complete Picture
Even with adaptive forking and energy budgets, running a single root seed with a massive energy budget eventually hits diminishing returns. The coverage bitmap saturates. Every mark looks barren because every new child’s bits overlap with the already-dense explored map. Productive marks get cut off too early, and energy piles up in the reallocation pool with nowhere useful to go.
The fundamental issue is that a single seed reaches a particular region of the state space. No amount of branching can explore regions that the root timeline’s initial execution path never touches. If the first 100 RNG calls establish a specific network topology and failure pattern, all child timelines share that foundation. They explore variations on a theme, not genuinely different themes.
The solution: run multiple root seeds, each with moderate energy, preserving coverage information across seeds.
Coverage-Preserving Seed Transitions
The naive approach to multi-seed exploration would be to reset everything between seeds: zero the explored map, reset all assertion state, start fresh. But that throws away valuable information. If seed 1 discovered that retry_path is productive, seed 2 should know that.
Moonpool’s prepare_next_seed() function performs a selective reset that preserves cumulative knowledge while clearing per-seed transient state.
Preserved across seeds:
- The explored map (coverage bitmap union). Bits set by prior seeds stay set. The explored map is the collective memory of the multiverse.
- Assertion watermarks and frontiers. If seed 1 achieved a numeric watermark of 42, seed 2’s assertions know that 42 is the bar to beat. Only genuine improvements trigger new splitpoints.
- Best scores for bucketed assertions. Prior quality context carries forward.
Reset between seeds:
- Split triggers. Every mark can trigger fresh splitpoints with the new seed. A mark that was barren under seed 1 might be productive under seed 2 because the new seed reaches it through a different execution path.
- Pass/fail counters. Per-seed statistics start fresh.
- Energy budget. Each seed gets its own full energy allocation.
- Coverage bitmap (per-timeline). Cleared so each new root timeline starts with a clean slate.
- Bug recipe. Each seed can capture its own bug.
The preserved explored map is the key. It means that the coverage check has_new_bits() considers bits from all prior seeds. A child in seed 3 is only considered productive if it finds something that neither seed 1, seed 2, nor any of their children have seen. The bar rises progressively, focusing each seed’s energy on genuinely unexplored territory.
Warm Starts
There is a subtlety with preserved explored maps. When seed 2 starts, the explored map already has bits set from seed 1’s exploration. Marks that were productive under seed 1 (and filled in many bits) will appear to seed 2 as already-explored. The adaptive loop might classify them as barren after just min_timelines attempts, even though seed 2 has a genuinely different execution path that could find new things.
The warm_min_timelines parameter addresses this. On warm starts (seeds after the first), marks that appear barren exit after warm_min_timelines instead of the full min_timelines. This is typically set lower than min_timelines because the explored map has prior context. There is less need for a long ramp-up.
#![allow(unused)]
fn main() {
AdaptiveConfig {
batch_size: 20,
min_timelines: 400,
max_timelines: 2000,
per_mark_energy: 10_000,
warm_min_timelines: Some(30), // warm seeds: 30 instead of 400
}
}
On seed 1, each mark gets at least 400 timelines before being declared barren. On seeds 2 and 3, marks that tread already-explored ground (which is most of them, since the explored map is dense) are cut off after 30 timelines. But marks that find genuinely new paths on the new seed ramp up to the full budget. Energy automatically flows to the marks where the new seed has something unique to contribute.
Real Numbers
Here are actual results from moonpool’s simulation tests:
Dungeon simulation:
- Single seed, 2M energy: ~35 seconds, found 943 bugs
- 3 seeds x 400K energy (1.2M total): ~24 seconds, found comparable bugs
Less total energy, faster wall-clock time, comparable bug discovery. The multi-seed run is faster because each seed explores a different region efficiently. The single massive seed spends most of its late-stage energy on timelines that find nothing new.
Maze simulation:
- Single seed, 50K energy: ~4 seconds
- 2 seeds x 20K energy (40K total): ~0.75 seconds
Five times faster with 20% less total energy. The two seeds happen to cover the state space from different angles, and each seed’s warm start avoids re-exploring the other’s territory.
These speedups come from the coverage-preserving reset. Without it, multi-seed exploration would be N independent runs with no shared knowledge, which is what plain UntilAllSometimesReached already does. The explored map is the innovation that makes multi-seed better than the sum of its parts.
The Builder Loop
The simulation builder orchestrates multi-seed exploration as a loop:
Seed 1 (cold start):
- init() with full energy
- run simulation
- collect stats
- cleanup exploration state
Seed 2 (warm start):
- prepare_next_seed(energy) // selective reset
- skip_next_assertion_reset() // don't zero watermarks
- run simulation
- accumulate stats
- cleanup exploration state
Seed 3 (warm start):
- prepare_next_seed(energy)
- skip_next_assertion_reset()
- run simulation
- accumulate stats
- final cleanup
Each iteration after the first calls prepare_next_seed() instead of a full init()/cleanup() cycle. The skip_next_assertion_reset() function tells the simulation world not to zero the assertion table when it creates a new simulation, preserving the watermarks and frontiers that prepare_next_seed() carefully kept.
Statistics are accumulated across seeds. The final report shows the totals: total timelines, total bugs, total coverage bits, across all seeds.
When to Use Multi-Seed
Multi-seed exploration is most valuable when:
-
The simulation has many assertion paths. More assertions means faster bitmap saturation, which means each seed hits diminishing returns sooner. Multiple seeds keep the signal fresh.
-
You have a time budget. 3 seeds at 400K energy with parallel forking finishes faster than 1 seed at 2M energy because warm seeds cut off re-explored territory quickly.
-
You want diversity. Different root seeds produce different initial topologies, failure patterns, and timing. Each seed explores a region that the others might never reach.
Multi-seed is less important when:
-
The simulation has few assertion paths. With a sparse bitmap, the adaptive loop’s barren detection works well and a single seed with enough energy explores efficiently.
-
You want maximum depth. A single seed with massive energy explores the deepest possible fork trees. Multi-seed sacrifices per-seed depth for breadth across seeds.
The Complete Picture
Let us step back and see how all the exploration pieces fit together.
The problem: bugs that require sequences of unlikely events are exponentially hard to find with random seeds.
Fork at discovery: when something interesting happens, snapshot the universe and explore variations. Uses fork(), shared memory, and coverage bitmaps. Reduces sequential-luck probability from multiplicative to additive.
Energy budgets: three-level system (global, per-mark, reallocation pool) that caps total work and redistributes resources from barren to productive marks.
Adaptive forking: batch-based exploration that detects barren marks via coverage yield and stops early. Productive marks get more investment automatically.
Multi-seed exploration: multiple root seeds with moderate energy, preserving coverage knowledge across seeds. Explores genuinely different state-space regions while avoiding redundant work.
Together, these form a system that automatically allocates exploration budget where it produces results, across multiple seeds, across many splitpoints, down to the individual batch level. The developer’s job is to write good assertions (the “what to explore” specification) and set reasonable energy budgets (the “how much to spend” constraint). The exploration engine handles the rest.
The moonpool-explorer crate that implements all of this is 14 source files with a single external dependency (libc). It has zero knowledge of processes, networks, or storage. It communicates with the simulation through two function pointers. And it can turn a single-seed simulation run into a multiverse of thousands of timelines that systematically hunt for the bugs hiding behind layers of sequential luck.
Assertion Reference
- Boolean Assertions
- Numeric Assertions
- Compound Assertions
- Validation:
validate_assertion_contracts() - Related Functions
Moonpool provides 15 assertion macros for testing distributed system properties. All macros follow the Antithesis principle: assertions never crash your program. Violations are recorded in shared memory and reported after the simulation completes, allowing the system to continue running and discover cascading failures.
Every macro is tracked in shared memory via moonpool-explorer, enabling cross-process visibility across forked exploration timelines.
Boolean Assertions
These macros test boolean conditions. Always-type assertions record violations but do not panic. Sometimes-type assertions guide exploration by triggering forks on discovery.
| Macro | Category | Description | Panics? | Forks in exploration? |
|---|---|---|---|---|
assert_always! | Always | Condition must be true every time it is evaluated | No | No |
assert_always_or_unreachable! | Always | Condition must be true when reached, but the code path need not be reached | No | No |
assert_sometimes! | Sometimes | Condition should be true at least once across all iterations | No | Yes, on first success |
assert_reachable! | Reachable | Code path should be reached at least once | No | Yes, on first reach |
assert_unreachable! | Unreachable | Code path should never be reached | No | No |
assert_always!(condition, message)
Records a violation if condition is false. The simulation continues; violations are collected and reported at the end. Validated by validate_assertion_contracts() which flags the assertion if fail_count > 0, or if must_hit and the assertion was never reached.
#![allow(unused)]
fn main() {
assert_always!(
granted_count <= 1,
"lock never granted to two nodes simultaneously"
);
}
assert_always_or_unreachable!(condition, message)
Like assert_always!, but does not require the code path to be reached. If the code is never executed, the assertion passes silently. Useful for guarding optional error-handling paths.
#![allow(unused)]
fn main() {
assert_always_or_unreachable!(
retry_count < max_retries,
"retry count within bounds when retry path taken"
);
}
assert_sometimes!(condition, message)
Records pass/fail statistics. On the first time the condition is true, triggers a fork in exploration mode to explore alternate timelines from that point. Validated by checking that pass_count > 0 after enough iterations.
#![allow(unused)]
fn main() {
assert_sometimes!(
saw_leader_election,
"leader election triggered at least once"
);
}
assert_reachable!(message)
Marks a code path as “should be reached.” Always passes true as the condition. On first reach, triggers a fork. Validated by checking that pass_count > 0.
#![allow(unused)]
fn main() {
if connection_failed {
assert_reachable!("retry path exercised");
retry().await;
}
}
assert_unreachable!(message)
Marks a code path that should never execute. If reached, records a violation (but does not panic). Validated by checking that pass_count == 0.
#![allow(unused)]
fn main() {
match state {
State::Valid => { /* ok */ }
State::Invalid => {
assert_unreachable!("invalid state should never occur");
}
}
}
Numeric Assertions
These macros compare a value against a threshold. Always-type numeric assertions record violations on failure. Sometimes-type numeric assertions track watermarks (best observed value) and trigger forks when the watermark improves.
All values are cast to i64 internally.
| Macro | Category | Comparison | Panics? | Forks in exploration? |
|---|---|---|---|---|
assert_always_greater_than! | NumericAlways | val > threshold | No | No |
assert_always_greater_than_or_equal_to! | NumericAlways | val >= threshold | No | No |
assert_always_less_than! | NumericAlways | val < threshold | No | No |
assert_always_less_than_or_equal_to! | NumericAlways | val <= threshold | No | No |
assert_sometimes_greater_than! | NumericSometimes | val > threshold | No | Yes, on watermark improvement |
assert_sometimes_greater_than_or_equal_to! | NumericSometimes | val >= threshold | No | Yes, on watermark improvement |
assert_sometimes_less_than! | NumericSometimes | val < threshold | No | Yes, on watermark improvement |
assert_sometimes_less_than_or_equal_to! | NumericSometimes | val <= threshold | No | Yes, on watermark improvement |
Always numeric example
#![allow(unused)]
fn main() {
assert_always_greater_than!(
queue.len(),
0,
"queue never empty during processing"
);
assert_always_less_than_or_equal_to!(
latency_ms,
timeout_ms,
"response latency within timeout"
);
}
Sometimes numeric example
#![allow(unused)]
fn main() {
assert_sometimes_greater_than!(
concurrent_connections,
5,
"achieved high concurrency"
);
assert_sometimes_less_than!(
retry_delay_ms,
100,
"fast retry path exercised"
);
}
Watermark tracking: For sometimes numeric assertions, the explorer tracks the best value observed so far. When a new evaluation improves the watermark (higher for gt/ge, lower for lt/le), a fork is triggered to explore timelines that might push the metric even further.
Compound Assertions
These macros track multi-dimensional properties across multiple conditions or identity keys.
| Macro | Category | Description | Panics? | Forks in exploration? |
|---|---|---|---|---|
assert_sometimes_all! | BooleanSometimesAll | All named booleans should sometimes be true simultaneously | No | Yes, on frontier advance |
assert_sometimes_each! | EachBucket | Per-identity bucketed assertion with optional quality watermarks | No | Yes, on new bucket or quality improvement |
assert_sometimes_all!(message, [(name, bool), ...])
Tracks a frontier: the maximum number of conditions that have been simultaneously true. When the frontier advances (more conditions true at once than ever before), a fork is triggered.
#![allow(unused)]
fn main() {
assert_sometimes_all!("all_nodes_healthy", [
("node_a", node_a_healthy),
("node_b", node_b_healthy),
("node_c", node_c_healthy),
]);
}
If previously at most 2 of the 3 conditions were true at once, and now all 3 are true, the frontier advances from 2 to 3 and a fork is triggered.
assert_sometimes_each!(message, [(key, value), ...]) / assert_sometimes_each!(message, [(key, value), ...], [(quality_key, quality_value), ...])
Each unique combination of identity keys gets its own bucket. On first discovery of a new bucket, a fork is triggered. If quality keys are provided, the explorer also tracks quality watermarks per bucket and re-forks when quality improves.
#![allow(unused)]
fn main() {
// Identity keys only -- fork on first discovery of each (lock, depth) combo
assert_sometimes_each!("gate", [("lock", lock_id), ("depth", depth)]);
// With quality watermarks -- also fork when health improves for a known bucket
assert_sometimes_each!("descended", [("to_floor", floor)], [("health", hp)]);
}
Validation: validate_assertion_contracts()
After the simulation completes, validate_assertion_contracts() reads all assertion slots from shared memory and checks each one against its kind-specific contract. It returns two vectors:
Always violations (definite bugs)
These indicate real bugs and are safe to check regardless of iteration count.
| Kind | Violation condition |
|---|---|
Always | fail_count > 0 (condition was false at least once) |
Always | must_hit && total == 0 (assertion was never reached) |
AlwaysOrUnreachable | fail_count > 0 (condition was false when reached) |
Unreachable | pass_count > 0 (code path was reached) |
NumericAlways | fail_count > 0 (comparison failed at least once) |
Coverage violations (statistical)
These are only meaningful with enough iterations for statistical coverage. A single iteration may not trigger every path.
| Kind | Violation condition |
|---|---|
Sometimes | total > 0 && pass_count == 0 (condition was never true) |
Reachable | pass_count == 0 (code path was never reached) |
NumericSometimes | total > 0 && pass_count == 0 (comparison never succeeded) |
BooleanSometimesAll | No simple violation (frontier tracking is the guidance mechanism) |
Related Functions
| Function | Purpose |
|---|---|
record_always_violation() | Increment the thread-local violation counter (called by always-type macros) |
reset_always_violations() | Reset the violation counter (called at the start of each iteration) |
has_always_violations() -> bool | Check if any always-type violation occurred this iteration |
get_assertion_results() -> HashMap<String, AssertionStats> | Read all assertion statistics from shared memory |
reset_assertion_results() | Zero the shared memory assertion table (between iterations) |
skip_next_assertion_reset() | Prevent the next reset (used by multi-seed exploration) |
panic_on_assertion_violations(report) | Panic if the report contains any assertion violations |
Crate Map
Moonpool is organized as a workspace of eight crates. The dependency graph is deliberately layered: lower crates know nothing about higher ones, and the leaf crate (moonpool-explorer) has no moonpool dependencies at all.
Dependency Diagram
moonpool
(facade crate)
/ | \
/ | \
moonpool-transport moonpool-sim moonpool-core
(peer, wire, RPC) (simulation) (provider traits)
| \ |
| \ |
| moonpool- |
| transport- |
| derive |
| (proc macros) |
| |
+---> moonpool-core |
+---> moonpool-sim |
| |
v v
moonpool-explorer
(fork-based exploration)
|
v
libc
moonpool-sim-examples (example simulation binaries)
xtask (cargo automation, not a library dependency)
Crate Details
moonpool
Role: Facade crate. Re-exports everything from the lower crates so users only need one dependency.
Dependencies: moonpool-core, moonpool-sim, moonpool-transport
Key types: Re-exports all types from moonpool-core, moonpool-sim, and moonpool-transport.
moonpool-core
Role: Provider traits and core type definitions. Defines the abstraction boundary between real and simulated runtimes.
Dependencies: async-trait, rand, serde, serde_json, thiserror, tokio, tracing
Key traits:
TimeProvider–sleep(),timeout(), clock accessTaskProvider–spawn_task()for local task spawningNetworkProvider– TCP listener and stream creationRandomProvider– deterministic random number generationStorageProvider– file I/O with simulation support
Key types:
Endpoint–(IpAddr, Token)pair identifying a connection endpointUID– unique identifier typeNetworkAddress– parsed network addressWellKnownToken– reserved token namespace for framework servicesProviders– bundle of all provider traitsSimulationError/SimulationResult– error types
moonpool-sim
Role: Simulation runtime, chaos testing, buggify system, and assertion macros. The core simulation engine that drives deterministic testing.
Dependencies: moonpool-core, moonpool-explorer, async-trait, crc32c, futures, rand, rand_chacha, serde, serde_json, thiserror, tokio, tokio-util, tracing
Key types:
SimWorld– the simulated world containing network, time, storage, and event queueSimulationBuilder– builder pattern for configuring experimentsSimContext– per-workload context providing access to providers and topologyNetworkConfiguration/ChaosConfiguration– network chaos parametersProcess– trait for system-under-test server processesWorkload– trait for test driver workloadsAttrition– automatic process reboot configurationFaultInjector/FaultContext– custom fault injection during chaos phaseIterationControl– how many iterations to runSimulationReport– results, metrics, and assertion dataInvariant– trait for cross-system property validation
Assertion macros (15 total): assert_always!, assert_sometimes!, assert_reachable!, assert_unreachable!, assert_always_greater_than!, assert_sometimes_each!, and more. See the Assertion Reference for the complete list.
moonpool-transport
Role: Peer connections, wire format, FlowTransport-style networking, and RPC. Modeled after FoundationDB’s FlowTransport.
Dependencies: moonpool-core, moonpool-sim, moonpool-transport-derive, async-trait, crc32c, futures, serde, serde_json, thiserror, tokio, tokio-util, tracing
Key types:
Peer– manages a connection to a remote endpoint with automatic reconnectionPeerConfig– reconnection delays, queue size, connection timeoutMonitorConfig– ping-based connection health monitoringNetTransport– central coordinator managing peers and packet dispatchEndpointMap– hybrid token routing (O(1) well-known, O(log n) dynamic)FailureMonitor/FailureStatus– reactive address/endpoint failure trackingReplyPromise– server-side response promise (auto-sendsBrokenPromiseon Drop)ReplyFuture– client-side response future (auto-closes queue on Drop)ReplyError– error enum includingMaybeDelivered,Timeout,BrokenPromiseRequestStream– server-side typed request receiver withrecv_with_transport()RequestEnvelope– request + reply_to endpoint for bidirectional RPCMessagingError– transport-level error type- Delivery modes:
send,try_get_reply,get_reply,get_reply_unless_failed_for - Load balancing:
Alternatives,Distance,AtMostOnce,QueueModel,ModelHolder,Smoother,load_balance() - Fan-out:
fan_out_all,fan_out_quorum,fan_out_race,fan_out_all_partial,FanOutError
Proc macros (from moonpool-transport-derive):
#[service]– generates service trait, server, client, and bound client types
moonpool-transport-derive
Role: Procedural macros for RPC service definitions.
Dependencies: proc-macro2, quote, syn (compile-time only, no runtime deps)
Provides:
#[service]– derive macro for RPC service definitions
This is a proc-macro crate and cannot export regular types or functions.
moonpool-explorer
Role: Fork-based multiverse exploration, coverage tracking, and energy budgets. A leaf crate with zero moonpool knowledge – communicates with the simulation only through RNG function pointers.
Dependencies: libc (only dependency)
Key types:
ExplorationConfig– max depth, energy, timelines per split, adaptive configAdaptiveConfig– batch size, min/max timelines, per-mark energyParallelism– multi-core exploration variants (MaxCores, HalfCores, Cores, MaxCoresMinus)AssertionSlot/AssertionSlotSnapshot– shared-memory assertion tracking (128 slots)AssertKind– Always, AlwaysOrUnreachable, Sometimes, Reachable, Unreachable, NumericAlways, NumericSometimes, BooleanSometimesAllAssertCmp– Gt, Ge, Lt, LeEachBucket– per-value bucketed assertion tracking (256 buckets)CoverageBitmap/ExploredMap– 8192-bit coverage bitmapsEnergyBudget– 3-level energy system (global + per-mark + reallocation pool)SharedStats/SharedRecipe– cross-process counters and bug replay dataExplorationStats– snapshot of exploration progress
Key functions:
init()/cleanup()– lifecycle managementinit_assertions()/cleanup_assertions()– assertion-only lifecycleset_rng_hooks()– connect to simulation’s RNGassertion_bool(),assertion_numeric(),assertion_sometimes_all(),assertion_sometimes_each()– assertion recordingsplit_on_discovery()– fork the process at a splitpointexit_child()– terminate a forked child processprepare_next_seed()– selective reset for multi-seed explorationsancov_edges_covered(),sancov_edge_count(),sancov_is_available()– sanitizer coverage integration
moonpool-sim-examples
Role: Example simulation binaries demonstrating exploration features.
Dependencies: moonpool-sim
Binaries:
sim-maze-explore– adaptive exploration on maze workloadsim-dungeon-explore– adaptive exploration on dungeon workload
Not a library dependency – contains only binary targets for demonstration and testing of the exploration subsystem.
xtask
Role: Cargo xtask automation for running simulation binaries.
Not a library dependency – invoked via cargo xtask.
Commands:
cargo xtask sim list– list all simulation binariescargo xtask sim run <filter>– run simulation binaries matching a filtercargo xtask sim run-all– run all simulation binaries
Configuration Reference
- SimulationBuilder
- IterationControl
- ProcessCount
- WorkloadCount
- ClientId
- Attrition
- NetworkConfiguration
- ChaosConfiguration
- PeerConfig
- MonitorConfig
- ExplorationConfig
- AdaptiveConfig
This chapter documents every configuration type in Moonpool with its fields, types, and default values. All values are sourced directly from the codebase.
SimulationBuilder
The builder pattern for configuring and running simulation experiments. Created via SimulationBuilder::new().
| Method | Parameters | Description |
|---|---|---|
workload(w) | impl Workload | Add a single workload instance, reused across iterations |
workload_with_client_id(cid, w) | ClientId, impl Workload | Single workload with custom client ID strategy |
workloads(count, factory) | WorkloadCount, Fn(usize) -> Box<dyn Workload> | Add factory-created workload instances |
workloads_with_client_id(count, cid, factory) | WorkloadCount, ClientId, factory | Factory workloads with custom client IDs |
processes(count, factory) | impl Into<ProcessCount>, Fn() -> Box<dyn Process> | Add server processes (system under test) |
tags(dimensions) | &[(&str, &[&str])] | Attach round-robin tag distribution to processes |
attrition(config) | Attrition | Enable automatic process reboots during chaos phase |
invariant(i) | impl Invariant | Add an invariant checked after every simulation event |
invariant_fn(name, f) | String, closure | Add a closure-based invariant |
fault(f) | impl FaultInjector | Add a custom fault injector for the chaos phase |
chaos_duration(dur) | Duration | Set the chaos phase duration (faults run concurrently with workloads) |
set_iterations(n) | usize | Run exactly N iterations (default: 1) |
set_iteration_control(ctrl) | IterationControl | Set the iteration control strategy |
set_time_limit(dur) | Duration | Run for a wall-clock time duration |
set_debug_seeds(seeds) | Vec<u64> | Use specific seeds for deterministic debugging |
random_network() | – | Enable randomized NetworkConfiguration per iteration |
enable_exploration(config) | ExplorationConfig | Enable fork-based multiverse exploration |
replay_recipe(recipe) | BugRecipe | Replay a specific bug recipe |
run() | – | Execute the simulation, returns SimulationReport |
Default state
A freshly created SimulationBuilder::new() has:
- iteration_control:
IterationControl::FixedCount(1) - use_random_config:
false(usesNetworkConfiguration::default()) - exploration: disabled
- seeds: empty (auto-generated)
- No workloads, processes, invariants, or fault injectors
IterationControl
Controls how many iterations a simulation runs.
| Variant | Type | Description |
|---|---|---|
FixedCount(n) | usize | Run exactly n iterations |
TimeLimit(duration) | Duration | Run for a wall-clock time duration |
Note: The UntilAllSometimesReached(N) pattern mentioned in CLAUDE.md is implemented at the test level by checking assertion coverage, not as a variant of IterationControl.
ProcessCount
Controls how many process instances to spawn per iteration.
| Variant | Type | Description |
|---|---|---|
Fixed(n) | usize | Spawn exactly n processes every iteration |
Range(range) | RangeInclusive<usize> | Spawn a seeded random count from the inclusive range |
Accepts usize or RangeInclusive<usize> via Into<ProcessCount>.
WorkloadCount
Controls how many workload instances to spawn per iteration.
| Variant | Type | Description |
|---|---|---|
Fixed(n) | usize | Spawn exactly n instances |
Random(range) | Range<usize> | Spawn a seeded random count from the half-open range |
ClientId
Strategy for assigning client IDs to workload instances.
| Variant | Type | Description |
|---|---|---|
Fixed(base) | usize | Sequential IDs starting from base: instance 0 gets base, instance 1 gets base + 1, etc. |
RandomRange(range) | Range<usize> | Random ID drawn from [start..end) per instance (not guaranteed unique) |
Default: Fixed(0) (sequential starting from 0, matching FoundationDB’s WorkloadContext.clientId).
Attrition
Built-in configuration for automatic process reboots during the chaos phase. Requires .chaos_duration() to be set.
| Field | Type | Default | Description |
|---|---|---|---|
max_dead | usize | – | Maximum number of simultaneously dead processes |
prob_graceful | f64 | – | Weight for graceful reboots (signal + grace period) |
prob_crash | f64 | – | Weight for crash reboots (immediate kill) |
prob_wipe | f64 | – | Weight for crash + storage wipe reboots |
recovery_delay_ms | Option<Range<usize>> | 1000..10000 | Delay before restarting a killed process (ms) |
grace_period_ms | Option<Range<usize>> | 2000..5000 | Time allowed for graceful shutdown before force-kill (ms) |
The prob_* fields are weights, not probabilities. They are normalized internally and do not need to sum to 1.0.
RebootKind
The type of reboot chosen based on attrition probabilities:
| Variant | Behavior |
|---|---|
Graceful | Signal shutdown token, wait grace period, drain send buffers, then restart |
Crash | Immediate task cancel, all connections abort, no buffer drain |
CrashAndWipe | Same as Crash plus immediate storage wipe for the process (scoped by IP) |
NetworkConfiguration
Top-level network simulation parameters.
| Field | Type | Default |
|---|---|---|
bind_latency | Range<Duration> | 50us..150us |
accept_latency | Range<Duration> | 1ms..6ms |
connect_latency | Range<Duration> | 1ms..11ms |
read_latency | Range<Duration> | 10us..60us |
write_latency | Range<Duration> | 100us..600us |
chaos | ChaosConfiguration | See below |
Constructor variants
| Constructor | Description |
|---|---|
NetworkConfiguration::default() | Standard defaults with chaos enabled |
NetworkConfiguration::random_for_seed() | Randomized per seed for chaos testing |
NetworkConfiguration::fast_local() | Minimal latencies, all chaos disabled |
ChaosConfiguration
All fault injection settings for the simulated network. Part of NetworkConfiguration.
Clogging
| Field | Type | Default |
|---|---|---|
clog_probability | f64 | 0.0 |
clog_duration | Range<Duration> | 100ms..300ms |
Network Partitions
| Field | Type | Default |
|---|---|---|
partition_probability | f64 | 0.0 |
partition_duration | Range<Duration> | 200ms..2s |
partition_strategy | PartitionStrategy | Random |
PartitionStrategy variants: Random, UniformSize, IsolateSingle.
Bit Flips
| Field | Type | Default |
|---|---|---|
bit_flip_probability | f64 | 0.0001 (0.01%) |
bit_flip_min_bits | u32 | 1 |
bit_flip_max_bits | u32 | 32 |
bit_flip_cooldown | Duration | 0 |
Partial Writes
| Field | Type | Default |
|---|---|---|
partial_write_max_bytes | usize | 1000 |
Random Connection Close
| Field | Type | Default |
|---|---|---|
random_close_probability | f64 | 0.00001 (0.001%) |
random_close_cooldown | Duration | 5s |
random_close_explicit_ratio | f64 | 0.3 (30% explicit) |
Clock Drift
| Field | Type | Default |
|---|---|---|
clock_drift_enabled | bool | true |
clock_drift_max | Duration | 100ms |
Buggified Delay
| Field | Type | Default |
|---|---|---|
buggified_delay_enabled | bool | true |
buggified_delay_max | Duration | 100ms |
buggified_delay_probability | f64 | 0.25 (25%) |
Connection Failures
| Field | Type | Default |
|---|---|---|
connect_failure_mode | ConnectFailureMode | Probabilistic |
connect_failure_probability | f64 | 0.5 (50%) |
ConnectFailureMode variants: Disabled, AlwaysFail, Probabilistic (50% refused, 50% hang).
Latency Distribution
| Field | Type | Default |
|---|---|---|
latency_distribution | LatencyDistribution | Uniform |
slow_latency_probability | f64 | 0.001 (0.1%) |
slow_latency_multiplier | f64 | 10.0 |
LatencyDistribution variants: Uniform, Bimodal (99.9% fast, 0.1% slow).
Handshake Delay
| Field | Type | Default |
|---|---|---|
handshake_delay_enabled | bool | true |
handshake_delay_max | Duration | 10ms |
PeerConfig
Configuration for peer behavior and automatic reconnection. Part of moonpool-transport.
| Field | Type | Default |
|---|---|---|
initial_reconnect_delay | Duration | 100ms |
max_reconnect_delay | Duration | 30s |
max_queue_size | usize | 1000 |
connection_timeout | Duration | 5s |
max_connection_failures | Option<u32> | None (unlimited) |
monitor | Option<MonitorConfig> | Some(MonitorConfig::default()) |
Constructor variants
| Constructor | initial_reconnect_delay | max_reconnect_delay | max_queue_size | connection_timeout | max_connection_failures |
|---|---|---|---|---|---|
PeerConfig::default() | 100ms | 30s | 1000 | 5s | None |
PeerConfig::local_network() | 10ms | 1s | 100 | 500ms | Some(10) |
PeerConfig::wan_network() | 500ms | 60s | 5000 | 30s | None |
MonitorConfig
Ping-based connection health monitoring for peers. Follows FoundationDB’s connectionMonitor pattern.
| Field | Type | Default |
|---|---|---|
ping_interval | Duration | 1s |
ping_timeout | Duration | 2s |
max_tolerated_timeouts | u32 | 3 |
Constructor variants
| Constructor | ping_interval | ping_timeout | max_tolerated_timeouts |
|---|---|---|---|
MonitorConfig::default() | 1s | 2s | 3 |
MonitorConfig::local_network() | 500ms | 1s | 2 |
MonitorConfig::wan_network() | 5s | 10s | 5 |
ExplorationConfig
Configuration for fork-based multiverse exploration. Passed to SimulationBuilder::enable_exploration().
| Field | Type | Description |
|---|---|---|
max_depth | u32 | Maximum fork depth (0 = no forking) |
timelines_per_split | u32 | Children per splitpoint in fixed-count mode |
global_energy | i64 | Total number of fork operations allowed |
adaptive | Option<AdaptiveConfig> | Adaptive forking config; None = fixed-count mode |
parallelism | Option<Parallelism> | Multi-core exploration; None = sequential |
Parallelism
Controls how many forked children run concurrently.
| Variant | Slot count |
|---|---|
MaxCores | All available CPU cores |
HalfCores | Half of available cores (integer division, min 1) |
Cores(n) | Exactly n concurrent children |
MaxCoresMinus(n) | All cores minus n (min 1) |
AdaptiveConfig
Configuration for coverage-yield-driven batch forking. Used when ExplorationConfig::adaptive is Some.
| Field | Type | Description |
|---|---|---|
batch_size | u32 | Children to fork per batch before checking coverage yield |
min_timelines | u32 | Minimum total forks per mark (even if barren after first batch) |
max_timelines | u32 | Hard cap on total forks per mark |
per_mark_energy | i64 | Initial energy budget per assertion mark |
warm_min_timelines | Option<u32> | Minimum timelines for warm starts (multi-seed); defaults to batch_size if None |
How the 3-level energy system works
- Global energy (
global_energy): hard cap on total timelines across all marks. When this hits 0, all exploration stops. - Per-mark energy (
per_mark_energy): initial budget for each assertion mark. When exhausted, the mark draws from the reallocation pool. - Reallocation pool: energy returned by barren marks (marks that stopped producing new coverage). Productive marks can draw from this pool to continue exploring.
A mark is considered barren when a batch of children produces no new coverage bits and the mark has already spawned at least min_timelines (or warm_min_timelines during a warm start). Barren marks return their remaining per-mark energy to the reallocation pool.
Fault Reference
Consolidated quick-reference of every fault moonpool-sim can inject, organized by category. For detailed explanations and examples, see Network Faults, Storage Faults, and Attrition: Process Reboots.
Every fault listed below is automatically emitted to the "sim:faults" event timeline as a SimFaultEvent. Invariants can read these to correlate application behavior with infrastructure faults.
All defaults below refer to the values in ChaosConfiguration::default() and StorageConfiguration::default(). When using random_for_seed(), these values are randomized per seed within documented ranges.
Network Faults
Configured via ChaosConfiguration (nested under NetworkConfiguration::chaos).
Connection Failures
| Fault | Config Field | Default | Real-World Scenario |
|---|---|---|---|
| Random connection close | random_close_probability | 0.001% | Reconnection logic, message redelivery, connection pooling |
| Asymmetric close | random_close_explicit_ratio | 30% explicit (FIN), 70% silent (RST) | Half-closed sockets, FIN vs RST handling |
| Close cooldown | random_close_cooldown | 5s | Prevents cascading failures after a close event |
| Connect failure | connect_failure_mode | Probabilistic (50% refused, 50% hang) | Connection establishment retries, timeout handling |
| Connect failure probability | connect_failure_probability | 50% | Ratio of failed vs hanging connections |
Latency and Congestion
| Fault | Config Field | Default | Real-World Scenario |
|---|---|---|---|
| Latency distribution | latency_distribution | Uniform | P99/P99.9 tail latency testing |
| Slow latency spike | slow_latency_probability | 0.1% (bimodal mode only) | GC pauses, cross-datacenter hops |
| Slow latency multiplier | slow_latency_multiplier | 10x normal | Magnitude of tail latency spikes |
| Write clogging | clog_probability / clog_duration | 0%, 100-300ms | Backpressure handling, flow control |
| Clock drift | clock_drift_enabled / clock_drift_max | enabled, 100ms | Lease expiration, distributed consensus, TTL handling |
| Buggified delay | buggified_delay_probability / buggified_delay_max | 25%, 100ms | Race conditions, timing-dependent bugs |
| Handshake delay | handshake_delay_enabled / handshake_delay_max | enabled, 10ms | TLS negotiation, connection startup overhead |
Network Partitions
| Fault | Config Field | Default | Real-World Scenario |
|---|---|---|---|
| Random partition | partition_probability | 0% | Split-brain, quorum loss, leader election |
| Partition duration | partition_duration | 200ms-2s | Recovery time after network heal |
| Partition strategy | partition_strategy | Random | Random / UniformSize / IsolateSingle patterns |
Manual partition methods are also available on SimWorld: partition_pair(), partition_send_from(), partition_recv_to().
Data Integrity
| Fault | Config Field | Default | Real-World Scenario |
|---|---|---|---|
| Bit flips | bit_flip_probability | 0.01% | CRC/checksum validation, data corruption detection |
| Flip range | bit_flip_min_bits / bit_flip_max_bits | 1-32 bits | Power-law distribution of corruption severity |
| Flip cooldown | bit_flip_cooldown | 0 (no cooldown) | Rate-limiting corruption events |
| Partial writes | partial_write_max_bytes | 1000 bytes | TCP fragmentation, message framing |
Half-Open Connections
| Fault | Method | Real-World Scenario |
|---|---|---|
| Peer crash simulation | simulate_peer_crash() | TCP keepalive, heartbeat detection, silent failures |
| Half-open error detection | should_half_open_error() | Timeout-based failure detection |
| Stable connection exemption | mark_connection_stable() | Exempt supervision channels from chaos |
Storage Faults
Configured via StorageConfiguration. All fault probabilities default to 0% and must be enabled explicitly or via random_for_seed(). Storage faults are scoped per process: StorageState holds a global config plus optional per-process overrides in per_process_configs. Use SimWorld::set_process_storage_config(ip, config) to assign different fault profiles to individual processes.
| Fault | Config Field | Default | Real-World Scenario |
|---|---|---|---|
| Read corruption | read_fault_probability | 0% | ECC failures, DRAM bit flips, media degradation |
| Write corruption | write_fault_probability | 0% | Bad sectors, controller bugs, disk full |
| Crash fault (torn writes) | crash_fault_probability | 0% | Power loss mid-I/O, crash consistency |
| Misdirected write | misdirect_write_probability | 0% | Firmware bugs, wrong block written |
| Misdirected read | misdirect_read_probability | 0% | Controller errors, wrong block read |
| Phantom write | phantom_write_probability | 0% | Drive lies about durability |
| Sync failure | sync_failure_probability | 0% | fsync fails, disk full |
Per-Process Storage Operations
| Method | Parameters | Description |
|---|---|---|
SimWorld::set_process_storage_config(ip, config) | IpAddr, StorageConfiguration | Set per-process fault config (overrides global) |
SimWorld::simulate_crash_for_process(ip, close_files) | IpAddr, bool | Simulate power loss: torn writes, optional file close |
SimWorld::wipe_storage_for_process(ip) | IpAddr | Delete all storage owned by the process |
SimWorld::storage_provider(ip) | IpAddr | Create a SimStorageProvider scoped to this process |
Storage Performance Simulation
Storage also simulates realistic performance characteristics independent of fault injection.
| Parameter | Config Field | Default | Description |
|---|---|---|---|
| IOPS | iops | 25,000 | I/O operations per second limit |
| Bandwidth | bandwidth | 150 MB/s | Maximum throughput |
| Read latency | read_latency | 50-200us | Per-read operation delay |
| Write latency | write_latency | 100-500us | Per-write operation delay |
| Sync latency | sync_latency | 1-5ms | Per-sync/flush delay |
Process Lifecycle Faults
Configured via Attrition (built-in) or custom FaultInjector implementations.
| Fault | Mechanism | Behavior |
|---|---|---|
| Graceful reboot | RebootKind::Graceful | Signal shutdown token, wait grace period (default 2-5s), force kill, restart after recovery delay (default 1-10s) |
| Crash reboot | RebootKind::Crash | Immediate task abort, all connections reset, restart after recovery delay |
| Crash + wipe | RebootKind::CrashAndWipe | Crash behavior + immediate wipe of all persistent storage owned by the process (scoped by IP) |
| Continuous attrition | Attrition config | Random reboots during chaos phase with weighted prob_graceful/prob_crash/prob_wipe and max_dead limit |
Configuration Presets
| Preset | Description |
|---|---|
NetworkConfiguration::random_for_seed() | All chaos parameters randomized per seed for comprehensive testing |
NetworkConfiguration::fast_local() | 1-10us latencies, all chaos disabled |
ChaosConfiguration::disabled() | Zero probability for every fault category |
StorageConfiguration::random_for_seed() | Randomized faults (0.001%-0.1%), varied IOPS (10K-100K), varied bandwidth (50-500 MB/s) |
StorageConfiguration::fast_local() | 1M IOPS, 1 GB/s bandwidth, 1us latencies, all faults disabled |
See Configuration Reference for the complete builder API and all configuration types.
Glossary
Terms are listed alphabetically. Cross-references are shown in bold.
Adaptive forking – An exploration strategy where the number of timelines spawned at each splitpoint varies based on coverage yield. Productive marks that discover new coverage get more budget; barren marks return their energy to the reallocation pool. Configured via AdaptiveConfig.
Always assertion – An assertion that must hold every time it is evaluated. Violations are recorded but do not panic, following the Antithesis principle. Checked by validate_assertion_contracts() after the simulation completes. See assert_always! and assert_always_or_unreachable!.
Antithesis principle – The design philosophy that assertions should never crash the program. Violations are recorded and reported, allowing the simulation to continue and discover cascading failures. All 15 Moonpool assertion macros follow this principle.
Attrition – Built-in chaos mechanism that randomly kills and restarts server processes during the chaos phase. Configured via the Attrition struct with probability weights for graceful, crash, and wipe reboots. Respects max_dead to limit simultaneous deaths.
Barren mark – An assertion mark whose recent batch of timelines produced no new coverage bitmap bits. In adaptive forking, barren marks stop early and return their remaining energy to the reallocation pool.
Buggify – Deterministic fault injection system inspired by FoundationDB’s BUGGIFY macro. When enabled (50% activation rate, 25% firing rate per seed), buggified code paths randomly fire to test error handling. Decisions are deterministic given the seed, so bugs are reproducible.
Chaos injection – The practice of deliberately introducing faults during simulation to test system resilience. Includes network partitions, connection failures, bit flips, clock drift, buggified delays, clogging, and process attrition. Configured via ChaosConfiguration.
Coverage bitmap – A 1024-byte (8192-bit) bitfield that records which assertion paths a timeline touched. When an assertion fires, it sets a bit at position hash(name) % 8192. The explored map is the union of all coverage bitmaps across all timelines.
Determinism – The property that given the same seed, the simulation produces exactly the same execution. All randomness flows through the seeded RNG, and all I/O is simulated. This makes bugs reproducible: same seed, same bug, every time.
Event timeline – An append-only typed log attached to StateHandle. Workloads emit events via ctx.emit(key, event); invariants read them via state.timeline::<T>(key). Each entry carries time_ms, source (IP), and a global seq number for cross-timeline ordering. Distinct from timeline (a simulation run in the explorer).
Endpoint – A (IpAddr, Token) pair that uniquely identifies a connection endpoint in the simulated network. The IP address identifies the node; the token identifies the specific listener or connection on that node.
Energy budget – A finite pool that limits how many timelines the explorer can spawn, preventing exponential blowup. In fixed-count mode, a single global counter. In adaptive mode, a 3-level system: global budget, per-mark budget, and reallocation pool.
Explored map – The union (bitwise OR) of all coverage bitmaps across all timelines. Lives in MAP_SHARED memory so all forked processes can see it. Used to determine whether a new timeline discovered anything its siblings did not. Preserved across seeds in multi-seed exploration.
Explorer – The multiverse exploration framework (moonpool-explorer crate). Uses fork() to create timeline branches at splitpoints, exploring alternate executions with different randomness. Has zero knowledge of Moonpool internals – communicates only through RNG function pointers.
Fault timeline – The well-known event timeline at key "sim:faults" (SIM_FAULT_TIMELINE). Automatically populated by the simulator with SimFaultEvent entries covering network, storage, and process lifecycle faults. Invariants use it to correlate application behavior with infrastructure events.
Fork – An OS-level fork() call that creates a child process sharing the parent’s memory via copy-on-write. Each child continues the simulation with a new seed, creating an alternate timeline. Forks are triggered at splitpoints.
Frontier – For assert_sometimes_all!: the maximum number of named conditions that have been simultaneously true. When the frontier advances (more conditions true at once than ever before), a splitpoint is triggered. The frontier value is preserved across seeds in multi-seed exploration.
Invariant – A property that must hold across the entire simulated system, checked after every simulation event. Invariants validate cross-process properties via a StateRegistry; invariant functions read state and panic on violation.
Mark – An assertion site that can trigger splitpoints in the explorer. Each mark has a name, a shared-memory slot index, and (in adaptive mode) its own energy allowance. Marks are the unit of exploration budget management.
Multiverse – The tree of all timelines explored from one root seed. Each splitpoint creates new children with different seeds. The multiverse is fully deterministic: given the same root seed and configuration, the same tree is produced.
Process – The system under test. A server node that can be killed and restarted (rebooted). Each process gets fresh in-memory state on every boot; persistence is only through storage. Created by a factory function registered via SimulationBuilder::processes(). Analogous to FoundationDB’s fdbd.
Provider – A trait abstraction over runtime services (time, tasks, network, random, storage). Real implementations (TokioTimeProvider, etc.) delegate to tokio; simulation implementations intercept calls for deterministic control. Code uses providers instead of calling tokio directly.
Reachable – An assertion kind (assert_reachable!) that marks a code path as “should be reached at least once.” On first reach, triggers a fork. A coverage violation is reported if the path is never reached after enough iterations.
Reallocation pool – In adaptive forking, a shared energy reserve fed by barren marks that return their unused per-mark budget. Productive marks can draw from this pool when their own budget runs out, enabling automatic resource redistribution.
Recipe – The sequence of splitpoints that leads to a specific timeline. Encoded as a list of (rng_call_count, child_seed) pairs. If a bug is found, the recipe enables exact replay via SimulationBuilder::replay_recipe(). Formatted as "151@seed -> 80@seed".
Seed – A u64 value that completely determines a simulation’s randomness. Same seed = same RNG sequence = same execution. Seeds can be set explicitly via set_debug_seeds() or generated automatically. The seed is the fundamental unit of reproducibility.
Sometimes assertion – An assertion that should hold at least once across all iterations. Does not panic if false; instead, records statistics. On first success, triggers a fork in exploration mode. A coverage violation is reported if the condition is never true. See assert_sometimes!.
SimStorageProvider – The simulation implementation of StorageProvider. Constructed with an IP address (SimStorageProvider::new(sim, ip)) so all file operations are tagged with the owning process. Fault injection uses the per-process StorageConfiguration resolved by StorageState::config_for(ip).
Splitpoint – A moment where the explorer decides to branch the multiverse. Occurs when a sometimes assertion succeeds for the first time, a numeric watermark improves, or a frontier advances. The RNG call count at the splitpoint, combined with the seed, identifies the exact program state.
Timeline – One complete simulation run. A seed plus a sequence of splitpoints uniquely identifies a timeline. The root timeline runs from the original seed; child timelines branch off at splitpoints with derived seeds.
Token – A u64 identifier for a specific listener or connection on a node. Combined with an IP address to form an endpoint. See also well-known token.
Virgin map – An alternative name for the explored map (borrowed from AFL/fuzzing terminology). Records which coverage bits have been seen across all timelines and all seeds. “Virgin” bits are those never set by any timeline.
Warm start – In multi-seed exploration, when a new seed begins with the explored map already containing coverage from previous seeds. Warm starts use a lower warm_min_timelines threshold for barren mark detection, since much coverage is already known. Enabled automatically by prepare_next_seed().
Watermark – For numeric sometimes assertions: the best value ever observed. For gt/ge, the watermark tracks the maximum; for lt/le, the minimum. When a new evaluation improves the watermark, a splitpoint is triggered to explore timelines that might push the metric further.
Well-known token – A reserved token in the range 0..WELL_KNOWN_RESERVED_COUNT used for framework services. Well-known tokens provide stable endpoints for services like RPC registries without requiring dynamic discovery.
Wire format – The on-the-wire message encoding used by moonpool-transport. Each WireMessage includes a WireHeader with endpoint routing, a unique ID, message type, and payload size, followed by the serialized payload. CRC32C checksums protect against bit flip corruption.
Workload – The test driver. A workload survives process reboots and drives requests against the system under test. It validates correctness by making assertions about observed behavior. Analogous to FoundationDB’s tester.actor.cpp.