Sim-Driven Development - Sticking AI Generated Code in the Danger Room

Unit tests check that swords are sharp, sticking your system in a simulator that fires large amounts of production-like data at it checks that your code survives in glorious battle.

Coding agents make writing simulators for your systems achievable (cost, time, etc) and outside-of-the-box simulation is possibly the best way we have of shipping AI-generated systems with confidence.

Verification is hard

I've written before about how the existence of tests probably isn't enough to sign off agent generated code (or human generated code for that matter). Verifying that an entire system authored by coding agents works correctly is even harder.

The short version is that writing really good tests requires much more imagination than implementing a spec. To write good tests you need to be able to picture all of the non-obvious ways that something can break... while agents can do basic TDD loops I don't think they can currently muster the same level of imaginative cynicism that a weary senior engineer who's been woken up at 3am one too many times can.

The other issue is that agents go fast. Usually we ship systems slice by slice and have some time between slices to test that things work. If we compress development time and are able to ship entire systems at once then verification time also disappears.

But instead of looking at what's hard or difficult, let's look at what agents make possible...

The simulator already (mostly) exists in the negative space of your specs

If you already have a spec (or some sort of schema or other documentation) for a system, let's call it S and you can feed it into an agent A such that A(S) = system then you also have most of the spec for a system that tests your system (let's call it a simulator)... basically S + d = simulator spec where d is the sort of things you probably want to be considering when doing software engineering anyway (interfaces, protocols, response times, orders of magnitude traffic). does it work = A(S + d)(A(S)).

If you're using gen-ai to generate your system why not also use it to generate a complete simulator to test that your code works for X reps of production input, where X is a very big number.

The simulator doesn't need to know anything about the actual implementation of the system - in fact it's good that it doesn't, this means our simulator is isolated from any assumptions made by any particular implementation of S (and interestingly works for all implementations of S should we ever want to rewrite our system). It also means we can generate our simulator in parallel to our system.

Agents aren't people so I feel okay with a slightly adversarial framing here. Agent 1 writes the system. Agent 2 writes the system that tries to break the system, let's see which one wins.

Properties of a good system + simulator pair

For our simulator to do a good job of telling us if our code works our code and the simulator need some properties.

The simulator needs to -

Be deterministic, we provide a seed and all runs should be the same for that seed, this means when we detect a problem we can replay the run to both understand what happened and check that it has been fixed
Be fast enough to throw a meaningful amount of production-like input at our system - languages like Zig or Rust seem like a good fit here
Talk the language as our system will in production, if our system is a web server the simulator needs to make HTTP requests, etc
Produce production like data across a range of scenarios

Our system needs to -

Fail hard when it hits illegal states* for the domain. If balance should never been zero and our system encounters a zero balance then we want it to blow up and exit, not try catches, no recovery, do the minimum essential tidy up and fail with a non-zero exit code**
Have a solid understanding of its interfaces from day one (they can and should evolve but we should at least know what they are when they change)
Externalise state - if it can crash hard then process memory isn't the right place for important data
Tend toward being stateless - our system takes an input (production or simulated) and does something useful with it, it also has a mechanism we can use to tell if something useful happened.
Be restartable - start quickly, pick up where previous work left off, etc

*An illegal state is not a validation error, it's a situation that should never happen according to the rules of your domain.

**As an aside this is often better default behaviour for most systems. There's a whole host of bugs that get created and perpetuated when systems attempt to recover from impossible or illegal states under the false-pretence of "good engineering means code never crashes".

Most of these ideas aren't new. Tigerbeetle famously simulate 30-odd years of production traffic in different scenarios to test resilience. Erlang follows a "let it crash" philosophy thanks to its process model. NASA use assertions in code to write safety critical systems... the key point is that agents lower the cost of entry to high-rigour engineering while at the same time giving us tools for making systems more robust.

Avoiding upfront design

One argument against "sim-driven development" (if only from myself) might be that it feels like big upfront design, which is bad(TM).

The difference between big upfront design and this approach is in the half-life of your assumptions.

Sim-driven development is about capturing the assumptions about your system (either in the spec, system or simulator) and then validating them quickly by actually running your code in production like conditions.

If something blows up then rejoice, we learned something without ruining a real users day and can update the spec, system and/or simulator accordingly and get in more reps.

Crucially the simulator is a living artefact that exists alongside your code. If you're building a rocket no one questions the value of a gantry. If you're building a race car no one questions the value of a test track. The simulator is part of the critical infrastructure and possibly ends up living longer than the system itself.

Conclusion

If agents can give us extra engineering capacity (and I think they can) then we can use that capacity to find cool new ways of building better systems.

In a "traditional" engineering project the idea of building a complete simulation of production to test your system would typically get shut down due to budget and time constraints unless you were building some sort of safety or security critical system.

Coding agents make these approaches more tractable for "every day" engineering problems.

Personally I think "does this system survive X years of production traffic without encountering an illegal state" is a much better quality gate than did we catch all the issues via code review roulette.

Ultimately engineering is about systems and leverage. We no longer write assembly, the compiler does it. We no longer manually manage memory (most of the time)... but we still build systems. If the tooling changes the work becomes finding the new leverage. I think agents and gen-ai will change how software gets delivered in more ways than we have currently seen but there will still be room for human input in the process.