fleeting.computer

Tests are not enough (for gen-ai code)

Automated tests correlate with code quality but they don't cause code quality. Just because generated code has a lot of tests doesn't mean it works or is any good.

The challenge with automated tests of has always been -

  • Knowing if the engineer has written enough tests to capture every edge case
  • Verifying that those tests are constructed well enough to test what they claim to be testing

I've run into a few situations where claude/copilot will say "I'll just make these tests pass for now, fixing them is out of scope".

A diligent operator might catch that happening and fix the issue. But if they don't then the it's tempting to just look at the generated files and say "wow look how many tests this code has, it must be good".

The same issue exists in reviews. Senior engineers, already burnt out from reviewing everyone's giant ai-assisted PRs, can't realistically read production and test code line by line and spot micro-errors - people aren't compilers.

This problem isn't new, it exists in plenty of human-written codebases. I've deleted entire implementations from production code only for the tests to keep passing (turns out the whole thing was just testing mocks).

My guess right now (assuming we want to get the productivity gains everyone is promising gen-ai will deliver) is that we need to move to some sort of systems testing systems approach. Maybe even with adversarial goals...

  • Agent A's goal is to build a sub-system to a certain spec with a fixed interface
  • Agent B's goal is to build a test harness that drives the interface and tries to break the sub-system. Bonus points if it spits out genuinely useful information about how it works

I like this approach because it also speeds up reviews and can be added to CI pipelines. It moves the sign off from "looks good to me" to "does your code survive the 10,000 reps in the torment-nexus / desert world of Arakis"

Automated tests are still useful but they aren't sufficient.