Tests Are the New Prompt: Making AI Write Verifiable Code
A prompt is a lossy spec; a test is one the machine can evaluate. Here's why AI-written code needs a verifier, the failure modes tests catch, and a concrete workflow that lets an agent iterate to correct code without you in the loop.

Table of contents
A natural-language prompt is a lossy spec. You ask for "a function that validates email addresses," and the model produces something plausible — that quietly accepts user@.com, or rejects valid plus-addressing, or imports a package that does not exist. The code looks done. "Looks done," as Anthropic's own Claude Code documentation observes, "is the only signal available" unless you give the agent something better. A test is that something better: an executable, unambiguous statement of what "correct" means, which the agent can run, read and iterate against without you in the loop.
This is the shift behind the phrase "tests are the new prompt." You stop describing the behaviour you want in prose and start encoding it in checks the machine can evaluate. Here is why that works, the specific failure modes it catches, and a concrete workflow for making AI write verifiable code.
Why AI code needs a verifier at all
The headline failure mode is invention. A peer-reviewed study, "We Have a Package for You! A Comprehensive Analysis of Package Hallucinations by Code Generating LLMs" (Spracklen et al., arXiv:2406.10279), tested 16 models across 576,000 generated code samples and found an overall package-hallucination rate of 19.7% — nearly one in five recommended packages does not exist. The split is stark: "at least 5.2% for commercial models and 21.7% for open-source models." Across the study, the researchers observed 205,474 unique hallucinated package names.
What makes this dangerous rather than merely annoying is repeatability. As Socket's analysis of the same work reports, when triggering prompts were rerun ten times each, 43% of hallucinated packages were repeated every time and 58% appeared more than once. A predictable hallucination is an attack surface: an adversary registers the fake name with malware (the technique is called slopsquatting, coined by the Python Software Foundation's Seth Larson in April 2025), and the next developer who blindly installs it runs attacker code. A plausible-looking import statement is not evidence the dependency exists. Only an execution check — install, import, run — catches it.
The mechanism: a test is an executable spec
Anthropic's Claude Code best-practices documentation states the principle directly: "Give Claude a check it can run: tests, a build, a screenshot to compare. It's the difference between a session you watch and one you walk away from." And on why: "Claude stops when the work looks done. Without a check it can run, 'looks done' is the only signal available... Give Claude something that produces a pass or fail, and the loop closes on its own. Claude does the work, runs the check, reads the result, and iterates until the check passes."
That closing-of-the-loop is the whole game. A failing test encodes the exact expected behaviour as something the machine can evaluate, so the model cannot fill ambiguity with a guess. Anthropic contrasts a weak prompt ("implement a function that validates email addresses") with a strong one: "write a validateEmail function. example test cases: user@example.com is true, invalid is false, user@.com is false. run the tests after implementing." The second prompt is a test. The check does not have to be a unit test, either — Anthropic lists "a test suite, a build exit code, a linter, a script that diffs output against a fixture, or a browser screenshot compared against a design" as valid pass/fail signals.
The documented failure this prevents is what Anthropic calls the "trust-then-verify gap": "Claude produces a plausible-looking implementation that doesn't handle edge cases. Fix: Always provide verification (tests, scripts, screenshots). If you can't verify it, don't ship it."
What each kind of check catches
Different checks catch different LLM failure modes. Match the check to the mistake:
| Check | What it catches in AI output |
|---|---|
| Install + import smoke test | Hallucinated / slopsquatted packages (~19.7% of recommended packages don't exist) |
| Unit tests with explicit example cases | Wrong logic, missed edge cases (the "plausible but wrong" gap) |
| Build exit code / typecheck | API misuse, signature drift, nonexistent symbols |
| Linter | Style and convention drift, dead code, obvious smells |
| Output-vs-fixture diff script | Behavioural regressions against a known-good baseline |
| Screenshot vs. design | UI rendering errors invisible to text-only checks |
| Full repository test suite | Whether a change actually resolves the real problem |
| Adversarial reviewer (fresh context) | Gaps the implementing agent rationalised away; test tampering |
The how-to: a verification-first workflow
Encode the spec as checks before the agent writes the implementation, then let it iterate against a fixed target.
| Stage | Action | Why it anchors the agent |
|---|---|---|
| 1. Spec | Write acceptance criteria first — what "done" means, with concrete examples | Intent becomes the source of truth, not the prose prompt |
| 2. Red | Write tests encoding that spec; run them; confirm they fail | A failing test is an executable target and proves the test exercises the behaviour |
| 3. Commit | Commit the failing tests as a checkpoint | Any later edit to a test shows in the diff and is revertable |
| 4. Green | Let the agent implement, run the suite, read failures, iterate — without editing the tests | Closes the loop unattended against a fixed target |
| 5. Gate | Enforce with a Stop hook or goal condition: the check must pass before the turn ends | Deterministic; the agent can't declare "done" while red |
| 6. Review | Fresh-context reviewer checks the diff against the spec | The author isn't the grader; catches scope creep and gaming |
Stop the agent from cheating
The obvious exploit is the agent editing the test, or weakening an assertion, to turn red into green. Anthropic documents three escalating gates against this: ask it to run the check and iterate in one prompt; attach a goal condition that a separate evaluator re-checks after every turn; and add a Stop hook that runs the check as a deterministic script and blocks the turn from ending until it passes. They also recommend an adversarial review subagent running in fresh context — "so the agent doing the work isn't the one grading it" — and to "have Claude show evidence rather than asserting success." Committing the tests first (stage 3) is the cheap backstop: if the agent touches a test, the change is right there in the diff, and you revert it.
The benchmark version of the same idea
This discipline scales. SWE-bench Verified is a set of 500 human-validated, real GitHub issues, built with OpenAI, where a model receives the issue and the full repository, must output a diff, and succeeds only if the repository's hidden test suite passes after its patch (executed in Docker). Tests are the grader — exactly the local TDD-with-agent loop, run at benchmark scale. Frontier coding agents now resolve a large majority of SWE-bench Verified tasks (reported figures for the strongest 2026 models sit in the high-80s percent), though exact leaderboard standing shifts week to week, so treat any single number as a snapshot.
Spec-driven development takes it further
If tests are the spec, why not make the spec a first-class artefact? That is the premise of GitHub Spec Kit, an open-source toolkit GitHub released on 2 September 2025 to bring spec-driven development to AI agents, with a four-phase workflow — Specify → Plan → Tasks → Implement — and support for agents including GitHub Copilot, Claude Code and Gemini CLI. Its framing: "We're moving from 'code is the source of truth' to 'intent is the source of truth.'" Tests and executable specs are two expressions of the same move — pin the intent down precisely enough that the agent's job becomes satisfying it, not guessing it.
FAQ
Doesn't writing tests first slow me down compared to just prompting? It moves the work earlier, not adds it. The alternative is shipping plausible-but-wrong code and finding the bugs in production. The tests also let the agent self-correct in a loop instead of handing you broken output to debug manually.
Can the AI write the tests too? It can draft them, but review them yourself or with a fresh-context reviewer — the same model that writes a wrong implementation can write a test that "agrees" with it. Concrete example cases you supply (the validateEmail-style cases) are the most reliable anchor.
How do I stop the agent from editing tests to make them pass? Commit the tests first so any change shows in the diff, gate the agent's "done" on a deterministic script (a Stop hook), and use a separate reviewer so the author isn't the grader. Anthropic documents all three.
Do screenshots and linters really count as "tests"? For an agent's purposes, yes — anything that produces a deterministic pass/fail it can read back qualifies. Anthropic explicitly lists build exit codes, linters, fixture diffs and screenshot comparisons alongside unit tests.
Bottom line
AI writes code that looks right far better than it writes code that is right — and the gap is measurable, from one-in-five hallucinated packages to edge cases that survive a review skim. A test closes that gap by turning "correct" into something the machine can evaluate, so the agent iterates to green on its own. Write the checks first, commit them, gate the agent's "done" on them passing, and have someone other than the author grade the result. The prompt got you a draft; the test is what makes it verifiable.
Sources and further reading
- Anthropic: Best practices for Claude Code https://code.claude.com/docs/en/best-practices
- Spracklen et al.: We Have a Package for You! (Package Hallucinations), arXiv:2406.10279 https://arxiv.org/abs/2406.10279
- GitHub Blog: Spec-driven development with AI — a new open source toolkit https://github.blog/ai-and-ml/generative-ai/spec-driven-development-with-ai-get-started-with-a-new-open-source-toolkit/
- Socket: Slopsquatting — How AI Hallucinations Are Fueling a New Class of Supply Chain Attacks https://socket.dev/blog/slopsquatting-how-ai-hallucinations-are-fueling-a-new-class-of-supply-chain-attacks


