Tests Are the New Prompt: Making AI Write Verifiable Code

A prompt is a lossy spec; a test is one the machine can evaluate. Here's why AI-written code needs a verifier, the failure modes tests catch, and a concrete workflow that lets an agent iterate to correct code without you in the loop.

Alex Rivera · Jun 21, 2026 · updated Jun 18, 2026

Tests Are the New Prompt: Making AI Write Verifiable Code

Table of contents

Why AI code needs a verifier at all
The mechanism: a test is an executable spec
What each kind of check catches
The how-to: a verification-first workflow
The benchmark version of the same idea
Spec-driven development takes it further
FAQ
Bottom line
Sources and further reading

A natural-language prompt is a lossy spec. You ask for "a function that validates email addresses," and the model produces something plausible — that quietly accepts user@.com, or rejects valid plus-addressing, or imports a package that does not exist. The code looks done. "Looks done," as Anthropic's own Claude Code documentation observes, "is the only signal available" unless you give the agent something better. A test is that something better: an executable, unambiguous statement of what "correct" means, which the agent can run, read and iterate against without you in the loop.

This is the shift behind the phrase "tests are the new prompt." You stop describing the behaviour you want in prose and start encoding it in checks the machine can evaluate. Here is why that works, the specific failure modes it catches, and a concrete workflow for making AI write verifiable code.

Why AI code needs a verifier at all

The headline failure mode is invention. A peer-reviewed study, "We Have a Package for You! A Comprehensive Analysis of Package Hallucinations by Code Generating LLMs" (Spracklen et al., arXiv:2406.10279), tested 16 models across 576,000 generated code samples and found an overall package-hallucination rate of 19.7% — nearly one in five recommended packages does not exist. The split is stark: "at least 5.2% for commercial models and 21.7% for open-source models." Across the study, the researchers observed 205,474 unique hallucinated package names.

What makes this dangerous rather than merely annoying is repeatability. As Socket's analysis of the same work reports, when triggering prompts were rerun ten times each, 43% of hallucinated packages were repeated every time and 58% appeared more than once. A predictable hallucination is an attack surface: an adversary registers the fake name with malware (the technique is called slopsquatting, coined by the Python Software Foundation's Seth Larson in April 2025), and the next developer who blindly installs it runs attacker code. A plausible-looking import statement is not evidence the dependency exists. Only an execution check — install, import, run — catches it.

The mechanism: a test is an executable spec

Anthropic's Claude Code best-practices documentation states the principle directly: "Give Claude a check it can run: tests, a build, a screenshot to compare. It's the difference between a session you watch and one you walk away from." And on why: "Claude stops when the work looks done. Without a check it can run, 'looks done' is the only signal available... Give Claude something that produces a pass or fail, and the loop closes on its own. Claude does the work, runs the check, reads the result, and iterates until the check passes."

That closing-of-the-loop is the whole game. A failing test encodes the exact expected behaviour as something the machine can evaluate, so the model cannot fill ambiguity with a guess. Anthropic contrasts a weak prompt ("implement a function that validates email addresses") with a strong one: "write a validateEmail function. example test cases: user@example.com is true, invalid is false, user@.com is false. run the tests after implementing." The second prompt is a test. The check does not have to be a unit test, either — Anthropic lists "a test suite, a build exit code, a linter, a script that diffs output against a fixture, or a browser screenshot compared against a design" as valid pass/fail signals.

The documented failure this prevents is what Anthropic calls the "trust-then-verify gap": "Claude produces a plausible-looking implementation that doesn't handle edge cases. Fix: Always provide verification (tests, scripts, screenshots). If you can't verify it, don't ship it."

What each kind of check catches

Different checks catch different LLM failure modes. Match the check to the mistake:

Check	What it catches in AI output
Install + import smoke test	Hallucinated / slopsquatted packages (~19.7% of recommended packages don't exist)
Unit tests with explicit example cases	Wrong logic, missed edge cases (the "plausible but wrong" gap)
Build exit code / typecheck	API misuse, signature drift, nonexistent symbols
Linter	Style and convention drift, dead code, obvious smells
Output-vs-fixture diff script	Behavioural regressions against a known-good baseline
Screenshot vs. design	UI rendering errors invisible to text-only checks
Full repository test suite	Whether a change actually resolves the real problem
Adversarial reviewer (fresh context)	Gaps the implementing agent rationalised away; test tampering

The how-to: a verification-first workflow

Encode the spec as checks before the agent writes the implementation, then let it iterate against a fixed target.

Stage	Action	Why it anchors the agent
1. Spec	Write acceptance criteria first — what "done" means, with concrete examples	Intent becomes the source of truth, not the prose prompt
2. Red	Write tests encoding that spec; run them; confirm they fail	A failing test is an executable target and proves the test exercises the behaviour
3. Commit	Commit the failing tests as a checkpoint	Any later edit to a test shows in the diff and is revertable
4. Green	Let the agent implement, run the suite, read failures, iterate — without editing the tests	Closes the loop unattended against a fixed target
5. Gate	Enforce with a Stop hook or goal condition: the check must pass before the turn ends	Deterministic; the agent can't declare "done" while red
6. Review	Fresh-context reviewer checks the diff against the spec	The author isn't the grader; catches scope creep and gaming

Stop the agent from cheating

The obvious exploit is the agent editing the test, or weakening an assertion, to turn red into green. Anthropic documents three escalating gates against this: ask it to run the check and iterate in one prompt; attach a goal condition that a separate evaluator re-checks after every turn; and add a Stop hook that runs the check as a deterministic script and blocks the turn from ending until it passes. They also recommend an adversarial review subagent running in fresh context — "so the agent doing the work isn't the one grading it" — and to "have Claude show evidence rather than asserting success." Committing the tests first (stage 3) is the cheap backstop: if the agent touches a test, the change is right there in the diff, and you revert it.

The benchmark version of the same idea

This discipline scales. SWE-bench Verified is a set of 500 human-validated, real GitHub issues, built with OpenAI, where a model receives the issue and the full repository, must output a diff, and succeeds only if the repository's hidden test suite passes after its patch (executed in Docker). Tests are the grader — exactly the local TDD-with-agent loop, run at benchmark scale. Frontier coding agents now resolve a large majority of SWE-bench Verified tasks (reported figures for the strongest 2026 models sit in the high-80s percent), though exact leaderboard standing shifts week to week, so treat any single number as a snapshot.

Spec-driven development takes it further

If tests are the spec, why not make the spec a first-class artefact? That is the premise of GitHub Spec Kit, an open-source toolkit GitHub released on 2 September 2025 to bring spec-driven development to AI agents, with a four-phase workflow — Specify → Plan → Tasks → Implement — and support for agents including GitHub Copilot, Claude Code and Gemini CLI. Its framing: "We're moving from 'code is the source of truth' to 'intent is the source of truth.'" Tests and executable specs are two expressions of the same move — pin the intent down precisely enough that the agent's job becomes satisfying it, not guessing it.

FAQ

Doesn't writing tests first slow me down compared to just prompting? It moves the work earlier, not adds it. The alternative is shipping plausible-but-wrong code and finding the bugs in production. The tests also let the agent self-correct in a loop instead of handing you broken output to debug manually.

Can the AI write the tests too? It can draft them, but review them yourself or with a fresh-context reviewer — the same model that writes a wrong implementation can write a test that "agrees" with it. Concrete example cases you supply (the validateEmail-style cases) are the most reliable anchor.

How do I stop the agent from editing tests to make them pass? Commit the tests first so any change shows in the diff, gate the agent's "done" on a deterministic script (a Stop hook), and use a separate reviewer so the author isn't the grader. Anthropic documents all three.

Do screenshots and linters really count as "tests"? For an agent's purposes, yes — anything that produces a deterministic pass/fail it can read back qualifies. Anthropic explicitly lists build exit codes, linters, fixture diffs and screenshot comparisons alongside unit tests.

Bottom line

AI writes code that looks right far better than it writes code that is right — and the gap is measurable, from one-in-five hallucinated packages to edge cases that survive a review skim. A test closes that gap by turning "correct" into something the machine can evaluate, so the agent iterates to green on its own. Write the checks first, commit them, gate the agent's "done" on them passing, and have someone other than the author grade the result. The prompt got you a draft; the test is what makes it verifiable.

Sources and further reading

Anthropic: Best practices for Claude Code https://code.claude.com/docs/en/best-practices
Spracklen et al.: We Have a Package for You! (Package Hallucinations), arXiv:2406.10279 https://arxiv.org/abs/2406.10279
GitHub Blog: Spec-driven development with AI — a new open source toolkit https://github.blog/ai-and-ml/generative-ai/spec-driven-development-with-ai-get-started-with-a-new-open-source-toolkit/
Socket: Slopsquatting — How AI Hallucinations Are Fueling a New Class of Supply Chain Attacks https://socket.dev/blog/slopsquatting-how-ai-hallucinations-are-fueling-a-new-class-of-supply-chain-attacks