TDD with AI agents

Write the tests yourself. Let the agent make them pass. Ship with confidence.

Jan 25, 2026

The hardest part of working with AI agents isn't getting them to write code. It's knowing if the code is correct.

You can't review every line. You can't catch every edge case by reading diffs. You need an objective signal: does this work or not?

The answer is TDD. Write the tests yourself. Let the agent make them pass.

Why TDD works with agents

TDD gives you three things that matter when you're not writing the implementation:

1. Tests are the spec. When you write a test, you're encoding exactly what "correct" means. Not vague acceptance criteria—executable assertions.

2. Verification is automatic. You don't have to review code hoping you catch bugs. Tests pass or they don't. Red means wrong, green means done.

3. Your expertise goes where it matters. You know the edge cases, the business logic, the things that break in production. Put that knowledge in tests, not in implementation details the agent handles better anyway.

Why you write the tests, not the agent

It's tempting to let the agent write tests too. Don't.

When AI writes both tests and implementation, it creates a self-reinforcing failure loop: the tests verify the implementation, the implementation satisfies the tests, but both are wrong. The AI confirms its own work rather than validating against actual requirements.

Research on AI-generated tests shows they frequently verify buggy behavior—the AI creates tests that confirm how the code currently behaves rather than how it should behave. You end up with high coverage and false confidence.

The fix is simple: you write the tests, the agent writes the code. Your tests encode your understanding of requirements. The agent's job is to satisfy that understanding, not define it.

The workflow

I use spec-driven development with TDD at the core:

1. Spec (`/sdd-spec`)

I use custom commands to structure this. /sdd-spec creates a spec document with:

Problem — What are we solving?
Scope — What's in, what's out?
Acceptance criteria — Testable conditions labeled AC-1, AC-2, etc.
Risks — What could go wrong?

The spec forces clarity before any code is written. If I can't articulate the acceptance criteria, I'm not ready to build.

2. Tests first (`/sdd-tasks` generates test-first tasks)

Write tests that encode the acceptance criteria. This is where your engineering judgment matters most.

describe('BookmarkSearch', () => {
  it('returns results ranked by relevance', async () => {
    // Your expertise: what "relevance" means for this feature
  });

  it('handles empty query gracefully', async () => {
    // Your expertise: edge cases that matter
  });

  it('respects rate limits', async () => {
    // Your expertise: production concerns
  });
});

You're not writing implementation—you're defining the contract.

3. Implementation (`/sdd-implement`)

Tell the agent: "Make these tests pass. Follow the patterns in the existing codebase."

The agent writes code. You watch tests go from red to green. If a test fails, the agent fixes it. Your job is watching the test output, not reading every line of code.

4. Verification (`/sdd-verify`)

Tests passing isn't the only check:

Run the linter
Check types
Review the diff for anything obviously wrong
Run the full test suite to catch regressions

But the core question—does this work?—is answered by the tests you wrote.

5. Ship

Tests pass, lints pass, types check. Ship it.

What this looks like in practice

Here's how I built a feature last week:

Task: Add semantic search to bookmarks API.

/sdd-spec semantic-search — 5 minutes. Defined the problem, scope, and acceptance criteria:

AC-1: Search returns results ordered by cosine similarity
AC-2: Empty results return empty array, not error
AC-3: Query over 500 chars is rejected
AC-4: Results include relevance score

/sdd-tasks — Generated test-first tasks from the spec. Each AC becomes a pair: write failing test, then implement to pass.

Tests: 15 minutes writing the actual test code for AC-1 through AC-4.

/sdd-implement — "Make these tests pass. Use pgvector for similarity search—it's already in the schema."

Agent wrote the code. First run: 2 tests failed. Agent fixed them. Second run: all green.

/sdd-verify — Generated verification report showing all ACs passed with evidence.

Total time: 30 minutes for a feature that would have taken 2 hours to write by hand.

The key: I spent my time on spec and tests, not implementation. My expertise went into defining what "correct" means. The agent handled the typing.

Unit tests vs integration tests

Not all tests serve the same purpose with AI agents.

Unit tests are your core TDD loop. They're fast, isolated, and the contract is clear. Write a unit test, agent makes it pass, repeat. This is where TDD shines—tight feedback, rapid iteration.

But unit tests have a blind spot: they can't catch system-level assumptions. AI-generated code often passes unit tests while failing in production because the agent assumed infrastructure that doesn't exist, APIs that behave differently, or async flows that don't match reality.

Integration tests catch what unit tests miss. When your feature touches databases, external APIs, or coordinates multiple services, write integration tests that validate those boundaries. These are slower, but they catch the architectural violations that unit tests can't see.

A practical split:

Test type	Purpose	Who writes
Unit tests	Core logic, edge cases, error handling	You specify, agent can write setup
Integration tests	Service boundaries, data flows, external systems	You specify the interactions
Contract tests	API boundaries between services	You define the contract
E2E tests	Critical user journeys	You specify scenarios

The research is clear: studies of AI-assisted development found teams spend less than 5% of testing effort on integration and coordination—the most fragile parts of the system. Don't make that mistake.

Start with unit tests for the core logic. Add integration tests for every boundary the agent's code crosses.

When to write more tests

The agent will sometimes produce code that passes your tests but has issues you didn't anticipate. When that happens:

Write a new test that catches the issue
Let the agent fix it
Now that case is covered forever

Your test suite grows with your understanding. Every bug becomes a test. Every edge case gets encoded.

If you've never written tests before

Start simple. You don't need to learn a framework—just describe what "correct" means.

Step 1: Write assertions in plain language.

Before any code, write down what should happen:

- Given an empty cart, total should be 0
- Adding an item should increase the count by 1
- Removing the last item should make the cart empty again

These are your tests. The agent can turn them into actual test code.

Step 2: Ask the agent to scaffold the test file.

"Create a test file for a shopping cart. Here are my test cases: [paste your assertions]. Use Vitest."

The agent writes the test structure. You review that it matches your intent.

Step 3: Run the tests, watch them fail.

Red means the tests are working—they're checking for behavior that doesn't exist yet.

Step 4: Tell the agent to make them pass.

"Implement the ShoppingCart class to make these tests pass."

Now you're doing TDD. You specified what "correct" means. The agent wrote the code.

The key insight: you don't need to know testing syntax to do TDD. You need to know what your code should do. The agent handles the rest.

How this compares to GitHub SpecKit

If this workflow sounds familiar, it's because it's similar to GitHub SpecKit—the open-source toolkit GitHub released for spec-driven development.

SpecKit uses a similar structure:

SpecKit	My workflow	Purpose
`/speckit.constitution`	`AGENTS.md`	Project principles and standards
`/speckit.specify`	`/sdd-spec`	Requirements and acceptance criteria
`/speckit.plan`	`/sdd-plan`	Technical architecture
`/speckit.tasks`	`/sdd-tasks`	Task decomposition
`/speckit.implement`	`/sdd-implement`	Code generation

The key difference: I made TDD explicit in the workflow.

SpecKit's task phase generates implementation tasks. My /sdd-tasks generates test-first tasks—each acceptance criterion becomes a pair: write failing test, then implement to pass. The verification isn't a separate phase at the end; it's built into every task.

Both approaches share the same insight: specs before code, structured phases, and human control over "what" while AI handles "how." The difference is where testing fits. SpecKit treats testing as part of implementation. I treat tests as the primary artifact that drives implementation.

If you're starting from scratch, SpecKit is a great starting point. If you want TDD baked into the workflow, adapt it like I did.

When this doesn't work

TDD with agents isn't magic. It fails when:

You don't understand the domain. If you can't write good tests, the agent can't write good code. Garbage in, garbage out.
The feature is exploratory. Sometimes you need to prototype before you know what to test. That's fine—spike first, then write tests, then rewrite with the agent.
Tests are too slow. If your test suite takes 10 minutes, the feedback loop breaks. Keep tests fast.

The shift

Before AI, I wrote tests after implementation (if at all). TDD felt slow—why write tests for code that doesn't exist?

With agents, TDD is faster. Writing tests is the hard part. Implementation is the easy part. So front-load the hard part, then let the agent handle the easy part.

The engineers shipping the most with AI aren't the ones who prompt best. They're the ones who test best. They know what "correct" means before any code is written.

Write the tests. Let the agent make them pass. Ship with confidence.