The Testing Pyramid Still Matters (Even When AI Wrote Your Tests)
I’ve been spending a lot of time in agentic AI workflows lately. Describe a feature, the agent writes the code, the tests, and the PR. It’s genuinely impressive. And the test suites are taking fifteen minutes to run, full of integration tests that didn’t need to be integration tests.
The testing pyramid didn’t become irrelevant just because AI wrote the tests.
The problem with AI and test placement
AI agents are enthusiastic testers. Ask one to add coverage for a new feature and it’ll write tests, lots of them. The problem isn’t that they default to any particular layer. It’s that they’re trained to optimise for quality and coverage, not for the cost of achieving it. Nobody’s system prompt says “think about how long this test suite will take to run.” So they go overboard. Spinning up a database, creating records via factories, hitting an HTTP endpoint, asserting on the response. It works. The test passes. But it’s expensive.
One of those tests takes maybe 300ms to run. Fine. Write fifty of them and you’re looking at a multi-minute test suite before you’ve even touched E2E. Write a few hundred and now your CI pipeline is a queue-occupying bottleneck that frustrates everyone on the team.
The agent didn’t write bad tests exactly. It wrote tests that are too high up the pyramid for what they’re actually validating.
Why this still matters: humans are still in the loop
I’ve written before about why humans need to stay in the loop with AI-assisted development. The short version is that AI doesn’t remove the need for code review, it just changes who triggers it.
When someone raises a pull request, the code review can’t happen until the tests finish. That’s the human waiting on the machine. And if your test suite takes forever because every piece of logic is validated through an HTTP request and a database hit, you’ve created a bottleneck that affects every developer on the team, not just the one who wrote the slow tests.
It doesn’t matter who wrote the tests. If they’re slow, they’re slow. The bottleneck is real regardless of whether it was a human or an agent that placed a unit-testable calculation behind a full integration test.
The pyramid is a cost model
Martin Fowler’s Test Pyramid isn’t just a preference. It’s a cost model. Each layer up the pyramid costs more to run, more to maintain, and more to debug when it breaks. Push tests down the pyramid as far as they can go without losing meaningful coverage.
Unit tests are cheap. Test your business logic, your calculations, your model accessors - mock your dependencies, stub your data layer, keep real infrastructure out of it. These tests run in milliseconds and give you tight, fast feedback.
Integration tests are where you bring real infrastructure in. Use them for what they’re actually good at: HTTP request/response contracts, authorisation enforcement, middleware behaviour, database interactions where the database is genuinely the thing being tested.
E2E tests are expensive. Reserve them for real browser journeys, things that genuinely require JavaScript, client-side interactivity, accessibility checks, or cross-page navigation that depends on browser state.
A healthy suite has many unit tests, a moderate number of integration tests, and very few E2E tests. The ratio matters. An inverted pyramid, lots of slow integration tests and barely any unit tests, means you’re paying maximum cost for minimum speed.
You still need to guide the agent
If you’re using agentic AI development, you need to tell the agent how to test, not just that tests should exist. Left without guidance, it defaults to whatever pattern it sees most in your existing test suite. If your existing tests are already hitting real databases for things that could be mocked, the agent will replicate that pattern and make it worse.
The answer I’ve been working on is a testing skill file, a markdown document that lives in your repo alongside your other Copilot skill files. It tells the agent what belongs at unit vs. integration vs. E2E level, which anti-patterns to avoid (testing framework internals, asserting exact HTML, testing config values), and gives it a decision flowchart to ask before it places a test.
I’ve put the one I’ve been using as a GitHub Gist here: testing/SKILL.md
It’s written for Laravel/Pest but the principles apply anywhere. The key sections are the anti-pattern examples, concrete code showing what not to write and why, and the decision flowchart that gives the agent a simple set of questions to ask before it places a test.
What good looks like
A service that hits the database doesn’t have to be tested with a real database. Mock the data layer, stub the dependencies, test the logic in isolation. Fast, no infrastructure needed. The integration test - the one that proves it all wires together correctly - is still worth writing, but you only need a few of those. You don’t need one for every code path.
When the agent understands those distinctions, the tests it writes are faster, the suite is cheaper to run, and your CI pipeline stops being the thing everyone complains about on standups.
The testing pyramid isn’t old advice that AI has made obsolete. It’s advice that matters more now, because agents can generate bad test placement at scale, and slow test suites block the human review step that keeps the whole thing from going off the rails.
Start with the skill file. Your reviewers will thank you.