April 22, 2026

Multi-LLM Spec-Driven Software Development

AI spec-driven LLM workflow

Over the past year I have shipped around twenty major revisions of a 25-year-old, 550,000-line codebase using a workflow that seems obvious to me (and a growing number of professional software engineers) and foreign to many working programmers I talk to. Either they are still in the Cursor-style inner loop: small prompts, small edits, small completions, or trying to one-shot complex instructions. They are not running an agentic system at full capability, and they are paying for it in slow, flaky, one-model output. It’s impossible to quantify, but my unscientific impression is I’ve made more changes to this product in the past 12 months than previous teams of four could have achieved over several years.

This approach isn’t novel. Anyone who has spent the last year full-time searching for a reliable AI software development workflow will land somewhere similar.

The workflow has three pillars: spec-driven planning, multi-LLM review, and hand-curated tests with golden datasets. Each pillar catches a class of failure the other two can’t. Everything that follows is a defense of why you need all three.

Three independent pillars catch three different classes of failure: spec-driven planning catches wrong intent, multi-LLM review catches wrong implementation, and hand-curated tests with golden datasets catch wrong output. The usual operational gates (human PR review, lint, CI, merge) run beneath all three. Three independent guarantees. Skip one and the other two can still let the wrong answer through.

Spec-driven planning

This pillar is old news in a new context. For major projects, agentic coding flips your job. You’re less a coder and more a domain expert with an architect’s instincts and a testing manager’s paranoia, writing down exactly what the system should do, how it should do it, and how it should be tested. When coding is cheap and fast, you spend much less time sweating the details of a sort and more time worrying about everything else. This still requires all of the skills developed over years of experience in the software industry, it’s just those skills are everything else but how to solve LeetCode problems.

With agile entrenched almost everywhere, writing a specification up front sounds quaint or worse. Isn’t waterfall dead? Not for this problem. And it’s not that simple, and not about that old argument.

How software project management will eventually change with AI is outside of the scope of this document. Its too large a topic; how a project setups a development process has always depended on the type of product you’re creating. The needs of a business app are very different from that of a communication library implementing an RFC. What has changed is the sheer speed of writing code. Now that writing a complex piece of code that used to take two weeks can be done in a matter of hours, you have to tell the model exactly what it is you want to build ahead of time. And you have to document not only want you wanted, but how to build it, and how to tell if it actually worked ahead of time.

The good news is how fast an LLM can generate code. The catch is you need to tell it exactly what to build because the AI will happily build the wrong thing very fast!

The process I’m describing is how to get the results you want once you’ve decided on something. Whether that’s part of a larger plan that’s pre-decided or part of a global agile iteration is up to you. If you’re just vibe coding a prototype or trying something out, go ahead and see what the AI model comes up with. But if you want to generate code that will work in production, you will need a process.

Once you’ve decide to use AI to generate a chunk of code, the biggest mistake you can make is under-specifying. The first problem people hit on an AI coding project, every time, is that the spec wasn’t specific enough. GitHub’s Spec Kit walks you through the three layers:

Specification. What the system does.
Plan. How it does it.
Tasks. The individual work items that carry out the plan.

There are lots of other planning tools, and personal preference decides which one you pick. The important thing is that by the time a model is generating code, the hard problems have already been decided. Each task is small, self-contained, and doesn’t need the entire codebase in context to complete. Failure risk collapses because the scope of any one completion is a function or a class, not an app.

Multi-LLM review

A single LLM cannot be trusted to produce anything by itself. Every time I’ve submitted a specification, plan, or code to another LLM for review it has found major errors in logic, strategy or consistency. Fixing these problems up front is always faster than chasing a cascading flow of software bugs, only to discover it would have been impossible to make the program work as planned. In a way, this is no different than the process of having multiple people reviewing specs and code design up front. Its simply implementing a long-held best practice with AI in the loop rather than a team of peers.

There is some evidence to back up this observation. Look at benchmark submissions: every frontier model solves roughly the same set of easy problems, but on the hard ones, each model gets a different subset right. The overlap in failures is low.

That’s not a hunch. Organizations have to publish per-task results when they upload benchmark scores, which means you can compute the oracle: the score you would get if you always picked the best model’s answer for each problem. Across 39 models on SWE-Bench Verified, the oracle lifts the best individual model from 76.8% to 90.2%. On the harder SWE-Bench Pro, across nine models, it goes from 43.8% to 64.8%. The gains concentrate where they matter most. On the hardest SWE-Bench Pro problems, where the best individual model again scores 43.8%, a three-model oracle lifts that to 56.9%. (Full analysis in my forthcoming paper, How Much Could Multi-Model Teams Improve AI Coding Agents?)

The multi-model oracle beats the best individual model by 13.4 points on SWE-Bench Verified (76.8% to 90.2%) and 21.0 points on the harder SWE-Bench Pro (43.8% to 64.8%). On the hardest SWE-Bench Pro problems, a three-model oracle lifts the solve rate by 13.1 points, from 43.8% to 56.9%. Harder benchmarks show larger oracle gains, concentrating where single models struggle most.

You can’t hit the oracle in practice, because you don’t know ahead of time which model has the right answer, but you can get close by having multiple models review each other’s work. I run Claude Code as the driver and route every step through OpenAI, Gemini, and DeepSeek for review. The third slot rotates depending on what’s hot that month. In debate mode, each model rips apart whatever it is reviewing, spec or plan or code, and argues with the others about what actually matters. Every time I subject something to this review it finds multiple things that need to be fixed; if there’s only a single major problem I count myself lucky. Usually there’s several pages of useful feedback. Fixing the spec and the plan up front is enormously cheaper than fixing code later, same as it has always been.

The online discourse about model quality is almost all noise. A model gives someone a hallucination or fails on a simple problem, and that becomes a thread declaring the model ruined. Any model will fail. They all fail at different times. A one-off bad result against your particular problem isn’t a performance signal. It’s a sample of one. Any coding workflow that depends on a frontier coding model never making a mistake and never hallucinating is fundamentally flawed.

That isn’t to say three reviewers is a hard and fast rule. Even one outside model reviewing every stage cuts the chance a bug ships. Three cuts it further. How far is impossible to say right now without more real-world benchmarks.

Hand-curated tests with golden datasets

The first two pillars work in the semantic domain. Spec review checks whether you wrote down the right intent. Multi-LLM code review checks whether the implementation matches that intent. Neither pillar runs the code. Neither compares actual output against a correct answer on inputs a human deliberately chose to be nasty.

That gap is where LLM-generated code hides its bugs. A human engineer writes code that looks wrong when it’s wrong, through weird structure, nervous comments, sloppy naming. An LLM writes code that looks right even when it’s wrong, because it has been trained to produce code that pattern-matches working code. Three models in a debate round can all agree the code is fine and all three can be fooled by the same plausible-looking mistake. The only discipline that reliably catches this class of failure is running the code against ground-truth output that a human curated by hand.

The hand-curated part is the whole point. Tests written by an AI against its own output are circular; they assert that the code does what the code does, not what the code was supposed to do. The point of a golden dataset is that a human sits down, picks the inputs most likely to trip the system, decides what the correct answer should be, and encodes both. The test harness then runs the code and checks. This is the one artifact in the pipeline that cannot be delegated to the model, because it is the definition of correctness, not a derivative of it.

Skip this pillar and the first two can still let wrong answers through. Keep it, and each pillar catches what the others miss: intent, implementation, and output. Sound like extra work compared to a traditional development process? It shouldn’t be since this has been common practice for as long as companies have been shipping complicated software. The only thing that’s changed is the AI model needs to use this up front to be able to design and validate the code it writes.

Quality processes still apply

Even with all three pillars in place, you still build in the assumption that any individual model will confidently generate something terrible. This is not different from managing a team of engineers, where any of them will have a bad day eventually and write something that looks fine until it isn’t. The answer is not a better prompt. The answer is the operational layer the industry settled on decades ago: human code review, lint, CI/CD, rollbacks. The three pillars above don’t retire any of that. They make it more important.

One-shotting an app looks like magic because you press a button and something comes out the other end. But that’s the output you would get from a human engineer told not to run tests, not to exercise the app, and to ship as soon as it compiles. Of course it’s crap. If you use the same professional workflow you would use with humans, the hallucinations and bad-confidence moments get absorbed the same way a weak PR from a tired engineer gets absorbed: somebody catches it and it doesn’t ship.

Isn’t this slow and expensive?

It depends what you’re measuring. If you measure the time until code first appears, this workflow is slow. If you measure the time to working code that is close to shippable, it’s much faster. I spend most of the up-front budget on architecture, data model, and how to comprehensively test the output before any code is generated. But what gets generated then tends to run and match the spec on the first pass. It may not be what I stick with, but it works. Without this structure you end up with a buggy mess, the AI trying to fix bug after bug while you head to Reddit to post that your favorite model was “nerfed.” Model quality will always fluctuate. With this workflow, it doesn’t matter.

As for cost, reviewing plans and code is cheap. I’m on API key plans for everything but Claude Code, and those extra LLMs typically run under $100 a month combined. For a professional programmer, that’s one billable hour to save untold hours of work.

Isn’t this overkill for small problems?

Of course. As any software project manager will tell you, you pick the level of process appropriate to the size and importance of the task. I divide my work into three tiers:

Effort to do it by hand	Workflow
A few hours	Just tell the AI to do it
A day or two	Claude Code’s built-in planning
Three or more days, or a major refactor	Full spec-driven, multi-LLM process

What you can’t skip on is the testing. “Just tell the AI to do it” without a plan only works if there’s a comprehensive testing framework that will catch regression, “tell the AI” has to include test driven development instructions to create the proof that what is created will actually work. The overhead of the full workflow only pays off when the work is substantial enough that catching mistakes early matters more than shipping fast. For small jobs, one-shotting is possible only if you have well-designed unit and integration test suites. Just don’t mistake a big job for a small one.

The bottom line

Can spec-driven, multi-LLM development fail? Yes, the same as any development process. Once I worked through the initial gotchas and folded the recovery patterns into my workflow, it has been twenty major revisions without a meaningful miss. Each one of those would have taken a month or two the old way. They go out in a few days now. The pace is, frankly, astounding.