MICHAEL CZEISZPERGER

Multi-LLM Spec-Driven Software Development

Over the past year I have shipped around twenty major revisions of a 25-year-old, 600,000+ line codebase using a workflow that seems obvious to me (and a growing number of professional software engineers) and foreign to many working programmers I talk to. Either they are still in the Cursor-style inner loop: small prompts, small edits, small completions, or trying to one-shot complex instructions. They are not running an agentic system at full capability, and they are paying for it in slow, flaky, one-model output. It’s impossible to quantify, but my unscientific impression is I’ve made more changes to this product in the past 12 months than previous teams of four could have achieved over several years.

This approach isn’t novel. Anyone who has spent the last year full-time searching for a reliable AI software development workflow will land somewhere similar.

The workflow has three pillars: spec-driven planning, multi-LLM review, and hand-curated tests with golden datasets. Each pillar catches a class of failure the other two can’t. Everything that follows is a defense of why you need all three.

Three independent pillars catch three different classes of failure: spec-driven planning catches wrong intent, multi-LLM review catches wrong implementation, and hand-curated tests with golden datasets catch wrong output. The usual operational gates (human PR review, lint, CI, merge) run beneath all three. Three independent guarantees. Skip one and the other two can still let the wrong answer through.

Spec-driven planning

This pillar is old news in a new context. For major projects, agentic coding flips your job. You’re less a coder and more a business analyst with an architect’s instincts and a testing manager’s paranoia, writing down exactly what the system should do, how it should do it, and how it should be tested. You spend much less time sweating the low-level details of a sort. With agile entrenched almost everywhere, writing a specification up front sounds quaint or worse. Isn’t waterfall dead?

Not for this problem. If the requirements are squishy and likely to mutate mid-stream, agile fits. If the target is defined (an RFC protocol), the spec goes up front. What’s new is that AI makes it cheap to iterate until you get to a defined spec. You can vibe-code a working mockup of the app, its workflow, its interaction patterns, and pass that around to stakeholders before anyone commits to a single line of backend. By the time code gets written, it’s building what everyone has already agreed to.

The biggest mistake you can make is under-specifying. The first problem people hit on an AI coding project, every time, is that the spec wasn’t specific enough. GitHub’s Spec Kit walks you through the three layers:

  1. Specification. What the system does.
  2. Plan. How it does it.
  3. Tasks. The individual work items that carry out the plan.

There are lots of other planning tools, and personal preference decides which one you pick. The important thing is that by the time a model is generating code, the hard problems have already been decided. Each task is small, self-contained, and doesn’t need the entire codebase in context to complete. Failure risk collapses because the scope of any one completion is a function or a class, not an app.

Multi-LLM review

A single LLM cannot be trusted to produce the specification, the plan, the tasks, or code by itself. Look at benchmark submissions: every frontier model solves roughly the same set of easy problems, but on the hard ones, each model gets a different subset right. The overlap in failures is low.

That’s not a hunch. Organizations have to publish per-task results when they upload benchmark scores, which means you can compute the oracle: the score you would get if you always picked the best model’s answer for each problem. Across 39 models on SWE-Bench Verified, the oracle lifts the best individual model from 76.8% to 90.2%. On the harder SWE-Bench Pro, across nine models, it goes from 43.8% to 64.8%. The gains concentrate where they matter most. On the hardest SWE-Bench Pro problems, where the best individual model again scores 43.8%, a three-model oracle lifts that to 56.9%. (Full analysis in my forthcoming paper, How Much Could Multi-Model Teams Improve AI Coding Agents?)

The multi-model oracle beats the best individual model by 13.4 points on SWE-Bench Verified (76.8% to 90.2%) and 21.0 points on the harder SWE-Bench Pro (43.8% to 64.8%). On the hardest SWE-Bench Pro problems, a three-model oracle lifts the solve rate by 13.1 points, from 43.8% to 56.9%. Harder benchmarks show larger oracle gains, concentrating where single models struggle most.

You can’t hit the oracle in practice, because you don’t know ahead of time which model has the right answer, but you can get close by having multiple models review each other’s work. I run Claude Code as the driver and route every step through OpenAI, Gemini, and DeepSeek for review. The third slot rotates depending on what’s hot that month. In debate mode, each model rips apart whatever it is reviewing, spec or plan or code, and argues with the others about what actually matters. When all three agree something is wrong, it gets redesigned. Fixing the spec and the plan up front is enormously cheaper than fixing code later, same as it has always been.

The online discourse about model quality is almost all noise. A model gives someone a hallucination or fails on a simple problem, and that becomes a thread declaring the model ruined. Any model will fail. They all fail at different times. A one-off bad result against your particular problem isn’t a performance signal. It’s a sample of one. Any coding workflow that depends on a frontier coding model never making a mistake and never hallucinating is fundamentally flawed.

That isn’t to say three reviewers is a hard and fast rule. Even one outside model reviewing every stage cuts the chance a bug ships. Three cuts it further.

Hand-curated tests with golden datasets

The first two pillars work in the semantic domain. Spec review checks whether you wrote down the right intent. Multi-LLM code review checks whether the implementation matches that intent. Neither pillar runs the code. Neither compares actual output against a correct answer on inputs a human deliberately chose to be nasty.

That gap is where LLM-generated code hides its bugs. A human engineer writes code that looks wrong when it’s wrong, through weird structure, nervous comments, sloppy naming. An LLM writes code that looks right even when it’s wrong, because it has been trained to produce code that pattern-matches working code. Three models in a debate round can all agree the code is fine and all three can be fooled by the same plausible-looking mistake. The only discipline that reliably catches this class of failure is running the code against ground-truth output that a human curated by hand.

The hand-curated part is the whole point. Tests written by an AI against its own output are circular; they assert that the code does what the code does, not what the code was supposed to do. The point of a golden dataset is that a human sits down, picks the inputs most likely to trip the system, decides what the correct answer should be, and encodes both. The test harness then runs the code and checks. This is the one artifact in the pipeline that cannot be delegated to the model, because it is the definition of correctness, not a derivative of it.

Skip this pillar and the first two can still let wrong answers through. Keep it, and each pillar catches what the others miss: intent, implementation, and output.

Quality processes still apply

Even with all three pillars in place, you still build in the assumption that any individual model will confidently generate something terrible. This is not different from managing a team of engineers, where any of them will have a bad day eventually and write something that looks fine until it isn’t. The answer is not a better prompt. The answer is the operational layer the industry settled on decades ago: human code review, lint, CI/CD, rollbacks. The three pillars above don’t retire any of that. They make it more important.

One-shotting an app looks like magic because you press a button and something comes out the other end. But that’s the output you would get from a human engineer told not to run tests, not to exercise the app, and to ship as soon as it compiles. Of course it’s crap. If you use the same professional workflow you would use with humans, the hallucinations and bad-confidence moments get absorbed the same way a weak PR from a tired engineer gets absorbed: somebody catches it and it doesn’t ship.

Isn’t this slow and expensive?

Yes, it slows things down up front. It depends what you’re measuring. If you measure the time until code first appears, this workflow is slow. If you measure the time to working code that is close to shippable, it’s much faster. I spend most of the up-front budget on architecture, data model, and how to comprehensively test the output before any code is generated. But what gets generated then tends to run and match the spec on the first pass. It may not be what I stick with, but it works. Without this structure you end up with a buggy mess, the AI trying to fix bug after bug while you head to Reddit to post that your favorite model was “nerfed.” Model quality will always fluctuate. With this workflow, it doesn’t matter.

As for cost, reviewing plans and code is cheap. I’m on API key plans for everything but Claude Code, and those extra LLMs typically run under $100 a month combined. For a professional programmer, that’s one billable hour to save untold hours of work.

Isn’t this overkill for small problems?

Of course. As any software project manager will tell you, you pick the level of process appropriate to the size of the task. I divide my work into three tiers:

Effort to do it by hand Workflow
A few hours Just tell the AI to do it
A day or two Claude Code’s built-in planning
Three or more days, or a major refactor Full spec-driven, multi-LLM process

The overhead of the full workflow only pays off when the work is substantial enough that catching mistakes early matters more than shipping fast. For small jobs, one-shotting is fine as long as you have well-designed unit and integration test suites. Just don’t mistake a big job for a small one.

The bottom line

Can spec-driven, multi-LLM development fail? Yes, the same as any development process. Once I worked through the initial gotchas and folded the recovery patterns into my workflow, it has been twenty major revisions without a meaningful miss. Each one of those would have taken a month or two the old way. They go out in a few days now. The pace is, frankly, astounding.

×