AI assistants reduce the time needed to produce a plausible candidate. That is useful, but a candidate is not yet an engineering decision. Someone still has to establish what changed, which repository facts support it, what can fail, and which checks ran.
1. The bottleneck moves
In a conventional workflow, implementation capacity limits how many changes reach review. When generation becomes cheap, more candidates arrive and each may be wider than a human-written change produced under the same time constraint. Review capacity does not increase automatically.
For a simple queue, weekly review load is:
Review load = candidates per week × average review time
If a team moves from 20 to 40 candidates while review remains 30 minutes each, review load doubles from 10 to 20 hours. If wider candidates take 37.5 minutes, the load becomes 25 hours. These are illustrative values, not a market estimate, but the queueing mechanism is direct.
Our longer quantitative analysis shows why reducing the model bill alone may have little effect when reviewer time dominates total decision cost: model cost and engineering cost are different quantities.
2. What review actually has to do
Review is not a binary scan for syntax errors. A reviewer reconstructs several things:
- Intent: what problem the change claims to solve.
- Context: which call paths, contracts, and constraints govern the change.
- Evidence: which files, tests, logs, or specifications support each important claim.
- Counterexamples: what concurrent, failure, or compatibility path breaks the proposal.
- Residual risk: what remains unverified after automated checks finish.
Bacchelli and Bird's study of modern code review found outcomes beyond defect discovery, including code improvement, understanding, and knowledge transfer. That matters for AI-generated changes: a polished diff can still impose substantial comprehension work on the reviewer.
3. Common failure modes of fast generation
The implementation is locally plausible but repository-wrong
The model follows a familiar framework pattern while missing a local invariant, compatibility layer, generated file, or indirect caller. The code can look idiomatic and still violate the system around it.
The explanation overstates what was checked
A response may say "tests pass" when only one targeted test ran, or say "all callers were updated" without a complete search. Reviewers then spend time separating observed facts from fluent summary.
Large candidates hide uncertainty
When generation is cheap, speculative cleanup and unrelated refactoring can enter the same patch. That increases the surface that must be understood and weakens the link between evidence and decision.
Another model agrees without adding evidence
Multi-agent consensus can expose disagreements, but agreement is not proof. If every participant reasons from the same missing context, consensus only repeats the same unsupported premise.
4. Replace the polished answer with a review contract
A useful AI review artifact should reduce reconstruction work without pretending to guarantee correctness. At minimum, it should separate:
| Field | Question it answers |
|---|---|
| Decision | What action is recommended? |
| Evidence | Which observed repository facts support it? |
| Assumptions | What had to be taken as true? |
| Rejected hypotheses | Which plausible alternatives were examined? |
| Open checks | What still needs human or automated validation? |
| Trust verdict | How far can the candidate be relied on now? |
The key is provenance, not volume. A short statement linked to a concrete observation is more reviewable than a long narrative that mixes facts, inference, and recommendation.
5. Measure the merge decision, not output speed
Useful evaluation questions include:
- How long does it take a reviewer to reach apply, revise, or reject?
- How many material claims have inspectable evidence anchors?
- How often does the artifact correctly say that evidence is insufficient?
- How many review comments identify missing context that the system should have fetched?
- Do escaped defects and rollback rates change as candidate volume grows?
Faster code generation is valuable only when the system around it keeps review load, rework, and false confidence under control.
See the AI code review use case for the product-facing workflow, or read how a staged trust artifact is constructed.