"The retry loop is safe," "all callers use the new API," and "the test covers the race" are not equivalent to the repository observations that would make them true. Treating prose as evidence is the central failure mode in AI-assisted engineering review.
1. Separate four kinds of claim
| Type | Meaning | Example |
|---|---|---|
| Observed fact | Directly inspected or executed | A named test failed with a captured assertion. |
| Inference | Conclusion derived from facts | The lock boundary likely permits duplicate sessions. |
| Assumption | Required premise not yet verified | The callback handler is the only writer. |
| Recommendation | Proposed action | Move the lock around exchange and acquisition. |
Problems begin when an inference is formatted as an observed fact, or when a recommendation is presented as if implementation and tests already existed. A useful artifact preserves the boundary.
2. Build an evidence chain, not a citation decoration
A repository evidence anchor should answer four questions:
- Target: which file, symbol, test, log, or command output was examined?
- Observation: what exactly was found there?
- Relevance: how does that observation affect the claim?
- Coverage: what nearby path or counterexample remains unchecked?
A path alone is not evidence. Neither is a line number that does not contain the claimed behavior. The anchor must connect an observation to the decision and disclose its coverage limits.
Claim → repository target → observation → inference → decision
If any edge is missing, the claim should be downgraded to an assumption or open check rather than silently retained as fact.
3. A practical verification workflow
Step 1: Extract the material claims
Ignore style claims and focus on statements that change the merge decision: root cause, blast radius, compatibility, security boundary, test coverage, migration safety, and rollback behavior.
Step 2: Find the ownership boundary
Start from the named symbol, but inspect callers, configuration, adapters, and tests that own the behavior. A bug often lives at a boundary rather than inside the function named in the prompt.
Step 3: Search for counterevidence
Do not ask only "where is this claim supported?" Ask "what repository fact would make it false?" Search for alternate call paths, feature flags, platform branches, legacy aliases, and error handling.
Step 4: Run bounded checks
A targeted test establishes less than a full suite, and a full suite establishes less than production safety. Record the exact command, exit status, and scope. Do not summarize "tests pass" without scope.
Step 5: Reconcile the result
Keep supported claims, downgrade uncertain ones, reject contradicted ones, and emit open checks for missing evidence. "Insufficient evidence" is a valid result when the repository cannot support a stronger verdict.
4. What failed verification should look like
A verifier should not repair uncertainty with more confident prose. If a file was unavailable, a symbol could not be resolved, or a test could not run, the final artifact should carry that failure forward.
| Failure | Correct response |
|---|---|
| Relevant file not inspected | Mark the claim as assumption or insufficient evidence. |
| Symbol search returns multiple owners | List candidates and request disambiguation. |
| Test cannot run | Report command failure and keep validation open. |
| Reviewers disagree | Preserve the disagreement and the evidence behind each position. |
| Evidence contradicts proposal | Reject or revise the proposal; do not average opinions. |
5. Reviewer checklist
- Can every merge-critical claim be classified as fact, inference, assumption, or recommendation?
- Does each observed fact have an inspectable repository or execution anchor?
- Were likely counterexamples actively searched for?
- Is test scope stated precisely?
- Are skipped checks and missing context visible in the final verdict?
- Could a reviewer reproduce the decision without trusting the model's tone?
The goal is not to prove that a model never hallucinates. It is to prevent unsupported claims from crossing the engineering trust boundary unnoticed.
See the hallucination-checking use case for the product view, or the execution-flow case study for an example of evidence disappearing between stages.