Verify AI Engineering Claims Against Repository Evidence

"The retry loop is safe," "all callers use the new API," and "the test covers the race" are not equivalent to the repository observations that would make them true. Treating prose as evidence is the central failure mode in AI-assisted engineering review.

1. Separate four kinds of claim

Type	Meaning	Example
Observed fact	Directly inspected or executed	A named test failed with a captured assertion.
Inference	Conclusion derived from facts	The lock boundary likely permits duplicate sessions.
Assumption	Required premise not yet verified	The callback handler is the only writer.
Recommendation	Proposed action	Move the lock around exchange and acquisition.

Problems begin when an inference is formatted as an observed fact, or when a recommendation is presented as if implementation and tests already existed. A useful artifact preserves the boundary.

2. Build an evidence chain, not a citation decoration

A repository evidence anchor should answer four questions:

Target: which file, symbol, test, log, or command output was examined?
Observation: what exactly was found there?
Relevance: how does that observation affect the claim?
Coverage: what nearby path or counterexample remains unchecked?

A path alone is not evidence. Neither is a line number that does not contain the claimed behavior. The anchor must connect an observation to the decision and disclose its coverage limits.

Claim → repository target → observation → inference → decision

If any edge is missing, the claim should be downgraded to an assumption or open check rather than silently retained as fact.

3. A practical verification workflow

Step 1: Extract the material claims

Ignore style claims and focus on statements that change the merge decision: root cause, blast radius, compatibility, security boundary, test coverage, migration safety, and rollback behavior.

Step 2: Find the ownership boundary

Start from the named symbol, but inspect callers, configuration, adapters, and tests that own the behavior. A bug often lives at a boundary rather than inside the function named in the prompt.

Step 3: Search for counterevidence

Do not ask only "where is this claim supported?" Ask "what repository fact would make it false?" Search for alternate call paths, feature flags, platform branches, legacy aliases, and error handling.

Step 4: Run bounded checks

A targeted test establishes less than a full suite, and a full suite establishes less than production safety. Record the exact command, exit status, and scope. Do not summarize "tests pass" without scope.

Step 5: Reconcile the result

Keep supported claims, downgrade uncertain ones, reject contradicted ones, and emit open checks for missing evidence. "Insufficient evidence" is a valid result when the repository cannot support a stronger verdict.

4. What failed verification should look like

A verifier should not repair uncertainty with more confident prose. If a file was unavailable, a symbol could not be resolved, or a test could not run, the final artifact should carry that failure forward.

Failure	Correct response
Relevant file not inspected	Mark the claim as assumption or insufficient evidence.
Symbol search returns multiple owners	List candidates and request disambiguation.
Test cannot run	Report command failure and keep validation open.
Reviewers disagree	Preserve the disagreement and the evidence behind each position.
Evidence contradicts proposal	Reject or revise the proposal; do not average opinions.

5. Reviewer checklist

Can every merge-critical claim be classified as fact, inference, assumption, or recommendation?
Does each observed fact have an inspectable repository or execution anchor?
Were likely counterexamples actively searched for?
Is test scope stated precisely?
Are skipped checks and missing context visible in the final verdict?
Could a reviewer reproduce the decision without trusting the model's tone?

The goal is not to prove that a model never hallucinates. It is to prevent unsupported claims from crossing the engineering trust boundary unnoticed.

See the hallucination-checking use case for the product view, or the execution-flow case study for an example of evidence disappearing between stages.