← Undes Blog

Evidence verification

How to Verify AI Engineering Claims Against Repository Evidence

A confident answer can contain correct facts, plausible inference, and invented detail in the same paragraph. Verification starts by separating those categories and demanding an evidence path for material claims.

"The retry loop is safe," "all callers use the new API," and "the test covers the race" are not equivalent to the repository observations that would make them true. Treating prose as evidence is the central failure mode in AI-assisted engineering review.

1. Separate four kinds of claim

TypeMeaningExample
Observed factDirectly inspected or executedA named test failed with a captured assertion.
InferenceConclusion derived from factsThe lock boundary likely permits duplicate sessions.
AssumptionRequired premise not yet verifiedThe callback handler is the only writer.
RecommendationProposed actionMove the lock around exchange and acquisition.

Problems begin when an inference is formatted as an observed fact, or when a recommendation is presented as if implementation and tests already existed. A useful artifact preserves the boundary.

2. Build an evidence chain, not a citation decoration

A repository evidence anchor should answer four questions:

  1. Target: which file, symbol, test, log, or command output was examined?
  2. Observation: what exactly was found there?
  3. Relevance: how does that observation affect the claim?
  4. Coverage: what nearby path or counterexample remains unchecked?

A path alone is not evidence. Neither is a line number that does not contain the claimed behavior. The anchor must connect an observation to the decision and disclose its coverage limits.

Claim → repository target → observation → inference → decision

If any edge is missing, the claim should be downgraded to an assumption or open check rather than silently retained as fact.

3. A practical verification workflow

Step 1: Extract the material claims

Ignore style claims and focus on statements that change the merge decision: root cause, blast radius, compatibility, security boundary, test coverage, migration safety, and rollback behavior.

Step 2: Find the ownership boundary

Start from the named symbol, but inspect callers, configuration, adapters, and tests that own the behavior. A bug often lives at a boundary rather than inside the function named in the prompt.

Step 3: Search for counterevidence

Do not ask only "where is this claim supported?" Ask "what repository fact would make it false?" Search for alternate call paths, feature flags, platform branches, legacy aliases, and error handling.

Step 4: Run bounded checks

A targeted test establishes less than a full suite, and a full suite establishes less than production safety. Record the exact command, exit status, and scope. Do not summarize "tests pass" without scope.

Step 5: Reconcile the result

Keep supported claims, downgrade uncertain ones, reject contradicted ones, and emit open checks for missing evidence. "Insufficient evidence" is a valid result when the repository cannot support a stronger verdict.

4. What failed verification should look like

A verifier should not repair uncertainty with more confident prose. If a file was unavailable, a symbol could not be resolved, or a test could not run, the final artifact should carry that failure forward.

FailureCorrect response
Relevant file not inspectedMark the claim as assumption or insufficient evidence.
Symbol search returns multiple ownersList candidates and request disambiguation.
Test cannot runReport command failure and keep validation open.
Reviewers disagreePreserve the disagreement and the evidence behind each position.
Evidence contradicts proposalReject or revise the proposal; do not average opinions.

5. Reviewer checklist

  • Can every merge-critical claim be classified as fact, inference, assumption, or recommendation?
  • Does each observed fact have an inspectable repository or execution anchor?
  • Were likely counterexamples actively searched for?
  • Is test scope stated precisely?
  • Are skipped checks and missing context visible in the final verdict?
  • Could a reviewer reproduce the decision without trusting the model's tone?
The goal is not to prove that a model never hallucinates. It is to prevent unsupported claims from crossing the engineering trust boundary unnoticed.

See the hallucination-checking use case for the product view, or the execution-flow case study for an example of evidence disappearing between stages.