4.612 min

Design multi-instance and multi-pass review architectures

When Claude reviews work it just produced, it carries the reasoning it used to generate that work and tends to rationalize its own choices instead of questioning them. In production review pipelines (automated code review, extraction QA), the reliable fixes are architectural: hand the artifact to a fresh independent instance with no generation context, and split large multi-file reviews into focused per-file passes plus a separate cross-file integration pass. These techniques catch subtle bugs and eliminate the inconsistent, contradictory findings that a single self-review pass produces.

An independent review instance receives only the diff (no generation transcript), then fans out into per-file local passes plus a cross-file integration pass whose findings are unioned, de-duplicated, and routed by confidence.

Why same-session self-review is weak

When one Claude session generates code and is then asked to review that same code, it still holds the full reasoning context from generation. It already worked out why each choice was correct, so rather than questioning those decisions it tends to rationalize them. This is generator context bias.

Instructions like "now carefully review your own work" or enabling extended thinking do not remove the bias, because the biasing material (the assumptions and justifications) is still sitting in the context window. The model reviews through the same lens it wrote with, and extended thinking often just deepens that same line of reasoning.

The exam states this precisely: a model retains reasoning context from generation, making it less likely to question its own decisions in the same session. The correct response is architectural, not a stronger prompt.

Independent review instances (multi-instance)

An independent review instance is a fresh Claude invocation that receives only the artifact to review (the diff, the file, the extracted JSON) and none of the generation transcript. With no prior reasoning to anchor on, it evaluates the work on its merits and catches subtle issues that self-review misses.

In the Claude Agent SDK this is a separate session or agent; in Claude Code CI it is a separate claude -p invocation from the one that produced the change. The generation job writes code, and a distinct review job reads only the resulting diff. This is the same idea as session context isolation from task 3.6: the session that generated the code is less effective at reviewing its own changes than an independent instance.

Key testable ranking: an independent instance beats both self-review instructions and extended thinking for catching subtle problems, because those two operate inside the biased context while the independent instance starts clean.

Multi-pass review: local passes plus an integration pass

Reviewing many changed files in one prompt dilutes attention. The symptoms are distinctive and heavily tested: detailed feedback on some files but superficial comments on others, obvious bugs missed, and contradictory findings where the model flags a pattern as a bug in one file while approving identical code elsewhere in the same review.

The remedy is prompt chaining into focused passes. Run a per-file local pass on each file for issues contained within that file (null handling, off-by-one errors, local error handling, in-file logic). Then run a separate cross-file integration pass that examines data flow across files: changed function signatures, interface and contract mismatches, shared state, and ordering assumptions.

Because each local pass sees only one file, depth stays consistent and findings stop contradicting each other. The integration pass is where genuinely cross-cutting bugs surface, which is why you cannot simply split the change into smaller reviews and stop there. This is the review-stage application of the decomposition principle in task 1.6.

Verification passes with per-finding confidence

A verification pass re-examines the artifact, often with an independent instance, and attaches a confidence level to each finding. Those confidences drive calibrated routing: high-confidence findings can be auto-posted as PR comments, while low-confidence or ambiguous ones are routed to a human or a deeper second look.

Calibrate the thresholds against a labeled validation set rather than trusting the raw numbers, and keep the scope narrow. Per-finding confidence is a prioritization signal for reviewer attention. It is not a reliable proxy for overall case complexity or an escalation trigger, because model self-reported confidence is poorly calibrated when used that way (see tasks 5.2 and 5.5).

So use confidence to rank and route the findings coming out of a review, not to decide whether the whole task needs a human.

Composing the architecture in a CI pipeline

Put the pieces together for a scenario-5 style automated review. Generation and review run as separate jobs, which gives you the independent instance for free. The review job runs claude -p in non-interactive mode and emits machine-parseable findings with --output-format json and --json-schema, so results can be posted as inline PR comments.

For a large PR the review job fans out: one local pass per changed file plus one integration pass over the full diff, then it aggregates the findings. On re-runs after new commits, include the prior findings in context and instruct Claude to report only new or still-unaddressed issues, which avoids duplicate comments.

# review job, independent from the generator
claude -p "Review src/orders.ts for local issues" \
  --output-format json --json-schema findings.schema.json

This mirrors the prompt-chaining decomposition from task 1.6, applied to the review stage itself rather than to the original implementation task.

Anti-patterns to avoid

avoid

Add "now review your own work carefully" or enable extended thinking in the same session that wrote the code.

Why it fails: The generation reasoning still occupies the context, so the model rationalizes its own choices instead of questioning them; extended thinking tends to deepen the same biased line of thought rather than break out of it.

instead Run the review in a second, independent instance that sees only the artifact and none of the generation transcript.

avoid

Review all changed files together in one large single pass.

Why it fails: Attention dilutes across files, producing uneven depth, missed bugs, and contradictory findings such as flagging a pattern as a bug in one file while approving the identical pattern in another.

instead Prompt-chain per-file local passes for in-file issues plus one separate cross-file integration pass for data-flow and contract issues.

avoid

Run N independent passes and only report findings that appear in at least K of them (majority vote).

Why it fails: Real bugs are often caught only intermittently, so consensus gating suppresses exactly those findings and lowers recall, defeating the purpose of running multiple reviews.

instead Union the findings across passes and de-duplicate; use agreement to rank or prioritize findings, not to filter genuine issues out.

avoid

Switch to a larger-context model so all files fit in one prompt.

Why it fails: The bottleneck is attention quality within the window, not raw capacity; more tokens do not restore consistent per-file depth or stop contradictory findings.

instead Split the review into focused per-file and integration passes regardless of the context-window size.

Worked example: Fixing an inconsistent 14-file CI review

Situation. In your CI pipeline (scenario 5), a Claude Code job generates and refactors code, and a review step analyzes the resulting PR. A PR touches 14 files in the stock-tracking module. The single-pass review that reads all 14 files at once returns detailed feedback for a few files, superficial comments for the rest, misses an obvious null-dereference, and even flags an early-return pattern as a bug in positions.ts while approving the identical pattern in orders.ts.

Diagnosis. Two separate problems. First, the review may be running in the same session that generated the code, so it inherits generator context bias. Second, and the dominant issue here, one pass over 14 files causes attention dilution, which is exactly what produces uneven depth and contradictory findings.

Restructure.

Make review a distinct claude -p job from generation so it is an independent instance with no generation transcript.
Fan out one local pass per file for in-file issues:

for f in $(git diff --name-only origin/main); do
  claude -p "Review $f for local issues (null handling, error paths, logic)." \
    --output-format json --json-schema findings.schema.json
done

Run one cross-file integration pass over the full diff for data-flow and contract issues (changed signatures, shared state, ordering assumptions).
Aggregate: union all findings, de-duplicate, attach per-finding confidence, auto-post high-confidence items and route the rest to a human.

Why not the tempting alternatives. Asking developers to split the PR shifts the burden onto humans without improving the system. A bigger context window does not fix attention quality. Running three full passes and keeping only issues seen in two of three would suppress the intermittently-caught null-dereference. Splitting into local passes plus an integration pass attacks the root cause directly, which is why it is the correct answer.

Exam tips

✓Same-session self-review is weak because the model retains its generation reasoning context; an independent instance with no generation context catches more subtle issues than self-review instructions or extended thinking.
✓The signature symptom that a review needs multi-pass splitting is inconsistent depth plus contradictory findings, for example flagging a pattern as a bug in one file while approving identical code in another.
✓The multi-pass structure is per-file local passes (in-file issues) plus one separate cross-file integration pass (data flow, contracts, shared state).
✓A larger context window does NOT fix attention dilution; splitting the work into focused passes is what fixes it.
✓Majority-vote or K-of-N consensus filtering across runs suppresses real bugs caught only intermittently; prefer the union of findings and use agreement only to rank.
✓Per-finding confidence self-reporting is for calibrated routing of review findings (calibrated against labeled sets), not a reliable proxy for case complexity or an escalation trigger.

Official exam objectives for 4.6

Knowledge of

Self-review limitations: a model retains reasoning context from generation, making it less likely to question its own decisions in the same session
Independent review instances (without prior reasoning context) are more effective at catching subtle issues than self-review instructions or extended thinking
Multi-pass review: splitting large reviews into per-file local analysis passes plus cross-file integration passes to avoid attention dilution and contradictory findings

Skills in

Using a second independent Claude instance to review generated code without the generator's reasoning context
Splitting large multi-file reviews into focused per-file passes for local issues plus separate integration passes for cross-file data flow analysis
Running verification passes where the model self-reports confidence alongside each finding to enable calibrated review routing

Flashcards from this lesson

Why is same-session self-review unreliable?

The model retains the reasoning context it used to generate the work, so it rationalizes its own decisions rather than questioning them (generator context bias).

What beats self-review instructions and extended thinking for catching subtle bugs?

A second, independent Claude instance that sees only the artifact and none of the generation transcript.

What symptoms signal that a large multi-file review needs to be split into passes?

Inconsistent depth (detailed on some files, superficial on others), missed obvious bugs, and contradictory findings such as flagging a pattern in one file but approving identical code in another.

What is per-finding confidence self-reporting used for, and what is it NOT for?

It is for calibrated routing of review findings (auto-post vs human review), calibrated against labeled sets. It is NOT a reliable proxy for case complexity or an escalation trigger.

Does switching to a larger context window fix a diluted single-pass review?

No. The bottleneck is attention quality within the window, not capacity; you fix it by splitting the review into focused passes.

What are the two pass types in a multi-pass review?

Per-file local passes for in-file issues, plus one separate cross-file integration pass for data flow, changed signatures, contracts, and shared state.

Why is K-of-N majority voting a bad way to filter review findings?

Real bugs are often caught only intermittently, so consensus gating suppresses them and lowers recall; union the findings instead and use agreement only to rank.

Study all flashcards with spaced repetition

Mark this lesson complete when you are confident.

← Previous

4.5 Design efficient batch processing strategies

5.1 Manage conversation context to preserve critical information across long interactions