5.512 min

Design human review workflows and confidence calibration

Before an extraction pipeline is allowed to auto-accept results without a human in the loop, you have to prove it is safe, and a single aggregate accuracy number is not proof. A headline like 97% can hide a document type or a field that is failing badly, so you validate accuracy by segment (document type and field), have the model emit field-level confidence, and calibrate routing thresholds against a labeled validation set. Then you keep watching the automated lane with stratified random sampling to catch novel error patterns before they cause harm, routing low-confidence and ambiguous cases to your limited pool of human reviewers.

Segment the accuracy metric first (the aggregate hides a failing handwritten segment and a weak tax_id field), emit field-level confidence, calibrate per-segment thresholds on a labeled set, then split into auto-accept and human-review lanes while a stratified random sampling audit continuously measures the automated lane and detects novel patterns.

Aggregate accuracy is a trap

A single headline metric like "97% accurate" is an average over the whole document population, and averages are dominated by the common case. If 90% of documents are clean typed invoices scoring 99% and 10% are handwritten receipts scoring 62%, the blended number still reads about 96% and looks production-ready, while one entire segment is quietly broken.

The exam frames this precisely: aggregate accuracy metrics may mask poor performance on specific document types or fields. The failure has two axes. By document type (typed versus handwritten, one vendor's layout versus another) and by field (invoice_number is easy, a multi-line tax breakdown is hard). A field can sit at 70% while the document as a whole scores well because the other fields carry the average.

The design consequence: you must validate accuracy by document type and field segment before automating high-confidence extractions or reducing human review. The decision to remove a human is made per segment, never for the pipeline as a whole based on one blended figure.

Field-level confidence, not one score per document

Confidence should be reported per field, because difficulty varies field by field within the same document. Ask the model to emit a confidence value alongside each extracted value, which you can do directly in the tool_use output schema from task 4.3.

{
  "invoice_number": { "value": "INV-4471",   "confidence": 0.98 },
  "total":          { "value": 1840.00,      "confidence": 0.95 },
  "tax_id":         { "value": "12-3456789", "confidence": 0.71 }
}

A single whole-document confidence hides the pattern that matters. A receipt can be extracted with high confidence on eight easy fields and low confidence on the one hard field, and it is that one field you need to route to a human. Field-level scores let you review only the risky field instead of discarding, or blindly accepting, the whole extraction.

Calibrate confidence against a labeled validation set

Raw model confidence is not trustworthy out of the box. A model that reports 0.9 is not necessarily correct 90% of the time, so you cannot pick a threshold by intuition. Calibration means running the model over a labeled validation set (documents with known-correct answers), grouping predictions by reported confidence, and measuring the actual observed accuracy at each confidence level.

The result is a mapping from reported confidence to observed accuracy, computed per segment. You then set the auto-accept threshold where observed accuracy clears your quality bar. That threshold is usually different for each field and document type: 0.95 confidence on invoice_number might correspond to 99% real accuracy, while 0.95 on tax_id corresponds to only 88%, so the two fields need different cutoffs.

This is the nuance that separates 5.5 from 5.2 and 4.6. Uncalibrated self-reported confidence is an unreliable proxy (task 5.2 warns against using it as an escalation trigger, since the model is often confidently wrong on hard cases). Calibration against labeled data is exactly what turns a raw, untrustworthy number into a usable routing signal. Confidence is only as good as its calibration.

Stratified random sampling of the automated lane

Once high-confidence extractions bypass human review, you still need to know their real error rate, and you cannot review all of them without defeating the automation. So you sample a subset for ongoing measurement. Simple random sampling is not enough: rare document types are under-represented, so a segment that is 3% of volume but failing badly may never appear in the sample.

Stratified random sampling divides the auto-accepted population into strata (by document type, by field, and by confidence band) and samples within each stratum, guaranteeing coverage of rare and high-risk segments. This produces an ongoing, per-segment error-rate measurement rather than a one-time launch gate.

The second job of this sampling is detecting novel error patterns. Distributions drift over time: a new vendor layout appears, a form template changes, a new document type enters the stream. The model can be confidently wrong on these, so they land in the auto-accept lane undetected. Stratified sampling is the mechanism that surfaces these emerging failures before they accumulate into a large silent error rate.

Routing and prioritizing limited reviewer capacity

Human reviewers are the scarce resource, so routing is about spending their attention where it changes outcomes. Route to human review: extractions with low calibrated confidence, and extractions from ambiguous or contradictory source documents (a document that states two different totals, or a scan too degraded to read reliably). Auto-accept: high-confidence extractions in segments you have already validated.

Confidence is the prioritization signal. When reviewer capacity is limited, rank the queue so the lowest-confidence and highest-value items are seen first, and let the clearly-fine cases through. This is the same per-finding-confidence-for-routing idea from task 4.6, applied to extraction QA rather than to code-review findings.

Note the two distinct triggers for a human: model uncertainty (low confidence) and source-document uncertainty (ambiguous or self-contradictory input). A document can be genuinely unresolvable even when the model is confident, which is why ambiguous sources route to a human regardless of the confidence score.

The deployment gate: validate per segment before removing the human

Putting it together, the decision to reduce human review is gated on segment-level evidence, not on the headline number. For each document type and field, confirm on the labeled validation set that calibrated high-confidence accuracy clears your bar. Automate only those segments, and keep a human in the loop for segments that fail or that you have not measured yet.

After launch, the stratified sampling audit keeps the gate honest over time. If a segment's sampled error rate rises, or a novel pattern appears, that segment goes back to human review. Automation coverage expands and contracts based on measured, segmented accuracy, and never on a single aggregate figure that can drift out from under you.

Anti-patterns to avoid

avoid

Turn off human review because overall accuracy is 97%.

Why it fails: The aggregate averages over the whole population and masks tail segments; a document type or field failing at 60-70% is invisible in the blended number, so you would ship those failures straight to downstream systems.

instead Break accuracy down by document type and field, and automate only the segments that clear your bar; keep the human in the loop for weak or unmeasured segments.

avoid

Pick an auto-accept threshold from the raw model confidence (e.g. accept everything above 0.9) by intuition.

Why it fails: Model confidence is poorly calibrated out of the box, so 0.9 does not mean 90% accurate, and the true accuracy at a given confidence differs by field and document type.

instead Calibrate against a labeled validation set: measure observed accuracy at each confidence level per segment, then set per-field, per-doc-type thresholds where observed accuracy meets your bar.

avoid

Spot-check the automated lane with simple random sampling.

Why it fails: Simple random sampling under-represents rare document types and confidence bands, so a small-but-broken segment or an emerging novel pattern can go unmeasured indefinitely.

instead Use stratified random sampling across document type, field, and confidence band so rare and high-risk strata are always covered and drift is detected.

avoid

Emit one confidence score for the whole extraction and route on it.

Why it fails: A single score hides per-field variation, so a document that is correct on nine fields and wrong on one hard field looks uniformly confident and the wrong field never gets reviewed.

instead Have the model output field-level confidence, route only the low-confidence fields to a human, and keep the rest.

Worked example: Deciding what to automate in an invoice extraction pipeline

Situation (scenario 6). You run a structured extraction system over incoming invoices and receipts using tool_use with a strict JSON schema (task 4.3). Every extraction is currently checked by a human. Leadership sees 97% aggregate accuracy on the test set and asks you to drop human review to cut cost.

Step 1, segment the metric. Break the 97% down by document type and field. You find: typed vendor invoices 99.2%, PDF statements 98.0%, but handwritten receipts 64%. By field: invoice_number 99.5%, total 97%, tax_id 79%. The aggregate was masking two weak segments. Removing the human wholesale would ship those failures downstream.

Step 2, emit field-level confidence. Extend the extraction schema so each field carries its own confidence:

{ "total":  { "value": 1840.00, "confidence": 0.96 },
  "tax_id": { "value": "12-3456789", "confidence": 0.68 } }

Step 3, calibrate. On a labeled validation set, bin predictions by reported confidence per field and measure observed accuracy. You learn that total is at least 99% accurate above confidence 0.90, that tax_id only reaches that accuracy above 0.97, and that handwritten receipts never clear the bar at any confidence.

Step 4, set routing. Auto-accept typed-invoice fields above their calibrated per-field thresholds. Route to a human: any field below threshold, every handwritten receipt (segment not safe at any confidence), and any document with contradictory sources (two stated totals, unreadable scan). Prioritize the reviewer queue by lowest confidence and highest invoice value so limited capacity goes to the riskiest, most expensive cases first.

Step 5, keep watching. Run stratified random sampling across document types, fields, and confidence bands on the auto-accepted lane on an ongoing basis. When a new vendor's layout starts producing confident-but-wrong totals, the stratified sample catches the novel pattern and that segment is returned to human review.

Why this is the exam-correct shape. It refuses to trust the aggregate, segments by document type and field, uses calibrated field-level confidence for routing, and audits the automated lane with stratified sampling, matching every knowledge and skill bullet for task 5.5.

Exam tips

✓A high aggregate accuracy (e.g. 97%) can mask poor performance on specific document types or fields; always segment by document type AND field before automating or reducing human review.
✓Use stratified random sampling, not simple random sampling, to measure the error rate of auto-accepted high-confidence extractions and to detect novel or emerging error patterns; simple random under-samples rare strata.
✓Prefer field-level confidence over one whole-document score, and calibrate thresholds against a labeled validation set, per field and per document type; raw model confidence is poorly calibrated and 0.9 does not mean 90% accurate.
✓Route to human review on two independent triggers: low (calibrated) model confidence, and ambiguous or contradictory source documents; use confidence to prioritize a limited reviewer queue.
✓Validate accuracy by document type and field segment BEFORE automating high-confidence extractions; the automate/keep-human decision is per segment, not pipeline-wide.
✓Calibration is what makes confidence usable: contrast with task 5.2, where uncalibrated self-reported confidence is an unreliable escalation proxy because the model is often confidently wrong on hard cases.

Official exam objectives for 5.5

Knowledge of

The risk that aggregate accuracy metrics (e.g., 97% overall) may mask poor performance on specific document types or fields
Stratified random sampling for measuring error rates in high-confidence extractions and detecting novel error patterns
Field-level confidence scores calibrated using labeled validation sets for routing review attention
The importance of validating accuracy by document type and field segment before automating high-confidence extractions

Skills in

Implementing stratified random sampling of high-confidence extractions for ongoing error rate measurement and novel pattern detection
Analyzing accuracy by document type and field to verify consistent performance across all segments before reducing human review
Having models output field-level confidence scores, then calibrating review thresholds using labeled validation sets
Routing extractions with low model confidence or ambiguous/contradictory source documents to human review, prioritizing limited reviewer capacity

Flashcards from this lesson

Why is a 97% aggregate accuracy number not enough to justify turning off human review?

The aggregate averages over the whole population and can mask a document type or field that is failing badly (e.g. handwritten receipts at 64%); you must validate accuracy by document type and field segment first.

Why report confidence per field instead of one score per extraction?

Difficulty varies field by field. A document can be high-confidence on easy fields and low-confidence on one hard field; field-level scores let you route only the risky field to a human instead of the whole extraction.

What does it mean to calibrate confidence, and why is it necessary?

Run the model on a labeled validation set, group predictions by reported confidence, and measure observed accuracy at each level, per segment. Raw model confidence is poorly calibrated (0.9 does not mean 90%), so thresholds must come from measured data.

Why use stratified random sampling rather than simple random sampling to audit the automated lane?

Simple random sampling under-represents rare document types and confidence bands. Stratifying by document type, field, and confidence band guarantees coverage of rare and high-risk segments and surfaces novel error patterns from drift.

What are the two independent triggers for routing an extraction to a human reviewer?

Low calibrated model confidence, and ambiguous or contradictory source documents (e.g. two stated totals or an unreadable scan). A document can need a human even when the model is confident.

How is field-level confidence in 5.5 different from the self-reported confidence warned about in 5.2?

In 5.2 uncalibrated self-reported confidence is an unreliable escalation proxy. In 5.5 confidence becomes usable precisely because it is calibrated against a labeled validation set and used per-segment for review routing.

What is the correct gate for automating a segment of extractions?

Confirm on the labeled validation set that calibrated high-confidence accuracy clears your bar for that specific document type and field; automate only those segments and keep humans on weak or unmeasured ones.

Study all flashcards with spaced repetition

Mark this lesson complete when you are confident.

← Previous

5.4 Manage context effectively in large codebase exploration

5.6 Preserve information provenance and handle uncertainty in multi-source synthesis