5.113 min

Manage conversation context to preserve critical information across long interactions

Long sessions are where agents quietly go wrong: exact amounts, dates, and order numbers get flattened into vague summaries, verbose tool outputs crowd out the signal, and findings buried in the middle of a huge input simply get ignored. Because the Claude API is stateless, you own the transcript and its budget, so context management is the core engineering task of any long-running agent. The reliable pattern is to preserve full history for coherence while compressing the expensive parts (verbose tool results), protecting the fragile parts (exact facts) in a persistent case-facts layer, and laying out long inputs so the important material sits where the model actually attends to it.

How to lay out a single long API request so critical facts survive: pin the case-facts block and a summary at the top, trim tool output in the middle, keep recent turns at the end. The curve shows attention is highest at the start and end (lost in the middle).

The stateless API makes context management your job

The Claude Messages API is stateless: every request must carry the full messages array, and the model has no server-side memory of prior turns. To maintain conversational coherence across a long interaction you resend the complete conversation history on each request, so the model can reason over everything the customer said and everything the tools returned. In the Agent SDK and the raw API you own that transcript, unlike a consumer chat product that appears to "remember" for you.

Because you own the transcript, you also own its cost. Every token of history is re-billed and re-processed on every turn, and the context window is finite. That creates a real tension: dropping older turns to save tokens breaks coherence (the agent forgets what the customer already told it), but keeping everything raw blows the budget and dilutes the model's attention.

The rest of this lesson resolves that tension. Keep the full history the model needs for coherence, but compress the parts that are expensive and low-value (verbose tool results), and protect the parts that are cheap but fragile (exact numbers, dates, IDs) in a dedicated structure that summarization can never touch.

Progressive summarization and what it silently destroys

As history grows, a common tactic is progressive summarization: periodically replace older turns with a condensed summary to reclaim tokens. Summarization is excellent for narrative and intent, for example "the customer is disputing a duplicate charge and is frustrated after two failed calls." It is dangerous for precision.

Prose summaries reliably lose exact numerical values, percentages, dates, order numbers, statuses, and customer-stated expectations, which are the very facts a support or extraction workflow depends on. A summary that reads "the customer wants a refund for a recent order" has silently dropped that it was order #A-4471, that the amount was $650, and that the customer was promised a 3-day turnaround. Two turns later the agent quotes the wrong figure, re-asks a question the customer already answered, or refunds the wrong amount.

The rule is simple: summarize the narrative, never the load-bearing facts. Anything the agent must reproduce exactly should be pulled out and stored verbatim before summarization runs, not entrusted to a paraphrase.

The case-facts block: a persistent structured layer

The fix for lossy summarization is to extract transactional facts into a persistent "case facts" block that lives outside the summarized history and is re-injected verbatim into every prompt. Store it as structured data so it stays unambiguous and cheap to carry:

{
  "case_facts": {
    "customer_id": "C-88213",
    "order_id": "A-4471",
    "refund_amount_usd": 650.00,
    "order_date": "2026-06-14",
    "status": "delivered_damaged",
    "promised_resolution": "3-day replacement"
  }
}

Because this block sits outside the region that gets summarized or compacted, no summarization pass can overwrite it. For multi-issue sessions (a single customer with three separate disputes) keep a separate context layer keyed per issue, each carrying its own order IDs, amounts, and statuses, so facts from one issue do not bleed into another. Think of it as a small typed scratchpad that rides along with the conversation and always reflects ground truth.

Trim verbose tool outputs before they accumulate

Tool results accumulate in context and consume tokens disproportionately to their relevance. A single order lookup might return 40+ fields (billing address, carrier metadata, internal flags, tax breakdowns) when the return workflow only needs about 5: order ID, item, amount, status, and order date. If you append the raw payload every turn, it stays in the transcript forever, gets re-billed on every iteration, and dilutes the model's attention across noise.

Trim tool outputs to only the relevant fields before they enter the transcript. The cleanest place to do this is a PostToolUse hook or the tool wrapper itself, so the trimming is deterministic rather than something you ask the model to do:

// PostToolUse: keep only return-relevant fields
const { order_id, item, amount, status, order_date } = raw;
return { order_id, item, amount, status, order_date };

This is usually the single highest-leverage token saving in an agent, because tool results dominate context growth. Note the difference from a prompt instruction: you physically shrink the payload, you do not merely ask the model to ignore fields, which it may not reliably do.

Lost in the middle: lay out long inputs deliberately

Models reliably process information at the beginning and the end of a long input but may omit content buried in the middle. This "lost in the middle" effect is a property of long-context attention, not something you can fully prompt away, and it gets worse as the input grows.

Two layout tactics mitigate it. First, place a key-findings summary at the very beginning of an aggregated input, so the most important conclusions sit in a high-attention region rather than being discovered (or missed) deep inside. Second, organize detailed results with explicit section headers, so each section has a salient anchor and the model can navigate the structure instead of skimming an undifferentiated wall of text. The recent turns and the current question naturally fall at the end, which is also high-attention.

What you must never do is drop a critical figure, or an entire subtopic, into the middle of a 40k-token block and assume it will be used. In a research synthesis, that is exactly how a report ends up covering the first and last sources well while quietly omitting everything in between.

Structured handoffs for downstream context budgets

In multi-agent systems, every downstream agent has its own limited context budget, so what upstream agents return is decisive. If a web-search subagent hands the synthesis agent verbose page content plus its full reasoning chain, the synthesizer's window fills with low-value text and the actual facts get crowded out or lost in the middle.

Instead, modify upstream agents to return structured data: key facts, citations, and relevance scores rather than raw content and reasoning chains. Require subagents to include the metadata downstream synthesis needs to stay accurate, including publication or data-collection dates, source URLs, document names, page numbers, and methodological context. Dates in particular prevent temporal differences from being misread as contradictions, and source mappings preserve attribution through the synthesis step.

{
  "claim": "Adoption rose 34% in 2025",
  "source": "gartner.com/ai-2025",
  "published": "2026-01-12",
  "relevance": 0.91
}

This is the multi-agent expression of the same principle used in the case-facts block: pass the distilled, metadata-rich signal, not the raw transcript, so each agent spends its budget on reasoning rather than on re-reading noise.

Anti-patterns to avoid

avoid

Running progressive summarization over the whole history, including the exact amounts, dates, order IDs, and customer-stated expectations.

Why it fails: Prose summaries reliably drop precise figures and commitments. The agent later quotes the wrong refund amount, re-asks answered questions, or acts on a vague paraphrase, which is fatal when money or compliance is involved.

instead Extract transactional facts into a persistent case-facts block stored verbatim outside the summarized history, and summarize only the narrative around them.

avoid

Appending raw, full tool outputs (a 40+ field order lookup) to the transcript on every turn.

Why it fails: Tool results consume tokens disproportionately to their relevance, get re-billed each iteration, exhaust the window, and dilute attention across fields the task never needs.

instead Trim tool outputs to only the relevant fields in a PostToolUse hook or tool wrapper before they enter context, keeping just the handful the workflow uses.

avoid

Concatenating many findings into one huge input and assuming the model reads all of it equally.

Why it fails: The lost-in-the-middle effect means content in the middle of a long input is often omitted, so critical figures or whole subtopics silently disappear from the output.

instead Put a key-findings summary at the very top, add explicit section headers, and keep the current question at the end so important material sits in high-attention regions.

avoid

Truncating or dropping earlier conversation turns to save tokens, or resending only a summary in place of the transcript.

Why it fails: The API is stateless, so anything not resent is forgotten. Dropping real turns breaks conversational coherence and the agent loses context the customer already provided.

instead Preserve complete history for coherence, but reclaim budget by trimming verbose tool results and compressing narrative, while the case-facts block keeps the precise data intact.

Worked example: Keeping a multi-issue billing dispute coherent over a long support session

In the customer support scenario (Scenario 1), a customer opens a long session with three separate problems: a duplicate charge on order #A-4471 ($650, delivered damaged), a missing loyalty credit, and a subscription they want cancelled with a partial refund. The Agent SDK agent uses get_customer, lookup_order, process_refund, and escalate_to_human. By turn 20 the raw transcript is huge and the naive setup starts failing: a summarization step has compressed the early turns into "customer has some billing issues," and each lookup_order dumped 40+ fields into context. The agent now quotes $65 instead of $650 and asks the customer to repeat the order number.

Step 1: pin a per-issue case-facts layer. Extract structured facts as they are confirmed, one entry per issue, and re-inject verbatim on every request:

{ "issues": [
  { "id": "dup_charge", "order_id": "A-4471", "amount_usd": 650.00, "status": "delivered_damaged", "promised": "3-day replacement" },
  { "id": "loyalty_credit", "expected_points": 500, "status": "missing" },
  { "id": "sub_cancel", "plan": "Pro", "refund_basis": "prorated", "status": "open" }
] }

Because this layer lives outside the summarized history, no later summarization pass can turn $650 into $65 or lose which order maps to which issue.

Step 2: trim tool outputs at the source. A PostToolUse hook reduces each lookup_order result to order_id, item, amount, status, order_date, dropping carrier metadata and internal flags. Token growth per lookup falls by roughly 85%, leaving budget for the actual reasoning.

Step 3: lay out the input for attention. When the agent assembles its prompt, the case-facts block and a one-line status summary go at the top, the three issues are separated with explicit section headers, and the customer's current message sits at the end. Nothing load-bearing is buried in the middle.

Result: the agent keeps all three issues straight, quotes exact figures, honors the 3-day commitment, and only escalates the genuinely ambiguous subscription refund, moving first-contact resolution toward the 80% target instead of losing the thread as the session grows.

Exam tips

✓The Messages API is stateless: you must resend the complete conversation history each request for coherence, and every token is re-billed and re-processed on every turn.
✓Summarize the narrative, never the load-bearing facts. Exact amounts, dates, order numbers, statuses, and customer-stated expectations must be preserved verbatim in a persistent case-facts block outside the summarized history.
✓Lost in the middle: models attend reliably to the start and end of a long input but may omit the middle. Put a key-findings summary at the top and use explicit section headers.
✓Trim verbose tool outputs to only relevant fields (keep ~5 of 40+) in a PostToolUse hook or wrapper; this is usually the biggest token saving because tool results dominate context growth.
✓For multi-issue sessions, keep a separate structured context layer per issue so facts from one issue do not bleed into another.
✓For multi-agent handoffs, have upstream agents return structured data (key facts, citations, relevance scores) plus metadata like dates and source URLs, not verbose content and reasoning chains, so downstream agents with small budgets stay accurate.

Official exam objectives for 5.1

Knowledge of

Progressive summarization risks: condensing numerical values, percentages, dates, and customer-stated expectations into vague summaries
The "lost in the middle" effect: models reliably process information at the beginning and end of long inputs but may omit findings from middle sections
How tool results accumulate in context and consume tokens disproportionately to their relevance (e.g., 40+ fields per order lookup when only 5 are relevant)
The importance of passing complete conversation history in subsequent API requests to maintain conversational coherence

Skills in

Extracting transactional facts (amounts, dates, order numbers, statuses) into a persistent "case facts" block included in each prompt, outside summarized history
Extracting and persisting structured issue data (order IDs, amounts, statuses) into a separate context layer for multi-issue sessions
Trimming verbose tool outputs to only relevant fields before they accumulate in context (e.g., keeping only return-relevant fields from order lookups)
Placing key findings summaries at the beginning of aggregated inputs and organizing detailed results with explicit section headers to mitigate position effects
Requiring subagents to include metadata (dates, source locations, methodological context) in structured outputs to support accurate downstream synthesis
Modifying upstream agents to return structured data (key facts, citations, relevance scores) instead of verbose content and reasoning chains when downstream agents have limited context budgets

Flashcards from this lesson

Why must you resend the full conversation history on every Claude API request?

The Messages API is stateless; the model has no server-side memory of prior turns, so complete history is required each request to maintain conversational coherence.

What does progressive summarization tend to destroy, and how do you protect it?

It flattens exact numbers, percentages, dates, order IDs, statuses, and customer-stated expectations into vague prose. Protect them by extracting them into a persistent case-facts block stored verbatim outside the summarized history.

What is the 'lost in the middle' effect and how do you mitigate it?

Models attend reliably to the start and end of a long input but may omit content in the middle. Mitigate by placing a key-findings summary at the top and organizing details with explicit section headers.

Why trim tool outputs, and where should the trimming happen?

Tool results consume tokens disproportionately to relevance and are re-billed each turn. Trim to only relevant fields (e.g. 5 of 40+) deterministically in a PostToolUse hook or tool wrapper, not via a prompt instruction.

How do you keep facts straight in a multi-issue session?

Maintain a separate structured context layer per issue (order IDs, amounts, statuses per issue) so data from one issue does not bleed into another.

What should upstream subagents return when downstream agents have small context budgets?

Structured data (key facts, citations, relevance scores) plus required metadata (dates, source URLs, document names, page numbers, methodological context), not verbose content or full reasoning chains.

Why include publication or data-collection dates in structured subagent outputs?

So temporal differences between sources are interpreted correctly rather than being misread as contradictions during synthesis.

Study all flashcards with spaced repetition

Mark this lesson complete when you are confident.

← Previous

4.6 Design multi-instance and multi-pass review architectures

5.2 Design effective escalation and ambiguity resolution patterns