5.312 min

Implement error propagation strategies across multi-agent systems

In coordinator-subagent systems, subagents fail routinely, so system reliability depends on how failure information travels back to the coordinator, the only component that can decide whether to retry, switch approaches, or degrade gracefully. Returning structured error context (failure type, attempted operation, partial results, alternatives) lets the coordinator recover intelligently, while generic statuses, silent suppression, and whole-run termination all break recovery. This lesson covers how to propagate errors so multi-agent workflows fail gracefully instead of silently or catastrophically.

Structured error propagation from a failed subagent to the coordinator, contrasted with the two anti-patterns and the access-failure vs empty-result distinction.

Error propagation is a reliability design decision

In a coordinator-subagent system, subagents fail all the time: a search API times out, an MCP tool returns a 503, a document is unreachable. The reliability of the whole system depends less on preventing these failures than on how failure information travels back to the coordinator. The coordinator is the only component with a global view, so it is the right place to decide whether to retry, try an alternative source, or proceed with reduced coverage. A subagent that hides or flattens its failure robs the coordinator of the context it needs to recover.

The governing principle: propagate enough structure that the coordinator can make an intelligent decision, but no more than it needs. This mirrors the MCP structured-error pattern (isError, errorCategory, isRetryable) from tool design, applied to the agent-to-agent boundary. Errors are data that flow through the system, not exceptions that halt it.

Structured error context: the four load-bearing fields

When a subagent cannot complete its task, it should return a structured error object, not a bare string. The four fields the exam expects are: (1) failure type (timeout, permission, rate limit, not found), (2) what was attempted (the exact query or operation), (3) partial results already gathered, and (4) potential alternative approaches.

Example:

{
  "status": "error",
  "failureType": "timeout",
  "attemptedQuery": "AI impact on film production 2024",
  "partialResults": [{"url": "...", "excerpt": "..."}],
  "alternatives": ["retry with narrower query", "use news API instead of web crawl"]
}

With this, the coordinator can retry with a modified query, delegate to a different source, or accept the partial results and annotate the gap. A bare "search failed" forces the coordinator to either blindly retry or give up, neither of which is an informed decision.

Access failures vs valid empty results

This distinction is subtle and heavily tested. An access failure means the query never really ran: a timeout, a 503, an auth error. It needs a retry-or-alternative decision. A valid empty result means the query ran successfully and simply found nothing, zero matching orders, no articles on the topic. That is a successful outcome, not an error.

Conflating the two is expensive in both directions. If you report an empty result as an error, the coordinator wastes retries on a query that will always return nothing. If you report an access failure as an empty result, the coordinator believes a topic has no sources when the truth is that the source was unreachable, and the final report becomes silently incomplete. Always encode which of the two actually occurred so the coordinator responds correctly.

Local recovery first, then propagate

Not every failure should reach the coordinator. Transient failures (a single timeout, a momentary rate limit) are best handled locally by the subagent: retry with backoff, try a fallback endpoint. Only when local recovery is exhausted should the subagent propagate, and when it does, it must include what was attempted and any partial results, not just a terminal status.

This keeps the coordinator focused on decisions it alone can make (re-scoping, cross-agent trade-offs) rather than drowning in low-level retries. It also preserves progress: a subagent that gathered three of five sources before failing should hand those three upward, not discard them. Local recovery for the routine case, structured propagation for the unresolvable case.

Coverage annotations carry errors into the output

Error propagation does not end at the coordinator; it must survive into the final output. When some subagents returned partial or empty results, the synthesis agent should annotate coverage: which findings are well-supported by multiple sources, and which topic areas have gaps because a source was unavailable.

This is the difference between a report that says "AI has no measurable impact on film" (false, the film search timed out) and one that says "film production could not be assessed because the primary source was unreachable." Coverage annotations turn a silent hole into an explicit, actionable caveat that both the reader and the coordinator can act on.

The two failure anti-poles

Two opposite mistakes both break reliability. At one extreme is silent suppression: catching an error and returning empty-but-successful, so failure looks like success and the workflow produces confidently incomplete output. At the other extreme is fail-fast termination: letting a single subagent's exception abort the entire multi-agent run, so a recoverable timeout destroys hours of parallel work from other agents.

The correct posture sits between them: surface the failure with structure, let the coordinator decide, and degrade gracefully. A single failed subagent should reduce coverage, not kill the run, and reduced coverage should be visible, not hidden. The exam frequently presents these two extremes as the wrong answers flanking the structured-context correct answer.

Anti-patterns to avoid

avoid

Catch the subagent's error and return an empty result marked as success.

Why it fails: Failure becomes indistinguishable from a genuine no-matches outcome. The coordinator proceeds as if the topic was covered, and the final report is silently incomplete, the most dangerous failure mode because nothing appears wrong.

instead Return a structured error with a failure type and partial results so the coordinator knows the gap exists and can recover or annotate it.

avoid

Propagate the subagent exception to a top-level handler that terminates the whole research workflow.

Why it fails: A single recoverable timeout throws away correct work from every other subagent, and recovery strategies (retry, alternative source, partial synthesis) never get a chance to run.

instead Contain the failure at the subagent, return structured context, and let the coordinator degrade coverage gracefully while other agents continue.

avoid

Retry internally with backoff and, once exhausted, return a generic "search unavailable" status.

Why it fails: The generic status hides the attempted query, the partial results, and the viable alternatives, so the coordinator can only blindly retry or give up rather than make an informed recovery decision.

instead Return the failure type, what was attempted, partial results, and alternatives so the coordinator can choose a different query or source.

avoid

Treat a query that returned zero rows the same as one that timed out (both as "error").

Why it fails: The coordinator wastes retries on empty-but-successful queries and may misread a real outage as simply 'no data.' The two situations demand different responses.

instead Distinguish valid empty results (success, no retry) from access failures (retry or alternative) explicitly in the response payload.

Worked example: A web search subagent times out mid-research

You are running the multi-agent research system from Scenario 3: a coordinator delegates to a web search subagent, a document analysis subagent, and a synthesis subagent. Mid-run, the web search subagent times out while researching one subtopic.

Two tempting wrong moves. Catching the timeout and returning an empty result marked successful means the coordinator thinks that subtopic has no sources, and the final report omits it with no warning. Letting the exception bubble up to a top-level handler that aborts the workflow throws away the correct work already done by the other two subagents.

The correct move: return structured error context. The web search subagent first retries locally with backoff. When that is exhausted, it propagates:

{
  "status": "error",
  "failureType": "timeout",
  "attemptedQuery": "AI in film production, 2024 studios",
  "isRetryable": true,
  "partialResults": [{"url": "https://...", "excerpt": "VFX house adopts generative pipeline..."}],
  "alternatives": ["narrower query per studio", "query the news API instead of a full web crawl"]
}

Coordinator recovery. With this context the coordinator has real choices: re-delegate a narrower query, switch the subagent to the suggested alternative source, or accept the single partial result and move on. It is no longer forced to choose between a blind retry and killing the run.

Carry it into the output. The synthesis agent annotates coverage: film production is marked as a partial-coverage area backed by a single source due to a search timeout, while music and writing are marked well-supported. The reader sees the gap explicitly instead of silently receiving an incomplete report. This is exactly the reasoning that sample question 8 rewards: structured error context beats a generic status, silent suppression, and whole-workflow termination.

Exam tips

✓Structured error context = failure type + what was attempted + partial results + alternatives. Memorize these four fields.
✓Access failure (timeout, 503) needs a retry-or-alternative decision. Valid empty result (query ran, zero matches) is a success and is never retried. Never conflate them.
✓Generic statuses like 'search unavailable' are wrong answers because they hide the context the coordinator needs to recover.
✓Both silent suppression (empty-as-success) and whole-workflow termination on a single failure are anti-patterns; the exam pairs them as the two wrong extremes around the correct structured-context answer.
✓Subagents recover transient failures locally and propagate only unresolved errors, always including what was attempted plus partial results.
✓Synthesis output should carry coverage annotations: well-supported findings vs topic areas with gaps from unavailable sources.

Official exam objectives for 5.3

Knowledge of

Structured error context (failure type, attempted query, partial results, alternative approaches) as enabling intelligent coordinator recovery decisions
The distinction between access failures (timeouts needing retry decisions) and valid empty results (successful queries with no matches)
Why generic error statuses ("search unavailable") hide valuable context from the coordinator
Why silently suppressing errors (returning empty results as success) or terminating entire workflows on single failures are both anti-patterns

Skills in

Returning structured error context including failure type, what was attempted, partial results, and potential alternatives to enable coordinator recovery
Distinguishing access failures from valid empty results in error reporting so the coordinator can make appropriate decisions
Having subagents implement local recovery for transient failures and only propagate errors they cannot resolve, including what was attempted and partial results
Structuring synthesis output with coverage annotations indicating which findings are well-supported versus which topic areas have gaps due to unavailable sources

Flashcards from this lesson

What four fields make error context 'structured' enough for coordinator recovery?

Failure type, what was attempted (the query or operation), partial results already gathered, and potential alternative approaches.

Access failure vs valid empty result?

Access failure = the query never truly ran (timeout, 503, auth) and needs a retry or alternative decision. Valid empty result = the query ran and found nothing; it is a success and should not be retried.

Why is returning a generic 'search unavailable' status a poor propagation strategy?

It hides the attempted query, partial results, and alternatives, so the coordinator can only blindly retry or abandon rather than make an informed recovery decision.

Name the two opposite error-propagation anti-patterns.

Silent suppression (returning empty results marked as success) and terminating the entire workflow on a single subagent failure.

When should a subagent propagate an error instead of handling it?

After local recovery for transient failures is exhausted; then it propagates the unresolved error, including what was attempted and any partial results.

What is a coverage annotation in synthesis output?

A note distinguishing well-supported findings from topic areas that have gaps because a source was unavailable, so incompleteness is explicit rather than hidden.

Study all flashcards with spaced repetition

Mark this lesson complete when you are confident.

← Previous

5.2 Design effective escalation and ambiguity resolution patterns

5.4 Manage context effectively in large codebase exploration