5 min readResonate HQJust published

Multi-agent pipeline with durable handoffs in Python on Resonate

How three sequential LLM agents become four lines of generator code when every yield is a Resonate checkpoint.

Resonate brand card on a dark background with a plum spectrum wave at the bottom and the post headline in white Sansation.

A multi-step LLM pipeline — researcher gathers findings, writer produces a draft, reviewer approves it — must survive any single agent failure without re-running the earlier agents, since each call is slow, costly, and non-deterministic. The Resonate shape of the solution is to register each agent as a normal function and orchestrate them from a generator workflow where every yield ctx.run(...) is a durable checkpoint; an agent that raises is retried in place while siblings stay cached in the promise store. The example shows the pipeline under the happy path and under a forced first-attempt failure on the writer, plus a commented-out extension to a real human-in-the-loop step via ctx.promise.

The shape of the solution

@resonate.register
def orchestrate(ctx: Context, topic: str, crash_on_writer: bool = False):
    # Step 1: Research -- gather findings
    findings = yield ctx.run(researcher, topic)
 
    # Step 2: Write -- produce a draft from findings
    # If crash_on_writer=True, the writer fails on first attempt and retries.
    # The researcher does NOT re-run on retry -- its result is cached.
    draft = yield ctx.run(writer, topic, findings, crash_on_writer)
 
    # Step 3: Review -- check the draft quality
    review = yield ctx.run(reviewer, draft)
 
    # Step 4: Human approval (simulated in this demo).
    # In production:
    #   approval = yield ctx.promise(id=f"approval/{topic}")
    #   approved = yield approval
    # ...
    approved = "APPROVED" in review.upper()
 
    result: OrchestrationResult = {
        "status": "published" if approved else "rejected",
        "topic": topic,
        "findings": findings,
        "draft": draft,
        "review": review,
    }
    return result
# from example-multi-agent-orchestration-py/src/agent.py:162-190

The workflow is a generator function, not async def. Each yield ctx.run(agent, ...) runs the child agent under a durable promise and suspends the orchestrator until the result is checkpointed.

The durable primitives in play

  • Resonate() — constructs the Resonate client embedded in the worker process. src/agent.py:27.
  • resonate.set_dependency("openai", OpenAI(...)) — registers the OpenAI client as a worker-level dependency so agent functions can retrieve it via ctx.get_dependency("openai") rather than reading globals. src/agent.py:203.
  • @resonate.register — registers the top-level workflow so the worker can claim and execute it under a caller-supplied promise id. src/agent.py:162.
  • ctx.run(fn, *args) — runs a function as a durable child step. The return value is persisted at the call site; on replay the SDK returns the cached value rather than re-invoking the function. Used for all three agent calls. src/agent.py:165, :170, :173.
  • ctx.get_dependency("openai") — fetches the worker-scoped OpenAI client inside an agent function. src/agent.py:53, :94, :131.
  • ctx.promise(id=...) — referenced in the in-file comment block (src/agent.py:177) as the production human-in-the-loop primitive. Blocks the workflow on an externally-resolved durable promise.
  • resonate.start() / resonate.stop() — start and stop the worker's polling loop against the Resonate server. src/agent.py:225, :231.

What the SDK handles vs. what you write

SDK handlesYou write
Checkpointing the return value of each ctx.run(agent, ...) in the durable promise storeThe three yield ctx.run(...) calls and the agent functions themselves
Suspending the generator after each yield and resuming when the child promise resolvesThe straight-line orchestrator body (findings, draft, review, approved)
Replaying the orchestrator after a crash using the cached step values rather than re-running completed agentsNo replay code — the orchestrator is written as if it never crashes
Retrying a registered child function that raisesThe actual failure (RuntimeError("Writer agent connection reset (simulated)") at src/agent.py:92) — no retry decorator, no try/except
Holding the worker on a durable promise via ctx.promise(id=...) until it is resolved externallyThe promise id template (f"approval/{topic}") and the resolver call (POST /promises/.../resolve from outside)
Routing the OpenAI client into each agent invocation via the dependency registryThe single resonate.set_dependency(...) registration in main()

The orchestrator body is four assignments and a return statement. The retry behaviour, the cached intermediate results, the resume-after-crash semantics, and the blocking on external approval all sit in the SDK + server, not in the code the author writes.

Failure modes covered

  • Writer raises on its first attempt. src/agent.py:90-92 throws RuntimeError("Writer agent connection reset (simulated)") when crash_on_first is set and attempt == 1. The SDK retries the registered writer function. The README crash-mode transcript (README:90-99) shows the researcher line printed once, then [writer] Writing article (attempt 1), the Resonate retry log line, then [writer] Writing article (attempt 2) — the researcher does not re-run because its return value is already checkpointed at the first yield ctx.run(...).
  • Worker crashes between agents. Because the orchestrator is registered under a caller-supplied id (the orchestration.1 / orchestration.crash id in the resonate invoke calls, README:62-65, README:84-88), a worker restart replays the orchestrator and finds the prior ctx.run(...) results in the promise store; only the unfinished step actually re-enters the agent function. README:107 states this explicitly.
  • The orchestration is invoked twice with the same id. The outer id (orchestration.1) is supplied by the caller; a second resonate invoke ... orchestration.1 resolves against the existing promise rather than starting a parallel pipeline.
  • Human approval never arrives (production extension). Using ctx.promise(id=f"approval/{topic}") (src/agent.py:177, README:142-145) the workflow blocks indefinitely on the durable promise. The worker can restart, the server can restart — the promise remains in the store until something external POSTs to /promises/approval/<topic>/resolve (README:151-153). The example does not enable this path by default; it is documented as the production swap-in.

The example does not implement provider-side idempotency on the OpenAI calls — that is outside the workflow's scope and is not claimed.

When to reach for this pattern

  • If you are chaining multiple LLM agents where each call is expensive enough that re-running an earlier agent on a later-step failure is unacceptable.
  • If you want straight-line orchestration code for a sequential agent pipeline instead of a DAG framework or a hosted multi-agent runtime.
  • If the pipeline needs to survive worker restarts mid-run and resume at the failed step.
  • If a downstream step (review, approval, payment) needs to block on a human decision that may take minutes, hours, or days — ctx.promise makes the wait durable.
  • If you need per-agent retry without writing per-agent retry decorators or try/except scaffolding around each call.
  • If the agents are written as plain Python functions that happen to call an LLM, and you want the orchestration concern lifted out of them entirely.

Notes on the SDK pin

pyproject.toml:11 pins resonate-sdk>=0.6.7. The 0.6.x line of the Python SDK speaks the legacy Resonate server protocol; the README directs the reader to run resonate serve from resonatehq/resonate-legacy-server (README:27). This predates the post-0.10 brand-aligned API renames on the TypeScript and Rust SDKs — agents mapping this pattern to a newer Python release should re-verify the ctx.run / ctx.promise / resonate.register surface against current SDK docs first.

Sources