AWS Lambda has a 15-minute hard timeout, and document-processing workflows that involve OCR, LLM analysis, database writes, and downstream notifications routinely run longer than that. This example splits the responsibility: Lambda is a stateless trigger that calls resonate.begin_rpc(...) and returns 202 immediately, while the actual workflow runs on a separate long-running Python process registered with the Resonate Server. It shows the dispatch handler, the status-polling handler, and a five-step durable workflow registered on the worker — each step a checkpoint that survives crashes and worker restarts.
The shape of the solution
@resonate.register
def process_document(ctx: Context, job: dict[str, Any]):
"""Five-step durable document pipeline.
Each `yield ctx.run(...)` is checkpointed: if the worker crashes,
only the unfinished step re-runs on resume.
"""
page_count = yield ctx.run(download_document, job)
text = yield ctx.run(extract_text, job, page_count)
analysis = yield ctx.run(analyze_document, job, text)
stored_at = yield ctx.run(
store_results, job, analysis["summary"], analysis["data"]
)
notified_at = yield ctx.run(notify_requester, job, stored_at)
return {
"jobId": job["jobId"],
"type": job["type"],
"pageCount": page_count,
"summary": analysis["summary"],
"extractedData": analysis["data"],
"storedAt": stored_at,
"notifiedAt": notified_at,
}
# from example-lambda-workers-py/worker.py:91-114The Lambda side is shorter — it does no work, only dispatches. The module-scope client is created once per cold start:
resonate = Resonate.remote(
group="gateway",
host=RESONATE_HOST,
store_port=RESONATE_STORE_PORT,
message_source_port=RESONATE_MSG_PORT,
)
# from example-lambda-workers-py/lambda_function.py:45-50The handler parses the request, then dispatches and returns 202:
def _handle_process_document(event: dict[str, Any]) -> dict[str, Any]:
# ... parse and validate body, build `job` dict ...
resonate.options(target="poll://any@worker").begin_rpc(
f"doc/{job_id}",
"process_document",
job,
)
return _response(
202,
{
"status": "accepted",
"jobId": job_id,
"statusUrl": f"/status/{job_id}",
"message": "Processing in background. Poll statusUrl for results.",
},
)
# from example-lambda-workers-py/lambda_function.py:67-111The durable primitives in play
Resonate.remote(group="gateway", ...)—lambda_function.py:45-50. Connects the Lambda container to the Resonate Server as a member of thegatewaygroup. Created at module scope so the connection survives across warm invocations.Resonate.remote(group="worker")—worker.py:34. The worker's client. Joins theworkergroup that the Lambda targets withpoll://any@worker.@resonate.register—worker.py:91. Registersprocess_documentby name with the Resonate Server so the server can route an incoming RPC to it.resonate.options(target="poll://any@worker").begin_rpc(id, func_name, args)—lambda_function.py:97-101. Non-blocking dispatch: enqueues a durable RPC on the named worker group and returns. The promise IDdoc/<jobId>is the idempotency key — the server deduplicates by it.yield ctx.run(fn, *args)—worker.py:98-104. The checkpoint primitive inside a generator workflow. Runsfndurably, persists its return value to the Resonate Server, and on replay the workflow resumes from the next yield.resonate.get(id)—lambda_function.py:126. Returns the durable promise handle for an existing workflow.handle.done()andhandle.result()give the status-poll endpoint its state and return value.resonate.start()+Event().wait()—worker.py:120-121. Starts the worker's poll loop against the server and blocks the main thread so the process stays up.
What the SDK handles vs. what you write
| The SDK handles | You write |
|---|---|
Routing the RPC from gateway group to a worker in worker group via poll://any@worker | The dispatch call: resonate.options(target=...).begin_rpc(id, "process_document", job) |
Persisting each yield ctx.run(...) return value to the Resonate Server as a checkpoint | The generator body with the five yields |
| Replaying the generator on worker restart, skipping completed steps and resuming at the unfinished one | The five step functions (download_document, extract_text, analyze_document, store_results, notify_requester) — plain functions, no SDK awareness |
Deduplicating dispatches by promise ID — begin_rpc("doc/job-001", ...) twice produces one execution | Choosing a meaningful ID. This example uses doc/<jobId> so client retries collapse to one workflow |
Exposing the durable promise to the status-poll Lambda — same doc/<jobId> resolved from a different container | The GET /status/:jobId handler that calls resonate.get(...) |
| Holding workflow state across Lambda's 15-minute timeout — state lives on the Resonate Server, not in any Lambda | The split: Lambda only dispatches and polls; the workflow runs on worker.py |
The Lambda function never executes a step. It only knows the promise ID and the function name. Step bodies live on the worker, and the worker is a long-running Python process — typically on ECS / Fargate or an EC2 instance (README.md:140-152). Lambda's serverless cost profile is preserved on the trigger; the long workflow runs where time limits don't apply.
Failure modes covered
- Workflow exceeds Lambda's 15-minute ceiling. The Lambda invocation ends after the
begin_rpc(...)call returns; it does not wait for the workflow to finish. The workflow's wall-clock time has no relationship to Lambda's timeout because the workflow runs onworker.py. (lambda_function.py:94-101,worker.py:120-121) - The same
jobIdis submitted twice (API Gateway retry on timeout, client retry on network failure, duplicate webhook). Both calls pass the same promise IDdoc/<jobId>. The Resonate Server returns the existing promise; only one workflow execution runs. (lambda_function.py:98,README.md:206-211) - The worker crashes between two steps. When the worker restarts and re-joins the
workergroup, the server replays the generator. Completedyield ctx.run(...)steps return their checkpointed values immediately; the unfinished step re-executes. The workflow's final return value is unchanged. (worker.py:14-18,worker.py:93-97) - The caller wants the result minutes or hours after dispatch.
GET /status/:jobIdis a fresh Lambda invocation. The Lambda container holds no workflow state; it callsresonate.get(f"doc/{job_id}")to read state from the server.handle.done()distinguishes in-flight from completed. (lambda_function.py:118-134)
When to reach for this pattern
- If you have an API Gateway → Lambda entry point but the actual work routinely exceeds 15 minutes (large document OCR, multi-step LLM pipelines, ETL jobs).
- If your handler needs to be idempotent against client and gateway retries — use a domain ID (here,
jobId) as the promise ID and let the server deduplicate. - If the workflow has discrete, durable steps where you want each step's success persisted so a worker crash doesn't restart the whole pipeline.
- If status polling needs to work across cold starts — any Lambda container can
resonate.get(id)because state lives on the server, not in process memory. - If you want to later add a step that blocks on human approval or an external callback. The current example does not use a blocking promise primitive, but the worker process has no execution ceiling, so adding one is an additive change inside the same generator.
Sources
- Example repo: resonatehq-examples/example-lambda-workers-py
- Python SDK: resonatehq/resonate-sdk-py
- SDK source for the primitives used:
resonate/resonate.py(class Resonate,register,begin_rpc,get) - Resonate Server install: docs.resonatehq.io/server/install (this example targets the legacy
resonate serve, notresonate dev) - Resonate docs home: docs.resonatehq.io
