4 min readResonate HQJust published

AWS Lambda as a stateless trigger for durable Python workflows

How a Lambda handler hands a multi-step Python workflow to a long-running Resonate worker that survives Lambda's 15-minute timeout.

Resonate brand card on a dark background with a plum spectrum wave at the bottom and the post headline in white Sansation.

AWS Lambda has a 15-minute hard timeout, and document-processing workflows that involve OCR, LLM analysis, database writes, and downstream notifications routinely run longer than that. This example splits the responsibility: Lambda is a stateless trigger that calls resonate.begin_rpc(...) and returns 202 immediately, while the actual workflow runs on a separate long-running Python process registered with the Resonate Server. It shows the dispatch handler, the status-polling handler, and a five-step durable workflow registered on the worker — each step a checkpoint that survives crashes and worker restarts.

The shape of the solution

@resonate.register
def process_document(ctx: Context, job: dict[str, Any]):
    """Five-step durable document pipeline.
 
    Each `yield ctx.run(...)` is checkpointed: if the worker crashes,
    only the unfinished step re-runs on resume.
    """
    page_count = yield ctx.run(download_document, job)
    text = yield ctx.run(extract_text, job, page_count)
    analysis = yield ctx.run(analyze_document, job, text)
    stored_at = yield ctx.run(
        store_results, job, analysis["summary"], analysis["data"]
    )
    notified_at = yield ctx.run(notify_requester, job, stored_at)
 
    return {
        "jobId": job["jobId"],
        "type": job["type"],
        "pageCount": page_count,
        "summary": analysis["summary"],
        "extractedData": analysis["data"],
        "storedAt": stored_at,
        "notifiedAt": notified_at,
    }
# from example-lambda-workers-py/worker.py:91-114

The Lambda side is shorter — it does no work, only dispatches. The module-scope client is created once per cold start:

resonate = Resonate.remote(
    group="gateway",
    host=RESONATE_HOST,
    store_port=RESONATE_STORE_PORT,
    message_source_port=RESONATE_MSG_PORT,
)
# from example-lambda-workers-py/lambda_function.py:45-50

The handler parses the request, then dispatches and returns 202:

def _handle_process_document(event: dict[str, Any]) -> dict[str, Any]:
    # ... parse and validate body, build `job` dict ...
    resonate.options(target="poll://any@worker").begin_rpc(
        f"doc/{job_id}",
        "process_document",
        job,
    )
 
    return _response(
        202,
        {
            "status": "accepted",
            "jobId": job_id,
            "statusUrl": f"/status/{job_id}",
            "message": "Processing in background. Poll statusUrl for results.",
        },
    )
# from example-lambda-workers-py/lambda_function.py:67-111

The durable primitives in play

  • Resonate.remote(group="gateway", ...)lambda_function.py:45-50. Connects the Lambda container to the Resonate Server as a member of the gateway group. Created at module scope so the connection survives across warm invocations.
  • Resonate.remote(group="worker")worker.py:34. The worker's client. Joins the worker group that the Lambda targets with poll://any@worker.
  • @resonate.registerworker.py:91. Registers process_document by name with the Resonate Server so the server can route an incoming RPC to it.
  • resonate.options(target="poll://any@worker").begin_rpc(id, func_name, args)lambda_function.py:97-101. Non-blocking dispatch: enqueues a durable RPC on the named worker group and returns. The promise ID doc/<jobId> is the idempotency key — the server deduplicates by it.
  • yield ctx.run(fn, *args)worker.py:98-104. The checkpoint primitive inside a generator workflow. Runs fn durably, persists its return value to the Resonate Server, and on replay the workflow resumes from the next yield.
  • resonate.get(id)lambda_function.py:126. Returns the durable promise handle for an existing workflow. handle.done() and handle.result() give the status-poll endpoint its state and return value.
  • resonate.start() + Event().wait()worker.py:120-121. Starts the worker's poll loop against the server and blocks the main thread so the process stays up.

What the SDK handles vs. what you write

The SDK handlesYou write
Routing the RPC from gateway group to a worker in worker group via poll://any@workerThe dispatch call: resonate.options(target=...).begin_rpc(id, "process_document", job)
Persisting each yield ctx.run(...) return value to the Resonate Server as a checkpointThe generator body with the five yields
Replaying the generator on worker restart, skipping completed steps and resuming at the unfinished oneThe five step functions (download_document, extract_text, analyze_document, store_results, notify_requester) — plain functions, no SDK awareness
Deduplicating dispatches by promise ID — begin_rpc("doc/job-001", ...) twice produces one executionChoosing a meaningful ID. This example uses doc/<jobId> so client retries collapse to one workflow
Exposing the durable promise to the status-poll Lambda — same doc/<jobId> resolved from a different containerThe GET /status/:jobId handler that calls resonate.get(...)
Holding workflow state across Lambda's 15-minute timeout — state lives on the Resonate Server, not in any LambdaThe split: Lambda only dispatches and polls; the workflow runs on worker.py

The Lambda function never executes a step. It only knows the promise ID and the function name. Step bodies live on the worker, and the worker is a long-running Python process — typically on ECS / Fargate or an EC2 instance (README.md:140-152). Lambda's serverless cost profile is preserved on the trigger; the long workflow runs where time limits don't apply.

Failure modes covered

  • Workflow exceeds Lambda's 15-minute ceiling. The Lambda invocation ends after the begin_rpc(...) call returns; it does not wait for the workflow to finish. The workflow's wall-clock time has no relationship to Lambda's timeout because the workflow runs on worker.py. (lambda_function.py:94-101, worker.py:120-121)
  • The same jobId is submitted twice (API Gateway retry on timeout, client retry on network failure, duplicate webhook). Both calls pass the same promise ID doc/<jobId>. The Resonate Server returns the existing promise; only one workflow execution runs. (lambda_function.py:98, README.md:206-211)
  • The worker crashes between two steps. When the worker restarts and re-joins the worker group, the server replays the generator. Completed yield ctx.run(...) steps return their checkpointed values immediately; the unfinished step re-executes. The workflow's final return value is unchanged. (worker.py:14-18, worker.py:93-97)
  • The caller wants the result minutes or hours after dispatch. GET /status/:jobId is a fresh Lambda invocation. The Lambda container holds no workflow state; it calls resonate.get(f"doc/{job_id}") to read state from the server. handle.done() distinguishes in-flight from completed. (lambda_function.py:118-134)

When to reach for this pattern

  • If you have an API Gateway → Lambda entry point but the actual work routinely exceeds 15 minutes (large document OCR, multi-step LLM pipelines, ETL jobs).
  • If your handler needs to be idempotent against client and gateway retries — use a domain ID (here, jobId) as the promise ID and let the server deduplicate.
  • If the workflow has discrete, durable steps where you want each step's success persisted so a worker crash doesn't restart the whole pipeline.
  • If status polling needs to work across cold starts — any Lambda container can resonate.get(id) because state lives on the server, not in process memory.
  • If you want to later add a step that blocks on human approval or an external callback. The current example does not use a blocking promise primitive, but the worker process has no execution ceiling, so adding one is an additive change inside the same generator.

Sources