6 min readResonate HQJust published

Worker-group routing for an on-premise FaaS in Python on Resonate

How RPC, RFI, and ctx.detached split a FaaS submission into one durable orchestration on a router process and a pool of replaceable executors in a named worker group.

Resonate brand card on a dark background with a plum spectrum wave at the bottom and the post headline in white Sansation.

A FaaS-style platform needs to accept a function submission, route it to a worker that can run it (a GPU node, a CPU node, a node with a specific accelerator), and return a result whose delivery survives worker crashes. Resonate models this as a durable orchestration on a router process that issues remote calls to a named worker group; the durable promise lives on the Resonate server, so a crashed worker is replaced by another worker in the same group without losing the submission. The example-function-as-a-service-py repo ("Modulate") demonstrates this with one router (modulate.py), one worker (worker.py) hard-coded to the gpu group, and three invocation modes — RPC, RFI, and ctx.detached — selected per submission via the router's wait flag. The repo accepts a --machine CLI flag but does not yet wire it into the poll(...) target; both branches dispatch to the gpu group. Multi-group routing is described in the README's architecture diagram but is not implemented in the code at HEAD; wiring machine_type into the target is left as an extension.

SDK-version note. pyproject.toml:7 pins resonate-sdk>=0.6.7, but the committed source still uses pre-v0.6.7 import paths and kwargs: resonate.targets (removed), resonate.task_sources.poller (renamed to resonate.message_sources.poller), resonate.retry_policy.never (now resonate.retry_policies.Never), the task_source= kwarg (now message_source=), and .options(send_to=poll("gpu")) (now .options(target="poll://any@gpu/<worker-id>"), produced by Poller(...).anycast). The code blocks below are lifted verbatim from the repo for traceability and will not import as-is against the pinned SDK; the orchestration shape they demonstrate is the same on both surfaces.

The shape of the solution

The router's orchestrating function reads the script, then branches on whether the caller wants to block for a result (RPC) or fire-and-forget via a detached child (RFI):

@resonate.register(retry_policy=never())
def prep_execute(ctx: Context, id, script, machine_type, wait):
    try:
        content = ""
        with open(script, "r") as file:
            content = file.read()
 
        print("Sending script to be executed in gpu")
        if wait:
            result = yield ctx.rfc(execute, content, id).options(
                id=id, send_to=poll("gpu")
            )
            return result
        else:
            detached_id = f"detached_{id}"
            yield ctx.detached(detached_id, detached_rfi, content, id, "gpu")
            return None
 
    except FileNotFoundError as ef:
        print("Error: File not found.")
        raise ef
    except Exception as e:
        print(f"An error occurred: {e}")
        raise e
# from example-function-as-a-service-py/modulate.py:30-53

execute is registered on the router as a stub whose body is ..., so its name is in the local registry but the real body lives on the worker:

@resonate.register()
def execute(ctx: Context, script_content, id): ...
# from example-function-as-a-service-py/modulate.py:20-21

The real body is the worker-side execute:

@resonate.register()
def execute(ctx: Context, script_content, script_id):
    # ...
    temp_dir = tempfile.mkdtemp()
    # ...
    result = subprocess.run(
        ["python", "-I", "-S", script_path],
        cwd=temp_dir,
        stdout=subprocess.PIPE,
        stderr=subprocess.PIPE,
        text=True,
        timeout=120,
        check=True,
    )
    # ...
    return output_filename
# from example-function-as-a-service-py/worker.py:15-76

Router and worker are separate processes pinned to different polling groups. The router is in the entry group; the worker is hard-coded to gpu:

# router process
resonate = Resonate(store=RemoteStore(), task_source=Poller(group="entry"))
# from example-function-as-a-service-py/modulate.py:10
 
# worker process
resonate = Resonate(store=RemoteStore(), task_source=Poller(group="gpu"))
# from example-function-as-a-service-py/worker.py:12

The README documents uv run python worker.py --group gpu (README.md:113), but worker.py has no argparse and no CLI flag — group selection requires editing worker.py:12. Treat the README flag as aspirational.

The durable primitives in play

  • @resonate.register(retry_policy=never()) — registers prep_execute as a durable function with retries disabled, so a FileNotFoundError from the script-load step is fatal rather than replayed against the bad path. modulate.py:30-31.
  • ctx.rfc(execute, content, id).options(id=id, send_to=poll("gpu")) — Remote Function Call: yields the coroutine, blocks prep_execute until the promise identified by id resolves on the Resonate server, and returns the worker's return value. The send_to=poll("gpu") directs the invocation to the gpu worker group's poller. modulate.py:38-42.
  • ctx.detached(detached_id, detached_rfi, content, id, "gpu") — starts a new top-level call graph not tied to the caller's lifetime. The router can return None immediately and the detached child handles the actual remote invocation. modulate.py:43-46.
  • ctx.rfi(execute, content, id).options(id=id, send_to=poll("gpu")) — Remote Function Invocation inside the detached child: fire-and-forget, no yield-blocking on the result. modulate.py:24-27.
  • resonate.promises.get(id=id) — non-durable read of the promise store from outside any orchestration; used by the --get CLI flag to poll for completion. modulate.py:13-17.
  • Poller(group=<name>) task source — each process polls Resonate for tasks addressed to its group. The router polls entry; the worker polls gpu. modulate.py:10, worker.py:12. Group is set in source, not at CLI time.
  • prep_execute.run(execution_id, id, script, machine_type, wait) then handle.result() — the CLI entrypoint starts the durable orchestration and blocks for its return value. modulate.py:101-105.

What the SDK handles vs. what you write

You write the orchestration as straight-line Python with yield for the three remote primitives (rfc, rfi, detached). You write the per-group worker registration and the actual execution body. You write the CLI surface that turns --script foo.py --machine gpu into a prep_execute.run(...) call.

The SDK does the rest. The Resonate server holds the durable promise keyed by the submission id. The Poller running inside each process subscribes to its group's queue and pulls work. When prep_execute yields an rfc, the SDK serialises the call, writes it to the server as a durable promise targeting poll("gpu"), and pauses the coroutine. A gpu worker's poller picks up the task, runs execute, and resolves the promise with the return value (or rejects it with an exception). The SDK resumes prep_execute with the value, on whatever router replica is alive at that moment — not necessarily the one that originally yielded. None of the routing, queueing, promise persistence, or replay is in the application code.

The split also explains the execute = ... stub at modulate.py:20-21: the router never executes execute locally, but its function name must be in the local registry so ctx.rfc(execute, ...) can resolve the registered name to send to the wire.

Failure modes covered

  • Worker crashes mid-execute. The durable promise lives on the Resonate server, not on the worker. When the worker dies, the lease on the task expires and another worker in the same group picks it up on its next poll. The router's yield ctx.rfc(...) is unaffected — it was already suspended waiting for the promise. The promise resolves once for the submission as a whole, but the worker-side execute body can run more than once if a crash happens after side effects but before the promise is resolved; the worker's only externally-visible side effects are the per-script_id .sout/.eout files (worker.py:26-27, 66-72) and the sandboxed subprocess, both of which are safe to re-run on a clean temp directory. The README phrases this as "exactly-once execution semantics" (README.md:240-245), which is accurate for the promise but not for the body. This is a property of the durable-promise + group-polling design, not of code paths in this repo.
  • Router process restarts mid-orchestration. prep_execute is registered durable. When the router restarts, the SDK replays the coroutine from the last yield checkpoint; the open rfc is still on the server as an unresolved promise, so the replay re-attaches to the existing promise rather than re-submitting work. modulate.py:30, 38-42.
  • Caller wants fire-and-forget with no live coupling. wait=False takes the ctx.detached branch (modulate.py:43-46), which spawns a new top-level call graph; the router's prep_execute returns None and exits, but the detached detached_rfi continues to drive the actual rfi against the gpu group. The submission can be queried later via resonate.promises.get(id=id) (modulate.py:13-17).
  • Bad script path. prep_execute is registered with retry_policy=never() (modulate.py:30). The open(script, "r") at modulate.py:34-35 raises FileNotFoundError (modulate.py:48-50); the orchestration fails immediately instead of replaying against a path that will keep failing.
  • User script misbehaves on the worker. The worker sandboxes the script in a tempfile.mkdtemp() directory (worker.py:25) and runs it with subprocess.run(["python", "-I", "-S", ...], timeout=120, check=True) (worker.py:39-47). subprocess.TimeoutExpired and CalledProcessError are caught and reported (worker.py:52-55); the temp directory is always cleaned up in finally (worker.py:58-60). Note that worker.py:53 reports "Error: Execution timed out after 5 seconds" while the actual timeout= argument is 120 seconds — the error string lags the timeout value and should be read as "timed out".

When to reach for this pattern

  • If you need to route function executions to a worker group and the routing decision is made at submission time. Wiring multiple groups (e.g. selecting between gpu and cpu from machine_type) requires threading the argument into the target kwarg; this example fixes the target to gpu and would need that extension.
  • If a submission must outlive the router process — the caller wants to disconnect and come back later for the result via an ID.
  • If you want one orchestration entrypoint to support both blocking (RPC) and fire-and-forget (RFI / detached) submission modes, selectable per call.
  • If workers in a group must be interchangeable and crashes during execution must transparently re-dispatch to a peer.
  • If you need both a durable submission protocol and an out-of-band polling API for status, served from the same promise store.

Sources

  • Example repo: https://github.com/resonatehq-examples/example-function-as-a-service-py
  • Resonate Python SDK: https://github.com/resonatehq/resonate-sdk-py (tag v0.6.7 matches pyproject.toml:7)
  • SDK primitives referenced (v0.6.7):
    • Context.rfiresonate-sdk-py/resonate/resonate.py:1037-1057
    • Context.rfcresonate-sdk-py/resonate/resonate.py:1060-1080
    • Context.detachedresonate-sdk-py/resonate/resonate.py:1083-1127
    • Function.options(target=...)resonate-sdk-py/resonate/resonate.py:1314-1366
    • Poller.anycast (v0.6.7 replacement for targets.poll(...)) — resonate-sdk-py/resonate/message_sources/poller.py:52-54
    • Never retry policy — resonate-sdk-py/resonate/retry_policies/__init__.py:6
  • Resonate docs: https://docs.resonatehq.io/sdk/python (Python SDK reference), https://docs.resonatehq.io/concepts/rpc-rfi (RPC vs RFI background, linked from the example README).