5 min readResonate HQJust published

Load balancing across a worker group in Python on Resonate

How tagging a Python process with `group="worker-group"` and addressing an RPC to `poll://any@worker-group` handles service discovery, load balancing, and crash recovery.

Resonate brand card on a dark background with a plum spectrum wave at the bottom and the post headline in white Sansation.

A single worker process eventually saturates, and running multiple instances forces the caller to solve service discovery, load balancing, and crash recovery for a pool that may grow, shrink, or have a member die mid-execution. Resonate's shape of the solution is to tag each worker process with a group name on construction and let the caller address an RPC to "any member" of that group through the durable promise store; the server then routes the invocation to a live worker and re-routes it if the claimant disappears before completing. The example runs N copies of worker.py in "worker-group", then fires repeated non-blocking begin_rpc calls from invoke.py against poll://any@worker-group.

The shape of the solution

from resonate import Resonate
from threading import Event
import time
 
# Initialize an instance of Resonate as a worker in "worker-group"
resonate = Resonate.remote(
    group="worker-group",
)
 
# Register the function with Resonate
@resonate.register
def compute_something(_, id, compute_cost):
    """A function that simulates a computation that takes some time."""
    print(f"starting computation {id}")
    # Using time.sleep(), instead of ctx.sleep(), blocks the thread, simulating a time-consuming task
    time.sleep(compute_cost)
    print(f"computed something that cost {compute_cost} seconds")
    return
 
 
def main():
    # Explicitly start Resonate instance threads
    resonate.start()
    print("worker running...")
    # Keep the main thread alive to allow async tasks to complete
    Event().wait()
# from example-load-balancing-py/worker.py:1-26
from resonate import Resonate
from uuid import uuid4
from random import randint
 
 
# Initialize an instance of Resonate as a client in "invoke-group"
resonate = Resonate.remote(
    group="invoke-group",
)
 
def main():
    # Generate a random compute cost between 1 and 10 seconds
    compute_cost = randint(1, 10)
    # Generate a random promise ID
    promise_id = str(uuid4())
    # Invoke compute_something() on the worker group
    _ = resonate.options(target="poll://any@worker-group").begin_rpc(promise_id, "compute_something",  promise_id, compute_cost)
# from example-load-balancing-py/invoke.py:1-17

compute_something is a regular function (not a generator) — the example does not use ctx.run, ctx.sleep, or any context primitive. The durability and routing work happens entirely at the boundaries: at registration on the worker, and at the RPC target on the caller.

The durable primitives in play

  • Resonate.remote(group="worker-group") — constructs a Resonate client that connects to a Resonate Server as both durable promise store and anycast message source, and joins the named group. worker.py:6-8. SDK constructor at v0.6.7: resonate-sdk-py/resonate/resonate.py:194-254.
  • @resonate.register — registers compute_something under its Python name and version 1 in the worker's local Registry, so the server can route an invocation keyed on the name string "compute_something" to this process. worker.py:11. SDK: resonate-sdk-py/resonate/resonate.py:316-399 (v0.6.7) — overload stubs at 316-331, implementation at 332-399.
  • resonate.start() — starts the worker's bridge, processor threads, and the Poller message source that long-polls the Resonate Server for tasks claimed against this group. worker.py:23. SDK: resonate-sdk-py/resonate/resonate.py:266-274 (v0.6.7).
  • resonate.options(target="poll://any@worker-group") — returns a per-call copy of the client with the RPC target overridden to anycast against the named group. invoke.py:17. The poll://any@<group> form is the SDK's anycast convention; see resonate-sdk-py/resonate/message_sources/poller.py:52-54 and resonate-sdk-py/resonate/conventions/remote.py:59-61.
  • resonate.begin_rpc(id, "compute_something", promise_id, compute_cost) — creates a durable promise keyed on id, dispatches the invocation to the targeted group, and returns a Handle[R] without blocking. The example discards the handle. invoke.py:17. SDK: resonate-sdk-py/resonate/resonate.py:633-713 (v0.6.7).
  • Worker ttl on claimed tasks — defaults to 10 seconds (ttl=10 in Resonate.remote's signature). When a worker claims a task it holds the claim for at most ttl seconds; if it dies without completing, the task becomes re-claimable by another member of the group. resonate-sdk-py/resonate/resonate.py:206 (v0.6.7).

What the SDK handles vs. what you write

SDK handlesYou write
Discovering which worker instances are currently in worker-group and routing each RPC to one of themgroup="worker-group" on the worker, target="poll://any@worker-group" on the caller
Persisting a durable promise per invocation keyed on the caller-supplied promise_id, so a duplicate begin_rpc with the same id resolves against the existing promise rather than starting a new executionThe promise_id = str(uuid4()) (invoke.py:15) — or any deterministic id you want to use as an idempotency key
Holding the task claim for ttl seconds and re-routing it to another worker if the claimant diesNothing — ttl=10 is the default
Long-polling the Resonate Server for tasks addressed to worker-group from each worker processCalling resonate.start() once per worker (worker.py:23)
Resolving the function name string "compute_something" to the registered function in the receiving worker's registryRegistering the function once with @resonate.register (worker.py:11) and invoking by name string (invoke.py:17)

The caller does not know how many workers exist, where they run, or which one will pick up a given call. The worker code has no scheduling logic, no health endpoint, no service-discovery client. Both ends are written as if a single in-process function call were happening, and the durable promise plus the anycast target are doing the routing and recovery.

Failure modes covered

  • No worker is available when the invocation lands. The durable promise is created on the server when begin_rpc is called. If no worker in worker-group is currently claiming, the task waits in the server queue until one is, then gets routed. invoke.py:17 plus the server's anycast routing.
  • A worker crashes after claiming the task but before completing it. The claim expires after the worker's ttl (10 seconds by default; resonate-sdk-py/resonate/resonate.py:206), the task becomes re-claimable, and another worker in the group picks it up. The README confirms this explicitly: "If you kill one of the workers while it is in the middle of handling executions, you will see the executions recover on another worker" (README.md:77).
  • The caller fires the same promise_id twice. begin_rpc is keyed on id. A second call with the same id resolves against the existing durable promise rather than triggering a second execution. The example happens to use a fresh uuid4() per call (invoke.py:15), which means each invocation is intentionally a new execution; the dedupe guarantee is still there if a caller supplies a stable id.
  • The caller process exits immediately after dispatching. The invocation has already been persisted as a durable promise before begin_rpc returns the handle. Discarding the handle (_ = ...) is safe — the worker will run the function and resolve the promise on the server regardless of whether anyone is still subscribed. invoke.py:17.

What this example does not cover, by design:

  • No retry policy is configured on the call. If compute_something raises, the durable promise rejects; the caller never observes it because the handle is discarded.
  • No ctx.run checkpoint inside the function. The body is a single time.sleep plus two prints; a mid-function crash retries the whole function on recovery, not a sub-step.
  • No fan-out, no compensation, no scheduling.

When to reach for this pattern

  • If you have one stateless function that can run anywhere and you need to scale instances horizontally without writing service-discovery code.
  • If the caller wants fire-and-forget semantics for work that needs to survive worker churn — the durable promise outlives any individual worker.
  • If you need a worker pool where members can crash mid-execution and the in-flight work has to land on another member, without writing the recovery wiring yourself.
  • If you want to deploy a worker group as an independent process pool (one container image, N replicas, no leader, no peer discovery) and address it as a single logical endpoint from the caller side.
  • If you are migrating from an HTTP microservice with a load balancer + retry queue to a single-language workflow runtime and want the same shape with the queue, the routing, and the at-least-once delivery handled in one place.

Sources

  • Example repo: https://github.com/resonatehq-examples/example-load-balancing-py
  • Python SDK repo: https://github.com/resonatehq/resonate-sdk-py
  • Python SDK at the pinned version: https://github.com/resonatehq/resonate-sdk-py/tree/v0.6.7
  • Resonate documentation (worker groups and targets): https://docs.resonatehq.io/concepts/targets
  • Files cited in this post:
    • worker.py:1-26 — the worker process: client construction, registration, run loop
    • invoke.py:1-17 — the caller: client construction, RPC target, begin_rpc
    • pyproject.toml:6-9 — Python and SDK pins
    • .python-version:1 — Python pin
    • SDK resonate/resonate.py:194-254 (v0.6.7) — Resonate.remote(...) classmethod
    • SDK resonate/resonate.py:266-274 (v0.6.7) — Resonate.start()
    • SDK resonate/resonate.py:281-314 (v0.6.7) — Resonate.options(...) and the target field
    • SDK resonate/resonate.py:316-399 (v0.6.7) — @resonate.register (overload stubs 316-331, implementation 332-399)
    • SDK resonate/resonate.py:633-713 (v0.6.7) — Resonate.begin_rpc(...)
    • SDK resonate/message_sources/poller.py:49-54poll://uni@<group> (49-50) / poll://any@<group> (52-54) conventions
    • SDK resonate/conventions/remote.py:59-61 — fallback targetpoll://any@<target> mapping
  • Note on SDK version: pyproject.toml pins resonate-sdk>=0.6.7. The 0.6.x line exposes Resonate.remote(group=..., ...) and Resonate.local(...) as classmethod constructors. The current main branch of resonatehq/resonate-sdk-py has collapsed these into a single Resonate(url=..., group=...) constructor with auto-detection from RESONATE_URL. Against a post-0.6.7 release, verify the constructor signature and re-confirm the poll://any@<group> target string against the SDK you are pinning.