4 min readResonate HQJust published

Durable sleep across process crashes in Rust on Resonate

How `ctx.sleep` collapses long-lived waits into one line by suspending the workflow against a durable timer promise.

Resonate brand card on a dark background with an ember spectrum wave at the bottom and the post headline in white Sansation.

A business process that needs to pause for hours, days, or longer cannot rely on holding a single process open — the longer a process lives, the more likely it crashes. Resonate replaces the in-process wait with a server-backed timer promise: the workflow suspends, the worker becomes free, and any worker in the group resumes the workflow when the timer fires. The example-durable-sleep-rs repo shows this in roughly 60 lines split between a worker binary and a client binary.

The shape of the solution

#[resonate::function]
async fn sleeping_workflow(ctx: &Context, secs: u64) -> Result<String> {
    println!("Sleeping for {secs} seconds...");
 
    ctx.sleep(Duration::from_secs(secs)).await?;
 
    Ok(format!("Slept for {secs} seconds"))
}
// from example-durable-sleep-rs/src/bin/worker.rs:10

The whole pattern is one line: ctx.sleep(Duration::from_secs(secs)).await?. Everything around it is framing. The function returns a Result<String> and the ? propagates Error::Suspended to the SDK, which is how the worker is told to release this workflow until the timer resolves.

The client side dispatches the workflow by id, by function name, to the workers group:

let id = "sleep-workflow-1";
let secs: u64 = 5;
 
let result: String = resonate
    .rpc(id, "sleeping_workflow", secs)
    .target("poll://any@workers")
    .await
    .expect("rpc to worker failed");
// from example-durable-sleep-rs/src/bin/client.rs:16

The worker process is registered into the workers group (src/bin/worker.rs:23), the client into the client group (src/bin/client.rs:13), and poll://any@workers is the target string that routes the RPC to any process in workers.

The durable primitives in play

  • ctx.sleep(Duration) — creates a durable timer promise on the Resonate Server with the resonate:timer tag, and on Pending causes the awaiting workflow to suspend by returning Err(Error::Suspended). src/bin/worker.rs:14; SDK at resonate/src/context.rs:319 (builder) and resonate/src/context.rs:1065 (IntoFuture impl that returns Err(Error::Suspended) on pending).
  • #[resonate::function] + resonate.register(sleeping_workflow) — registers the function under the name "sleeping_workflow" so the client can dispatch it by string. src/bin/worker.rs:10, src/bin/worker.rs:27; SDK at resonate/src/resonate.rs:295.
  • resonate.rpc(id, name, args).target(...) — durable RPC from outside a workflow: creates a root promise keyed on id, routes the dispatch to the workers group, and awaits the typed result. src/bin/client.rs:19-23; SDK at resonate/src/resonate.rs:361.
  • Worker groups + target("poll://any@workers")Resonate::new is configured with group: Some("workers".into()) for the worker (src/bin/worker.rs:23) and group: Some("client".into()) for the client (src/bin/client.rs:13); the target string selects any process in the named group.

What the SDK handles vs. what you write

You write: the workflow body, the duration argument, the promise id on the client side, and the group strings ("workers" / "client"). That is the entire surface.

The SDK and the Resonate Server handle: creating the timer promise with the right tags, suspending the workflow's coroutine the moment the timer is Pending (the ? after await propagates Error::Suspended — the SDK catches it and releases the worker), persisting the timer across worker restarts, firing the timer at the requested wall-clock time, dispatching the resumed workflow to any available process in the workers group, and decoding the resolved-vs-rejected result back into a typed Result<String> on the client. The worker process holds no in-memory wait state — between ctx.sleep(...).await? returning Err(Suspended) and the workflow resuming after the timer fires, there is no Rust future kept alive anywhere on the worker.

Failure modes covered

  • Worker crashes mid-sleep. The timer promise lives on the Resonate Server, not in the worker process. When the timer fires, the server dispatches the resumption to any process in the workers group; this is the poll://any@workers target on the client's RPC plus the worker's group registration at src/bin/worker.rs:23. The README explicitly invites this test: "Try killing the worker mid-sleep and starting it back up — the workflow recovers from the server-side timer promise and finishes" (README.md:67).
  • Long sleep outlives any single process lifetime. Because the wait is a server-side timer, not a tokio::time::sleep, durations of days or years do not depend on a single process surviving that long. The README frames this as the original motivation (README.md:18-25) and the SDK's sleep_create_req writes the resonate:timer tag and a timeout_at derived from the duration (resonate/src/context.rs:286-303).
  • Client retries with the same id. The RPC is keyed on the promise id ("sleep-workflow-1" in src/bin/client.rs:16); a second RPC with the same id reconnects to the existing pending execution rather than starting a new sleep. The client doc comment makes this explicit: "Resonate deduplicates by promise ID: invoking with a workflow ID that already has a PENDING execution reconnects to it; one that has RESOLVED returns the cached result." (src/bin/client.rs:3-6).
  • No worker available when the timer fires. The timer still fires on the server; the resumed workflow waits in the queue for the workers group until a process polls for work. Restarting the worker is enough to drain it.

When to reach for this pattern

  • If a workflow must pause for longer than is comfortable to keep a process running (anything from minutes upward, with no upper bound).
  • If you would otherwise reach for a cron job, scheduled task, or external timer service purely to re-awaken an idle process — and the only reason for the second system is the duration of the wait.
  • If you want the wait to be a single line in the workflow body rather than splitting the workflow into "before the wait" and "after the wait" subroutines connected by external state.
  • If the resumed work must run on whichever worker is healthy at the moment the timer fires, not necessarily the one that started the wait.
  • If you need the wait to be idempotent under retries — a second invocation with the same id during the sleep window must attach to the in-flight timer, not start a second one.

Sources