May 25, 20264 min readResonate HQ

Per-message Kafka worker with crash recovery in TypeScript

How a Kafka-fed worker collapses head-of-line blocking and crash-resume into straight-line code by treating each message as a durable promise.

typescript kafka for-agents

A worker sitting between two Kafka topics has two failure shapes: head-of-line blocking when a slow message stalls a faster one behind it, and re-execution of already-completed steps when the worker crashes mid-task. Resonate turns the per-message handler into a durable workflow keyed by the message id, so the consumer can hand off without blocking and any crashed step is replayed from its last checkpoint. The example app reads record ids off a records_to_be_deleted topic, runs a multi-step batch-deletion loop per id, and publishes a completion to records_that_were_deleted.

The shape of the solution

// The durable workflow: loop until deleteBatch returns false, then publish
// the completion message. Each `yield* ctx.run(...)` is a checkpoint — if the
// worker crashes mid-flight, Resonate replays from the last checkpoint.
export function* workflow(
  ctx: Context,
  recordId: string,
  offset: string,
): Generator<any, void, any> {
  console.log(`processing record ${recordId} in position ${offset}`);
  while (yield* ctx.run(deleteBatch, recordId)) {
    console.log(`record ${recordId} still has rows to delete`);
    yield* ctx.sleep(5_000);
  }
  console.log(`all rows deleted for record ${recordId} in position ${offset}`);
  yield* ctx.run(enqueueCompletion, recordId, offset);
}
 
resonate.register("workflow", workflow);
// from example-kafka-worker-ts/src/workflow.ts:58

The consumer side does no orchestration of its own — it parses each Kafka message and hands the record id to resonate.beginRun, which is non-blocking:

// beginRun is non-blocking: it kicks off (or recovers) the workflow
// identified by recordId and returns a handle. The recordId acts as
// the durable promise id, so duplicate invocations dedupe.
await resonate.beginRun(
  recordId,
  "workflow",
  recordId,
  offset,
);
// from example-kafka-worker-ts/src/consumer.ts:47

The durable primitives in play

resonate.beginRun(id, "workflow", ...) — creates a durable promise keyed by recordId and returns immediately so the consumer can advance to the next Kafka message. If a promise with that id already exists, the call is idempotent. (src/consumer.ts:47)
ctx.run(deleteBatch, recordId) — runs the step as a child durable promise; on application error Resonate retries with the SDK's default policy without re-running earlier steps. (src/workflow.ts:64)
ctx.sleep(5_000) — durable sleep between batch attempts. Survives worker restarts; the workflow does not resume early when a process comes back up. (src/workflow.ts:66)
ctx.run(enqueueCompletion, recordId, offset) — second checkpoint after the loop exits, so the completion publish is not re-issued on replay if it already succeeded. (src/workflow.ts:69)
resonate.register("workflow", workflow) — registers the generator under a name so it can be invoked by string and recovered by any worker in the same group. (src/workflow.ts:72)

What the SDK handles vs. what you write

You write: a generator function that yields ctx.run(...) for each step you want checkpointed, a Kafka consumer that calls beginRun(recordId, "workflow", ...), and the imperative work each step actually does (deleteBatch, enqueueCompletion). You pick the durable promise id (recordId) and you decide what counts as a step.

The SDK handles: persisting a durable promise per workflow keyed by id, persisting a child durable promise per ctx.run step, retrying failed steps automatically, skipping already-completed steps on replay, scheduling durable sleep across restarts, deduplicating beginRun calls with the same id, and routing recovered work to any process registered in the same group (appNodeGroup = "workers", declared at src/workflow.ts:6 and passed to the Resonate constructor at src/workflow.ts:10).

Failure modes covered

Worker crashes between batches. deleteBatch completed N times, then the process dies. On restart, Resonate replays the workflow; the completed ctx.run invocations short-circuit to their stored results and the loop resumes at the next batch attempt. (src/workflow.ts:64)
A batch step throws a transient error. deleteBatch has a simulated 25% chance of throwing (src/workflow.ts:31-34). The thrown error is caught by the SDK's retry policy for that ctx.run invocation; only that step retries, not the whole workflow.
The same record id is delivered twice. Kafka delivery semantics allow duplicates. Because beginRun is keyed on recordId (src/consumer.ts:47), the second invocation attaches to the existing durable promise rather than starting a parallel workflow.
Head-of-line blocking on long-running messages. A single record may take many batch iterations (the loop continues while deleteBatch returns true). Because the consumer uses beginRun and does not await workflow completion (src/consumer.ts:47), the next Kafka message is processed immediately on the same partition. Workflow execution proceeds concurrently in the worker pool.
Worker pool rebalancing. Multiple consumers can run with group: "workers". If the worker currently running a workflow dies, another worker in the same group picks up the recovered invocation. (src/workflow.ts:6-12, README §Recovery)
Completion-publish step crash. If the worker dies after the loop exits but before enqueueCompletion finishes, the replay re-enters at the second ctx.run (src/workflow.ts:69). The publish runs again only if it had not completed; the durable promise records the outcome of the first successful attempt.

When to reach for this pattern

If you have a queue or topic where a single message kicks off a multi-step task and you cannot afford to redo completed steps after a worker restart.
If individual tasks vary widely in duration and you do not want a slow message to stall faster ones behind it on the same partition.
If at-least-once delivery means the same task id may arrive twice and you want the second invocation to dedupe rather than run again.
If the per-message task contains a polling or retry loop with sleeps that must survive process restarts.
If you want to scale the worker pool horizontally and have recovered tasks resume on whichever process is healthy, without writing your own coordination.
If the final step of the task publishes a downstream message and you need exactly-once behavior for that publish across crashes and retries (with the caveat that the broker-side publish is not transactional with the durable promise — the worker can re-publish if it crashes after the broker accepts the message but before the durable promise records the result).

Sources

Example repo: https://github.com/resonatehq-examples/example-kafka-worker-ts
Resonate TypeScript SDK: https://github.com/resonatehq/resonate-sdk-ts (pinned @resonatehq/sdk ^0.10.0 in package.json:15; installed 0.10.0)
resonate.beginRun signature: https://github.com/resonatehq/resonate-sdk-ts/blob/main/src/resonate.ts (beginRun at line 290)
ctx.run / ctx.sleep on Context: https://github.com/resonatehq/resonate-sdk-ts/blob/main/src/context.ts (sleep at line 230, beginRun alias at line 285)
Companion Python port: https://github.com/resonatehq-examples/example-kafka-worker-py