Draining Kubernetes worker nodes for maintenance is a multi-step operation that can stall indefinitely when a Pod Disruption Budget refuses eviction, and a partial drain that loses progress on operator interrupt or worker crash is unsafe to retry. With Resonate, the drain is a single generator function: each per-node drain is a ctx.run checkpoint, and a blocked drain creates a durable promise that the workflow awaits until an operator resolves it via the HTTP gateway. The example registers one workflow (drainAllNodes), exposes it through an Express gateway that triggers it via RPC and settles its blocking promises, and survives killing the worker process mid-operation.
The shape of the solution
function* drainAllNodes(
ctx: Context,
operationId: string,
options: DrainOptions,
nodeSelector?: Record<string, string>
): Generator<unknown, DrainOperationResult, unknown> {
// ...
const nodes = yield* ctx.run(getNodes, nodeSelector);
// ...
for (const node of nodes) {
const result: NodeDrainResult = yield* ctx.run(
drainSingleNode,
node.name,
options
);
results.push(result);
if (!result.success && !options.force) {
const decisionPromise = yield* ctx.promise<HumanDecision>({});
console.log(`[Drain] Promise ID for decision: ${decisionPromise.id}`);
const decision: HumanDecision = yield* decisionPromise;
switch (decision) {
case "abort":
// ...
return { /* ... */ };
case "retry":
// ...
break;
case "force":
// ...
break;
case "skip":
// ...
break;
}
}
}
// ...
}
// from example-node-drain-orchestrator-ts/src/worker.ts:60The retry and force cases each call yield* ctx.run(drainSingleNode, node.name, options) (the force case spreads { ...options, force: true }) and overwrite results[results.length - 1] with the new result — see src/worker.ts:128-147 for the lifted bodies. The workflow is registered with resonate.register("drainAllNodes", drainAllNodes) at src/worker.ts:291. The gateway triggers it as RPC at src/gateway.ts:162:
resonate.rpc(
operationId,
"drainAllNodes",
operationId,
options,
body.nodeSelector,
resonate.options({ target: "poll://any@default" })
);
// from example-node-drain-orchestrator-ts/src/gateway.ts:162The durable primitives in play
ctx.run(getNodes, nodeSelector)— local function call, checkpointed. After it returns, the node list is durable; a worker crash will not re-list nodes on resume.src/worker.ts:73.ctx.run(drainSingleNode, node.name, options)— per-node drain as a single checkpoint. Each completed node's result is durable; on resume the workflow continues with the next node.src/worker.ts:95.ctx.promise<HumanDecision>({})— creates a durable promise on the Resonate server. The workflow yields it and suspends on the server until the promise is resolved from outside.src/worker.ts:108.yield* decisionPromise— awaits the durable promise; this yields control back to the server until the gateway settles it.src/worker.ts:114.resonate.register("drainAllNodes", drainAllNodes)— registers the generator function so the worker can be assigned invocations of it by name.src/worker.ts:291.resonate.rpc(id, "drainAllNodes", ..., resonate.options({ target: "poll://any@default" }))— gateway-side invocation that creates the root promise and dispatches the workflow to a worker polling thedefaultgroup.src/gateway.ts:162.resonate.promises.get(operationId)— gateway reads workflow state directly from the server, not from in-process memory.src/gateway.ts:67.resonate.promises.settle(promiseId, "resolved", { data })— gateway resolves the blocking decision promise from an HTTP handler, unsuspending the workflow.src/gateway.ts:220, also:244,:266,:288,:310. Note on SDK version skew: the repo pins"@resonatehq/sdk": "^0.10.0", which resolves to0.10.2. At0.10.1and later,settlewas madeprivateand the public API isresonate.promises.resolve(id, { data })/.reject(id, { data })/.cancel(id, { data })(each a thin wrapper that calls the now-privatesettlewith the corresponding state — seeresonate-sdk-ts/src/promises.ts:98, :110, :122atv0.10.2). Agents porting this example forward should callresonate.promises.resolve(promiseId, { data })in place ofresonate.promises.settle(promiseId, "resolved", { data }).
What the SDK handles vs. what you write
You write: the generator function that drains nodes, the per-step Kubernetes calls (getNodes, drainSingleNode, k8s.cordonNode, k8s.evictPod, k8s.deletePod, k8s.waitForPodDeletion, k8s.waitForNodeEmpty), the option/result types, and the HTTP routes that start the workflow and settle the blocking promise. There is no state machine, no progress table, no queue, no retry loop in your code.
The SDK handles: persisting every ctx.run result so a crashed worker resumes on the next node rather than the first; persisting the durable promise created by ctx.promise so the workflow can survive a worker crash while awaiting a human decision; routing the RPC submitted by the gateway to any worker polling poll://any@default; replaying the generator deterministically from durable checkpoints when a worker picks the work back up. The gateway holds no in-process workflow state — when it serves /status, it calls resonate.promises.get(operationId) and decodes the result (src/gateway.ts:67, base64-decoded at :90).
Failure modes covered
- Worker crashes mid-drain (after node 1, before node 2). On restart,
resonate.register("drainAllNodes", drainAllNodes)re-attaches the worker to the workflow; the generator replays from durable checkpoints and resumes at thectx.run(drainSingleNode, ...)for the next un-drained node.src/worker.ts:91-99. - Pod eviction blocked by PDB.
k8s.evictPodreturns{ success: false };drainSingleNodereturns aNodeDrainResultwithsuccess: falseand ablockedBypod (src/worker.ts:215). The orchestrator then enters the human-in-the-loop branch. - Workflow needs to block on arbitrary human response time.
ctx.promise<HumanDecision>({})creates a durable promise; the workflow suspends on the server. The worker process can be killed and restarted while the workflow is suspended — the promise lives on the server.src/worker.ts:108,:114. - Operator wants to skip / retry / force / abort a blocked node. The gateway exposes
POST /skip/:promiseId,POST /retry/:promiseId,POST /force/:promiseId,POST /abort/:promiseId, and a genericPOST /decision. Each settles the durable promise with a base64-encodedHumanDecision.src/gateway.ts:238,:260,:282,:304,:191. - Eviction times out waiting for a pod to terminate.
k8s.waitForPodDeletionreturnsfalse;drainSingleNodereturns asuccess: falseresult witherror: "Timeout waiting for pod ...". The orchestrator routes that into the same human-decision branch.src/worker.ts:236. - Duplicate
/drainPOST while an operation is in flight. Gateway looks up the existing operation withresonate.promises.get(currentOperationId)and, if its state is neitherresolvednorrejected, returns HTTP 409.src/gateway.ts:139-148.
When to reach for this pattern
- If you're orchestrating a multi-step Kubernetes (or other infrastructure) mutation where partial progress is meaningful and re-running from scratch is unsafe.
- If a workflow needs to block on arbitrary human response time — minutes, hours, or days — and you can't keep an HTTP request open.
- If the process driving the workflow can crash or be killed (e.g., an operator hits Ctrl+C, the worker pod is evicted) and progress must survive.
- If you want a control-plane HTTP gateway that holds no in-process workflow state and reads everything from the durable execution server.
- If your "ask the human" surface needs to be a small set of REST endpoints (
/skip,/retry,/force,/abort) rather than a bespoke state machine.
Sources
- Example repo: https://github.com/resonatehq-examples/example-node-drain-orchestrator-ts
- Resonate TypeScript SDK: https://github.com/resonatehq/resonate-sdk-ts (pinned
@resonatehq/sdk^0.10.0 inpackage.json; resolves to0.10.2) - SDK
ContextAPI (incl.run,promise): https://github.com/resonatehq/resonate-sdk-ts/blob/main/src/context.ts - SDK public promise-resolution API (
resolve/reject/cancel): https://github.com/resonatehq/resonate-sdk-ts/blob/main/src/promises.ts - Resonate docs — human-in-the-loop pattern: https://docs.resonatehq.io/get-started/examples/human-in-the-loop
- Resonate docs — TypeScript SDK reference: https://docs.resonatehq.io/develop/typescript
