4 min readResonate HQJust published

Human-in-the-loop Kubernetes node drain in TypeScript on Resonate

How a node drain that may block for arbitrary human response time is written as a single generator function with checkpointed steps and a blocking durable promise.

Resonate brand card on a dark background with a teal spectrum wave at the bottom and the post headline in white Sansation.

Draining Kubernetes worker nodes for maintenance is a multi-step operation that can stall indefinitely when a Pod Disruption Budget refuses eviction, and a partial drain that loses progress on operator interrupt or worker crash is unsafe to retry. With Resonate, the drain is a single generator function: each per-node drain is a ctx.run checkpoint, and a blocked drain creates a durable promise that the workflow awaits until an operator resolves it via the HTTP gateway. The example registers one workflow (drainAllNodes), exposes it through an Express gateway that triggers it via RPC and settles its blocking promises, and survives killing the worker process mid-operation.

The shape of the solution

function* drainAllNodes(
  ctx: Context,
  operationId: string,
  options: DrainOptions,
  nodeSelector?: Record<string, string>
): Generator<unknown, DrainOperationResult, unknown> {
  // ...
  const nodes = yield* ctx.run(getNodes, nodeSelector);
  // ...
  for (const node of nodes) {
    const result: NodeDrainResult = yield* ctx.run(
      drainSingleNode,
      node.name,
      options
    );
    results.push(result);
 
    if (!result.success && !options.force) {
      const decisionPromise = yield* ctx.promise<HumanDecision>({});
      console.log(`[Drain] Promise ID for decision: ${decisionPromise.id}`);
      const decision: HumanDecision = yield* decisionPromise;
 
      switch (decision) {
        case "abort":
          // ...
          return { /* ... */ };
        case "retry":
          // ...
          break;
        case "force":
          // ...
          break;
        case "skip":
          // ...
          break;
      }
    }
  }
  // ...
}
// from example-node-drain-orchestrator-ts/src/worker.ts:60

The retry and force cases each call yield* ctx.run(drainSingleNode, node.name, options) (the force case spreads { ...options, force: true }) and overwrite results[results.length - 1] with the new result — see src/worker.ts:128-147 for the lifted bodies. The workflow is registered with resonate.register("drainAllNodes", drainAllNodes) at src/worker.ts:291. The gateway triggers it as RPC at src/gateway.ts:162:

resonate.rpc(
  operationId,
  "drainAllNodes",
  operationId,
  options,
  body.nodeSelector,
  resonate.options({ target: "poll://any@default" })
);
// from example-node-drain-orchestrator-ts/src/gateway.ts:162

The durable primitives in play

  • ctx.run(getNodes, nodeSelector) — local function call, checkpointed. After it returns, the node list is durable; a worker crash will not re-list nodes on resume. src/worker.ts:73.
  • ctx.run(drainSingleNode, node.name, options) — per-node drain as a single checkpoint. Each completed node's result is durable; on resume the workflow continues with the next node. src/worker.ts:95.
  • ctx.promise<HumanDecision>({}) — creates a durable promise on the Resonate server. The workflow yields it and suspends on the server until the promise is resolved from outside. src/worker.ts:108.
  • yield* decisionPromise — awaits the durable promise; this yields control back to the server until the gateway settles it. src/worker.ts:114.
  • resonate.register("drainAllNodes", drainAllNodes) — registers the generator function so the worker can be assigned invocations of it by name. src/worker.ts:291.
  • resonate.rpc(id, "drainAllNodes", ..., resonate.options({ target: "poll://any@default" })) — gateway-side invocation that creates the root promise and dispatches the workflow to a worker polling the default group. src/gateway.ts:162.
  • resonate.promises.get(operationId) — gateway reads workflow state directly from the server, not from in-process memory. src/gateway.ts:67.
  • resonate.promises.settle(promiseId, "resolved", { data }) — gateway resolves the blocking decision promise from an HTTP handler, unsuspending the workflow. src/gateway.ts:220, also :244, :266, :288, :310. Note on SDK version skew: the repo pins "@resonatehq/sdk": "^0.10.0", which resolves to 0.10.2. At 0.10.1 and later, settle was made private and the public API is resonate.promises.resolve(id, { data }) / .reject(id, { data }) / .cancel(id, { data }) (each a thin wrapper that calls the now-private settle with the corresponding state — see resonate-sdk-ts/src/promises.ts:98, :110, :122 at v0.10.2). Agents porting this example forward should call resonate.promises.resolve(promiseId, { data }) in place of resonate.promises.settle(promiseId, "resolved", { data }).

What the SDK handles vs. what you write

You write: the generator function that drains nodes, the per-step Kubernetes calls (getNodes, drainSingleNode, k8s.cordonNode, k8s.evictPod, k8s.deletePod, k8s.waitForPodDeletion, k8s.waitForNodeEmpty), the option/result types, and the HTTP routes that start the workflow and settle the blocking promise. There is no state machine, no progress table, no queue, no retry loop in your code.

The SDK handles: persisting every ctx.run result so a crashed worker resumes on the next node rather than the first; persisting the durable promise created by ctx.promise so the workflow can survive a worker crash while awaiting a human decision; routing the RPC submitted by the gateway to any worker polling poll://any@default; replaying the generator deterministically from durable checkpoints when a worker picks the work back up. The gateway holds no in-process workflow state — when it serves /status, it calls resonate.promises.get(operationId) and decodes the result (src/gateway.ts:67, base64-decoded at :90).

Failure modes covered

  • Worker crashes mid-drain (after node 1, before node 2). On restart, resonate.register("drainAllNodes", drainAllNodes) re-attaches the worker to the workflow; the generator replays from durable checkpoints and resumes at the ctx.run(drainSingleNode, ...) for the next un-drained node. src/worker.ts:91-99.
  • Pod eviction blocked by PDB. k8s.evictPod returns { success: false }; drainSingleNode returns a NodeDrainResult with success: false and a blockedBy pod (src/worker.ts:215). The orchestrator then enters the human-in-the-loop branch.
  • Workflow needs to block on arbitrary human response time. ctx.promise<HumanDecision>({}) creates a durable promise; the workflow suspends on the server. The worker process can be killed and restarted while the workflow is suspended — the promise lives on the server. src/worker.ts:108, :114.
  • Operator wants to skip / retry / force / abort a blocked node. The gateway exposes POST /skip/:promiseId, POST /retry/:promiseId, POST /force/:promiseId, POST /abort/:promiseId, and a generic POST /decision. Each settles the durable promise with a base64-encoded HumanDecision. src/gateway.ts:238, :260, :282, :304, :191.
  • Eviction times out waiting for a pod to terminate. k8s.waitForPodDeletion returns false; drainSingleNode returns a success: false result with error: "Timeout waiting for pod ...". The orchestrator routes that into the same human-decision branch. src/worker.ts:236.
  • Duplicate /drain POST while an operation is in flight. Gateway looks up the existing operation with resonate.promises.get(currentOperationId) and, if its state is neither resolved nor rejected, returns HTTP 409. src/gateway.ts:139-148.

When to reach for this pattern

  • If you're orchestrating a multi-step Kubernetes (or other infrastructure) mutation where partial progress is meaningful and re-running from scratch is unsafe.
  • If a workflow needs to block on arbitrary human response time — minutes, hours, or days — and you can't keep an HTTP request open.
  • If the process driving the workflow can crash or be killed (e.g., an operator hits Ctrl+C, the worker pod is evicted) and progress must survive.
  • If you want a control-plane HTTP gateway that holds no in-process workflow state and reads everything from the durable execution server.
  • If your "ask the human" surface needs to be a small set of REST endpoints (/skip, /retry, /force, /abort) rather than a bespoke state machine.

Sources