Guide

Durable workflows on the Postgres you already have

Most background-job problems are not throughput problems. They are reliability problems. A payment charge succeeds but the receipt email never sends because the process died in between. A trial-reminder cron double-fires after a deploy. An AI enrichment pipeline crashes on step 4 of 6 and re-runs the expensive embedding step from scratch. The fix for all of these is durable, step-based execution — and you do not need Temporal or a Redis queue to get it. You need the Postgres you are already running.

The reliability gap in a typical Node app

A normal async function has no memory. If it crashes halfway, everything it did is gone and everything it was about to do never happens. Teams paper over this with cron jobs, ad-hoc jobs tables, or a BullMQ + Redis setup. Each adds infrastructure you now have to run, monitor, and pay for. For a solo dev or a five-person SaaS, that operational tax is the actual problem — not the queue throughput.

flow-keeper takes a different position: your database is already durable, transactional, and backed up. Put the workflow state there. A workflow is just a function whose steps are checkpointed to Postgres, so it can survive a crash, a deploy, or a multi-day pause and resume exactly where it left off.

1. Install & migrate

The whole engine is one package and four tables, created in their own flow_keeper schema so they never touch your app tables.

npm i @flow-keeper/core

// once, on boot:
import { FlowKeeper } from "@flow-keeper/core";
const fk = new FlowKeeper({ connection: process.env.DATABASE_URL });
await fk.migrate();

2. Define steps as plain functions

A workflow is a function. Inside it, anything wrapped in ctx.step() becomes a durable checkpoint: once a step completes successfully, its return value is persisted, and any later replay (after a crash, a retry, or a sleep) returns that stored value instead of executing the step again. That gives you at-least-once execution with memoized results: a step body runs again only if it never recorded a success. Make the side effect itself idempotent (e.g. pass a payment idempotency key) and you get effectively exactly-once behavior.

import { defineWorkflow } from "@flow-keeper/core";

const checkout = defineWorkflow("checkout", async (ctx, order) => {
  // Pass an idempotency key so a retry can't double-charge.
  const charge = await ctx.step("charge-card", () =>
    stripe.charge({ ...order, idempotencyKey: order.id }));

  // Memoized: once recorded complete, replays skip this step entirely.
  await ctx.step("send-receipt", () => email.receipt(order, charge));

  return { charged: charge.id };
});

fk.register(checkout).start();

3. Retries come for free

Every step retries with exponential backoff and jitter. A transient 502 from your payment gateway, a brief network partition, a rate-limit — flow-keeper reschedules the run and replays it, skipping the steps that already succeeded. You configure attempts per step.

await ctx.step("charge-card", () => stripe.charge(order), {
  retries: 5,        // total attempts before the run fails
  backoffMs: 1000,   // 1s, 2s, 4s, 8s … with jitter
});

Because completed steps are memoized, retrying is safe: charge-card failing on attempt 3 does not re-run send-receipt from a prior pass. This is the property that makes at-least-once delivery actually usable.

4. Durable sleep — pause for seconds or weeks

ctx.sleep() suspends the run and writes the wake time to Postgres. The process can restart, you can deploy ten times, the server can reboot — when the time arrives, the worker picks the run back up and continues from the next line. This is how you build trial reminders, email drips, and cooldowns without a scheduler.

const drip = defineWorkflow("welcome-drip", async (ctx, user) => {
  await ctx.step("day-0", () => email.welcome(user));
  await ctx.sleep("wait-3-days", 3 * 24 * 60 * 60 * 1000);
  await ctx.step("day-3", () => email.tips(user));
  await ctx.sleep("wait-7-days", 7 * 24 * 60 * 60 * 1000);
  await ctx.step("day-10", () => email.upgrade(user));
});

5. Idempotency keys stop duplicate runs

Trigger a workflow with an idempotencyKey and flow-keeper guarantees only one run exists for that key — enforced by a unique index in Postgres, not application logic. Fire the same webhook twice and you still get exactly one checkout.

await fk.trigger("checkout", order, { idempotencyKey: order.id });
await fk.trigger("checkout", order, { idempotencyKey: order.id });
// → both return the same run id. One charge.

6. Run a worker (a long-lived process)

fk.start() kicks off a background poll loop — but a worker only does work while its process is alive. Run it as a separate, always-on process (a small Node script under PM2, a container, or a dedicated service), not inside a request handler. Trigger workflows from your app; let the worker process them.

// worker.ts — keep this process running (PM2 / systemd / a container)
import { FlowKeeper } from "@flow-keeper/core";
import { checkout } from "./workflows";

const fk = new FlowKeeper({ connection: process.env.DATABASE_URL });
await fk.migrate();
fk.register(checkout).start();

// Keep the event loop alive; drain gracefully on shutdown.
process.on("SIGTERM", () => fk.stop());
console.log("flow-keeper worker running");

Run as many workers as you like across as many processes — FOR UPDATE SKIP LOCKED ensures no two ever claim the same run. If a worker dies mid-step, its run is reclaimed after the lease window (default 60s) and replayed — that is the crash-safety guarantee in practice.

A note on determinism

Because a run replays its function from the top (skipping completed steps), the workflow body must be deterministic outside of steps. Put anything non-deterministic — Date.now(), random values, network calls, DB reads — inside a ctx.step() so its result is captured and replayed consistently. Don't rename or reorder steps in a workflow that has in-flight runs; step identity is the step name, so a rename makes a replay treat it as new work and re-execute it.

How it works under the hood

A worker polls for claimable runs using SELECT … FOR UPDATE SKIP LOCKED, so you can run many workers across many processes and no two will ever grab the same run. Each step writes a row to flow_keeper.steps on completion; each state transition appends to flow_keeper.events, which is exactly the data the hosted dashboard renders as a timeline. It is plain SQL — you can query your job state with the same tools you already use for your data.

When to reach for something heavier

flow-keeper is built for the 95% case: thousands to low-millions of runs a day on a single Postgres. If you need cross-datacenter replication of workflow state, millions of concurrent timers, or a polyglot fleet of workers in five languages, Temporal earns its operational cost. For everyone else — the solo dev shipping payments, the small team running email and AI jobs — a second piece of infrastructure is a liability, not a feature.

Want a live timeline, alerting, and one-click replay?

The library is free and open-source. The hosted dashboard watches your runs and pages you when one fails.

See pricing →