01 — PremiseWhy idempotency matters
An idempotent pipeline is one where running it twice produces the same result as running it once. That's it. Simple definition, extremely demanding implementation.
The reason it matters is operational, not theoretical. Every pipeline you ship will eventually fail mid-run. The Snowflake warehouse will time out. A network blip will drop a connection. Someone will deploy a bad change at 4 a.m. The question is whether the recovery is a single retry, or a manual cleanup followed by a retry.
02 — PatternWatermark tracking
The first ingredient is a deterministic notion of "what's been processed." For event-style data, this is usually a high-watermark on a monotonic column — updated_at, event_id, or a Snowflake stream offset.
The key discipline: the watermark advances only after the merge has succeeded and validations have passed. If anything fails, the watermark stays where it was, and the next run reprocesses the same window — safely, because the merge is idempotent.
03 — DisciplineMerge, never insert
Plain inserts are how you get duplicates after a partial failure. Use MERGE with a deterministic match key — usually a natural key plus an effective date for SCD Type 2 facts.
MERGE INTO fact_visit t USING stg_visit_incoming s ON t.tenant_id = s.tenant_id AND t.visit_id = s.visit_id WHEN MATCHED AND s.row_hash <> t.row_hash THEN UPDATE SET status = s.status, updated_at = s.updated_at, row_hash = s.row_hash, loaded_at = CURRENT_TIMESTAMP() WHEN NOT MATCHED THEN INSERT ( tenant_id, visit_id, status, updated_at, row_hash, loaded_at ) VALUES ( s.tenant_id, s.visit_id, s.status, s.updated_at, s.row_hash, CURRENT_TIMESTAMP() );
The row_hash column is the trick: it lets you skip writes when nothing meaningful has changed, which keeps your loaded_at column stable for downstream consumers and reduces churn on clustered tables.
04 — Edge caseLate-arriving data
The watermark approach assumes events arrive in roughly monotonic order. Real systems break this assumption. A source might emit an event with updated_at earlier than your watermark — because of clock skew, retried writes, or batch processing on the source side.
The fix is a small backfill window. Always pull watermark minus N on each run, where N is sized to your worst-case skew (often 1–2 hours). Combined with the merge pattern, late-arriving rows quietly slot into place without duplicating anything.
The backfill window is also where data quality tests pay off. If a row's updated_at is older than your backfill window, log it as a late-late arrival rather than silently dropping it. Surface those to operations weekly.
05 — ObservabilityControl tables & recovery
Every run logs a row to a control table: pipeline name, start time, end time, watermark before/after, row counts at each stage, validation results. This isn't optional — it's the recovery surface when something goes wrong at 3 a.m.
CREATE TABLE pipeline_runs ( run_id VARCHAR, pipeline_name VARCHAR, started_at TIMESTAMP_NTZ, ended_at TIMESTAMP_NTZ, watermark_before TIMESTAMP_NTZ, watermark_after TIMESTAMP_NTZ, rows_extracted NUMBER, rows_merged NUMBER, validation_pass BOOLEAN, status VARCHAR, -- success / fail / partial error_detail VARCHAR );
When a run fails, the recovery procedure becomes mechanical: look up the last successful run's watermark_after, confirm the failed run's row_hash logic was correct, and rerun. No spelunking. No "did it commit before it died?" guesswork.
06 — TakeawaysThe shape of resilience
Idempotency isn't a feature you add at the end. It's a posture you take from the first line of code. The patterns are simple — watermarks, merges, validation gates, control tables — but applied consistently, they turn pipelines from fragile shell scripts into infrastructure you can sleep through.
- Make the merge key deterministic and never trust auto-generated IDs as match keys
- Advance the watermark last, only after every step has succeeded
- Always pull a backfill window — late-arriving data is the rule, not the exception
- Log every run, with enough detail that recovery is obvious without context
The first time a 3 a.m. failure becomes a 3:02 a.m. retry instead of a 7 a.m. incident, you'll know the patterns are working.