Building Idempotent ETL Pipelines in Snowflake + Python

01 — PremiseWhy idempotency matters

An idempotent pipeline is one where running it twice produces the same result as running it once. That's it. Simple definition, extremely demanding implementation.

The reason it matters is operational, not theoretical. Every pipeline you ship will eventually fail mid-run. The Snowflake warehouse will time out. A network blip will drop a connection. Someone will deploy a bad change at 4 a.m. The question is whether the recovery is a single retry, or a manual cleanup followed by a retry.

A pipeline that needs human intervention to retry is not a pipeline. It's a script with a person attached.

02 — PatternWatermark tracking

The first ingredient is a deterministic notion of "what's been processed." For event-style data, this is usually a high-watermark on a monotonic column — updated_at, event_id, or a Snowflake stream offset.

Fig. 02 · The watermark is the only state your pipeline needs to remember between runs.

The key discipline: the watermark advances only after the merge has succeeded and validations have passed. If anything fails, the watermark stays where it was, and the next run reprocesses the same window — safely, because the merge is idempotent.

03 — DisciplineMerge, never insert

Plain inserts are how you get duplicates after a partial failure. Use MERGE with a deterministic match key — usually a natural key plus an effective date for SCD Type 2 facts.

merge_visit_fact.sql

MERGE INTO fact_visit t
USING stg_visit_incoming s
  ON  t.tenant_id = s.tenant_id
  AND t.visit_id  = s.visit_id

WHEN MATCHED AND s.row_hash <> t.row_hash THEN UPDATE SET
  status        = s.status,
  updated_at    = s.updated_at,
  row_hash      = s.row_hash,
  loaded_at     = CURRENT_TIMESTAMP()

WHEN NOT MATCHED THEN INSERT (
  tenant_id, visit_id, status, updated_at, row_hash, loaded_at
)
VALUES (
  s.tenant_id, s.visit_id, s.status, s.updated_at, s.row_hash, CURRENT_TIMESTAMP()
);

The row_hash column is the trick: it lets you skip writes when nothing meaningful has changed, which keeps your loaded_at column stable for downstream consumers and reduces churn on clustered tables.

04 — Edge caseLate-arriving data

The watermark approach assumes events arrive in roughly monotonic order. Real systems break this assumption. A source might emit an event with updated_at earlier than your watermark — because of clock skew, retried writes, or batch processing on the source side.

The fix is a small backfill window. Always pull watermark minus N on each run, where N is sized to your worst-case skew (often 1–2 hours). Combined with the merge pattern, late-arriving rows quietly slot into place without duplicating anything.

→ Watch for this

The backfill window is also where data quality tests pay off. If a row's updated_at is older than your backfill window, log it as a late-late arrival rather than silently dropping it. Surface those to operations weekly.

05 — ObservabilityControl tables & recovery

Every run logs a row to a control table: pipeline name, start time, end time, watermark before/after, row counts at each stage, validation results. This isn't optional — it's the recovery surface when something goes wrong at 3 a.m.

pipeline_runs.sql

CREATE TABLE pipeline_runs (
  run_id           VARCHAR,
  pipeline_name    VARCHAR,
  started_at       TIMESTAMP_NTZ,
  ended_at         TIMESTAMP_NTZ,
  watermark_before TIMESTAMP_NTZ,
  watermark_after  TIMESTAMP_NTZ,
  rows_extracted   NUMBER,
  rows_merged      NUMBER,
  validation_pass  BOOLEAN,
  status           VARCHAR,    -- success / fail / partial
  error_detail     VARCHAR
);

When a run fails, the recovery procedure becomes mechanical: look up the last successful run's watermark_after, confirm the failed run's row_hash logic was correct, and rerun. No spelunking. No "did it commit before it died?" guesswork.

06 — TakeawaysThe shape of resilience

Idempotency isn't a feature you add at the end. It's a posture you take from the first line of code. The patterns are simple — watermarks, merges, validation gates, control tables — but applied consistently, they turn pipelines from fragile shell scripts into infrastructure you can sleep through.

Make the merge key deterministic and never trust auto-generated IDs as match keys
Advance the watermark last, only after every step has succeeded
Always pull a backfill window — late-arriving data is the rule, not the exception
Log every run, with enough detail that recovery is obvious without context

The first time a 3 a.m. failure becomes a 3:02 a.m. retry instead of a 7 a.m. incident, you'll know the patterns are working.

01 — PremiseWhy idempotency matters

02 — PatternWatermark tracking

03 — DisciplineMerge, never insert

04 — Edge caseLate-arriving data

05 — ObservabilityControl tables & recovery

06 — TakeawaysThe shape of resilience

Designing a Multi-Tenant Healthcare Data Warehouse in Snowflake

Snapshot vs CDC vs Full-Load: Practical Tradeoffs in Modern ELT