$ I'm Gunabhiram Billa — a Data & Analytics Engineer in New York. I build the unglamorous infrastructure underneath your dashboards: idempotent pipelines, multi-tenant warehouses, and reconciliation layers that catch mismatches before finance does. 50+ production systems shipped. 60+ multi-tenant dashboards live. No demos that pass once and break Tuesday.
I work in the unglamorous middle layer — between the source systems that emit data and the dashboards that consume it. Semantic layers, KPI frameworks, idempotent pipelines. The kind of code finance, sales, and CS teams quietly depend on without ever opening a ticket about.
Across 50+ production projects, I've designed multi-tenant warehouses serving 60+ customer-facing dashboards, automated reporting that turned two-day Mondays into 8 a.m. emails, and reconciliation layers that catch mismatches before finance closes the books.
I care about idempotency, dimensional discipline, and cross-system reconciliation — the foundations that make the dashboards on top actually trustworthy. If a pipeline ran at 3 a.m. and nobody noticed, that's the goal.
A working toolkit, sharpened across data warehousing, automation, and cross-system integration in healthcare operational analytics. No boilerplate languages I've touched once on a tutorial. Tap a category to explore.
Real systems I've shipped — the architecture that held, the bugs I traced through three layers, the trade-offs that mattered. Selected from a portfolio of 50+ production projects. Click any row to expand.
Customer Success was spending two full days every week assembling a recurring KPI report — pulling SQL exports, copying values into Excel templates, refreshing charts manually, then re-typing summaries into a Word doc and exporting to PDF. The same humans were doing the same clicks every Monday.
Designed a parameterized Python pipeline that reads customer config from a control table, runs templated Snowflake queries, populates an Excel workbook (preserving formatting and pivot logic), and injects the results plus auto-generated narrative into a Word template via python-docx. Final step exports to PDF for leadership distribution.
The Sales and Customer Success teams couldn't quantify product value at renewal conversations. Each rep was inventing their own ROI math, leading to inconsistent numbers across customers — and no defensible framework when leadership asked "what's the upsell story?"
Built a unified Snowflake-backed ROI model that quantifies four pillars: intake completion lift, scheduling adoption, reminder effectiveness (no-show reduction), and engagement uplift. Each pillar maps to specific time and revenue assumptions configurable per customer segment.
Multiple customers, shared infrastructure, divergent needs — and a warehouse that had grown organically into a tangle of one-off views. Cross-tenant benchmarking was unsafe, performance was inconsistent, and onboarding a new customer required custom code rather than configuration.
Redesigned as a layered architecture: staging mirrors source systems, intermediate handles business logic and conformed dimensions, marts expose tenant-isolated semantic models. SCD Type 2 captures historical attribution. Tenant ID propagates through every fact for safe row-level isolation.
Daily pipelines occasionally failed mid-run or produced duplicates after partial recovery. Engineers had to manually clean state before reruns, and late-arriving source records silently corrupted aggregates. Trust in the data was eroding.
Designed an incremental framework around three principles: deterministic watermark tracking, merge-based upserts (never plain inserts), and explicit validation gates. Every run logs to a control table; every failure can be safely retried without manual cleanup.
CRM said one number of active contracts. Billing said another. The operational system said a third. Finance was reconciling these by hand each month-end, and discrepancies often weren't surfaced until they had compounded across periods.
Built a reconciliation layer that joins all three systems on canonical entity keys and applies deterministic mismatch rules. Every mismatch is logged with a category, severity, and recommended action. Daily snapshots track drift over time, with exception reports surfaced to operations.
Snowflake credit consumption was growing faster than usage justified. A single oversized warehouse was running every workload, ad-hoc queries were doing full table scans, and no one had visibility into which queries were the actual cost drivers.
Mined query history and warehouse metering to attribute cost by workload. Split into purpose-sized warehouses (small for BI, medium for ETL, X-small for ad hoc), enforced auto-suspend, and rewrote the top expensive queries to use clustering and pruning. Built an internal cost dashboard for ongoing visibility.
Visit lifecycle data lived in a noisy event stream from the UrgentIQ API. Some statuses were skipped, some were emitted out of order, and timestamps were UTC while reporting expected EST. Operations couldn't answer basic questions like "how long are visits stuck in checked-in status?"
Built an incremental ingestion pipeline that normalizes status timelines (1 → N), gap-fills missing transitions with deterministic rules, standardizes timezone, and computes transition durations. Output is a clean lifecycle fact ready for funnel and bottleneck analysis.
Customers were billed based on provider counts, but "provider" meant different things in different systems. Some part-time providers were counted as full FTEs, some full-time providers were missed entirely, and contract terms specified thresholds nobody was actively checking.
Designed a reconciliation workflow that joins operational provider activity with CRM contract terms and QuickBooks billing records. Applied FTE thresholds (active hours, encounter volume) to classify each provider into a billing category. Standardized monthly reconciliation checkpoints.
SMS costs were inflated by two avoidable issues: messages tipping over the 160-character single-segment threshold (doubling per-message cost) and a fleet of 300+ phone numbers that were billed monthly but no longer in use. Nobody had audited either in years.
Audited message content across templates, identified those tipping over the segment boundary, and rewrote them under a 150-character target with a margin for personalization variables. Cross-referenced active phone numbers against actual outbound traffic and decommissioned the inactive ones.
EMR vendor API costs were rising without a clear understanding of what was driving usage. Were certain endpoints inefficient? Were features double-fetching data? Engineering needed evidence before making architectural changes.
Mined six months of API logs to surface usage patterns: top endpoints by volume, redundant call chains, customer-level cost attribution, and timing correlations with downstream features. Output was a structured report engineering used to prioritize caching and batching work.
Hundreds of business-critical views had grown over years with no tests, no documentation, and no consistent style. Logic was duplicated across views, dependencies were implicit, and a change in one place could quietly break a dashboard halfway across the warehouse.
Migrated transformations to dbt with a layered staging → intermediate → mart structure. Added schema tests for uniqueness, referential integrity, and accepted values. Generated docs auto-publish on every merge. CI runs the full test suite before any change reaches production.
Operational teams needed visibility into events as they happened — not 24 hours later. The existing batch pipeline ran nightly, which meant by the time an issue showed up on the dashboard, the day was over. Switching to streaming meant rethinking watermarks, ordering, and failure recovery.
Built a Kafka-to-Snowflake pipeline using Snowpipe Streaming for sub-minute landing, then Snowflake Streams + Tasks for incremental transformation. Out-of-order events handled via watermark tolerance windows. Quality metrics (lag, dropped events, schema drift) emitted to a monitoring dashboard.
RAG systems are only as trustworthy as the documents they retrieve from. Stale clinical guidelines, contradictory sources, and ambiguous procedure documentation all silently corrupt LLM answers downstream — with no easy way to detect the failure after the fact.
Designed a quality framework that runs on the source corpus before embedding: freshness checks (last-updated thresholds), deduplication (semantic similarity clustering), contradiction detection (cross-document NLI scoring), and provenance tagging. Documents that fail are quarantined; the rest enter the vector index with quality metadata attached.
Opinions earned shipping data systems. Each one has a scar behind it — a 3 a.m. page, a quarter-end mismatch, a dashboard that lied for two months before anyone noticed. I write them down so I don't forget.
Most analytics bugs aren't typos. They're cardinality surprises hiding in a vague join. Pin the grain first, write SQL second. Twenty minutes of sketching saves four hours of debugging.
It's a script with a person attached. Watermarks, merge upserts, control tables — make rerunning safe by construction. The first time a 3 a.m. failure becomes a 3:02 a.m. retry, you'll know it's working.
The row policy either holds for every query or it doesn't. Test isolation like a security control — write queries that try to violate it and confirm they can't. The worst-case bug is silent.
Run it daily, review weekly, audit monthly. Drift compounds; cadence catches it early. By month-end, three systems disagreeing has cost real money. Make every mismatch visible within 24 hours.
Write SQL, comments, and docs like the person debugging this at 3 a.m. has zero context. Eventually they will. That stranger might be future you. The empathy is the same.
Technical writing on data modeling, architecture decisions, automation patterns, and governance. Most of these are post-mortems on something that broke first.