Contact-Data Normalization
Solo migration of 4B+ contact records from nested JSON on a monolith model into normalized relational tables, with real-time sync, zero downtime, and zero performance regression.
The single biggest architectural-debt bet I’ve executed at RocketReach.
Problem
Contact data — emails and phones — was stored as nested JSON on the profile model. The decision was made very early in the company’s life and made sense at the time. By 2025 it was the largest single design flaw in the codebase: it blocked indexed reverse lookups (which contact maps to which profile?), made GDPR-style privacy removals nearly impossible, and was the thing standing between us and decomposing the monolith.
The constraint was sharp: a live, heavily-loaded system, thousands of writers in the codebase touching the JSON contact field, billions of existing rows to backfill, and zero tolerance for downtime or perf regression.
Design
I rejected the obvious options:
- Big-bang migration — too risky on a billion-row live system.
- Materialized view — doesn’t give a real relational target, doesn’t solve the write path.
- App-layer dual-write — thousands of call sites means missed writes.
The chosen design was a real-time sync at the Django field layer: low-level field-type overrides ensure every write to the nested JSON also produces a corresponding write to the new normalized profile_email and profile_phone tables, with in-memory dedup, persistence on profile.save(), and custom managers handling the mapping. Backfill ran alongside live sync with batched throughput control and consistency checks against the JSON source of truth.
Outcome
The new tables now hold 4B+ records combined. Migration shipped solo, with no site disruption and no performance regression. The downstream effects are bigger than the migration itself:
- GDPR removals became tractable — a contact-to-profile reverse lookup was finally possible.
- Monolith decomposition stopped being theoretical — the contact-data shape that was anchoring it to the monolith is gone.
- Data-quality wins during backfill — surfaced and fixed thousands of dupes, whitespace inconsistencies, and broken email-domain associations.