Phone Data Normalization
Internal package that restructures and normalizes the entire phone dataset — 250M phones across 3B+ data points — into a scalable waterfall ingestion system. The PM called it "the single largest data improvement of the past year."
The 2026 follow-on to the contact-data normalization story — same shape of problem, applied to the phone dataset.
Problem
Phone records had accumulated multiple parallel sources, formats, and conflict-resolution paths. The result was a dataset whose top-line size obscured noisy duplication, inconsistent canonicalization, and long-tail data-quality issues that customers noticed in API responses.
Design
I built an internal Python package that owns phone normalization end-to-end: parse-and-canonicalize, de-duplication, source-priority resolution, and a waterfall ingestion path that absorbs new data without disturbing the live read path. 250M canonical phones across more than 3B underlying data points.
Outcome
A dataset that finally matches the marketing claims behind it. The PM’s framing — “the single largest data improvement of the past year” — is the line I quote when I’m asked what “data quality” actually means as a deliverable.