SYS JS.DEV
BUILD F3CREO
DATE 2026.04.26
UTC 01:30 UTC
LOC NYC → STANFORD
STATUS OPEN TO ML/SYSTEMS ROLES

Phone Data Normalization

Internal package that restructures and normalizes the entire phone dataset — 250M phones across 3B+ data points — into a scalable waterfall ingestion system. The PM called it "the single largest data improvement of the past year."

Phones normalized 250M
Data points 3B+
PM quote LARGEST DATA IMPROVEMENT OF PAST YEAR

The 2026 follow-on to the contact-data normalization story — same shape of problem, applied to the phone dataset.

Problem

Phone records had accumulated multiple parallel sources, formats, and conflict-resolution paths. The result was a dataset whose top-line size obscured noisy duplication, inconsistent canonicalization, and long-tail data-quality issues that customers noticed in API responses.

Design

I built an internal Python package that owns phone normalization end-to-end: parse-and-canonicalize, de-duplication, source-priority resolution, and a waterfall ingestion path that absorbs new data without disturbing the live read path. 250M canonical phones across more than 3B underlying data points.

Outcome

A dataset that finally matches the marketing claims behind it. The PM’s framing — “the single largest data improvement of the past year” — is the line I quote when I’m asked what “data quality” actually means as a deliverable.

STACK · Python · Django · PostgreSQL