SYS JS.DEV
BUILD F3CRDE
DATE 2026.04.26
UTC 01:30 UTC
LOC NYC → STANFORD
STATUS OPEN TO ML/SYSTEMS ROLES

Deep-Learning Entity Resolution — BERT Embeddings + MLP

Python package implementing entity resolution as a two-stage pipeline — BERT embeddings of structured and unstructured tuples, followed by a multilayer perceptron classifier. Built with a JHU professor advisor across 2023.

Entity resolution — do these two records refer to the same entity? — is the central unsolved problem in data integration. It is also exactly the problem RocketReach’s product orbits, which is what motivated the research.

Approach

Read dozens of academic papers on deep-learning approaches to ER, benchmarked several embedding families, studied vector-similarity geometry and embedding fine-tuning. Settled on a two-stage architecture: BERT embeddings to produce a dense vector representation of each tuple’s contents, followed by a feedforward neural net (multilayer perceptron) trained on labeled match/non-match pairs.

The package was structured to be agnostic to the input domain — entities can be structured (database rows) or unstructured (free-text descriptions), and the embedding-then-classify pattern works across both. Designed to be imported and trained against any labeled corpus, not just the one I tested on.

Outcome

A working, generalizable Python package for deep-learning entity resolution, with a defensible benchmark story versus the classical heuristics it was meant to replace. The longer arc: this was the work that took my applied-ML practice from “I can use a model someone else built” to “I can read the literature, pick an architecture, and ship it.”

It also became the conceptual lineage for several production matching projects at RocketReach — including the community-program AI-sampled rules model, which uses the inverse pattern (LLM-distilled rules instead of an embedded classifier) and outperforms it on this particular shape of problem.