AI-Sampled Rules Model — Community Program
Replaced the production ML model on the most-valuable first-party data pipeline with a rules-based model whose rules are extracted by an LLM iteratively sampling labeled data. 30× throughput, 92% cost reduction, 2.5× more usable contact data.
A research-flavored result — most engineers reach for an ML model when an LLM-sampled rules model is faster, cheaper, more debuggable, and outperforms it.
Problem
The community program is the highest-leverage first-party data source the company has — the most-valued contribution by current acquirers and investors, per ongoing exit conversations. The existing ML matching pipeline was the bottleneck on its impact: throughput-bound, expensive to operate, and missing contact data we knew was extractable.
A bigger or better-tuned ML model was the obvious move. It was the wrong move.
Design
The system uses an LLM to iteratively sample labeled data, propose rules, evaluate the rules against held-out cases, and refine until the rules generalize. The rules are then compiled into a fast deterministic matcher. The LLM is in the offline loop only; production inference is purely the rules.
This shape — use a large model to write a small model — turns out to be remarkably effective on dirty real-world matching: it forces the LLM to articulate the decision boundary, surfaces edge cases the team wouldn’t have thought to label, and produces a system whose decisions can be audited and amended without retraining.
I redesigned the matching service end-to-end alongside the model and built a new scalable matching layer to run it.
Outcome
- Throughput up 30× vs. the prior ML model.
- Cost down 92%.
- 2.5× more usable contact data extracted from the same input.
- This data source has roughly doubled in match rate and ingestion volume, becoming the most impactful contribution for the company’s exit narrative.
A clean, surprising win for the rules-via-LLM pattern over the model-on-model arms race.