All articles
EngineeringApril 3, 20264 min read

272,717 Dental Providers: What We Learned Building a National Provider Database

We processed the entire NPI/NPPES bulk file (9.3GB, 7.8M records) to extract 272,717 dental providers across all 50 states. Here's what it took, and how the matching algorithm evolved from 22% to 47% license match rates.

The National Plan and Provider Enumeration System (NPPES) is the federal government's registry of every healthcare provider in the United States. It's a single CSV file. It's 9.3 gigabytes when extracted. It contains 7.8 million records spanning every physician, nurse, therapist, and dentist in the country.

Our job was to extract the 272,717 dental providers from that file and turn them into something useful. That meant solving three hard problems: identification, enrichment, and matching.

Identification: 12 taxonomy codes

Dental providers are identified by Healthcare Provider Taxonomy Codes. There are 12 that matter, from general dentistry (1223G0001X) to oral surgery (1223S0112X) to dental hygiene (124Q00000X). We filter the full NPPES file against these codes to extract the dental universe. Every weekly NPI update gets the same treatment.

Enrichment: state board data

NPI tells you who's registered. State boards tell you who's actually practicing. The Texas State Board of Dental Examiners publishes daily CSV extracts with license status, expiration dates, disciplinary actions, anesthesia permits, graduation year, and practice type. For Texas, we've currently enriched 14,612 providers with state board data, adding 15+ fields of intelligence per provider. That's more than double where our matching rate started at 22%.

Matching: the 6-tier system

Linking NPI to state board data is harder than it sounds. Providers don't use consistent names across systems: the NPI file shows legal names, state boards often use preferred names, nickname variations are common, and maiden names complicate the picture for anyone who's changed their name. Our matching pipeline uses six tiers, each a fallback when the previous tier misses:

  1. License number exact match. NPI embeds state license numbers in 50 "Other Provider Identifier" fields. We extract these as JSONB and match directly against TSBDE license numbers. This catches the majority where providers kept their NPI registration up to date.
  2. Name + ZIP exact match. First name, last name, and ZIP code all match exactly. Catches providers whose NPI record doesn't include the license number.
  3. Fuzzy name + city with nickname expansion. Handles ZIP mismatches, nickname variations ("Robert" vs "Bob", "Elizabeth" vs "Liz"), and Levenshtein-tolerant last name matching. Uses a 57-group nickname dictionary.
  4. Address match (relaxed multi-provider). When two providers share an address but different names, cross-reference via the license number to disambiguate.
  5. Name-only with confirmation signals. Match on name alone when graduation year, gender, or specialty confirm the identity. Catches providers who relocated between NPI and TSBDE registration.
  6. Former last name retry. Tries maiden names from TSBDE historical records against NPI registered names. Catches name-change cases.

The result is a nationwide base layer with state-specific enrichment bolted on top. Texas is live. California, Florida, and New York are next. The architecture is state-agnostic: adding a new state means a new pipeline adapter and a data ingest, not a code change.

Why does this matter? Because the data most teams use is incomplete. Broker databases are curated but narrow. LinkedIn scraping gives you employment history, not license status. NPI alone has no expiration dates, no disciplinary flags, no age indicators. The value is in the join.

Methodology: Enrichment counts queried from the ProviderSignal providers table on April 11, 2026, filtering where state=TX and license_status is not null. Match rate percentage is the ratio of enriched TX dentists to total TSBDE licensees.

We provide the data. You make the sale.

Start a free 7-day trial and see your top 20 acquisition targets in under 60 seconds. Cancel anytime.

Show Me My Top 20 Targets