FHIR for AI/ML Engineers: What the Spec Gets Right and Where It Bites

May 3, 20267 min read
← Back to Blog

A field guide to FHIR for the ML engineer who just received a half-broken CSV export and a tight deadline. What the standard actually solves, and what it hands you as a research problem in disguise.

HealthcareFHIRML EngineeringData Engineering

The first time I touched a real FHIR feed, I thought I had a parsing bug. The file was a 1.4 GB NDJSON dump from a hospital partner, labeled "FHIR-compliant," and roughly half the Observation rows had a null effectiveDateTime. Two of the lab values were encoded in LOINC. Three more were in a homegrown vocabulary the hospital had invented around 2009 and never deprecated. The deadline was three weeks. I opened the spec, saw eight hundred pages of resource definitions, and quietly closed the tab.

I did the wrong thing for the next four days. I treated the export as a cleaning problem. I wrote regex repairs and unit conversions on top of the rows as I found them. By the end of that week the pipeline ran, and it was also unauditable: I could not tell you, for any given lab value, which transformations had been applied or in what order. I threw it all out and started again. This post is the field guide I wish I had on day one. It is not a sales pitch for the standard. It is a working engineer's read on what FHIR gets right, where it bites, and how to keep your sanity when the bites compound.

What FHIR Actually Is, in Plain Prose

FHIR (Fast Healthcare Interoperability Resources) is a RESTful API standard for clinical data, maintained by HL7. The current published release is R4, normative since January 2019, and R5, published in 2023, sitting in trial-use status. R6 is in normative ballot starting January 2026, with completion not expected before 2027. If you are building something today, you are almost certainly building against R4, because that is what regulators, EHR vendors, and certified health IT systems actually serve. R5 is interesting if you have a greenfield project and a friendly FHIR server. In production, R4 is the floor and the ceiling.

The mental model worth holding onto: FHIR is a graph of typed JSON resources, exchanged over HTTP. The resources you will touch most often as an ML engineer are Patient, Observation, Condition, MedicationRequest, Encounter, and Bundle. A Bundle is a transactional or search envelope containing other resources. An Observation is the workhorse: lab values, vital signs, imaging findings, behavioral assessments, all wear the Observation shape. A Patient resource ties identity to the rest of the graph by reference. If you can write a JSON API client, you can read FHIR. The hard part is not the syntax. The hard part is what the JSON does not say.

Where FHIR Gets It Right

Three things, and they are not small.

First, identity and references work. A Reference field like subject: { reference: "Patient/123" } is unambiguous, machine-resolvable, and survives across systems that implement the spec correctly. After years of stitching together patient records from CSVs with mismatched MRN columns, this is genuinely a relief. The first time I joined an Observation to a Patient to a Condition across three different source systems without a single fuzzy match, I sent the team a screenshot. It was not a complicated query. It was just one that, in a pre-FHIR world, would have taken a week of identifier reconciliation.

Second, the resource model is opinionated in useful ways. Observation.value[x] forces you to declare whether a measurement is a quantity, a coded concept, a string, a ratio, or a range. That discipline pushes ambiguity into the open, where you can deal with it, instead of letting it hide inside a free-text column. I have worked with legacy clinical CSVs where a single column contained "120/80", "120 over 80", "WNL", and "see notes". FHIR will not let you do that. It is a rude little spec, and rudeness here is a feature.

Third, the surrounding ecosystem is real. SMART on FHIR App Launch v2.2.0 gives you a tested OAuth 2.0 flow for clinical apps. The Bulk Data Access IG, version 2.0.0 (STU 2) on FHIR R4 gives you an asynchronous export pattern designed for exactly the population-scale extraction that ML training needs. USCDI v6, released by ASTP/ONC in July 2025, defines the floor of what certified EHRs in the United States must expose. None of this is theoretical. You can build against it today.

Where FHIR Bites

Now the honest part.

Bite 1: Terminology harmonization is a research problem, not a config problem.

FHIR tells you to put a code in a CodeableConcept. It does not tell you that the LOINC, SNOMED CT, and ICD-10 systems disagree on what counts as the same clinical event. It does not tell you that one hospital's "Hemoglobin A1c" is LOINC 4548-4 and another hospital's is LOINC 17856-6 (a sub-fraction) and a third hospital just uses an internal code that maps to neither. It does not tell you that ICD-10-CM and SNOMED CT model "diabetic neuropathy" with different ontological commitments and that a naive crosswalk will silently lose patients.

I learned this the hard way. On my first multi-site cohort, my model's recall on diabetic complications was about ten points lower at one site than the other two, and I spent a week chasing class imbalance and label noise before I noticed that the offending site was using a SNOMED CT code that my mapping table simply did not contain. The patients were there. My loader was throwing them away.

If you treat terminology mapping as a preprocessing step you can finish in an afternoon, you will ship a model that performs differently across sites for reasons you cannot explain. Treat it as a first-class engineering problem. Pick your target vocabulary upfront (LOINC for labs, SNOMED CT for clinical findings, RxNorm for medications). Build a normalization layer with explicit, versioned mapping tables. Log every code that fails to map, and review the log weekly. Most importantly, separate the unit normalization step from the code normalization step, because LOINC codes carry implicit unit assumptions and a mismatch there is a silent killer in lab-based features.

Bite 2: CodeableConcept gymnastics.

A single CodeableConcept can carry multiple Coding entries. The spec's intent is graceful: a sender can include both the local code and the standard code, so the receiver can pick whichever it understands. The reality is that you write code like this:

def extract_loinc(observation):
    for coding in observation.get("code", {}).get("coding", []):
        if coding.get("system") == "http://loinc.org":
            return coding.get("code")
    return None

And then you discover that one of your data sources puts LOINC codes under system: "urn:oid:2.16.840.1.113883.6.1" (the OID form of LOINC), another uses https://loinc.org (note the s), and a third uses LOINC as a bare string because someone wrote a transformer in a hurry. I have seen all three of these in the same week, from the same vendor, in feeds that were all advertised as "R4 conformant." Schema-tolerant loaders are not optional. Write a system-name canonicalizer. Treat it as part of your data contract, not as a workaround. Version it, test it against real exports, and assume the next vendor will invent a fourth variant.

Bite 3: Missing data semantics are subtle and load-bearing.

In a CSV, a missing value is a missing value. In FHIR, you have at least four distinct ways a value can be absent, and they mean different things:

  1. The Observation resource simply does not exist. The measurement was never taken, or was never recorded in this system.
  2. The Observation exists but has dataAbsentReason populated, with a code like not-asked, unknown, or error.
  3. The Observation exists with a value, but the value carries an extension flagging it as preliminary or amended.
  4. The Observation exists in a Bundle you received, but the source system filtered the record on consent grounds before it got to you.

A naive ML pipeline collapses all four into NaN and trains a model whose feature importance scores are an artifact of which kind of absence dominates the cohort. I have done this. On one project I built a sepsis risk feature pipeline that learned, with embarrassing confidence, that "lactate not measured" was protective. It was. The patients with no lactate measurement were the patients who were not sick enough to warrant one. The model had discovered triage, not biology.

If you are building anything clinical, decide on a missingness taxonomy before you write the loader, and preserve the distinctions through to the feature store. "Missing" is data. Throwing it away is a modeling choice, and it should be a deliberate one.

Bite 4: Search parameter quirks.

FHIR search is powerful and uneven. The spec defines parameters like _lastUpdated, date, and code, with prefix modifiers (gt, lt, ge, le) that look SQL-like. In practice, two FHIR servers will return different result sets for the same query because they implement search modifiers, chained parameters, and _include semantics with varying degrees of completeness. I have watched the same Observation?code=...&date=ge... query return different counts on consecutive days from the same server, because of how the vendor handled _include recursion under load. Your training pipeline cannot assume search is portable. If you need reproducible cohorts across sites, prefer the Bulk Data export over interactive search, and persist the raw NDJSON. You will thank yourself in six months when you need to re-derive a feature.

Building ML Feature Pipelines Without Losing Your Mind

A few habits that have saved me hours.

Cache the raw exports immutably. Every Bulk Data run gets a timestamped directory, gzipped NDJSON, and a manifest with the export job ID and the server's reported FHIR version. Re-derive features from cache. Never re-pull because a feature broke; the upstream resource may have been amended, and you will lose the audit trail. Storage is cheap. A reproducible cohort six months from now is not.

Normalize terminologies upfront, in a stage you can rerun independently. Your feature engineering code should not contain LOINC mapping logic. It should consume already-normalized rows. When the mapping table updates (and it will, every quarter at minimum), you rerun the normalization stage and recompute features, without touching the modeling code.

Write loaders to be schema-tolerant by default. Use Pydantic, marshmallow, or your framework of choice with extra="ignore" on input and explicit validation on output. FHIR servers in the wild include vendor extensions, deprecated fields, and the occasional malformed Reference. Do not let one rogue field crash a million-resource export.

Validate on the way out, not on the way in. The data is already in your system; rejecting it does not bring it back. Log validation failures, route them to a quarantine bucket, and proceed with the records that conform. Then triage the quarantine on a schedule. The single biggest reliability win I have shipped on a FHIR pipeline was switching from "raise on first invalid resource" to "quarantine and continue." Job success rate went from intermittent to boring overnight.

Treat the FHIR version as a feature of the data, not a constant. If you ingest from multiple sources, some will be R4 and some will be R5 and at least one will be R3 with custom extensions someone forgot to remove. Tag every record with its source version. Your model card will eventually need it, and the auditor who reads it will care.

The Honest Summary

FHIR is a real interoperability win at the network level. It will not solve your terminology problem, your missingness problem, or your cross-site reproducibility problem, because those are not interoperability problems. They are clinical-informatics problems wearing an interoperability mask.

If you came in expecting a standard that turns clinical data into a clean tabular feature store, you will be disappointed, and you will blame the spec. If you came in expecting a standard that gives you a stable, queryable graph of typed resources from which you must still build a clinical-informatics layer to get to ML-ready features, you will be productive, and you will ship.

The spec is not the model. The spec is the substrate. Build the rest deliberately.

Sources: