Clinical Note Summarizer: End-to-End MLOps for Healthcare NLP

healthcare7 min read
← Back to Projects

A FLAN-T5 fine-tuned on MTS-Dialog, wrapped in a FastAPI service that degrades gracefully, behind a CI/CD pipeline that authenticates to GCP via Workload Identity Federation.

Python 3.11FastAPIPydantic v2PyTorchHugging Face TransformersFLAN-T5React 19ViteDockerKubernetes (GKE Autopilot)GitHub ActionsWorkload Identity FederationRuffpytest

Most public clinical NLP I see lives at one of two extremes. Either it is a notebook that happens to call a transformer, or it is a closed enterprise platform whose engineering you can only guess at from the marketing page. I wanted a small, public reference that demonstrated both halves of the work: the model, and the production scaffolding around it.

The result is MLOPS-Project, a clinical note summarization service. A FLAN-T5 fine-tuned on the public Microsoft MTS-Dialog corpus sits behind a FastAPI backend with the kind of probes, validation, and rate limiting a real engineering team expects. A React 19 SPA is bundled into the same image. The whole thing deploys to GKE Autopilot through a GitHub Actions pipeline that authenticates to GCP using Workload Identity Federation, not service-account JSON keys. Built end to end as a non-PHI demo.

This post walks through the architecture, the engineering decisions I find worth talking about, and what the next pass would look like.

The problem framing

Clinical notes are dense, sectioned, and written for other clinicians. Patients and downstream coders often want a shorter, more accessible version. The summarization task is a natural fit for a sequence-to-sequence model. The harder, more interesting question is what surrounds the model: how it loads, how the service degrades when it cannot, how requests are validated, how the image gets to the cluster, and how a green main branch becomes a running pod without anyone touching kubectl.

I work on production healthcare software at metricHEALTH (FHIR integrations, SOC 2 controls, real clinical product). I wanted a portfolio piece that exercised the same instincts on the AI side, with public data and zero PHI risk.

Architecture

The service is a single FastAPI process that owns three responsibilities: serving the API, holding a reference to the model in memory after startup, and serving the built React SPA from web/dist so the demo runs on one URL. CORS, per-IP rate limiting, and a JSON error envelope sit as middleware in front of the routes. The lifespan handler loads the tokenizer and model once at startup and tears them down on shutdown.

Three endpoints matter:

  • GET /health is a liveness probe. It returns 200 as long as the process is running, plus a model_loaded flag for observability.
  • GET /ready is a readiness probe. It also returns 200, but reports degraded if the model is unavailable, so the deployment can still serve a stub response while a Kubernetes operator investigates.
  • POST /summarize is the inference endpoint. It validates with Pydantic v2, enforces a 10,000-character cap, runs the rate-limit check, and either calls model.generate (with beam search by default, sampling when temperature > 0) or falls back to a simple word-trim if the model failed to load.

Engineering decisions worth talking about

Graceful model fallback

The single most useful decision in the whole codebase is also the smallest. If the model cannot be loaded at startup (no network, missing files, transient HF Hub error), the service flips into stub mode. Probes still pass. The UI still works. /summarize returns the first 50 words of the input as a placeholder, with the same response schema and a degraded: true flag in the response so callers can branch. Nothing 500s, nothing crash-loops, and the readiness endpoint surfaces the degraded state honestly.

This is the kind of thing that sounds boring on a slide and matters every time something goes wrong in a real cluster. The first version of the service did not have it, and the first time HF Hub had a transient outage during a CI run the pod went into CrashLoopBackOff and the entire workflow timed out. The fallback was the smallest reasonable response to that: instead of treating "model unavailable" as a fatal error, treat it as a degraded mode with a different SLA. That decision is borrowed wholesale from how I think about FHIR integrations at metricHEALTH; the right answer when an upstream EHR is down is rarely to fail the whole patient flow.

Workload Identity Federation, not JSON keys

The CI/CD pipeline never holds a long-lived GCP credential. GitHub mints a short-lived OIDC token per workflow run; that token is exchanged for a GCP access token via Workload Identity Federation; that access token authenticates the build, push, and rollout. No JSON key is ever stored in GitHub Secrets. Rotation is automatic.

This is now the recommended pattern for any GitHub-to-GCP pipeline and worth getting right on a small project before doing it on a large one.

Multi-stage Docker, non-root user

The Dockerfile uses two stages: a python:3.11 builder that installs dependencies into an isolated prefix, and a python:3.11-slim runtime that copies only the installed packages plus the application source. The runtime container creates an appuser system account and runs the service under it. The frontend is built outside the image (the workflow runs npm ci && npm run build) and the static web/dist/ output is COPY-ed into the runtime stage with --chown=appuser:app. The result is a smaller, less privileged image.

Probes that match how the service actually starts

The Kubernetes Deployment uses three probes for a reason:

  • A startupProbe with a long failure threshold (90 attempts at 10 seconds) covers the cold-start window during which the model is loading.
  • A readinessProbe with a 90-second initial delay protects against routing traffic to a pod that has not finished loading.
  • A livenessProbe with a 120-second initial delay restarts the pod if it wedges later in its lifecycle.

A common mistake is to use the same /health configuration for everything and end up with restart loops during model load. Splitting the three probes makes the bootstrap behaviour explicit.

Per-IP rate limiting (with a known limitation)

The rate limiter is an in-memory sliding window, keyed by client IP. It is documented as such in the code: fine for demos, fine for single-replica deployments, not fine for multi-replica production where you want a shared store. Treating this honestly in the README rather than over-claiming was a deliberate choice. The README's roadmap calls out distributed rate limiting backed by Redis as a future item.

A consistent error envelope

Every HTTPException flows through a single handler that returns {"error": {"code": ..., "message": ...}}. This means the React client can write one error parser instead of branching on which FastAPI default schema came back. Tests assert the envelope shape directly.

CI/CD

CI runs on every push and pull request. It lints with Ruff (check and format), runs the pytest suite with coverage against a tiny hf-internal-testing/tiny-random-t5 model so the test job stays under a minute, and builds the React SPA. The tiny model is the single most important detail here: it lets the test suite exercise the real loader, the real fallback path, and the real generate call without paying for a multi-hundred-megabyte download on every push.

CD only runs on pushes to main. It builds the SPA, federates into GCP, builds and pushes a Docker image tagged with the commit SHA (and latest) to Artifact Registry, then issues kubectl set image and waits for rollout status with a 10-minute timeout. If the rollout fails, the workflow fails, and the cluster keeps serving the previous image.

Training pipeline

The model is fine-tuned on Microsoft MTS-Dialog, a public corpus of synthetic clinician-patient dialogues paired with sectioned notes. I picked it for three reasons: it is realistic enough to learn the section conventions of clinical writing, it is large enough (1,700 encounters) to support a small fine-tune, and it is synthetic so there is no PHI to manage.

The pipeline is three steps under clinical-note-summarizer/scripts/:

  1. preprocess_t5.py reads the CSV, prepends summarize: to each input, tokenizes inputs at max length 2048 and targets at 256, and saves a Dataset to disk. Pad tokens in the labels are masked to -100 so they are excluded from the loss.
  2. prepare_dataset.py produces a 90/10 train/validation split with a fixed seed.
  3. train.py fine-tunes FLAN-T5 with Seq2SeqTrainer, computes ROUGE during evaluation, persists the best checkpoint by rougeL, and copies it to a stable export directory the runtime image expects.

Two-thousand-and-forty-eight tokens is intentional. Clinical notes are long, and truncating aggressively at 512 destroys the structure the model is supposed to learn to summarize. The first time I tried this I left the input length at the FLAN-T5 default of 512 and watched ROUGE flatline near a generic summarization baseline. The model could not see past the chief complaint. Pushing the input length up was the single change that made the fine-tune behave like it was actually learning the corpus and not paraphrasing the first paragraph.

What I would not do again

A few things I would change on a second pass:

  • Move rate limiting to Redis. The in-memory limiter is a known toy. As soon as the HPA scales past one replica, it stops being a real limiter.
  • Add structured JSON logs and request IDs. The current logging is logger.info with extra={...}. Useful, but a step short of what a real ops team wants.
  • Add a Prometheus /metrics endpoint. Latency histograms, in-flight requests, model-load state.
  • Canary the rollouts. A Recreate strategy makes sense at the moment because the model load is heavy and I don't want two replicas competing for memory during deploy. Argo Rollouts with a small canary is the obvious next step.
  • Batch inference. A /summarize/batch endpoint that accepts a list and dispatches one tokenizer call would be more honest than the current per-request loop.

These are listed in the README's roadmap section. They are deliberately out of scope for the current cut.

What this is not

It is not a medical device. It is not validated for clinical use. The README and the /summarize description both say so explicitly. The corpus is synthetic, the deployment is a demo, and the API is rate-limited at 30 requests per minute per IP. Anyone running this against real PHI without a serious compliance review is doing something I am not endorsing.

What it is, instead, is a demonstration of the engineering practice that should sit between a model and a clinical workflow when the time comes for that work to be real.

Try it

If you want a faster local walkthrough than the full FastAPI plus React stack, there is a standalone Streamlit demo in demo_app.py that runs against the public google/flan-t5-base checkpoint. Start with that, then graduate to the API once you want to see the production patterns.

If you build on this or hit something that should work better, please open an issue or a PR. The contribution guide is in CONTRIBUTING.md.