Clinical Note Summarizer: End-to-End MLOps for Healthcare NLP
A FLAN-T5 fine-tuned on MTS-Dialog, wrapped in a FastAPI service that degrades gracefully, behind a CI/CD pipeline that authenticates to GCP via Workload Identity Federation.
Most public clinical NLP I see lives at one of two extremes. Either it is a notebook that happens to call a transformer, or it is a closed enterprise platform whose engineering you can only guess at from the marketing page. I wanted a small, public reference that demonstrated both halves of the work: the model, and the production scaffolding around it.
The result is MLOPS-Project, a clinical note summarization service. A FLAN-T5 fine-tuned on the public Microsoft MTS-Dialog corpus sits behind a FastAPI backend with the kind of probes, validation, and rate limiting a real engineering team expects. A React 19 SPA is bundled into the same image. The whole thing deploys to GKE Autopilot through a GitHub Actions pipeline that authenticates to GCP using Workload Identity Federation, not service-account JSON keys. Built end to end as a non-PHI demo.
This post walks through the architecture, the engineering decisions I find worth talking about, and what the next pass would look like.
The problem framing
Clinical notes are dense, sectioned, and written for other clinicians. Patients and downstream coders often want a shorter, more accessible version. The summarization task is a natural fit for a sequence-to-sequence model. The harder, more interesting question is what surrounds the model: how it loads, how the service degrades when it cannot, how requests are validated, how the image gets to the cluster, and how a green main branch becomes a running pod without anyone touching kubectl.
I work on production healthcare software at metricHEALTH (FHIR integrations, SOC 2 controls, real clinical product). I wanted a portfolio piece that exercised the same instincts on the AI side, with public data and zero PHI risk.
Architecture
The service is a single FastAPI process that owns three responsibilities: serving the API, holding a reference to the model in memory after startup, and serving the built React SPA from web/dist so the demo runs on one URL. CORS, per-IP rate limiting, and a JSON error envelope sit as middleware in front of the routes. The lifespan handler loads the tokenizer and model once at startup and tears them down on shutdown.
Three endpoints matter:
GET /healthis a liveness probe. It returns 200 as long as the process is running, plus amodel_loadedflag for observability.GET /readyis a readiness probe. It also returns 200, but reportsdegradedif the model is unavailable, so the deployment can still serve a stub response while a Kubernetes operator investigates.POST /summarizeis the inference endpoint. It validates with Pydantic v2, enforces a 10,000-character cap, runs the rate-limit check, and either callsmodel.generate(with beam search by default, sampling whentemperature > 0) or falls back to a simple word-trim if the model failed to load.
Engineering decisions worth talking about
Graceful model fallback
The single most useful decision in the whole codebase is also the smallest. If the model cannot be loaded at startup (no network, missing files, transient HF Hub error), the service flips into stub mode. Probes still pass. The UI still works. /summarize returns the first 50 words of the input as a placeholder, with the same response schema and a degraded: true flag in the response so callers can branch. Nothing 500s, nothing crash-loops, and the readiness endpoint surfaces the degraded state honestly.
This is the kind of thing that sounds boring on a slide and matters every time something goes wrong in a real cluster. The first version of the service did not have it, and the first time HF Hub had a transient outage during a CI run the pod went into CrashLoopBackOff and the entire workflow timed out. The fallback was the smallest reasonable response to that: instead of treating "model unavailable" as a fatal error, treat it as a degraded mode with a different SLA. That decision is borrowed wholesale from how I think about FHIR integrations at metricHEALTH; the right answer when an upstream EHR is down is rarely to fail the whole patient flow.
Workload Identity Federation, not JSON keys
The CI/CD pipeline never holds a long-lived GCP credential. GitHub mints a short-lived OIDC token per workflow run; that token is exchanged for a GCP access token via Workload Identity Federation; that access token authenticates the build, push, and rollout. No JSON key is ever stored in GitHub Secrets. Rotation is automatic.
This is now the recommended pattern for any GitHub-to-GCP pipeline and worth getting right on a small project before doing it on a large one.
Multi-stage Docker, non-root user
The Dockerfile uses two stages: a python:3.11 builder that installs dependencies into an isolated prefix, and a python:3.11-slim runtime that copies only the installed packages plus the application source. The runtime container creates an appuser system account and runs the service under it. The frontend is built outside the image (the workflow runs npm ci && npm run build) and the static web/dist/ output is COPY-ed into the runtime stage with --chown=appuser:app. The result is a smaller, less privileged image.
Probes that match how the service actually starts
The Kubernetes Deployment uses three probes for a reason:
- A
startupProbewith a long failure threshold (90 attempts at 10 seconds) covers the cold-start window during which the model is loading. - A
readinessProbewith a 90-second initial delay protects against routing traffic to a pod that has not finished loading. - A
livenessProbewith a 120-second initial delay restarts the pod if it wedges later in its lifecycle.
A common mistake is to use the same /health configuration for everything and end up with restart loops during model load. Splitting the three probes makes the bootstrap behaviour explicit.
Per-IP rate limiting (with a known limitation)
The rate limiter is an in-memory sliding window, keyed by client IP. It is documented as such in the code: fine for demos, fine for single-replica deployments, not fine for multi-replica production where you want a shared store. Treating this honestly in the README rather than over-claiming was a deliberate choice. The README's roadmap calls out distributed rate limiting backed by Redis as a future item.
A consistent error envelope
Every HTTPException flows through a single handler that returns {"error": {"code": ..., "message": ...}}. This means the React client can write one error parser instead of branching on which FastAPI default schema came back. Tests assert the envelope shape directly.
CI/CD
CI runs on every push and pull request. It lints with Ruff (check and format), runs the pytest suite with coverage against a tiny hf-internal-testing/tiny-random-t5 model so the test job stays under a minute, and builds the React SPA. The tiny model is the single most important detail here: it lets the test suite exercise the real loader, the real fallback path, and the real generate call without paying for a multi-hundred-megabyte download on every push.
CD only runs on pushes to main. It builds the SPA, federates into GCP, builds and pushes a Docker image tagged with the commit SHA (and latest) to Artifact Registry, then issues kubectl set image and waits for rollout status with a 10-minute timeout. If the rollout fails, the workflow fails, and the cluster keeps serving the previous image.
Training pipeline
The model is fine-tuned on Microsoft MTS-Dialog, a public corpus of synthetic clinician-patient dialogues paired with sectioned notes. I picked it for three reasons: it is realistic enough to learn the section conventions of clinical writing, it is large enough (1,700 encounters) to support a small fine-tune, and it is synthetic so there is no PHI to manage.
The pipeline is three steps under clinical-note-summarizer/scripts/:
preprocess_t5.pyreads the CSV, prependssummarize:to each input, tokenizes inputs at max length 2048 and targets at 256, and saves aDatasetto disk. Pad tokens in the labels are masked to-100so they are excluded from the loss.prepare_dataset.pyproduces a 90/10 train/validation split with a fixed seed.train.pyfine-tunes FLAN-T5 withSeq2SeqTrainer, computes ROUGE during evaluation, persists the best checkpoint byrougeL, and copies it to a stable export directory the runtime image expects.
Two-thousand-and-forty-eight tokens is intentional. Clinical notes are long, and truncating aggressively at 512 destroys the structure the model is supposed to learn to summarize. The first time I tried this I left the input length at the FLAN-T5 default of 512 and watched ROUGE flatline near a generic summarization baseline. The model could not see past the chief complaint. Pushing the input length up was the single change that made the fine-tune behave like it was actually learning the corpus and not paraphrasing the first paragraph.
What I would not do again
A few things I would change on a second pass:
- Move rate limiting to Redis. The in-memory limiter is a known toy. As soon as the HPA scales past one replica, it stops being a real limiter.
- Add structured JSON logs and request IDs. The current logging is
logger.infowithextra={...}. Useful, but a step short of what a real ops team wants. - Add a Prometheus
/metricsendpoint. Latency histograms, in-flight requests, model-load state. - Canary the rollouts. A
Recreatestrategy makes sense at the moment because the model load is heavy and I don't want two replicas competing for memory during deploy. Argo Rollouts with a small canary is the obvious next step. - Batch inference. A
/summarize/batchendpoint that accepts a list and dispatches one tokenizer call would be more honest than the current per-request loop.
These are listed in the README's roadmap section. They are deliberately out of scope for the current cut.
What this is not
It is not a medical device. It is not validated for clinical use. The README and the /summarize description both say so explicitly. The corpus is synthetic, the deployment is a demo, and the API is rate-limited at 30 requests per minute per IP. Anyone running this against real PHI without a serious compliance review is doing something I am not endorsing.
What it is, instead, is a demonstration of the engineering practice that should sit between a model and a clinical workflow when the time comes for that work to be real.
Try it
- Repo: github.com/TirtheshJani/MLOPS-Project
- License: MIT.
If you want a faster local walkthrough than the full FastAPI plus React stack, there is a standalone Streamlit demo in demo_app.py that runs against the public google/flan-t5-base checkpoint. Start with that, then graduate to the API once you want to see the production patterns.
If you build on this or hit something that should work better, please open an issue or a PR. The contribution guide is in CONTRIBUTING.md.