Quantum vs Classical for ICU Mortality: A Reproducible Benchmark

healthcare9 min read
← Back to Projects

Three quantum kernel feature maps, a variational quantum classifier, and a quantum neural network, run head-to-head against classical baselines on the WiDS Datathon 2020 ICU dataset. The README leads with the honest finding: classical wins on every metric, 100x to 10,000x faster.

PythonQiskitqiskit-machine-learningscikit-learnNumPypandasZZFeatureMapPauliFeatureMapVQCQNNRBF SVM

TL;DR. I built a reproducible end-to-end benchmark of five quantum approaches (three QSVC feature maps, a VQC, and a QNN) against three classical baselines (RBF SVM, logistic regression, random forest) for ICU mortality prediction on the WiDS Datathon 2020 dataset. The README leads with the honest finding: at this scale, on a noiseless simulator, classical kernels match or beat every quantum model on every metric while training 100x to 10,000x faster. The interesting result is the structure of the loss, not the headline number.

I have a physics undergraduate degree. I work as a healthcare ML engineer. The intersection of those two does not appear in any job posting I have ever read, and it took me a long time to take it seriously as a research thread. This project is the artifact where I stopped treating the intersection as a footnote.

The premise is a single, badly-asked question: can a quantum kernel beat a classical kernel on a medically-realistic tabular task? The answer at simulator scale is no. The interesting part is the precise account of where it loses, why it loses, and what the small corner where it does not lose looks like. The honest output of a project like this is not a benchmark trophy. It is a falsifiable account of which assumptions in the QML literature break first when the data is real.

The repository is the work. This page is the field notes. The numbers below are committed to the README; the Honest findings section there is permanent and explicit, not a hedge to be removed when results improve.

The cross-disciplinary stack

The project sits on three things at once.

Physics fluency. A quantum circuit is not a metaphor. It is a unitary operator acting on a Hilbert space, and the kernel between two patients is |<phi(x) | phi(x')>|^2 where phi is the feature map encoding classical data into a quantum state. Building a useful quantum kernel means picking a feature map whose induced inner product captures structure the data actually has. That decision is half quantum mechanics (which Hamiltonian generates the encoding) and half ML (which structure does the dataset reward). The decision is the project.

Healthcare data fluency. ICU mortality prediction is not a clean classification benchmark. The features arrive at irregular times, the missingness is informative, the labels are imbalanced (roughly 9 percent positive), and the operational cost of a wrong prediction is high. Working with the WiDS Datathon 2020 ICU dataset means honoring those properties during preprocessing, before any kernel ever sees the data.

ML engineering fluency. The quantum side is the small box at the end of a pipeline that mostly looks classical: cohort selection, median imputation, one-hot encoding, stratified split, train-only StandardScaler, baseline models, hyperparameter search, scoring. The quantum kernel slots in as a callable on top of sklearn.svm.SVC and the rest of the pipeline does not change.

None of these three is hard in isolation. The signal in the project is having all three loaded at once.

The dataset and the cohort

The WiDS Datathon 2020 dataset (Women in Data Science / MIT GOSSIS) contains roughly 91,000 ICU stays with 186 features per stay: vitals and lab values aggregated over the first 24 hours, demographics, admission diagnosis, ICU type. The task is binary: did the patient die in this hospital admission?

Class balance is a real problem. About 9 percent positive. Any model that predicts "no" for everyone scores 91 percent accuracy and is useless. The right metric is AUROC (area under the receiver operating characteristic) supplemented by AUPRC (area under the precision-recall curve), which weights minority-class performance more honestly.

Preprocessing is the unglamorous part that determines whether anything downstream is meaningful:

  1. Median imputation for numeric features (most clinical features have informative missingness, but a model on encoded missingness flags is its own project).
  2. One-hot encoding for categoricals (admission diagnosis, ICU type, ethnicity).
  3. Stratified 70/10/20 split for train/val/test, preserving the 9 percent positive rate in each fold.
  4. Train-only StandardScaler fit on the training fold and applied to val/test (no leakage).
  5. SelectKBest with ANOVA-F to reduce dimensionality before quantum encoding. The QSVC kernel matrix is O(N^2) in the training set, so feature count needs to come down to keep simulator runtime tractable.
  6. Class-balanced subsample for QSVM to keep the kernel matrix below a working size.

The pipeline reproduces from one command. If Kaggle credentials are present, it pulls the live dataset; if not, it falls back to a 5,000-row schema-matched synthetic dataset so the full experiment runs end to end on a fresh clone with no accounts.

python scripts/reproduce_all.py

Quantum kernels, in plain English

A support-vector machine classifies by finding the hyperplane that best separates two classes in some feature space. The kernel trick lets us compute distances in that feature space without ever materializing it: the kernel function k(x, x') = <phi(x), phi(x')> does the work directly. RBF and polynomial kernels are the standard choices.

A quantum kernel is the same idea with the feature map living on a quantum computer. Each classical feature vector x is encoded into a quantum state |phi(x)> by a parameterized circuit. The kernel between two patients is the squared overlap of their states, estimated by running a small circuit and measuring how often the result is the all-zeros bitstring (the ComputeUncompute fidelity protocol):

from qiskit.circuit.library import ZZFeatureMap
from qiskit_machine_learning.kernels import FidelityQuantumKernel
from sklearn.svm import SVC

feature_map = ZZFeatureMap(feature_dimension=n_features, reps=2, entanglement="linear")
quantum_kernel = FidelityQuantumKernel(feature_map=feature_map, enforce_psd=True)
svc = SVC(kernel=quantum_kernel.evaluate)
svc.fit(X_train, y_train)

The interesting design knob is the feature map. Three variants are under test in the QSVC arm:

  • ZZFeatureMap (Havlicek et al. 2019) with linear entanglement is the canonical baseline. Each feature gets a single-qubit rotation, and pairs of qubits are entangled by ZZ(theta_i * theta_j) rotations. The induced kernel is sensitive to interactions between features the entanglement structure connects.
  • PauliFeatureMap with paulis=["Z", "ZZ"] is the same family with explicit Pauli string control. Useful as a closer-to-spec comparison to the literature.
  • A custom H+RZ+CZ ring feature map where the entanglement step is a fixed Clifford circuit. The non-classical contribution comes entirely from the data-dependent rotations. This is the place where the entanglement structure stops being a hyperparameter and becomes a controlled variable: any difference between this map and the standard ones is attributable to the rotation pattern alone.

The custom ring map is the design decision worth dwelling on. The default linear entanglement in ZZFeatureMap couples qubits in topological order: qubit 0 with qubit 1, qubit 1 with qubit 2, and so on. That order has nothing to do with the data. A ring topology with a fixed Clifford entangler isolates one variable at a time, which is the discipline that lets the comparison with classical baselines be honest.

The variational arms (VQC and QNN)

Two further quantum models complete the suite. Both train a parameterized circuit end-to-end against the labels rather than precomputing a kernel.

Variational Quantum Classifier (VQC). A ZZFeatureMap encodes the input, then a RealAmplitudes ansatz applies trainable rotations and entanglement, then an output measurement decodes a class label. Loss is cross-entropy. Optimization is COBYLA via scipy.optimize.minimize. Each iteration costs a full forward pass over the training set, which is the single largest time sink at this scale.

Quantum Neural Network (QNN). A PauliFeatureMap plus RealAmplitudes ansatz is wrapped as a SamplerQNN and exposed via NeuralNetworkClassifier with a parity interpret function (count of 1s in the output bitstring, mod 2, as the predicted class). The training loop is the same shape as the VQC but with sampling-based output rather than expectation-value output.

Both of these add expressivity over the QSVC: trainable parameters, not just a fixed feature map. They also add cost. Each COBYLA iteration is a per-iteration full pass over the training set. The cost adds up quickly when the kernel matrix is already O(N^2).

The classical baselines

Three baselines, matched on the same cohort, the same features, the same train-test splits, the same scoring:

  • Logistic regression with L2 regularization. The linear floor.
  • Random forest with 200 trees. The non-linear floor.
  • RBF SVM with light tuning. The realistic ceiling for kernel methods on this kind of tabular task.

Scoring is AUROC and AUPRC. Every quantum result is reported alongside the matched classical scoreboard. There is no version of this project where the quantum number stands alone.

The honest scoreboard

The pattern across the runs committed to the repo:

ModelAUROCAUPRCTrain time
Logistic regression0.830.42seconds
Random forest (200 trees)0.850.46seconds
RBF SVM0.840.44tens of seconds
QSVC, ZZFeatureMap0.790.36minutes to hours
QSVC, PauliFeatureMap0.790.37minutes to hours
QSVC, H+RZ+CZ ring0.800.38minutes to hours
VQC0.760.32hours
QNN0.740.30hours

Numbers above are illustrative of the pattern committed to the repo's results/ directory; exact values depend on the random seed and the SelectKBest cut. The qualitative ordering is stable across seeds.

The README's Honest findings section is explicit:

Classical kernels match or beat every quantum model on every metric while training 100x to 10,000x faster on this hardware regime. The runtime gap is the dominant story. A single QSVC kernel matrix on a class-balanced subsample takes minutes on a noiseless simulator and would take orders of magnitude longer on real hardware once shot noise is included. Logistic regression finishes the same job in single-digit seconds on a laptop.

That paragraph stays in the README permanently. It is not a hedge to be removed when results improve. It is the result.

Kernel concentration: the lurking failure mode

The most interesting structural finding from the QSVC sweep is something that has a name in the literature: kernel concentration (Thanasilp et al. 2024). As the feature map gets deeper (more reps), the pairwise quantum kernel values bunch toward a constant and the SVM has nothing to learn from. This is the same phenomenon I documented in the QML-Essentials curriculum, seen here on healthcare data instead of synthetic toy datasets.

The diagnostic is direct. Compute the variance of the off-diagonal entries of the kernel matrix on a held-out set, and watch it collapse as reps grows. Once that variance is below roughly 1e-3 on this dataset, the SVM is no longer a meaningful classifier no matter what AUROC the cross-validation reports. At that point the test-set AUROC starts looking suspiciously close to the prior, which is the giveaway.

The structural takeaway: deeper is not better for QSVCs at this scale. The intuition that more circuit depth means more expressivity, ported over from classical neural networks, is exactly wrong for kernel methods on simulators.

Why this is worth doing anyway

Three reasons.

It is the right experiment to run before the hardware catches up. Quantum kernels at simulator scale today are a faithful proxy for what tomorrow's hardware will deliver, modulo noise. Building the pipeline, the cohort, the feature engineering, the baselines, the CI matrix, and the scoring discipline now means that when a 50-qubit fault-tolerant machine becomes routinely accessible, the experiment is ready to drop into it. The infrastructure is the contribution.

The cross-disciplinary stack is rare in healthcare ML. Most healthcare ML practitioners do not read quantum papers. Most quantum ML practitioners do not understand healthcare datasets. Sitting in the overlap, even at modest depth, is the differentiator.

The technical lessons travel. Working through quantum kernel construction, feature-map design, and kernel concentration deepens the intuition for classical kernel methods in ways that staring at scikit-learn does not. After a month of building Pauli feature maps, I went back to a classical XGBoost feature-engineering task and found I was making explicit, written-down decisions about which feature interactions deserved hand-engineered cross terms instead of trusting the trees to find them. That habit was a direct import from the quantum side.

Engineering hygiene

Five things make this repo recruiter-skim-able as MLOps, not just as a research notebook:

  1. CI matrix across Python 3.10, 3.11, 3.12. Tests run in under five seconds.
  2. make targets for test, lint, format, notebooks, all. A reviewer can run the full pipeline without reading the README.
  3. One-command reproduce script. scripts/reproduce_all.py runs the full benchmark end to end.
  4. Numbered notebooks (01 through 06) walking through data exploration, classical baselines, quantum kernels (with 60x60 kernel-matrix heatmaps and eigenvalue spectra), QSVM training, VQC and QNN training, and final results analysis.
  5. MIT licensed, pre-commit hooks, ruff, black, pytest, contributing guide.

What is next

Three concrete items on the roadmap.

  1. Run on real hardware. IBM's free 7-qubit machine is the obvious next step. Half the practical knowledge of QSVCs is invisible in simulation: shot noise, calibration drift, gate errors that are correlated and time-varying.
  2. Domain-informed entanglement structure. Pairing features that clinically co-vary (lactate with creatinine, mean arterial pressure with respiratory rate) in the entanglement graph is the small bet that the kernel will pick up the joint signal of metabolic acidosis and renal failure better than a kernel that pairs by qubit index. Whether the bet pays off is the open experiment.
  3. A targeted scaling study that holds the feature map fixed and varies only the qubit count, to measure kernel concentration empirically on healthcare features rather than synthetic ones.

The repo is at https://github.com/TirtheshJani/QML-Healthcare-Diagnostics. The honest framing in the README stays. The alternative is the kind of overclaimed quantum benchmark the field has too many of already.