Causal masked-line ablation reveals partial shortcut learning in a tree-based MK classifier of Gaia-ESO UVES spectra

data science7 min read
← Back to Projects

A LightGBM classifier of stellar spectra hits 92.6% macro-F1 on the Gaia-ESO Survey. A causal masked-line audit reveals that one of the three classes is being read off the wrong features. Strong aggregate metrics can hide class-specific physics gaps that only an interventional probe will surface.

PythonLightGBMSHAPGaia-ESO SurveyFLAMES-UVESPickles 1998

TL;DR. A LightGBM classifier hits 0.926 macro-F1 on 3,032 Gaia-ESO UVES spectra across F, G, and K. Two pre-registered masked-line ablations land where they should: F collapses without Balmer, G drops 21 points without Mg b. The third does not: K-class accuracy is unchanged when Mg b, Na D, and Ca I are all masked. The K-class is a partial shortcut, reading Fe I and Cr I in the mid-5200 Å region instead of the canonical neutral-metal diagnostics. Pickles-template chi-squared corroborates the same per-class disagreement pattern. Preprint on arXiv, May 2026.

A machine learning model trained on stellar spectra reaches a macro-F1 score of 0.926. By every conventional metric, that is a strong classifier. The natural reaction is to ship it, write the paper, and move on.

This paper is about what happens when you do not move on. It asks a different question: when the classifier predicts that a star is type F, G, or K, is it reading the spectral features that an astronomer would use to make the same call, or is it cheating? The answer turned out to be: depends on the class. F and G are textbook. K is not.

The setting in plain terms

Stars are sorted into spectral classes that run roughly hot-to-cool: O, B, A, F, G, K, M. The Sun is type G. The system, called Morgan-Keenan or MK classification, has been the working language of stellar astronomy since the mid-twentieth century. Astronomers identify class by looking at specific absorption lines in a star's spectrum. Hot stars show strong Balmer lines (the hydrogen series including Hα at 6563 Å and Hβ at 4861 Å). Intermediate-temperature stars show the Mg b triplet near 5170 Å. Cooler stars show neutral metal lines like the Na D doublet at 5890 Å and Ca I features at 6162 and 6439 Å. These are the canonical diagnostics, and they are the lines a working spectroscopist would point at.

Modern surveys produce far more spectra than a human can classify by eye. So researchers train classifiers. The question for any such classifier is whether it is doing the job the way an astronomer would, or whether it has found a statistical shortcut, a feature that correlates with class but disconnects from the canonical physics.

Why correlational interpretability is not enough

The standard tools for opening up a model are SHAP, permutation importance, and saliency maps. They rank features by how the model uses them. They do not, however, test whether a feature matters in a falsifiable sense. A feature can rank high because it correlates with a hidden temperature proxy. A feature can rank low because the model has redundant predictors and treats it as substitutable. Neither outcome tells you what would happen if you simply removed the feature and asked the model to classify again.

This study uses an interventional probe instead: causal masked-line ablation. For each combination of (diagnostic line, class), the bins covering the line are replaced with a continuum value, the masked spectrum is rescored against the same trained model, and the resulting accuracy drop is compared to a null distribution made from 500 random non-line windows of the same total width. Pre-registered before any masking was run. Bonferroni-corrected. Bootstrap confidence intervals. The output is a single number per pair with a p-value, and that number is falsifiable in the strong sense: if it disagrees with the prediction the audit set up, you know the prediction was wrong.

The dataset and the classifier

The audit was run on 3,032 high-resolution spectra from the Gaia-ESO Survey, taken with the FLAMES-UVES spectrograph at the VLT in Chile. The sample distributes as 536 F-class, 1,534 G-class, and 962 K-class stars after quality cuts. The wavelength window covers 4800 to 6800 Å, which contains all the canonical FGK diagnostics.

A LightGBM classifier (gradient-boosted decision trees) was trained on the spectra after a fixed train/validation/test split. On the held-out test set of 456 spectra, it scored:

  • F-class: precision 0.94, recall 0.84, F1 0.89
  • G-class: precision 0.91, recall 0.97, F1 0.94
  • K-class: precision 0.96, recall 0.94, F1 0.95
  • Macro-F1: 0.926

A leakage-aware spatial cross-validation (DBSCAN-grouped K-fold over sky coordinates) returned a macro-F1 of 0.903, against 0.913 for standard stratified K-fold. The 1-percentage-point gap sits in the minimal-leakage band reported by Blanco-Cuaresma (2019). The held-out test score sits 1.5 standard deviations above the spatial-CV mean, comfortably within sampling fluctuation.

So the classifier is real. The next question is how it works.

The audit, one pair at a time

Three headline pairs were pre-registered: Balmer lines on F-class, Mg b on G-class, and Mg b on K-class. The first two are textbook predictions: an astronomer who masked the Balmer series in an F-type spectrum would expect classification to fall apart, and the same for Mg b on G. The third pair (Mg b on K) was the interesting case, because correlational SHAP analysis already hinted that K-class might not be reading Mg b.

The Balmer mask on F-class was textbook. Accuracy dropped from 0.840 to 0.000. Every F prediction cascaded to G or K. The hierarchical Balmer-then-Mg b reasoning that astronomers use predicts exactly this pattern: with Balmer information removed, F-type spectra are read as cooler.

The Mg b mask on G-class was also textbook. Accuracy dropped from 0.965 to 0.752, a 21-point fall, statistically significant at p < 0.002.

The Mg b mask on K-class was the surprise. Accuracy went from 0.938 to 0.938. Zero change. The 95% confidence interval straddled zero. The p-value against the random-window null was 0.866. The K-class classifier was not using Mg b.

Two further pivot pairs corroborated the substitution. Masking Na D on K-class produced almost no change (Δ = -0.007, CI includes zero). Masking Ca I on K-class produced exactly zero change. The K-class achieves its 94% recall without measurable causal dependence on three of the canonical neutral-metal diagnostics.

What the K-class reads instead

The per-class SHAP traces give the answer.

The K-class classifier reads Fe I and Cr I multiplets in the mid-5200 Å region (specifically Fe I at 5269.54 Å and Cr I at 5206, 5208 and 5345 Å) plus the wings of Hβ and Hα. None of these are wrong features for K-class diagnosis. Iron and chromium line strengths track temperature in K dwarfs, and Balmer wing depth, while not a textbook K diagnostic, carries real spectral information that a tree ensemble can monotonize against the F-G-K sequence. The model has learned a temperature-correlated proxy, and the proxy works because at signal-to-noise ratios above 20 over 696 wavelength bins, there are many features that monotonize against temperature. A gradient-boosted tree does not need any one of them in particular.

This is partial shortcut learning. The aggregate metric is fine. The model is internally inconsistent: physics-aware on F and G, proxy-driven on K. Without an interventional audit, the gap is invisible.

A second view from a non-ML cross-check

To make sure the finding was not a quirk of the LightGBM model, the same test set was compared against the Pickles 1998 stellar flux library by chi-squared template matching, a non-machine-learning method.

The two methods agree on 68% of FGK templates. Where they disagree, the pattern is informative. The model over-predicts G against Pickles F templates that carry strong Mg b absorption (the model has substituted Mg b weight from F to G in its decision boundary). The model under-recovers Pickles K templates that carry textbook Na D and Ca I depth (the model is not reading those lines). Two completely different methods, the supervised classifier and the chi-squared template fit, land on the same per-class disagreement pattern. The audit corroborates independently.

What I think this means for the field

The methodological contribution is that causal masked-line ablation is a falsifiable per-class audit instrument, sharper than the correlational tools that dominate stellar machine learning interpretability. Permutation importance averages over classes and so masks per-class dependence structure. SHAP is per-class but is a Shapley decomposition of the model output, not a counterfactual on the input. Masking is a direct intervention on the input domain, and the audit produces a single pre-registered number per pair with a confidence interval and a p-value against a random-window null.

The wider implication is that strong aggregate metrics on stellar classifiers are not, on their own, evidence of physics-awareness. The community already knows this in the ML interpretability literature (the term "shortcut learning" goes back to Geirhos et al. 2020). The contribution of this paper is to bring an interventional probe into stellar spectroscopy specifically, with pre-registration, a proper null distribution, and a falsifiable per-class verdict.

Honest limits

The audit covers one classifier (LightGBM), one instrument (FLAMES-UVES), one survey (Gaia-ESO DR5.1), and one diagnostic line list (the canonical MK set). The Gaia-ESO target list concentrates on open clusters and known fields, so spectra are not spatially independent. The leakage-aware cross-validation bounds but does not eliminate this contribution. M-class statistics are too thin in the UVES sample to audit reliably, and the bluer Balmer lines and the Ca II H and K diagnostics fall outside the U580 window. The MK system also encodes luminosity class (dwarf, giant, supergiant) that this paper ignores.

A 3,032-star sample is an order of magnitude smaller than what neighbouring papers (Candebat et al. 2024 on FLAMES-GIRAFFE) work with. The audit instrument generalizes to those larger samples without architectural change because the intervention is defined on the input domain, not on internal model representations.

Why I wrote this paper

I came to this problem by way of a different one. I had a strong-looking classifier and the usual SHAP plots, and I could not bring myself to trust the SHAP plots. They told me which features the model used a lot. They did not tell me which features the model needed. The leap from correlation to intervention is small in the abstract and large in practice: it forces you to write down the prediction before you run the experiment, set the null distribution, and live with the verdict. Two of three pre-registered pairs landed where they were supposed to. The third, Mg b on K, did not, and now I trust the audit.

If you work in stellar machine learning, I would value your views on whether masked-line ablation should become a default reporting standard alongside aggregate metrics, and on what it would take to extend the audit to luminosity class and to the OBA and M ends of the MK sequence.


Read the preprint: arXiv, May 2026 Code: github.com/TirtheshJani/stellar-mk-audit (MIT license) Trained model and dataset: Zenodo DOI pending Get in touch: tirtheshjani@gmail.com