Stellar Spectra with Gradient Origin Networks

data science7 min read
← Back to Projects

A cross-survey deep generative model for stellar atmospheres. The data engineering across APOGEE DR17, GALAH DR3, and Gaia-ESO DR4 is the harder half of the work, and the half that already ships.

PythonPyTorchAstropyGradient Origin NetworksHDF5ParquetDVCAPOGEE DR17GALAH DR3Gaia-ESO DR4

The version of this project I want to write a year from now leads with a results table: cross-survey reconstruction fidelity, a confusion plot of identified anomalies, a scatter of physical parameters recovered from the latent space. That table does not exist yet. The training run that would produce it is the next phase.

What does exist, and what this page leads with, is the data engineering. Three of the largest spectroscopic surveys on Earth disagree about almost everything except the stars they happen to share. APOGEE DR17 looks at near-infrared light through fiber spectrographs in New Mexico. GALAH DR3 looks at four narrow optical bands from the Anglo-Australian Telescope. Gaia-ESO DR4 spans the optical at very high resolution from the VLT in Chile. They overlap on tens of thousands of stars, but the data products look entirely different. Getting them into a common format with comparable noise properties is most of the work, and it is the half that determines whether any model on top of it is trustworthy.

Phases 1 (data collection) and 2 (preprocessing) are complete. Phase 3 (training at scale) is in progress. The repository ships the data engineering, the model architecture in PyTorch, and a smoke test suite. When the training run produces results, this page will be updated with reconstruction quality, cross-survey transfer accuracy, and any anomaly detection findings. Until then, the contribution is the pipeline.

Why this is mostly a data engineering problem

Spectroscopy is how astronomy reads chemical fingerprints. Each absorption line in a star's spectrum corresponds to a specific element's electronic transitions; the depths and shapes of those lines encode temperature, surface gravity, metallicity, and motion. Modern surveys have collected millions of these spectra. The bottleneck has shifted from observing to interpreting at scale.

The dominant interpretive tools are physics-based codes that solve radiative transfer through model stellar atmospheres. They work, but they are expensive, parameter-bounded, and brittle outside their training grid. Discriminative ML models that map spectra directly to labels (effective temperature, surface gravity, [Fe/H]) are now standard supplements, but they only solve the forward problem.

Generative models open different doors:

  • Imputation. A spectrum with a detector gap or a cosmic-ray hit can be filled in by sampling from the learned distribution.
  • Anomaly detection. Spectra with low likelihood are candidates for unusual stellar types: chemically peculiar stars, binaries, or things nobody has labeled yet.
  • Cross-survey transfer. A latent representation learned across instruments lets predictions trained on one survey transfer to another with minimal retraining.
  • Calibrated uncertainty. Sampling from the latent distribution gives confidence on derived parameters in a way physics-based codes do not.

The mature comparison points are VAEs (used on SDSS galaxy spectra) and normalizing flows. GONs sit in a middle space: parameter-efficient, encoder-free, amenable to physics-informed design.

The three surveys, and why the union matters

GALAH, APOGEE, and Gaia-ESO map the Milky Way from three different angles. The value of unifying them is exactly that the angles are different.

APOGEE DR17 (Apache Point Observatory Galactic Evolution Experiment) takes near-infrared spectra from 1.514 to 1.696 microns at R approximately 22,500. Infrared light penetrates the dust that obscures the inner galaxy from optical surveys, so APOGEE reaches the bulge and inner disk. DR17 contains roughly 657,000 stars, mostly red giants observable at large distances.

GALAH DR3 (Galactic Archaeology with HERMES) surveys the southern sky at R approximately 28,000 across four narrow optical bands between 471 and 789 nanometers. DR3 contains hundreds of thousands of dwarfs and subgiants within roughly 2 kiloparsecs of the Sun, with up to 30 elemental abundances measured per star.

Gaia-ESO DR4 (UVES) provides very-high-resolution optical spectra at R approximately 47,000, with chemical abundances measured for a smaller but more precisely characterized stellar sample.

The three surveys see overlapping populations of stars at different resolutions, in different wavelength regimes, with different systematic offsets. Cross-calibration between them is an open problem in the field. The cross-match in this project yields roughly 30,000 stars present in all three surveys with quality flags satisfied. That is the dataset the GON gets to learn on.

The data pipeline, in detail

This is the part of the repo I am most willing to defend. Six pieces, each addressing a specific failure mode I hit on an earlier attempt.

Survey-specific FITS readers. Each survey ships its data in a different FITS layout. APOGEE has apStar (combined) and apVisit (single-visit) files. GALAH has four-camera files that need stitching. Gaia-ESO Phase-3 UVES has its own quirks. The repository contains tested readers for each, with a uniform return type so downstream code does not need to know which survey it came from.

Resumable parallel downloads. APOGEE DR17 alone is hundreds of gigabytes. The download client uses a job queue, hashes the local files for verification, and resumes interrupted downloads from where they stopped. This is unglamorous infrastructure that determines whether the project is reproducible by anyone other than me.

Cross-match pipeline. Building the common-star catalogue requires matching coordinates from APOGEE (the seed catalogue) against GALAH and Gaia-ESO. Coordinate systems differ across surveys, epochs need proper-motion correction, and one-to-many matches are common. The current pipeline uses spatial indexing to keep cross-match runtime tractable across hundreds of thousands of stars; the indexing helper is adapted from Henry Leung's astroNN cross-match utilities, with credit preserved in the source.

Common-grid resampling. Each spectrum is resampled onto a log-lambda grid at R approximately 10,000 across roughly 3,500 to 17,000 angstroms. Log-lambda is the right grid because radial-velocity shifts become uniform translations along the wavelength axis, which makes the model architecture simpler downstream. A linear-lambda grid would force the model to learn a position-dependent shift operator, which is a much harder thing to ask of a generative model trained on tens of thousands of spectra. Multiple continuum normalization methods are supported (Gaussian smoothing, polynomial fits, running medians) so I can ablate the choice.

Telluric and detector-gap masking. Earth's atmosphere imprints absorption features that are not properties of the star. Detectors have physical gaps. Both are propagated as masks alongside each spectrum so the loss can be weighted accordingly.

Storage layer. Regridded spectra go into compressed HDF5 for fast random access during training. Native-resolution spectra go into PyArrow Parquet because they are ragged across surveys. Both layers are tracked with DVC so that data and code stay in sync. A teammate who clones the repo and runs dvc pull gets the same intermediate artifacts I have on my machine, without having to re-download hundreds of gigabytes from three observatories.

The pyproject.toml carries the tooling, a pytest smoke suite covers the readers and the resampler, and a design PDF in docs/ walks through the methodology end to end.

Why a Gradient Origin Network

GONs (Bond-Taylor and Willcocks, ICLR 2021) replace the encoder in a standard autoencoder with a single optimization step. To get the latent representation z for an observed spectrum x, you start at z = 0 and take one gradient step on the negative log-likelihood with respect to z. That step is the encoding.

There are two reasons to care about this for spectroscopy.

Parameter efficiency. A standard autoencoder doubles parameter count by needing both an encoder and a decoder. GONs need only the decoder. For high-dimensional spectra with thousands of pixels, this halves the model size at no obvious cost in expressivity.

Latent inference is a single optimization, not a learned forward pass. That sounds like a downside (it costs more at inference time) but it has a useful side effect: the same machinery that does encoding can also do imputation, radial-velocity estimation, and parameter inference. The latent z that minimizes reconstruction loss is also the latent z that best explains partial or shifted data. One model, multiple uses.

The original GON papers were on natural images and small synthetic datasets. There is no published precedent for applying them to astronomical spectra. The architectural adaptations needed for this domain (Fourier feature encodings, SIREN-style coordinate networks for handling wavelength as a continuous variable, line-window-weighted loss functions that emphasize the absorption features the physics actually depends on) are part of the contribution.

What an earlier failure taught the current attempt

This project began earlier as a smaller-scope effort using only two surveys. That earlier work, which I have since archived as Deep-Generative-Model-Stellar-Spectra, taught me three things that shaped the current scope.

Astropy's memory model is unforgiving. FITS files use memory mapping by default, and the file handles persist past hdul.close(). Processing thousands of spectra sequentially leaks memory until the process is killed. The fix is explicit cleanup with del hdul[0].data and periodic gc.collect() calls. Nothing in the documentation prepares you for this. I lost an evening watching a script consume increasing amounts of memory and crash before I traced the leak to the memmap handles surviving past the with block I had wrapped them in. The pattern that finally worked: copy the array I need into an in-memory NumPy buffer, then del the FITS HDU explicitly. Any pattern that relies on Python's garbage collector doing the right thing across thousands of FITS reads will eventually stop working.

Cross-matching is harder than it looks. A naive O(N²) pairwise match works for thousands of stars and dies for hundreds of thousands. Spatial indexing is the only thing that keeps the pipeline tractable, and proper-motion correction across epochs is not optional if you care about the matches at fainter magnitudes.

Architectural ambition needs to follow data engineering competence. The earlier project tried to train a custom GON on stellar spectra before the data pipeline was reliable. That order is wrong. Without quality-controlled, comparable spectra in a stable format, the model has nothing reliable to learn from. The current project front-loaded the data work, and the model architecture is now the limiting factor rather than the data plumbing. That is the right state to be in. The specific failure that taught me this was a training run that converged smoothly on what I later realized was misaligned wavelength grids: the model had learned to reconstruct an artifact of my resampler, not the underlying spectra. The reconstruction loss curve looked excellent and the science was meaningless. The lesson translated almost word-for-word into the pipeline-first instinct I now hold for any ML project at work.

The earlier repository is archived. This one supersedes it.

What is next

Phase 3 is the model training run at scale. The architecture is implemented in PyTorch with Fourier and SIREN coordinate encoders, empirical-Bayes latent inference, optional gradient-based radial-velocity inference, and line-window-weighted reconstruction losses. The smoke tests pass. The bottleneck is GPU compute time on roughly 30,000 spectra at the chosen resolution, which is an infrastructure problem, not a research one.

When the training results are in, this page will fill in. Reconstruction quality first, then cross-survey transfer experiments (train on APOGEE, test on GALAH overlap), then an anomaly-detection pass over the full APOGEE catalogue. The honest framing of "in progress" stays until the numbers are real.

References

GitHub: https://github.com/TirtheshJani/StellarSpectraWithGONS