Star-Type Classification: A Pipeline I Trust

data science6 min read

Published 15 August 2024

A six-class stellar classifier in scikit-learn, built around a clean ColumnTransformer pipeline and a reproducible CLI. The artifact is the engineering polish, not the headline accuracy.

Pythonscikit-learnpandasNumPyjoblibColumnTransformerRandom Forestpytest

GitHub →

There is a version of this project that would be a notebook with a 95-percent accuracy figure at the bottom, a screenshot of a confusion matrix, and a README.md that says "Random Forest classifies stars." That version exists in a thousand repositories and teaches its author very little.

I wanted a different artifact. The dataset is small. The classifier is straightforward. What I cared about was whether I could turn the project into a piece of software a teammate could pick up, run from the command line, retrain on a new CSV, and ship without thinking about it. The model is the easy half. The pipeline around the model is where the engineering lives.

The result is STARTYPE-CLASSIFICATION-: a six-class stellar classifier with a ColumnTransformer preprocessing chain, a Pipeline that bundles preprocessing and model into a single fitted object, two CLI scripts for training and inference, and a smoke test suite that runs in under a second.

The data, honestly

The dataset is Stars.csv. 240 rows, six features, six classes. Every claim in this writeup hinges on those numbers being right, so they go up front.

The features split into four numeric and two categorical:

Temperature (Kelvin), Luminosity (in solar units, L), Radius (in solar units, R), and Absolute Magnitude (A_M) are continuous physical properties.
Color is a categorical descriptor (Red, Blue, White, Yellowish, and so on, with the casing inconsistent across rows because real datasets are like that).
Spectral_Class is the standard OBAFGKM letter grouping.

The target column maps to one of six stellar types: Red Dwarf, Brown Dwarf, White Dwarf, Main Sequence, Super Giant, Hyper Giant. These correspond to recognisable regions on a Hertzsprung-Russell diagram, which is the part of this dataset that makes it pedagogically nice. A reasonable classifier should align with the physics: hot, dim stars in the white dwarf corner; cool, bright stars in the giants region; the main sequence as the diagonal stripe down the middle.

240 stars is a toy dataset. It is small enough that any sufficiently expressive classifier will reach high accuracy, and small enough that the accuracy figure tells you almost nothing about generalization to a real spectroscopic survey. I did not chase a number I could not defend. The README does not list one. This page does not either.

What the pipeline actually does

The interesting code is in src/startype_classification/pipeline.py. The structure looks like this:

numeric_features = ["Temperature", "L", "R", "A_M"]
categorical_features = ["Color", "Spectral_Class"]

preprocessor = ColumnTransformer(
    transformers=[
        ("num", StandardScaler(), numeric_features),
        ("cat", OneHotEncoder(handle_unknown="ignore"), categorical_features),
    ]
)

model = Pipeline(
    steps=[
        ("normalize_color", FunctionTransformer(_normalize_color_column)),
        ("preprocess", preprocessor),
        ("classifier", RandomForestClassifier(n_estimators=200, random_state=42)),
    ]
)

There are three deliberate choices in those eight lines.

StandardScaler on the numerics. Random forests do not require feature scaling for the splits themselves. But scaling the inputs before they ever reach the tree means the same fitted preprocessor works if I swap the classifier later for an SVM, a k-NN, or a logistic regression baseline. The Pipeline is the artifact, not the model.

OneHotEncoder with handle_unknown="ignore". This is the one line in the file that matters most for production behaviour. If a future CSV contains a colour the model has not seen ("Whitish-Blue", which is real), the encoder emits zeros for that column instead of crashing. The classifier still produces a prediction. It might be wrong, but the system does not fail. That is the difference between code that runs in a notebook and code that runs in a job. The version of this file before I added the flag did crash, on exactly the kind of input a future user would feed it: a colour string with a hyphen that the training set happened not to contain. That failure mode was the reason the smoke-test suite grew a "predict on unseen-category input" case.

A custom colour normaliser ahead of the encoder. Real stellar catalogues spell "Blue White" four ways: Blue White, Blue-white, Blue white, bluewhite. A FunctionTransformer collapses casing and whitespace before the OneHotEncoder ever sees the column. Without it, the encoder treats spelling variants as distinct categories and fragments the model's view of the data.

The Pipeline wrapping all of this is what makes the rest of the system tidy. model.fit(X, y) and model.predict(new_X) are the only two calls that travel beyond the training script. Preprocessing and inference live behind the same interface, which means the input schema is defined once and enforced everywhere.

The CLI

There are two scripts at scripts/, and both exist because I wanted to demonstrate that this project graduates from notebook to tool.

python -m scripts.train --dataset Stars.csv --model models/star_classifier.joblib
python -m scripts.predict --model models/star_classifier.joblib \
    --input new_stars.csv --output predictions.csv

train.py reads the CSV, fits the pipeline, prints test-set accuracy and a 5-fold cross-validation summary and a per-class classification report, then persists the fitted pipeline with joblib.dump. predict.py loads that artifact, accepts a CSV with the same six feature columns, and writes a CSV of predicted classes.

joblib is the choice over pickle because it serializes NumPy arrays more efficiently and is the convention scikit-learn itself uses. The whole fitted object (preprocessing chain plus trained random forest) round-trips through a single file. There is no separate "load the encoder, then load the scaler, then load the model" dance. There is one artifact and one entry point on each side.

I did not write a Flask service for this. A CLI is the right surface for a classifier you might run as part of a periodic job or a batch pipeline. Web frameworks are useful when there is a user waiting for a synchronous response. There is not, so I left it off.

Why Random Forest, and what I would try next

I picked a 200-tree random forest because it is the right default for this kind of small mixed-feature tabular problem. The alternatives:

Logistic regression would be the strongest baseline. It is interpretable, fast, and on a six-feature dataset it would not be far behind the forest. I should add it.
Gradient boosting (XGBoost, LightGBM) is the modern default for tabular and would likely do at least as well. The cost is one more dependency.
An SVM with an RBF kernel would lean on the StandardScaler I am already running and would be a useful exercise in showing that the Pipeline abstraction makes classifier swaps a one-line change.

The point is that any of those swaps is a one-line change in pipeline.py because the ColumnTransformer upstream of the classifier does not care what comes after it. The cleanliness of the pipeline is what makes those experiments cheap.

What the tests check

The smoke tests under tests/ are deliberately minimal. They check three things:

The pipeline fits on the bundled Stars.csv without raising.
predict on a held-out slice returns labels from the known six-class set.
The fitted pipeline can be joblib.dump-ed and reloaded, and the reloaded version produces identical predictions.

That third test is the one that catches a whole category of subtle deployment bugs. If the persisted artifact does not reproduce the live model exactly (because of a version mismatch in scikit-learn, or a sklearn estimator that pickled with a non-deterministic attribute), the test fails immediately at CI time, not in production.

What this project taught me

Three things.

The first is that "real engineering" is not about the model. The model is six lines. The engineering is the file boundary between training and inference, the persistence format, the unknown-category handling, the CSV schema being explicit, the test that asserts round-trip identity. None of those are in the random forest.

The second is that the cleanest abstraction in scikit-learn is Pipeline. The library was designed around it, and as soon as I treated the fitted pipeline as the primary artifact (rather than the trained classifier), every downstream concern got simpler. Saving became one call. Loading became one call. Swapping models became one line. The mental model shifted from "I have a classifier and a preprocessor and I have to keep them in sync" to "I have a model artifact, full stop."

The third is that small datasets reward humility. With 240 rows, any number I report on a single train-test split is noise. The cross-validation summary is what I trust. The per-class report is what I read first, because it is where the imbalance lives. Hyper Giants are rare in this dataset. The model can hit 95% overall while being useless on Hyper Giants. The classification report makes that visible. The aggregate accuracy hides it. The same instinct applies to any small clinical-cohort task at work: the headline AUROC across all classes is the number that goes on the slide, and the per-class precision and recall are the numbers that determine whether the model is actually fit for purpose.

Where this fits

The Pipeline-as-artifact pattern from this project carried directly into the Stellar Spectra with Gradient Origin Networks work, where the data engineering surface area is larger and the cost of getting the schema wrong is much higher. The CLI ergonomics carried into the Clinical Note Summarizer, whose scripts/ layout is nearly identical.

A six-class star classifier is not the kind of project that sells anyone a job interview on its own. The point of writing it down is that the pattern underneath it (a ColumnTransformer that defines the schema, a Pipeline that owns the artifact, a CLI that owns the boundary, a test that owns the round-trip) is the same pattern I now use on every supervised learning project regardless of scale. The first place I worked it out cleanly was here.

GitHub: https://github.com/TirtheshJani/STARTYPE-CLASSIFICATION-