AutoML vs LSTM: Forecasting Iowa Liquor Sales

data science7 min read

Published 25 February 2024

Comparative study on 19.4 million transactions. Hyperopt-sklearn AutoML reached 85 percent accuracy. An LSTM network reached 88 percent. The interesting finding was the trade-off.

PythonTensorFlowhyperopt-sklearnKerasPandasMatplotlib

GitHub →

The same dataset can be approached two different ways and tell you two different things. This project, completed during the Big Data Analytics program at Georgian College, applied automated machine learning and a deep recurrent neural network to the same Iowa liquor sales forecasting problem. The AutoML approach reached 85 percent accuracy with minimal manual tuning. The LSTM reached 88 percent at the cost of significantly more development time.

The headline takeaway is not which one won. It is what the three-percentage-point gap actually buys, and whether the team that has to keep the model alive can afford it. I came into the project assuming the deep model would win on every axis worth caring about. By the end I was no longer sure I would have chosen it.

The dataset

Iowa is one of seventeen US states that operates as an Alcohol Beverage Control state, meaning all liquor sales pass through state-controlled distribution. The Iowa Department of Commerce publishes the resulting transaction data openly. The full dataset spans 2012 onward and contains over 19 million records across more than 24 features per transaction: store identification, geographic location, product category, brand, vendor, sale price, volume, transaction date, and more.

The forecasting question that anchored the project was a practical one for retail planning. Given historical sales patterns, predict December sales by store and product category. December is the highest-revenue month in the calendar for Iowa liquor retailers, accounting for roughly 15 to 20 percent of annual revenue, so a forecast that misses by even small margins translates into real dollars in inventory and stockout costs.

Two characteristics of the dataset made it well suited to this kind of comparison. The state-controlled distribution structure produces unusually clean data with few of the duplications and misclassifications common in private retail datasets. And the multi-year coverage gives any time-series approach enough history to actually train on.

Approach 1: AutoML with hyperopt-sklearn

The first model used hyperopt-sklearn, an automated machine learning framework that combines algorithm selection and hyperparameter tuning. The search space wraps the usual scikit-learn estimators (random forests, gradient boosted trees, support vector machines, k-nearest-neighbours, and several smaller models) and uses Tree-structured Parzen Estimator (TPE) optimization to sample hyperparameter configurations. TPE is the same Bayesian-optimization variant the underlying Hyperopt library has shipped since the Komer, Bergstra, and Eliasmith paper. It is much smarter than grid search and slightly more frustrating to debug, because what you can inspect is the trial history rather than a single tuning trace.

I sampled 50,000 records stratified by month and store class so the search would not over-fit a handful of high-volume Des Moines outlets. The pipeline ran over a few evenings on a laptop. The framework converged on a gradient-boosting ensemble. The regularization parameters it picked were close to what I would have set by hand if I had been doing this manually. Final test accuracy: 85 percent on the December classification task.

The case for AutoML is the development economics. The full pipeline (feature preparation, model search, evaluation, persistence) took a few hours of wall-clock time and almost no engineering hours past the data prep. Most of the human time went into deduplication and category harmonization, not modeling. The output was a deployable artifact with documented hyperparameters, feature importances, and a clear performance baseline. A retail analyst with limited deep-learning experience could realistically maintain this pipeline. Re-tuning quarterly is a one-line script.

The case against AutoML is what it could not see. The framework treats the problem as a structured prediction task on tabular features. It does not natively understand that the rows are ordered in time. Whatever temporal patterns exist in the data have to be encoded as features (lag variables, rolling means, seasonal indicators) before the model can use them. I built lag features at 1, 4, and 12 weeks plus rolling 30 and 90-day means, which is the conventional starting kit. That encoding limits how much temporal subtlety the model can learn. Anything that depends on the trajectory of recent sales rather than its summary statistics is invisible.

Approach 2: LSTM in TensorFlow

The second model was a Long Short-Term Memory network built in TensorFlow with Keras. LSTM cells are designed for sequential data: they maintain an internal cell state that is updated and selectively forgotten through input, forget, and output gates as the network processes each step in a time series. The architecture's strength is exactly the limitation that constrained the AutoML approach. Where the gradient-boosting model saw a row of summary features, the LSTM saw the actual sequence.

Input shape was (batch, timesteps, features) where each store-category pair contributed one sample per training week, the lookback window was the previous 26 weeks, and features included weekly volume, weekly revenue, week-of-year, and a holiday indicator. The architecture was a standard configuration: two stacked LSTM layers (64 then 32 units) with dropout at 0.2 between them, a dense projection layer, mean-squared-error loss for the regression formulation and cross-entropy when the prediction was discretized into volume buckets so the comparison with AutoML was apples to apples. Training used Adam at the Keras default learning rate and ran for around 50 epochs with early stopping on validation loss. The hyperparameter search was manual: I tried unit counts of 32, 64, and 128 in each layer, two and three stacked layers, and dropout from 0.1 to 0.3.

The final LSTM reached 88 percent accuracy on the same December prediction task. The three-percentage-point improvement over the AutoML model was not uniform across the population. Most of the gain concentrated in stores whose recent weeks showed acceleration or deceleration trends rather than steady-state demand. The AutoML model could not capture that linkage cleanly because the lag-feature encoding flattened a trajectory ("sales accelerating week over week") into a single rolling mean. The LSTM, with the actual 26-week sequence available, learned the shape.

The trade-off, plainly

Three percentage points sounds small. In context, it is not nothing.

Iowa's annual liquor sales are roughly $400 million USD across the regulated channel. December represents 15 to 20 percent of that, call it $70 million. A forecasting improvement of three percentage points across that base translates into something in the low millions of dollars of better inventory positioning, depending on how the model's predictions are operationalized. For a single state's liquor sector, that is meaningful.

But the LSTM is more expensive to maintain. It requires GPU infrastructure for retraining. The hyperparameters are not auto-tuned; whoever owns the model has to know what they are doing. Failures during training are harder to diagnose. A retail planning team that lacks the in-house deep-learning skills to support that operational burden may genuinely be better off with the AutoML solution and accepting the three-percentage-point gap.

The right model is the one that the operating team can actually maintain in production at the level of performance the business needs. That answer changes by organisation.

Method notes

The data preparation work was the dominant time sink for both approaches. Iowa's transaction data needed deduplication, store-level aggregation, category harmonization, and outlier handling before any modeling could start. The same prepared dataset was used for both approaches, which is the only honest way to compare them.

The evaluation framework used a holdout window approach: training on data up to October, validating on November, and testing on December. This avoids the standard pitfall of cross-sectional accuracy evaluation, which on time-series data overstates performance because it does not respect temporal ordering.

The accuracy numbers reported here are from the test set. Both models were evaluated on the same data with the same metric, so the comparison is direct.

What surprised me, and what I got wrong

A few things did not go the way I expected.

The first was that the gradient-boosting model picked by hyperopt-sklearn was harder to beat than I had assumed. Going in I treated AutoML as the floor and the LSTM as the ceiling. Three percentage points of accuracy is real, but it is not the canyon I expected for a sequence model on a sequence problem. Most of the AutoML's loss against the LSTM was concentrated in maybe 10 percent of the store-category pairs. On the rest, the two models disagreed by less than a percentage point.

The second was that my first LSTM configuration was worse than the AutoML model. I had used a 12-week lookback because that was a tidy quarter, and the model could not see far enough back to learn the cross-quarter shape it needed. Stretching the lookback to 26 weeks, which was an honest guess based on plotting autocorrelation rather than a principled hyperparameter, was what unlocked the lift. That is the kind of decision a manual deep-learning workflow forces you to think about and an AutoML pipeline doesn't.

The third was that data leakage almost ruined the comparison. My initial split was a random 80/20, which is the wrong call for time-series data: future weeks ended up in the training set, and the LSTM looked artificially strong. Switching to a holdout window (train through October, validate on November, test on December) dropped both numbers but made the comparison real. I leave it on this page as a flag, because it is the single mistake I see most often in posts that compare time-series models.

What this taught me

This was the project that crystallized for me where deep learning is worth the operational cost and where it is not.

The boring answer is also the right one: it depends on the patterns in the data and the team that has to operate the model. For sales forecasting on a relatively well-behaved retail dataset, the AutoML approach captures most of the value at a fraction of the engineering and operational cost. The deep learning model captures real incremental signal, but the incremental value has to clear the incremental cost. If I were advising an Iowa retail planning team I had just met, I would ship the AutoML pipeline first and add the LSTM as an ensemble component for the store-category pairs where it actually moves the residuals.

In healthtech work I do at metricHEALTH, the same calculus applies. Some clinical signals are well captured by classical machine learning approaches operated by analytical staff. Others (clinical text summarization, complex temporal patterns in patient deterioration) genuinely need the modeling sophistication that deep learning provides. The skill is not preferring one over the other. It is matching the architecture to the problem and the team that has to keep it running. The team-fit question is the part I underweighted before this project and now treat as a first-class input to the model-selection decision.

References

Iowa Liquor Sales Data, Iowa Department of Commerce. Public release.
Hyperopt-sklearn documentation and the underlying paper by Komer, Bergstra, and Eliasmith.
Hochreiter, S. and Schmidhuber, J. Long Short-Term Memory, Neural Computation, 1997.

GitHub (collaborator repository with the full project including LSTM model): https://github.com/ruwzeta/alcholsalespredictioniowa