From Physics PDEs to PyTorch: What Experimental Physics Taught Me About Debugging ML

May 3, 20267 min read
← Back to Blog

A physics undergrad spends a year staring at noisy oscilloscope traces and mis-zeroed polarizers. Years later, the same habits keep ML models honest. The transferable lessons no MOOC teaches.

Machine LearningPhysicsDebuggingEngineering Practice

In my third year of physics, I spent sixteen hours a week in a teaching lab where the equipment was older than I was. A Geiger-Müller counter that ticked at the cosmic background even when the source was in the next room. A Michelson interferometer that drifted if someone slammed a door two floors down. A Franck-Hertz tube whose anode current curve looked clean only if you remembered to wait twenty minutes for the oven to stabilize. By the end of the year I had taken about a thousand pages of lab notes, fit dozens of curves, and developed a deep, almost spiritual mistrust of any dataset that looked too clean.

That mistrust is the most useful thing I brought into machine learning.

ML practitioners and experimental physicists do, at the end of the day, the same job. We try to extract a signal from a noisy process using imperfect instruments. The instruments differ. The bug taxonomy does not. The PyTorch error messages are friendlier than a rubidium lamp that refuses to ignite, but the failure modes underneath are the same family. Below are four habits that physics drilled into me, each of which has paid for itself many times over in production ML.

Habit 1: Sanity-Check Your Instruments Before Your Data

The first thing you learn working with a Geiger-Müller counter is that the counter has a dead time. After a particle is detected, the tube is electrically blind for a brief window. If your source is hot enough, you start undercounting in a systematic, non-obvious way. Plot the count rate against source strength and you see the curve flatten where physics says it should keep climbing. The "data" is wrong because the instrument is lying. The fix is not statistical; you do not get to average your way out of dead-time saturation. You either correct for it analytically with the known dead-time constant, or you back the source off and accept fewer counts.

In ML, the equivalent mistake is tweaking model architecture before you have verified the data loader. You start adding attention heads because validation loss is flat, when in fact the loader is silently dropping every fifth batch because of an off-by-one in the sampler. Or your loss curve looks reasonable but the gradients in layer three are zero, because someone froze the wrong parameters and nobody noticed. I once spent two days re-tuning a learning-rate schedule on a model whose tokenizer was, it turned out, lower-casing the inputs and then comparing them against a case-sensitive vocabulary. About half the tokens were <unk>. The model was learning to predict the prior. The architecture was fine. The instrument was lying.

The physics fix and the ML fix are the same. Before you trust the experiment, characterize the apparatus. Run the loader for ten batches and visualize the inputs by hand. Plot the gradient norm per layer for the first hundred steps. Overfit on a single batch and confirm the loss goes to near zero. If those don't pass, no amount of architectural cleverness will save you, because the architecture is not the problem.

Habit 2: Always Plot the Residuals

In an Electron Spin Resonance experiment, you sweep a magnetic field and watch microwave absorption. The absorption peak is roughly Lorentzian, and you fit a Lorentzian to it. The fit will tell you it is excellent. The R-squared will be 0.998. You will be tempted to write down the linewidth and call it a day.

The temptation is wrong. What you do is plot the residuals: the data minus the fit, against the field. If the residuals are noise, your model captures the physics. If the residuals carry structure, a slope, a wave, a periodic ripple, your model is missing something. In ESR, that structure is usually a baseline drift from the lock-in amplifier, or a hyperfine splitting your simple Lorentzian cannot represent. The model fits well on average and badly in the places that matter. I still remember my TA pointing at a residual plot of mine that looked, to me, like static, and saying "look at it sideways." Rotated ninety degrees, the periodic structure was obvious. Once you know the eye-trick, you cannot unsee it.

ML accuracy is the R-squared of the practitioner. It is a single number that summarizes how the model does on average. It hides the fact that the model is 99 percent accurate on the easy 90 percent of the population and 50 percent accurate on the hard 10 percent who happen to be the entire reason you built the model. Plot the residuals. Bucket your errors by feature, by subpopulation, by time of day, by hospital site, by anything you can think of. On a clinical model I worked on, the aggregate AUC looked acceptable, but the AUC in the youngest age bucket, where the training data was thinnest, dropped by enough to make the model unusable for that population specifically. The aggregate looked fine. The residual plot did not. The interesting bug is never in the average; it is in the residual structure.

Habit 3: Trust the Dimensions

Dimensional analysis is a physicist's first line of defense. If you derive a formula for the period of a pendulum and the units come out in meters per second, you have made an algebra mistake somewhere, and you do not need to find it to know it is there. The dimensions told you. You can throw the line away and start the derivation again with confidence that the next attempt will be different.

In ML, the equivalent is shape and unit checking, and almost no one does enough of it. The number of bugs I have seen that resolve to "we passed logits where the loss expected probabilities," or "we computed a per-token loss but reduced it as if it were per-sequence," or "we summed gradients across a batch dimension that was supposed to be averaged" is not small. The model trains. The loss goes down. The numbers look plausible. And the gradients are off by a factor of batch_size, which is exactly the kind of bug that disappears into a learning-rate adjustment and never gets caught. The worst version of this I have personally shipped was a contrastive loss that, due to a broadcasting accident, was effectively averaging over an axis that should have been summed. Training looked fine. The model converged. It was just converging to a slightly different objective than the paper, and downstream evaluation lagged the published number by several percentage points for reasons I could not explain for two weeks.

When you write a forward pass, annotate every intermediate tensor with its expected shape and its expected semantic. Not in a comment. In the code. assert logits.shape == (batch, seq_len, vocab). assert (probs.sum(dim=-1) - 1).abs().max() < 1e-5. These are the unit-checks of ML. They cost a microsecond at runtime and they save days at debugging time. Libraries like jaxtyping and torchtyping formalize this, but you do not need a library. A handful of asserts in the right places is enough.

Habit 4: The Faraday Rotation Lesson

Of all the experiments I did, the one I remember most vividly is Faraday rotation. You shine polarized light through a sample in a magnetic field, and the polarization plane rotates by an angle proportional to the field. To measure the angle, you put a second polarizer (the analyzer) on the far side and rotate it until the transmitted intensity is minimized. The angle of the analyzer at minimum is your data point.

The first time I ran the experiment, my data fit a clean line through the origin, which is what theory predicts. Beautiful. I wrote it up. The TA asked me to flip the magnet polarity and rerun. The new data fit a clean line, also clean, but offset from the origin by about two degrees. Two degrees that should not have been there.

The bug was that I had not zeroed the analyzer at the start. With zero field, my polarizers were misaligned by two degrees. With the field on, that offset was hiding inside what looked like a clean linear signal. Flipping the field exposed it. If I had only ever taken data in one direction, I would have written up the systematic error as physics, and a TA who graded a few hundred lab reports a year would have caught it in the first paragraph. The lesson stuck because the data really did look perfect on the way in.

This is the lesson I cite most often in ML reviews. A clean training curve is not evidence of a working model. A model that performs well on the held-out set you split from the same distribution as training is not evidence of a working model. The systematic offsets, the bias in your sampling, the leakage between train and test, the spurious correlations in the source data, are all hiding inside clean-looking results, and they only show up when you stress the model in a direction the training data did not.

The analog of flipping the magnet is the adversarial probe. Run the model on a population it was not trained on. Run it on data from a different time period. Permute a feature you believe should not matter and confirm that the prediction does not change. Run it backwards in time and confirm it does not predict tomorrow's labels from yesterday's features. (That last one catches more leakage bugs than I would have guessed before I started using it routinely.) If any of those break, your clean-looking model has a mis-zeroed polarizer somewhere, and the production data will eventually flip the magnet for you.

The Through-Line

The common thread in all four habits is humility about your own apparatus. Physics undergrad labs teach humility brutally and early, because the equipment is old and the TA is unsentimental and the data does not care about your hypothesis. Industrial ML is gentler. The frameworks hide most of the apparatus from you. The data comes pre-cleaned, or at least pre-collected. The metrics report themselves. It is easy, in this environment, to forget that there is an instrument at all, and that the instrument can be wrong.

I think this is part of why a generation of ML researchers with backgrounds in experimental science (physics, biology, chemistry) ends up writing some of the most reliable production code I have read. Not because they know more PyTorch. They almost always know less. They are simply harder to fool, because they have already been embarrassed by a Geiger counter or a misaligned polarizer or a thermocouple they forgot to calibrate, and the embarrassment generalizes.

The best ML engineers I have worked with treat their pipelines the way a good experimentalist treats a Geiger counter. They calibrate before they collect. They plot what the model gets wrong, not just what it gets right. They check the dimensions of every tensor. And when the result looks clean, they flip the magnet, just to be sure.

Physics did not teach me PyTorch. It taught me how to not believe my own results. That, it turns out, is the harder skill.