Neural Networks from First Principles

data science5 min read

Published 14 February 2024

Backprop with a pencil before backprop in NumPy. A two-layer network for MNIST-shaped inputs, derived from the chain rule and implemented from scratch. The artifact is the math, not the accuracy.

PythonNumPyJupyterLinear AlgebraCalculus

GitHub →

I wrote backprop with a pencil before I wrote it in NumPy. That is the project, and that is what makes it worth a page. Frameworks abstract the chain rule away; the abstraction is correct, and a working ML engineer should still be able to derive the thing the framework is hiding. This was the exercise that made sure I could.

The repository is a teaching-format implementation of a two-layer network shaped for MNIST: 784 input units, a 10-unit hidden layer, a 10-class softmax output. There are no frameworks. There is NumPy, there is Matplotlib for visualizing learning curves, and there is the math. The artifact is the derivation. The accuracy figure is, deliberately, not the point. The README does not list one and this page does not either.

Forward pass: linear, non-linear, repeat

The forward pass is two linear maps separated by a non-linearity, ending in a softmax that turns scores into probabilities.

Z1 = W1 @ X + b1                # shape: (10, m)
A1 = np.maximum(0, Z1)          # ReLU, shape: (10, m)

Z2 = W2 @ A1 + b2               # shape: (10, m)
A2 = softmax(Z2)                # shape: (10, m), columns sum to 1

With a batch of m examples, X is (784, m). W1 is (10, 784) so Z1 lands at (10, m). b1 is (10, 1) and broadcasts along the batch dimension. W2 is (10, 10) because both the hidden layer and the output have 10 units (an unusual but tidy MNIST configuration). A2 is (10, m) where each column is a probability distribution over the ten digit classes.

ReLU is max(0, x). The reason it works is more interesting than the formula. It is non-linear, so stacking layers actually adds expressive power. It does not saturate for positive inputs, so gradients flow cleanly during backprop. It is essentially free to compute. That last point matters more than it sounds when the training run is on a CPU.

Softmax exponentiates each output and normalizes, turning real-valued scores into a probability vector. It pairs cleanly with cross-entropy loss for a reason that becomes the core of the backward derivation, so it goes there.

The loss

For one training example with one-hot label y and predicted distribution a, cross-entropy is

L = -sum_k y_k log(a_k)

For a batch of m examples this averages to L = -(1/m) sum over examples sum_k y_k log(a_k). The choice of cross-entropy over mean-squared error is not aesthetic. It is what makes the backward pass simplify, which is the next step.

Backpropagation: the chain rule, written out

Backprop is the chain rule applied carefully across a computational graph. Once I had the gradients on paper, the code wrote itself. The derivation in the notebook walks through every step. Here is the compressed version.

Output layer. I want dL/dZ2. Cross-entropy with softmax has an algebraic property worth deriving once and remembering forever:

dL/dZ2 = A2 - Y                 shape: (10, m)

That is it. The exponential in softmax cancels the logarithm in cross-entropy, the chain rule collapses, and what remains is predicted minus actual. This is the cleanest gradient in deep learning. It is also the reason this loss-and-activation pairing is the default. From there, the parameter gradients fall out:

dZ2 = A2 - Y                    # (10, m)
dW2 = (1/m) * dZ2 @ A1.T        # (10, 10)
db2 = (1/m) * np.sum(dZ2, axis=1, keepdims=True)   # (10, 1)

The transpose on A1 is the chain rule expressing that Z2 = W2 @ A1 + b2, so dW2 is built from dZ2 and the activations that fed into it.

Hidden layer. Propagate the signal back through W2, then through ReLU.

dZ1 = (W2.T @ dZ2) * (Z1 > 0)   # (10, m)
dW1 = (1/m) * dZ1 @ X.T         # (10, 784)
db1 = (1/m) * np.sum(dZ1, axis=1, keepdims=True)   # (10, 1)

The (Z1 > 0) mask is the ReLU derivative: 1 where the unit was active, 0 where it was clipped. That mask is why ReLU networks are sometimes called piecewise linear. For any single forward pass, the network is a linear map specific to which units are active. Different inputs activate different sub-networks. That is the source of expressive power.

Update. Plain gradient descent.

W1 -= alpha * dW1
b1 -= alpha * db1
W2 -= alpha * dW2
b2 -= alpha * db2

Five lines of math, applied repeatedly. That is the entire training loop.

The bugs that taught me the most

The interesting failures all came from shapes.

Silent broadcasting. NumPy will happily multiply a (10, 32) by a (32, 10) and give you a (10, 10) when what you wanted was an element-wise operation that needed both arrays to be the same shape. The error does not surface as a Python exception. It surfaces as a loss curve that looks fine for a while and then plateaus at random-guessing accuracy. The fix was to print every shape at every step until I trusted the algebra. I also took the habit of writing the expected shape next to each line as a comment, which is why the code blocks above are annotated. That comment-as-assertion style carried into how I write training code in PyTorch now: when in doubt, write the shape down.

Initialization. Initializing all weights to zero produces a network where every neuron in a layer learns the same thing forever, because the gradients are symmetric. Initializing with a Gaussian of variance 1 produces saturated activations and dead gradients on the first step. He initialization, which scales the variance by sqrt(2 / n_in) for ReLU layers, was not just a recommendation in a textbook. It was the difference between a network that learned and one that did not. The mechanical insight is that the variance of activations should stay roughly constant as signal flows through layers; if it explodes or vanishes, gradients do the same thing on the way back. Once that clicked, every initializer in modern frameworks (He for ReLU, Glorot for tanh, orthogonal for RNNs) read as a different answer to the same constraint.

Learning rate. Too high and the loss explodes. Too low and training takes forever or settles in a bad local region. The intuition I built from manual tuning transfers to every modern optimizer. Adam is doing a more sophisticated version of the same balancing act with moving averages of gradients and gradient magnitudes. The first time I saw a real loss curve diverge with a learning rate that was just slightly too high (the loss climbing instead of falling, then sailing off) I understood why the lr-finder visualisations from the fast.ai community look the way they do. A logarithmic sweep over the learning rate is the cheapest experiment in deep learning and the one I still run first on any new problem.

The notebook is currently labeled work-in-progress. It contains the derivations, the forward pass, the backward pass, and a training loop. A full convergence run with a final accuracy figure is not in the README. When the repo publishes that number, this page will be updated. Until then the artifact is the math.

Where this fits in my work

The from-scratch implementation is foundational, not production. The frameworks exist for good reasons: numerical stability, GPU acceleration, automatic differentiation across complex graphs, distributed training. Nobody should write a Transformer in pure NumPy.

But the math the frameworks abstract over does not change. When a model in production is misbehaving, the diagnostic vocabulary is the vocabulary of gradients, activations, learning rates, initialization, and shapes. Knowing how to derive dL/dZ2 = A2 - Y is what lets me read a paper that adds a custom loss term and predict, without running anything, where in the gradient graph the new term will show up.

For the stellar spectra GON work and the clinical text summarizer, that vocabulary is what lets me debug training runs and decide when a model is failing for an architectural reason versus a data reason. The two-layer NumPy network was where I learned to speak it.

References

The two resources I returned to most:

3Blue1Brown's neural network series on YouTube, especially the visual treatment of gradient descent and backpropagation.
Michael Nielsen's open book Neural Networks and Deep Learning, which derives the same network and the same equations in more depth than I have here.

GitHub: https://github.com/TirtheshJani/NN-with-math-and-numpy