Derivations of Gradients w.r.t. Neural Networks

05 Jan, 2024

This is a refresher on deriving backpropagation of MLPs with NLLoss and Softmax. This traces builds from Vector Calculus and basic Liner Algebra to derive key properties of a basic neural network. This was originally done as an exercise and has become a valuable resource for me to reason about derivatives. I hope this is useful for others and will always be open to improving this resource.

This should be a good resource for folks familiar with high school Calculus and roughly familiar with vectors. Most of the notation should be searchable and accurately reflect the dimension of vectors and matrices. Don't be too intimidated and work through it slowly :)

Refreshers on Derivatives of Vector Valued Functions

A Function $f : ℝ^{n} \to ℝ^{m}$ is differentiable at $a \in ℝ^{n}$ if there is an $m \times n$ matrix such that:

{lim}_{x \to a} \frac{| f (x) - f (a) - A \cdot (x - a) |}{| x - a |} = 0

If such matrix exists, the matrix $A$ is denoted by $D f (a)$ and is called the Jacobian.

Note that $| x - a |$ is the distance metric defined by the Euclidean Distance $\sqrt{{(x_{1} - a_{1})}^{2} + {(x_{2} - a_{2})}^{2} . . {(x_{n} - a_{n})}^{2}}$ and is a real valued scalar.

Formally, this can be derived from the general definition of a derivative:

f^{'} (a) = {lim}_{x \to a} \frac{f (x) - f (a)}{x - a}

Where this is only true if and only if:

0 = {lim}_{x \to a} \frac{f (x) - f (a)}{x - a} - f^{'} (a)

which can be transformed to:

0 = {lim}_{x \to a} \frac{f (x) - f (a) - f^{'} (a) (x - a)}{x - a}

and thus the evaluated to the final distance of each numerator and denominator from the origin:

0 = {lim}_{x \to a} \frac{| f (x) - f (a) - f^{'} (a) (x - a) |}{| x - a |}

(Since the notion of divison of two vectors is silly)

Here $f^{'} (a)$ represents our Jacobian matrix in $m \times n$ shape. Denote I refer to this matrix from now on as $D f (x)$ , though in the example above it is being used to be evaluated at point $a$ in vector space.

Defining the Jacobian in terms of coordinates and Indices

Definitions of Jacobians via multiple functions of $f$

Let the function $f : ℝ^{n} \to ℝ^{m}$ be given by the m differentiable functions $f_{1} (x_{,} . ., x_{n}), \dots, f_{m} (x_{1}, \dots, x_{n})$ such that:

f (x_{1}, \dots, x_{n}) = [\begin{matrix} f_{1} ((x_{,} . ., x_{n})) \\ ⋮ \\ ⋮ \\ f_{m} ((x_{,} . ., x_{n})) \end{matrix}]

Supposing we can represent each $f$ as a family of functions, indexed from 1 to $m$ , we can take the derivative of each function $f_{i}$ for $i \in 1 \dots m$ :

D f_{i} (x_{1}, . . ., x_{n}) \to \hat{v_{i}} such that v_{i} \in ℝ^{n}

In this case, we know $v_{i}$ to be $n$ dimensional, because of our original formulation of the derivative of vector valued functions. Note that we must compute $f^{'} (a)$ which is another function $D f (a)$ . However, we need the rows of $f^{'} (a)$ (a linear map of sorts) for each ith row in $A$ to represent a tangent line. This tangent line is conditioned to be for an input vector valued to be resultant vector of $x - a$ .

Similar to our single valued derivative case (grade school calculus):

y = f (a) + f^{'} (a) (x - a)

where the above function is the tangent line of some differentiable function at some point $a$ for a function $y = f (x)$ (Note this function is linear),

we want to build this same representation for a multivalued function $f_{i} (x_{1}, . . ., x_{n})$ . Thus we rely on partial derivatives to represent individual linear tangent lines with respect to each individual input $x_{j}$ where $j \in 1 . . . n$ . Thus, we can represent each row of $D f (x)$ as:

D f_{i} (x_{1}, . . ., x_{n}) = (\frac{\partial f_{i}}{\partial x_{1}}, \frac{\partial f_{i}}{\partial x_{2}}, \dots, \frac{\partial f_{i}}{\partial x_{n}})

So as a result, we can say $D f_{i} (x_{1}, . . ., x_{n})$ when applied to the elements of $x - a$ , will represent the linear approximations of wiggles on the vector valued function $f$ that will be approximately zero in distance from if we took the difference of $f (x) - f (a)$ exactly.

As a result, we can expand this out to each function $f_{1} . . . f_{m}$ in $f$ so that $D f (x_{1}, . . ., x_{n})$ our Jacobian is:

D f (x) = [\begin{matrix} \frac{\partial f_{1}}{\partial x_{1}} & \frac{\partial f_{1}}{\partial x_{2}} & \dots & \frac{\partial f_{1}}{\partial x_{n}} \\ \frac{\partial f_{2}}{\partial x_{1}} & \frac{\partial f_{2}}{\partial x_{2}} & \dots & \frac{\partial f_{2}}{\partial x_{n}} \\ ⋮ & ⋮ & ⋱ & ⋮ \\ \frac{\partial f_{m}}{\partial x_{1}} & \frac{\partial f_{m}}{\partial x_{2}} & \dots & \frac{\partial f_{m}}{\partial x_{n}} \end{matrix}]

Defining coordinates from domain and codomain dimensions

In our case, we will want to have a conveinient notation for handling indices in our Jacobian matrix, and tracking our output vector's dimensions, along with input vector's dimensions.

Defining Differentials of Compositions of Functions

Let $f : ℝ^{n} \to ℝ^{m}$ and $g : ℝ^{m} \to ℝ^{l}$ be differentiable functions. Also there is a composition function:

g \circ f : ℝ^{n} \to ℝ^{l}

that is also differentiable with a derivative given by: if $f (a) = b$ , then

D (g \circ f) (a) = D (g) (b) \cdot D (f) (a)

By the chain rule.

Building a Neural Network from an MLP with ReLU and NLL Loss

Multi-Layer Perceptron (MLP) with ReLU (Rectified Linear Unit) activation are common a neural network architecture, along with a Negative Log-likelihood loss (NLL Loss) for handling multiclass output.

What does an MLP represent mathematically?

An MLP is a neural network architecture where we have layers of preceptrons (sometimes called neurons) which can be seen as connected groups of bipartite graphs, such that each layer of nodes (neurons) is not connected to any other node in it's layer. Each neuron recieves inputs from all neurons from the layer previous to it. More specifically, each neuron in each layer takes all of the neuron inputs and multiplies itself with a weight for each incoming input. These are all summed together. Each neuron in the layer then passes its summed value to an activation function (such as the ReLU in our example). This will be our neuron's final output.

Let's Define it Better!

As noted, we often model it as a series of layers where each layer is composed of neurons. Each neuron computes a weighted sum of its inputs, which is then passed through an activation function. Here's a simplified LaTeX representation that captures these components across multiple layers:

Consider an MLP with one hidden layer and an output layer. The model consists of an input vector, hidden layers with activation functions, and an output layer. We will denote:

$𝐱$ as the input vector.
$𝐖^{(l)}$ as the weights of layer ( l ), respectively.
$σ$ as the activation function (e.g., sigmoid, ReLU).

Hidden Layer

Let's say the MLP has one hidden layer. The output of this layer for each neuron ( j ) in the hidden layer can be represented as:

a_{j}^{(1)} = σ (\sum_{i = 1}^{n} W_{j i}^{(1)} x_{i})

where $n$ is the number of inputs to each neuron, $W_{j i}^{(1)}$ are the weights connecting input $i$ to neuron $j$ in the first hidden layer.

Output Layer

For the output layer, if we have $k$ outputs, the output for each neuron $k$ in the output layer can be represented as:

y_{k} = σ (\sum_{j = 1}^{m} W_{k j}^{(2)} a_{j}^{(1)})

where $m$ is the number of neurons in the hidden layer, $W_{k j}^{(2)}$ are the weights connecting the hidden layer neuron $j$ to the output neuron $k$ .

Notice that $(j)$ represents the logit value for a neuron, which is a row of the matrix $(W^{(l)}$ . If we iterate over every input neuron we can store each value in a vector $[a_{1}, a_{2}, a_{3} . ., a_{k}]$ . This can thus be represented as simple a linear transformation! We will show more in the next section.

To Underscore, we will reference this indexing later in this writeup.

Defining Operations Concretely

Suppose you have an input $𝐱$ , weights $𝐖$ , (and biases which is common but we leave it out $𝐛$ ). The operations in the layer can be described as:

Linear Transformation: $𝐳 = W x$
ReLU Activation: $𝐚 = ReLU (𝐳)$ , where $ReLU (z_{i}) = max (z_{i}, 0)$
NLL Loss: $𝐋 = NLLLoss (𝐳) = - \log (z_{t})$

From our previous example, we can cleanly represent our Input to Output as a Linear Transformation, rather than a combersome summation. We will reference this from now on for the forward pass of MLP, and come back to this notation later when things get more tricky.

Example of a Forward Pass

We will show an example of a forward pass that works on a simple network, with two layers, and two neurons, three inputs and two outputs.

Assume for our example, we have: \

Input $𝐱$ = $[0, 1, 2]$
Weights:
- $W_{1} = [\begin{matrix} 1 & 2 \\ 3 & 4 \\ 5 & 6 \end{matrix}]$
- $W_{2} = [\begin{matrix} 1 & 0 \\ 0 & 1 \end{matrix}]$
Label: $t = 3$

The Forward Pass

Layer 1: Input $𝐱$ , weights $𝐖_{1}$
- Linear transformation: $𝐳_{1} = 𝐖_{1} 𝐱$
- Activation: $𝐚_{1} = ReLU (𝐳_{1})$
Output Layer: Input: $a_{1}$ , weights $𝐖_{2}$
- Linear transformation: $z_{2} = 𝐖_{2} 𝐚_{1}$
- Activation: $𝐚_{2} = ReLU (𝐳_{2})$
Computing the Loss: Input activations $a_{3}$ , labels $𝐭$
- Loss computation: $𝐋 = - \log (𝐚_{t}) \ where\ 𝐚_{t} = element at index\ t \ of\ 𝐚$
- $t$ here represents the class label index of the input example. As a result the possible values must be an index into the shape of $a_{3}$ . e.g. if $𝐚_{3} \in ℝ^{3}$ , $t \in 1 . . .3$

Quality of Life Reformulations

Given how that we have set up the premise, we can go ahead and try to use our previous mathematical foundation to reformulate some specific parts of our forward pass and the operations themselves. Why do we do this? Well we want to make it easier to convince oneself of how these models update or learn (more on this in the gradients and backpropagation section)

Representing MLP transformations in a more convenient Way

We currently represent each linear transformation for each layer as a Matrix-Vector multiplication. We find this quite cumbersome to handle when we want to do operations later on, specifically when trying to differentiate our composed function calls.

We can represent a linear transformation $𝐳 * l = 𝐖 * l 𝐱$ as:

f * W_{l} 𝐱) where\ f * 𝐖_{l} : ℝ^{n} \to ℝ^{m}

Now, our function is not necessarily multivalued, but represents a frozen transformation on vector valued inputs. Looking back to our refresher on vector valued functions, this means we can focus on differentiating $f$ , the linear transformation directly. This also means we can invoke this function for inputs that are not just vectors, but matrices or tensors, representing collections of inputs.

Deriving Gradients of our MLP

Neural Networks learn by updating it's weights against the the errors of the network to the correct answer. To determine what might need to change for the models parameters (weights $W^{(l)}$ for us), the training algorithm performs what is called a backwards pass on the network.

The backward pass in a neural network, particularly for a Multi-Layer Perceptron (MLP) with ReLU (Rectified Linear Unit) activation, involves computing the gradient of the loss function with respect to the weights. This process allows for the updating of parameters via optimization algorithms like gradient descent. Here's how it works for a MLP layer with ReLU activation:

Backward Pass (Reverse Mode)

The gradients need to be computed with respect to $𝐖$ . Assuming that the gradient of the loss $L$ with respect to the output of this layer $𝐚$ is known (denoted as $\frac{\partial L}{\partial 𝐚}$ ), the gradients can be calculated as follows:

Gradient through ReLU:

\frac{\partial L}{\partial 𝐳} = \frac{\partial L}{\partial 𝐚} ⊙ {ReLU}^{'} (𝐳)

Here, ${ReLU}^{'} (z_{i})$ is the derivative of ReLU, which is 1 for $z_{i} > 0$ and 0 otherwise. This can be proved simply by:

\frac{\partial RELU}{\partial 𝐳} = \frac{\partial ({\begin{matrix} z_{i} & z_{i} > 0 \\ 0 \end{matrix})}{\partial 𝐳}

Moving the derivative inside the piece wise:

\frac{\partial RELU}{\partial 𝐳} = ({\begin{matrix} \frac{z_{i}}{z_{i}} & z_{i} > 0 \\ 0 \end{matrix}) = ({\begin{matrix} 1 & z_{i} > 0 \\ 0 \end{matrix})

Gradient w.r.t. Weights:

\frac{\partial L}{\partial 𝐖} = \frac{\partial L}{\partial 𝐳} \cdot 𝐱^{T}

\frac{\partial L}{\partial 𝐛} = \frac{\partial L}{\partial 𝐳} (summing over the batch if needed)

Gradient through NLL Loss w.r.t. logits $z_{i}$ :

Given that NLL Loss takes as input a vector and returns a value, $NLL : ℝ^{c} \to ℝ$ , we can apply a similar proof strategy to define a jacobian that will generate a map for some loss $L$ back to the given logits in $z_{i}$ .

First we just define roughly the inner function with respect to potential input logits:

\frac{\partial {NLL}_{t}}{\partial z_{i}} = \frac{\partial}{\partial z_{i}} ({\begin{matrix} - \log (z_{t}) & t = i 0 & i \neq t \end{matrix})

\frac{\partial {NLL}_{t}}{\partial z_{i}} = {\begin{matrix} - \frac{1}{z_{t}} & t = i 0 & i \neq t \end{matrix}

where we are fixing in some $t$ index baked into the function (like a closure).

Then the Jacobian can be seeing as: ${DNLL}_{t} (z) : ℝ \to ℝ^{c}$ such that:

[\begin{matrix} \frac{\partial {NLL}_{t}}{\partial z_{1}} \\ \frac{\partial {NLL}_{t}}{\partial z_{2}} \\ ⋮ \\ \frac{\partial {NLL}_{t}}{\partial z_{n}} \end{matrix}] * z

In our implementation, we end using simply $l o s s = - i n p u t [t a r g e t]$ mainly due to the fact that, when simplifying the gradients, we see:

NLL (x, y) = - \log (p_{y}) = \log (\frac{e_{t}^{z}}{\sum_{j} e_{j}^{z}}) = z_{y} - \log (\sum_{j} e_{j}^{z})

Where loss is: $- z_{y} + (\sum {e_{j}^{z}}_{j})$ , taking the derivative of this function results in $- 1$ for the target index. Since in our case NL_Loss is the last layer and our chainrule starts with this value, we can assume that we don't need the $l o g (p_{z})$ , and allow gradients to flow directly from this function. Note, our loss value will be different numerically because we are going to be computing raw logit outputs, and not softmax'd outputs. But, for the sake of gradient flow back, it will be equivalent, as $- 1 / z_{i}$ and $- 1$ are the same if the LHS multiplicative will be $- 1$ (i.e. $\frac{- 1}{- 1}$ ).

We want to derive the loss with logits and understand what the derivative actually looks like when not on raw logits.

Recall: $- \log (p_{y})$ where $p_{j}$ actually represents values after a softmax function applied. Seperately, a softmax function can be written as:

p_{y} = \frac{e_{y}^{z}}{\sum_{j} e_{j}^{z}}

We want to get the ∂p/∂z(derivative of softmax):

given: f_{y} (𝐳) = f_{y} (z_{1}, . . ., z_{n}) = p_{y} = \frac{e_{y}^{z}}{\sum_{j} e_{j}^{z}} where z \in ℝ^{n}

Define a vector-valued function f : f = [\begin{matrix} f_{1} (z_{1}, . . ., z_{n}) \\ f_{1} (z_{1}, . . ., z_{n}) \\ ⋮ \\ f_{n} (z_{1}, . . ., z_{n}) \end{matrix}]

we want to find $D f_{i} (𝐳)$ , which can represented as:

D f_{i j} (z) \in R

D f (z) \in R^{n x n} D f_{i j} : ℝ^{n} - > ℝ

D f_{i j} (𝐳) = D f_{i j} (z_{1}, . . ., z_{n}) = \frac{\partial \frac{e^{z_{i}}}{\sum_{k} e^{z_{k}}}}{\partial z_{j}} = {\begin{matrix} 0 & i \neq j & i = k \end{matrix} \

$Case 1: where i \neq k,$

\frac{e^{z_{i}}}{\sum e^{z_{k}}} where k = j; \frac{e^{z_{i}}}{(e^{z_{1}} + e^{z_{2}} . . . + e^{z_{n}})} = e^{z_{i}} * (e^{z_{1}} + e^{z_{2}} . . . + e^{z_{n}})^{- 1} \to e^{z_{i}} * (- (e^{z_{1}} + e^{z_{2}} . . . + e^{z_{n}})^{- 2}) * e^{z_{j}}

D f_{i} j (z_{1}, . . ., z_{n}) = e^{z_{i}} * (- (e^{z_{1}} + e^{z_{2}} . . . + e_{n}^{z})^{- 2}) * e^{z_{j}} = - (e^{z_{i}} * \frac{e_{j}^{z}}{(e^{z_{1}} + e^{z_{2}} . . . + e_{n}^{z})^{2}}) = - 1 e^{(} z_{i}) * e^{z_{j}} / (\sum_{k} (e_{k}^{z})) * (\sum_{k} (e^{z_{k}}))

= - \frac{e^{z_{i}} * e^{z_{j}}}{(\sum_{k} e^{z_{k}}) * \sum_{k} e^{z_{k}})} = - p_{i} * p_{j}

$Case 2 When i = k :$

Note: derivative: 1 / g (x) = \frac{d}{d x} (a * g (x)^{-} 1) \to - g (x)^{- 2} * g^{'} (x)

$Using the product rule:$

\frac{\partial \frac{e^{z_{i}}}{\sum e^{z_{k}}}}{\partial z_{i}} = \frac{e^{z_{i}}}{(e^{z_{1}} + e^{z_{2}} . . . + e^{z_{n}})} = e^{z_{i}} * (e^{z_{1}} + e^{z_{2}} . . . + e^{z_{n}})^{- 1} \to e^{z_{i}} * ((e^{z_{1}} + e^{z_{2}} . . . + e^{z_{n}})^{- 1}) + e^{z_{i}} * (- (e^{z_{1}} + e^{z_{2}} . . . + e^{z_{n}})^{- 2})

e^{z_{i}} * ((e^{z_{1}} + e^{z_{2}} . . . + e^{z_{n}})^{- 1}) - e^{z_{i}} * (e^{z_{i}} * (e^{z_{1}} + e^{z_{2}} . . . + e^{z_{n}})^{- 2})

e^{z_{i}} (\frac{1}{e^{z_{1}} + e^{z_{2}} . . . + e^{z_{n}}} - \frac{e^{z_{i}}}{(e^{z_{1}} + e^{z_{2}} . . . + e^{z_{n}})^{2}})

e^{z_{i}} (\frac{1}{e^{z_{1}} + e^{z_{2}} . . . + e^{z_{n}}} - \frac{e^{z_{i}}}{(e^{z_{1}} + e^{z_{2}} . . . + e^{z_{n}})^{2}}) = \frac{e^{z_{i}}}{\sum e_{k}^{z}} * (1 - \frac{e^{z_{i}}}{(e^{z_{1}} + e^{z_{2}} . . . + e^{z_{n}})}) = softmax (x) (1 - softmax (x)) = - p_{i} - p i^{2} \

Deriving the gradients w.r.t. the weights $W$

To understand the geometric interpretation and prove the derivative of a loss function ( L ) with respect to a weights matrix ( W ) in the context of neural networks, we primarily use concepts from multivariable calculus and matrix calculus. The proof here will follow the fundamental principle that the derivative of a function at a point gives the best linear approximation to the function at that point, indicating the direction and rate of steepest ascent from that point.

Setting Up the Scenario

Assume $L$ is a scalar loss function that depends on the output $y$ of a neural network, where $y$ itself is a function of the weights matrix $W$ . Suppose the neural network follows as described in the forward pass:

y = f (W x)

Here, $f$ represents the activation function applied element-wise, $x$ is the input vector. The output $y$ thus depends on the weights matrix $W$ .

The Geometric Definition of Derivatives

In the context of functions from $ℝ^{n} \to ℝ^{m}$ , the derivative is a linear map that best approximates the change in the function near a point. For vector-valued functions $F : ℝ^{n} \to ℝ^{m}$ , the derivative at a point is represented by the Jacobian matrix, which consists of all first-order partial derivatives as described before.

For our case, $L : ℝ^{m} \to ℝ$ , the derivative with respect to the matrix $W$ (which is itself a tensor) needs to capture how infinitesimal changes in the elements of $W$ affect the change in $L$ . We also know that $W$ can be represented as a linear map $W : ℝ^{n} \to ℝ^{m}$ . We can treat $W$ as a function.

Derivative Computation

Recollect the Forward pass as:

z = W (x)

y = f (z)

L = Loss (y, target)

Applying the Chain Rule: The derivative of $L$ with respect to $W$ involves applying the chain rule through the layers of operations:

\frac{\partial L}{\partial W} = \frac{\partial L}{\partial y} \frac{\partial y}{\partial z} \frac{\partial z}{\partial W}

$\frac{\partial L}{\partial y}$ is the gradient of the loss with respect to the output $y$ , which depends on the specific loss function used.
$\frac{\partial y}{\partial z}$ is the derivative of the activation function $f$ , evaluated element-wise at $z$ .
$\frac{\partial z}{\partial W}$ captures how changes in $W$ affect $z$ , and for a given element $z_{i} = \sum_{j} W_{i j} x_{j}$ , the derivative with respect to each element $W_{i j}$ is simply $x_{j}$ .

Jacobian to Matrix Form: As before, $\frac{\partial z}{\partial W}$ catpures $z_{i} = \sum_{j} W_{i j} x_{j}$ , the derivative with respect to each element $W_{i j}$ is simply $x_{j}$ . Representing this as a vector valued function would first look like:

W (𝐳) = W (x_{1}, x_{2}, . . ., x_{n}) = (\begin{matrix} W_{1} (x_{1}, x_{2}, . . ., x_{n}) W_{2} (x_{1}, x_{2}, . . ., x_{n}) \\ ⋮ \\ W_{m} (x_{1}, x_{2}, . . ., x_{n}) \end{matrix})

where, the function $W_{j}$ is from $j$ functions that come from $W$ such that $j \in 1 . . . m$ . Furthermore we can think of a function $W_{j}$ as:

z_{j} = W_{j} (x_{1}, x_{2}, . . ., x_{n}) = W_{j 1} * x_{1} + W_{j 2} * x_{2} + . . . + W_{j n} * x_{n} = \sum_{k} W_{j k} x_{k}

Now looking at $\frac{\partial z}{\partial W}$ , we want to build the differential $D W (x)$ very similar to our derviation in the refresher.

D f_{i} (x_{1}, . . ., x_{n}) = (\frac{\partial f_{i}}{\partial x_{1}}, \frac{\partial f_{i}}{\partial x_{2}}, \dots, \frac{\partial f_{i}}{\partial x_{n}})

except, we want to take the partial derivatives with respect, $W_{i} k$ instead of {x_k}.

D f_{j} (x_{1}, x_{2}, . . ., x_{n}) = (\frac{\partial f_{i}}{\partial W_{j 1}}, \frac{\partial f_{i}}{\partial W_{j 2}}, \dots, \frac{\partial f_{i}}{\partial W_{j n}})

And here, we replace $f_{j}$ with our indexed functions from above $W_{j} (𝐱)$ .

D W_{j} (x_{1}, x_{2}, . . ., x_{n}) = (\frac{\partial z_{i}}{\partial W_{j 1}}, \frac{\partial z_{i}}{\partial W_{j 2}}, \dots, \frac{\partial z_{i}}{\partial W_{j n}})

Notice that we replaced $f_{i}$ with $z_{i}$ as in our original derivation $f_{i}$ represents the output as well, so we make that substitution here.

Now, to map it back to our partial derivative $\frac{\partial z}{\partial W}$ ,

z_{j} * \frac{\partial}{\partial W_{j}} = W_{j} (x_{1}, x_{2}, . . ., x_{n}) * \frac{\partial}{\partial W_{j}} \ \frac{\partial z_{j}}{\partial W_{j}} = \frac{\partial W_{j} (x_{1}, x_{2}, . . ., x_{n})}{\partial W_{j}}

Here we substitute in $D W_{j}$ as shown before, $\frac{\partial z_{j}}{\partial W_{j}} = D W_{j}$

\frac{\partial z_{j}}{\partial W_{j}} = D W_{j} (x_{1}, x_{2}, . . ., x_{n})

Note that $D W_{j}$ is a function, representing the infinitesimal changes of the elements of $W$ with respect to the output $z_{j}$ at some point $𝐱$ .

We can now map it to the Jocabian, given that we have each function $W_{j} (𝐱)$

D W (𝐱) = [\begin{matrix} \frac{\partial z_{1}}{\partial W_{1} 1} & \frac{\partial z_{1}}{\partial W_{1} 2} & \dots & \frac{\partial z_{1}}{\partial W_{1} n} \\ \frac{\partial z_{2}}{\partial W_{2} 1} & \frac{\partial z_{2}}{\partial W_{2} 2} & \dots & \frac{\partial f_{2}}{\partial W_{2} n} \\ ⋮ & ⋮ & ⋱ & ⋮ \\ \frac{\partial z_{m}}{\partial W_{m} 1} & \frac{\partial z_{m}}{\partial W_{m} 2} & \dots & \frac{\partial z_{m}}{\partial W_{m} n} \end{matrix}]

Each entry here directly maps to a value in the original summation defined:

Expanding $\frac{\partial z}{\partial W}$ in matrix terms, we have that each component $\frac{\partial z_{i}}{\partial W_{i j}} = x_{j}$ .

Note as well that we can expand out $\frac{\partial z}{\partial 𝐱}$ as well: $z_{i} = \sum_{j} W_{i j} x_{j}$ can be used to differentiate $z_{i}$ w.r.t either $W$ or $x$ so that: $\frac{\partial z_{i}}{\partial x_{j}} = W_{i j}$

Thus, we can transform the above jacobian representation in terms of matrix operations:

\frac{\partial L}{\partial W} = \frac{\partial L}{\partial y} \cdot \frac{\partial y}{\partial z} \cdot x^{T}

Here, $\frac{\partial L}{\partial y} \cdot \frac{\partial y}{\partial z}$ is treated as a row vector due to broadcasting rules in matrix multiplication, and $x^{T}$ is the transpose of the input vector $x$ .

Generalizing to an Arbitrarily Deep and Wide MLP

Network Setup and Notation

Assume a neural network with $K$ layers. Each layer $k$ has:

A weights matrix $W_{k}$
An activation function $f_{k}$
An input $x_{k}$
An output $y_{k}$ , where $y_{k} = f_{k} (z_{k})$ and $z_{k} = W_{k} x_{k}$

The output $y_{k}$ of each layer serves as the input $x_{k + 1}$ to the next layer.

Forward Pass

Input Layer: $x_{1}$ is the input to the network.
Hidden Layers and Output: For each layer $k$ :

\begin{matrix} z_{k} = & W_{k} x_{k} \\ y_{k} = & f_{k} (z_{k}) (with x_{k + 1} = y_{k} for the next layer) \end{matrix}

Final Output: The output $y_{K}$ of the last layer is used to compute the loss $L$ based on a target label e, i.e., $L = Loss (y_{K}, target)$ .

Backward Pass (General Case)

To compute the gradient $\frac{\partial L}{\partial W_{k}}$ for each layer's weights $W_{k}$ , you apply the chain rule in reverse order from the output back to the inputs:

Output Layer Gradient:

\frac{\partial L}{\partial y_{K}} = Derivative of Loss function w.r.t. the output of the last layer

Backpropagation Through Layers: For each layer $k$ from $K$ down to 1, compute:

\frac{\partial L}{\partial z_{k}} = \frac{\partial L}{\partial y_{k}} \cdot f_{k}^{'} (z_{k})

where $f_{k}^{'} (z_{k})$ is the derivative of the activation function at layer $k$ .

If $k < K$ , then:

\frac{\partial L}{\partial y_{k}} = \frac{\partial L}{\partial z_{k + 1}} W_{k + 1}^{T}

Gradient w.r.t. Weights: For each layer $k$ , compute:

\frac{\partial L}{\partial W_{k}} = \frac{\partial L}{\partial z_{k}} x_{k}^{T}

Summary

The backward pass effectively determines how the weights in each layer should be adjusted to minimize the loss, accounting for how each layer's output influences the loss through subsequent layers. Each $\frac{\partial L}{\partial W_{k}}$ points in the direction of greatest increase of the loss as a function of the weights in layer $k$ . By updating the weights in the opposite direction of this gradient, we move towards reducing the loss, implementing the essence of gradient descent.

These derivations are used to compute how we handle backpropagation in our MLP example in the modeling/ directory. network.py shows an example of how you can forward pass through the network, and how one can backwards pass the gradients through an MLP with ReLU activation and Negative Log-likelihood loss.