Derivations of Gradients w.r.t. Neural Networks
This is a refresher on deriving backpropagation of MLPs with NLLoss and Softmax. This traces builds from Vector Calculus and basic Liner Algebra to derive key properties of a basic neural network. This was originally done as an exercise and has become a valuable resource for me to reason about derivatives. I hope this is useful for others and will always be open to improving this resource.
This should be a good resource for folks familiar with high school Calculus and roughly familiar with vectors. Most of the notation should be searchable and accurately reflect the dimension of vectors and matrices. Don't be too intimidated and work through it slowly :)
Refreshers on Derivatives of Vector Valued Functions
A Function is differentiable at if there is an matrix such that:
If such matrix exists, the matrix is denoted by and is called the Jacobian.
Note that is the distance metric defined by the Euclidean Distance and is a real valued scalar.
Formally, this can be derived from the general definition of a derivative:
Where this is only true if and only if:
which can be transformed to:
and thus the evaluated to the final distance of each numerator and denominator from the origin:
(Since the notion of divison of two vectors is silly)
Here represents our Jacobian matrix in shape. Denote I refer to this matrix from now on as , though in the example above it is being used to be evaluated at point in vector space.
Defining the Jacobian in terms of coordinates and Indices
Definitions of Jacobians via multiple functions of
Let the function be given by the m differentiable functions such that:
Supposing we can represent each as a family of functions, indexed from 1 to , we can take the derivative of each function for :
In this case, we know to be dimensional, because of our original formulation of the derivative of vector valued functions. Note that we must compute which is another function . However, we need the rows of (a linear map of sorts) for each ith row in to represent a tangent line. This tangent line is conditioned to be for an input vector valued to be resultant vector of .
Similar to our single valued derivative case (grade school calculus):
where the above function is the tangent line of some differentiable function at some point for a function (Note this function is linear),
we want to build this same representation for a multivalued function . Thus we rely on partial derivatives to represent individual linear tangent lines with respect to each individual input where . Thus, we can represent each row of as:
So as a result, we can say when applied to the elements of , will represent the linear approximations of wiggles on the vector valued function that will be approximately zero in distance from if we took the difference of exactly.
As a result, we can expand this out to each function in so that our Jacobian is:
Defining coordinates from domain and codomain dimensions
In our case, we will want to have a conveinient notation for handling indices in our Jacobian matrix, and tracking our output vector's dimensions, along with input vector's dimensions.
Defining Differentials of Compositions of Functions
Let and be differentiable functions. Also there is a composition function:
that is also differentiable with a derivative given by: if , then
By the chain rule.
Building a Neural Network from an MLP with ReLU and NLL Loss
Multi-Layer Perceptron (MLP) with ReLU (Rectified Linear Unit) activation are common a neural network architecture, along with a Negative Log-likelihood loss (NLL Loss) for handling multiclass output.
What does an MLP represent mathematically?
An MLP is a neural network architecture where we have layers of preceptrons (sometimes called neurons) which can be seen as connected groups of bipartite graphs, such that each layer of nodes (neurons) is not connected to any other node in it's layer. Each neuron recieves inputs from all neurons from the layer previous to it. More specifically, each neuron in each layer takes all of the neuron inputs and multiplies itself with a weight for each incoming input. These are all summed together. Each neuron in the layer then passes its summed value to an activation function (such as the ReLU in our example). This will be our neuron's final output.
Let's Define it Better!
As noted, we often model it as a series of layers where each layer is composed of neurons. Each neuron computes a weighted sum of its inputs, which is then passed through an activation function. Here's a simplified LaTeX representation that captures these components across multiple layers:
Consider an MLP with one hidden layer and an output layer. The model consists of an input vector, hidden layers with activation functions, and an output layer. We will denote:
- as the input vector.
- as the weights of layer ( l ), respectively.
- as the activation function (e.g., sigmoid, ReLU).
Hidden Layer
Let's say the MLP has one hidden layer. The output of this layer for each neuron ( j ) in the hidden layer can be represented as:
where is the number of inputs to each neuron, are the weights connecting input to neuron in the first hidden layer.
Output Layer
For the output layer, if we have outputs, the output for each neuron in the output layer can be represented as:
where is the number of neurons in the hidden layer, are the weights connecting the hidden layer neuron to the output neuron .
Notice that represents the logit value for a neuron, which is a row of the matrix . If we iterate over every input neuron we can store each value in a vector . This can thus be represented as simple a linear transformation! We will show more in the next section.
To Underscore, we will reference this indexing later in this writeup.
Defining Operations Concretely
Suppose you have an input , weights , (and biases which is common but we leave it out ). The operations in the layer can be described as:
- Linear Transformation:
- ReLU Activation: , where
- NLL Loss:
From our previous example, we can cleanly represent our Input to Output as a Linear Transformation, rather than a combersome summation. We will reference this from now on for the forward pass of MLP, and come back to this notation later when things get more tricky.
Example of a Forward Pass
We will show an example of a forward pass that works on a simple network, with two layers, and two neurons, three inputs and two outputs.
Assume for our example, we have: \
Input =
Weights:
Label:
The Forward Pass
Layer 1: Input , weights
- Linear transformation:
- Activation:
Output Layer: Input: , weights
- Linear transformation:
- Activation:
Computing the Loss: Input activations , labels
- Loss computation:
- here represents the class label index of the input example. As a result the possible values must be an index into the shape of . e.g. if ,
Quality of Life Reformulations
Given how that we have set up the premise, we can go ahead and try to use our previous mathematical foundation to reformulate some specific parts of our forward pass and the operations themselves. Why do we do this? Well we want to make it easier to convince oneself of how these models update or learn (more on this in the gradients and backpropagation section)
Representing MLP transformations in a more convenient Way
We currently represent each linear transformation for each layer as a Matrix-Vector multiplication. We find this quite cumbersome to handle when we want to do operations later on, specifically when trying to differentiate our composed function calls.
We can represent a linear transformation as:
Now, our function is not necessarily multivalued, but represents a frozen transformation on vector valued inputs. Looking back to our refresher on vector valued functions, this means we can focus on differentiating , the linear transformation directly. This also means we can invoke this function for inputs that are not just vectors, but matrices or tensors, representing collections of inputs.
Deriving Gradients of our MLP
Neural Networks learn by updating it's weights against the the errors of the network to the correct answer. To determine what might need to change for the models parameters (weights for us), the training algorithm performs what is called a backwards pass on the network.
The backward pass in a neural network, particularly for a Multi-Layer Perceptron (MLP) with ReLU (Rectified Linear Unit) activation, involves computing the gradient of the loss function with respect to the weights. This process allows for the updating of parameters via optimization algorithms like gradient descent. Here's how it works for a MLP layer with ReLU activation:
Backward Pass (Reverse Mode)
The gradients need to be computed with respect to . Assuming that the gradient of the loss with respect to the output of this layer is known (denoted as ), the gradients can be calculated as follows:
- Gradient through ReLU:
Here, is the derivative of ReLU, which is 1 for and 0 otherwise. This can be proved simply by:
Moving the derivative inside the piece wise:
- Gradient w.r.t. Weights:
Gradient through NLL Loss w.r.t. logits :
Given that NLL Loss takes as input a vector and returns a value, , we can apply a similar proof strategy to define a jacobian that will generate a map for some loss back to the given logits in .
First we just define roughly the inner function with respect to potential input logits:
where we are fixing in some index baked into the function (like a closure).
Then the Jacobian can be seeing as: such that:
In our implementation, we end using simply mainly due to the fact that, when simplifying the gradients, we see:
Where loss is: , taking the derivative of this function results in for the target index. Since in our case NL_Loss is the last layer and our chainrule starts with this value, we can assume that we don't need the , and allow gradients to flow directly from this function. Note, our loss value will be different numerically because we are going to be computing raw logit outputs, and not softmax'd outputs. But, for the sake of gradient flow back, it will be equivalent, as and are the same if the LHS multiplicative will be (i.e. ).
We want to derive the loss with logits and understand what the derivative actually looks like when not on raw logits.
Recall: where actually represents values after a softmax function applied. Seperately, a softmax function can be written as:
We want to get the ∂p/∂z(derivative of softmax):
we want to find , which can represented as:
Deriving the gradients w.r.t. the weights
To understand the geometric interpretation and prove the derivative of a loss function ( L ) with respect to a weights matrix ( W ) in the context of neural networks, we primarily use concepts from multivariable calculus and matrix calculus. The proof here will follow the fundamental principle that the derivative of a function at a point gives the best linear approximation to the function at that point, indicating the direction and rate of steepest ascent from that point.
Setting Up the Scenario
Assume is a scalar loss function that depends on the output of a neural network, where itself is a function of the weights matrix . Suppose the neural network follows as described in the forward pass:
Here, represents the activation function applied element-wise, is the input vector. The output thus depends on the weights matrix .
The Geometric Definition of Derivatives
In the context of functions from , the derivative is a linear map that best approximates the change in the function near a point. For vector-valued functions , the derivative at a point is represented by the Jacobian matrix, which consists of all first-order partial derivatives as described before.
For our case, , the derivative with respect to the matrix (which is itself a tensor) needs to capture how infinitesimal changes in the elements of affect the change in . We also know that can be represented as a linear map . We can treat as a function.
Derivative Computation
- Recollect the Forward pass as:
- Applying the Chain Rule: The derivative of with respect to involves applying the chain rule through the layers of operations:
- is the gradient of the loss with respect to the output , which depends on the specific loss function used.
- is the derivative of the activation function , evaluated element-wise at .
- captures how changes in affect , and for a given element , the derivative with respect to each element is simply .
- Jacobian to Matrix Form: As before, catpures , the derivative with respect to each element is simply . Representing this as a vector valued function would first look like:
where, the function is from functions that come from such that . Furthermore we can think of a function as:
Now looking at , we want to build the differential very similar to our derviation in the refresher.
except, we want to take the partial derivatives with respect, instead of {x_k}.
And here, we replace with our indexed functions from above .
Notice that we replaced with as in our original derivation represents the output as well, so we make that substitution here.
Now, to map it back to our partial derivative ,
Here we substitute in as shown before,
Note that is a function, representing the infinitesimal changes of the elements of with respect to the output at some point .
We can now map it to the Jocabian, given that we have each function
Each entry here directly maps to a value in the original summation defined:
Expanding in matrix terms, we have that each component .
Note as well that we can expand out as well: can be used to differentiate w.r.t either or so that:
Thus, we can transform the above jacobian representation in terms of matrix operations:
Here, is treated as a row vector due to broadcasting rules in matrix multiplication, and is the transpose of the input vector .
Generalizing to an Arbitrarily Deep and Wide MLP
Network Setup and Notation
Assume a neural network with layers. Each layer has:
- A weights matrix
- An activation function
- An input
- An output , where and
The output of each layer serves as the input to the next layer.
Forward Pass
- Input Layer: is the input to the network.
- Hidden Layers and Output: For each layer :
- Final Output: The output of the last layer is used to compute the loss based on a target label e, i.e.,.
Backward Pass (General Case)
To compute the gradient for each layer's weights , you apply the chain rule in reverse order from the output back to the inputs:
- Output Layer Gradient:
- Backpropagation Through Layers: For each layer from down to 1, compute:
where is the derivative of the activation function at layer .
If , then:
- Gradient w.r.t. Weights: For each layer , compute:
Summary
The backward pass effectively determines how the weights in each layer should be adjusted to minimize the loss, accounting for how each layer's output influences the loss through subsequent layers. Each points in the direction of greatest increase of the loss as a function of the weights in layer . By updating the weights in the opposite direction of this gradient, we move towards reducing the loss, implementing the essence of gradient descent.
These derivations are used to compute how we handle backpropagation in our MLP example in the modeling/ directory. network.py shows an example of how you can forward pass through the network, and how one can backwards pass the gradients through an MLP with ReLU activation and Negative Log-likelihood loss.