Cindered Thoughts

Derivations of Gradients w.r.t. Neural Networks

This is a refresher on deriving backpropagation of MLPs with NLLoss and Softmax. This traces builds from Vector Calculus and basic Liner Algebra to derive key properties of a basic neural network. This was originally done as an exercise and has become a valuable resource for me to reason about derivatives. I hope this is useful for others and will always be open to improving this resource.

This should be a good resource for folks familiar with high school Calculus and roughly familiar with vectors. Most of the notation should be searchable and accurately reflect the dimension of vectors and matrices. Don't be too intimidated and work through it slowly :)


Refreshers on Derivatives of Vector Valued Functions

A Function f:nm is differentiable at an if there is an m×n matrix such that:

limxa|f(x)f(a)A·(xa)||xa|=0

If such matrix exists, the matrix A is denoted by Df(a) and is called the Jacobian.

Note that |xa| is the distance metric defined by the Euclidean Distance (x1a1)2+(x2a2)2..(xnan)2 and is a real valued scalar.

Formally, this can be derived from the general definition of a derivative:

f(a)=limxaf(x)f(a)xa

Where this is only true if and only if:

0=limxaf(x)f(a)xaf(a)

which can be transformed to:

0=limxaf(x)f(a)f(a)(xa)xa

and thus the evaluated to the final distance of each numerator and denominator from the origin:

0=limxa|f(x)f(a)f(a)(xa)||xa|

(Since the notion of divison of two vectors is silly)

Here f(a) represents our Jacobian matrix in m×n shape. Denote I refer to this matrix from now on as Df(x), though in the example above it is being used to be evaluated at point a in vector space.

Defining the Jacobian in terms of coordinates and Indices

Definitions of Jacobians via multiple functions of f

Let the function f:nm be given by the m differentiable functions f1(x,..,xn),,fm(x1,,xn) such that:

f(x1,,xn)=[f1((x,..,xn))fm((x,..,xn))]

Supposing we can represent each f as a family of functions, indexed from 1 to m, we can take the derivative of each function fi for i1m:

Dfi(x1,...,xn)vi^ such that vin

In this case, we know vi to be n dimensional, because of our original formulation of the derivative of vector valued functions. Note that we must compute f(a) which is another function Df(a). However, we need the rows of f(a) (a linear map of sorts) for each ith row in A to represent a tangent line. This tangent line is conditioned to be for an input vector valued to be resultant vector of xa.

Similar to our single valued derivative case (grade school calculus):

y=f(a)+f(a)(xa)

where the above function is the tangent line of some differentiable function at some point a for a function y=f(x) (Note this function is linear),

we want to build this same representation for a multivalued function fi(x1,...,xn). Thus we rely on partial derivatives to represent individual linear tangent lines with respect to each individual input xj where j1...n. Thus, we can represent each row of Df(x) as:

Dfi(x1,...,xn)=(fix1,fix2,,fixn)

So as a result, we can say Dfi(x1,...,xn) when applied to the elements of xa, will represent the linear approximations of wiggles on the vector valued function f that will be approximately zero in distance from if we took the difference of f(x)f(a) exactly.

As a result, we can expand this out to each function f1...fm in f so that Df(x1,...,xn) our Jacobian is:

Df(x)=[f1x1f1x2f1xnf2x1f2x2f2xnfmx1fmx2fmxn]

Defining coordinates from domain and codomain dimensions

In our case, we will want to have a conveinient notation for handling indices in our Jacobian matrix, and tracking our output vector's dimensions, along with input vector's dimensions.

Defining Differentials of Compositions of Functions

Let f:nm and g:ml be differentiable functions. Also there is a composition function:

gf:nl

that is also differentiable with a derivative given by: if f(a)=b, then

D(gf)(a)=D(g)(b)·D(f)(a)

By the chain rule.


Building a Neural Network from an MLP with ReLU and NLL Loss

Multi-Layer Perceptron (MLP) with ReLU (Rectified Linear Unit) activation are common a neural network architecture, along with a Negative Log-likelihood loss (NLL Loss) for handling multiclass output.

What does an MLP represent mathematically?

An MLP is a neural network architecture where we have layers of preceptrons (sometimes called neurons) which can be seen as connected groups of bipartite graphs, such that each layer of nodes (neurons) is not connected to any other node in it's layer. Each neuron recieves inputs from all neurons from the layer previous to it. More specifically, each neuron in each layer takes all of the neuron inputs and multiplies itself with a weight for each incoming input. These are all summed together. Each neuron in the layer then passes its summed value to an activation function (such as the ReLU in our example). This will be our neuron's final output.

Let's Define it Better!

As noted, we often model it as a series of layers where each layer is composed of neurons. Each neuron computes a weighted sum of its inputs, which is then passed through an activation function. Here's a simplified LaTeX representation that captures these components across multiple layers:

Consider an MLP with one hidden layer and an output layer. The model consists of an input vector, hidden layers with activation functions, and an output layer. We will denote:

Hidden Layer

Let's say the MLP has one hidden layer. The output of this layer for each neuron ( j ) in the hidden layer can be represented as:

aj(1)=σ(i=1nWji(1)xi)

where n is the number of inputs to each neuron, Wji(1) are the weights connecting input i to neuron j in the first hidden layer.

Output Layer

For the output layer, if we have k outputs, the output for each neuron k in the output layer can be represented as:

yk=σ(j=1mWkj(2)aj(1))

where m is the number of neurons in the hidden layer, Wkj(2) are the weights connecting the hidden layer neuron j to the output neuron k.

Notice that (j) represents the logit value for a neuron, which is a row of the matrix (W(l). If we iterate over every input neuron we can store each value in a vector [a1,a2,a3..,ak]. This can thus be represented as simple a linear transformation! We will show more in the next section.

To Underscore, we will reference this indexing later in this writeup.


Defining Operations Concretely

Suppose you have an input 𝐱, weights 𝐖, (and biases which is common but we leave it out 𝐛). The operations in the layer can be described as:

  1. Linear Transformation: 𝐳=Wx
  2. ReLU Activation: 𝐚=ReLU(𝐳), where ReLU(zi)=max(zi,0)
  3. NLL Loss: 𝐋=NLLLoss(𝐳)=log(zt)

From our previous example, we can cleanly represent our Input to Output as a Linear Transformation, rather than a combersome summation. We will reference this from now on for the forward pass of MLP, and come back to this notation later when things get more tricky.

Example of a Forward Pass

We will show an example of a forward pass that works on a simple network, with two layers, and two neurons, three inputs and two outputs.

Assume for our example, we have: \

The Forward Pass

  1. Layer 1: Input 𝐱, weights 𝐖1

    • Linear transformation: 𝐳1=𝐖1𝐱
    • Activation: 𝐚1=ReLU(𝐳1)
  2. Output Layer: Input: a1, weights 𝐖2

    • Linear transformation: z2=𝐖2𝐚1
    • Activation: 𝐚2=ReLU(𝐳2)
  3. Computing the Loss: Input activations a3, labels 𝐭

    • Loss computation: 𝐋=log(𝐚t)\ where\ 𝐚t=element at index\ t\ of\ 𝐚
    • t here represents the class label index of the input example. As a result the possible values must be an index into the shape of a3. e.g. if 𝐚33, t1...3

Quality of Life Reformulations

Given how that we have set up the premise, we can go ahead and try to use our previous mathematical foundation to reformulate some specific parts of our forward pass and the operations themselves. Why do we do this? Well we want to make it easier to convince oneself of how these models update or learn (more on this in the gradients and backpropagation section)

Representing MLP transformations in a more convenient Way

We currently represent each linear transformation for each layer as a Matrix-Vector multiplication. We find this quite cumbersome to handle when we want to do operations later on, specifically when trying to differentiate our composed function calls.

We can represent a linear transformation 𝐳*l=𝐖*l𝐱 as:

f*Wl𝐱) where\ f*𝐖l:nm

Now, our function is not necessarily multivalued, but represents a frozen transformation on vector valued inputs. Looking back to our refresher on vector valued functions, this means we can focus on differentiating f, the linear transformation directly. This also means we can invoke this function for inputs that are not just vectors, but matrices or tensors, representing collections of inputs.


Deriving Gradients of our MLP

Neural Networks learn by updating it's weights against the the errors of the network to the correct answer. To determine what might need to change for the models parameters (weights W(l) for us), the training algorithm performs what is called a backwards pass on the network.

The backward pass in a neural network, particularly for a Multi-Layer Perceptron (MLP) with ReLU (Rectified Linear Unit) activation, involves computing the gradient of the loss function with respect to the weights. This process allows for the updating of parameters via optimization algorithms like gradient descent. Here's how it works for a MLP layer with ReLU activation:

Backward Pass (Reverse Mode)

The gradients need to be computed with respect to 𝐖. Assuming that the gradient of the loss L with respect to the output of this layer 𝐚 is known (denoted as L𝐚), the gradients can be calculated as follows:

  1. Gradient through ReLU:
L𝐳=L𝐚ReLU(𝐳)

Here, ReLU(zi) is the derivative of ReLU, which is 1 for zi>0 and 0 otherwise. This can be proved simply by:

RELU𝐳=({zizi>00)𝐳

Moving the derivative inside the piece wise:

RELU𝐳=({zizizi>00)=({1zi>00)
  1. Gradient w.r.t. Weights:
L𝐖=L𝐳·𝐱TL𝐛=L𝐳(summing over the batch if needed)
  1. Gradient through NLL Loss w.r.t. logits zi:

    Given that NLL Loss takes as input a vector and returns a value, NLL:c, we can apply a similar proof strategy to define a jacobian that will generate a map for some loss L back to the given logits in zi.

    First we just define roughly the inner function with respect to potential input logits:

NLLtzi=zi({log(zt)t=i 0it)NLLtzi={1ztt=i 0it

where we are fixing in some t index baked into the function (like a closure).

Then the Jacobian can be seeing as: DNLLt(z):c such that:

[NLLtz1NLLtz2NLLtzn]*z

In our implementation, we end using simply loss=input[target] mainly due to the fact that, when simplifying the gradients, we see:

NLL(x,y)=log(py)=log(etzjejz)=zylog(jejz)

Where loss is: zy+(ejzj), taking the derivative of this function results in 1 for the target index. Since in our case NL_Loss is the last layer and our chainrule starts with this value, we can assume that we don't need the log(pz), and allow gradients to flow directly from this function. Note, our loss value will be different numerically because we are going to be computing raw logit outputs, and not softmax'd outputs. But, for the sake of gradient flow back, it will be equivalent, as 1/zi and 1 are the same if the LHS multiplicative will be 1 (i.e. 11).

We want to derive the loss with logits and understand what the derivative actually looks like when not on raw logits.

Recall: log(py) where pj actually represents values after a softmax function applied. Seperately, a softmax function can be written as:

py=eyzjejz

We want to get the ∂p/∂z(derivative of softmax):

given:fy(𝐳)=fy(z1,...,zn)=py=eyzjejz where znDefine a vector-valued function f: f=[f1(z1,...,zn)f1(z1,...,zn)fn(z1,...,zn)]

we want to find Dfi(𝐳), which can represented as:

Dfij(z)RDf(z)RnxnDfij:n>Dfij(𝐳)=Dfij(z1,...,zn)=ezikezkzj={0ij i=k\

Case 1: where ik,

eziezk  where k=j;ezi(ez1+ez2...+ezn)=ezi*(ez1+ez2...+ezn)1ezi*((ez1+ez2...+ezn)2)*ezjDfij(z1,...,zn)=ezi*((ez1+ez2...+enz)2)*ezj=(ezi*ejz(ez1+ez2...+enz)2)=1e(zi)*ezj/(k(ekz))*(k(ezk))=ezi*ezj(kezk)*kezk)=pi*pj

Case 2 When i=k:

Note: derivative: 1/g(x)=ddx(a*g(x)1)g(x)2*g(x)

 Using the product rule: 

eziezkzi=ezi(ez1+ez2...+ezn)=ezi*(ez1+ez2...+ezn)1ezi*((ez1+ez2...+ezn)1)+ezi*((ez1+ez2...+ezn)2)ezi*((ez1+ez2...+ezn)1)ezi*(ezi*(ez1+ez2...+ezn)2)ezi(1ez1+ez2...+eznezi(ez1+ez2...+ezn)2)ezi(1ez1+ez2...+eznezi(ez1+ez2...+ezn)2)=eziekz*(1ezi(ez1+ez2...+ezn))=softmax(x)(1softmax(x))=pipi2\

Deriving the gradients w.r.t. the weights W

To understand the geometric interpretation and prove the derivative of a loss function ( L ) with respect to a weights matrix ( W ) in the context of neural networks, we primarily use concepts from multivariable calculus and matrix calculus. The proof here will follow the fundamental principle that the derivative of a function at a point gives the best linear approximation to the function at that point, indicating the direction and rate of steepest ascent from that point.

Setting Up the Scenario

Assume L is a scalar loss function that depends on the output y of a neural network, where y itself is a function of the weights matrix W. Suppose the neural network follows as described in the forward pass:

y=f(Wx)

Here, f represents the activation function applied element-wise, x is the input vector. The output y thus depends on the weights matrix W.

The Geometric Definition of Derivatives

In the context of functions from nm, the derivative is a linear map that best approximates the change in the function near a point. For vector-valued functions F:nm, the derivative at a point is represented by the Jacobian matrix, which consists of all first-order partial derivatives as described before.

For our case, L:m, the derivative with respect to the matrix W (which is itself a tensor) needs to capture how infinitesimal changes in the elements of W affect the change in L. We also know that W can be represented as a linear map W:nm. We can treat W as a function.

Derivative Computation

  1. Recollect the Forward pass as:
z=W(x)y=f(z)L=Loss(y,target)
  1. Applying the Chain Rule: The derivative of L with respect to W involves applying the chain rule through the layers of operations:
LW=LyyzzW
  1. Jacobian to Matrix Form: As before, zW catpures zi=jWijxj, the derivative with respect to each element Wij is simply xj. Representing this as a vector valued function would first look like:
W(𝐳)=W(x1,x2,...,xn)=(W1(x1,x2,...,xn) W2(x1,x2,...,xn)Wm(x1,x2,...,xn))

where, the function Wj is from j functions that come from W such that j1...m. Furthermore we can think of a function Wj as:

zj=Wj(x1,x2,...,xn)=Wj1*x1+Wj2*x2+...+Wjn*xn=kWjkxk

Now looking at zW, we want to build the differential DW(x) very similar to our derviation in the refresher.

Dfi(x1,...,xn)=(fix1,fix2,,fixn)

except, we want to take the partial derivatives with respect, Wik instead of {x_k}.

Dfj(x1,x2,...,xn)=(fiWj1,fiWj2,,fiWjn)

And here, we replace fj with our indexed functions from above Wj(𝐱).

DWj(x1,x2,...,xn)=(ziWj1,ziWj2,,ziWjn)

Notice that we replaced fi with zi as in our original derivation fi represents the output as well, so we make that substitution here.

Now, to map it back to our partial derivative zW,

zj*Wj=Wj(x1,x2,...,xn)*Wj\ zjWj=Wj(x1,x2,...,xn)Wj

Here we substitute in DWj as shown before, zjWj=DWj

zjWj=DWj(x1,x2,...,xn)

Note that DWj is a function, representing the infinitesimal changes of the elements of W with respect to the output zj at some point 𝐱.

We can now map it to the Jocabian, given that we have each function Wj(𝐱)

DW(𝐱)=[z1W11z1W12z1W1nz2W21z2W22f2W2nzmWm1zmWm2zmWmn]

Each entry here directly maps to a value in the original summation defined:

Expanding zW in matrix terms, we have that each component ziWij=xj.

Note as well that we can expand out z𝐱 as well: zi=jWijxj can be used to differentiate zi w.r.t either W or x so that: zixj=Wij

Thus, we can transform the above jacobian representation in terms of matrix operations:

LW=Ly·yz·xT

Here, Ly·yz is treated as a row vector due to broadcasting rules in matrix multiplication, and xT is the transpose of the input vector x.

Generalizing to an Arbitrarily Deep and Wide MLP

Network Setup and Notation

Assume a neural network with K layers. Each layer k has:

The output yk of each layer serves as the input xk+1 to the next layer.

Forward Pass

  1. Input Layer: x1 is the input to the network.
  2. Hidden Layers and Output: For each layer k:
zk=Wkxkyk=fk(zk)(with xk+1=yk for the next layer)
  1. Final Output: The output yK of the last layer is used to compute the loss L based on a target label e, i.e.,L=Loss(yK,target).

Backward Pass (General Case)

To compute the gradient LWk for each layer's weights Wk, you apply the chain rule in reverse order from the output back to the inputs:

  1. Output Layer Gradient:
LyK=Derivative of Loss function w.r.t. the output of the last layer
  1. Backpropagation Through Layers: For each layer k from K down to 1, compute:
Lzk=Lyk·fk(zk)

where fk(zk) is the derivative of the activation function at layer k.

If k<K, then:

Lyk=Lzk+1Wk+1T
  1. Gradient w.r.t. Weights: For each layer k, compute:
LWk=LzkxkT

Summary

The backward pass effectively determines how the weights in each layer should be adjusted to minimize the loss, accounting for how each layer's output influences the loss through subsequent layers. Each LWk points in the direction of greatest increase of the loss as a function of the weights in layer k. By updating the weights in the opposite direction of this gradient, we move towards reducing the loss, implementing the essence of gradient descent.

These derivations are used to compute how we handle backpropagation in our MLP example in the modeling/ directory. network.py shows an example of how you can forward pass through the network, and how one can backwards pass the gradients through an MLP with ReLU activation and Negative Log-likelihood loss.