Formulations of Neural Net Weight initializations

19 Apr, 2024

This write-up is a repost of my personal notes on weight initialization, and how it impacts the outputs of an MLP. I use this reference for myself, I hope this can also be of help to others :)

Initializing Weights of the MLP

The initialization of neural network weights using a standard deviation of $\sqrt{1 / n}$ (where $n$ is the number of input neurons, also known as fan-in) is a strategy designed to maintain the variance of the outputs of each neuron at initialization. Let's delve into a mathematical explanation and derivation of why this specific value is chosen, particularly in the context of keeping the variance of the outputs stable.

Background

When we initialize the weights of a neural network, we want to ensure that the signal (i.e., the output of each neuron before applying the activation function) does not vanish or explode as it propagates through the network. This stability helps in maintaining effective gradient propagation during training.

Assumptions

The weights $w_{i j}$ are initialized independently from a normal distribution with mean 0 and standard deviation $σ$ .
Each neuron receives $n$ inputs $x_{i}$ , which are also assumed to be independent and have a mean of 0 and some constant variance (say, variance = 1 for simplicity).

Output Variance Calculation

Consider a neuron's output $z$ before applying the activation function, calculated as: $z = \sum_{i = 1}^{n} w_{i} x_{i}$ where $w_{i}$ are the weights and $x_{i}$ are the inputs.

Step 1: Calculate the Variance of $z$

Since the weights and inputs are independent and assuming the inputs also have zero mean, the variance of the product $w_{i} x_{i}$ for each $i$ is simply the product of their variances (due to independence and zero means): $Var (w_{i} x_{i}) = Var (w_{i}) \cdot Var (x_{i}) = σ^{2} \cdot 1 = σ^{2}$ (as we noted for simplicity $x_{i}$ has $Var (x_{i}) = 1$ )

Since $z$ is the sum of $n$ such independent terms $w_{i} x_{i}$ , the variance of $z$ is the sum of their variances: $Var (z) = Var (\sum_{i = 1}^{n} w_{i} x_{i}) = \sum_{i = 1}^{n} Var (w_{i} x_{i}) = n σ^{2}$

Step 2: Desired Variance of $ z $

To maintain the variance of the output $ z $ similar to the variance of the input across layers, we would like $Var (z) = 1$ . This condition helps prevent the vanishing or exploding gradients during training.

Setting $Var (z) = 1$ : $n σ^{2} = 1$ $σ^{2} = \frac{1}{n}$

Therefore, the standard deviation $σ$ should be: $σ = \sqrt{\frac{1}{n}}$

Conclusion

This derivation shows that setting the standard deviation of the weight initialization to $\sqrt{1 / n}$ ensures that the output of each neuron has a variance of 1, assuming the inputs also have a variance of 1. This balance is crucial for maintaining effective learning, as it prevents the scale of the neuron outputs from increasing or decreasing dramatically across layers, which can lead to numerical instability or poor convergence. This is why the $\sqrt{1 / n}$ factor is commonly used in weight initialization methods like Xavier/Glorot initialization (which adjusts the variance further based on both the number of inputs and outputs).

Proving the identities of linear transformations under Random Variables

Let's go through the mathematical proof for how the linear transformations of a random variable affect its mean and variance. The transformation we are considering is $Y = a X + b$ , where $X$ is a random variable with mean $ \mu_X $ and variance $σ_{X}^{2}$ , and $ a $ and $b$ are constants.

1. Expectation (Mean)

The expectation operator $E$ has the properties of linearity, which means that for any constants $a$ and $ b $, a n d a r a n d o m v a r i a b l e$ X $:$ E[aX + b] = aE[X] + b$

Proof for Mean

Given that $X$ has a mean of $μ_{X}$ , the mean of $Y$ is calculated as follows: $E [Y] = E [a X + b]$ $E [Y] = a E [X] + b$ $E [Y] = a μ_{X} + b$

Thus, the mean of $Y$ is $a μ_{X} + b$ .

2. Variance

Variance, denoted as $Var$ , measures the spread of a random variable around its mean. The variance of a transformed random variable $Y = a X + b$ is defined as: $Var (Y) = E [(Y - E [Y])^{2}]$

Proof for Variance

Substituting $ Y = aX + b $ and $ E[Y] = a\mu_X + b $ into the variance formula: $Var (Y) = E [(a X + b - (a μ_{X} + b))^{2}]$ $Var (Y) = E [(a X - a μ_{X})^{2}]$ $Var (Y) = E [a^{2} (X - μ_{X})^{2}]$ $Var (Y) = a^{2} E [(X - μ_{X})^{2}]$

Since $E [(X - μ_{X})^{2}]$ is the definition of $Var (X)$ , or $ \sigma_X^2 $:$ \text{Var}(Y) = a^2\sigma_X2$

Key Insight

The addition of a constant $b$ shifts the mean but does not affect the spread or variability of the distribution, hence it does not influence the variance. The multiplication by $a$ , however, scales the spread of the distribution by $a^{2}$ .

Summary

This proof shows that the mean and variance of a linear transformation of a random variable $Y = a X + b$ are $a μ_{X} + b$ and $a^{2} σ_{X}^{2}$ respectively. These properties are foundational in probability and statistics and are extensively utilized across fields like data science, economics, and engineering to understand and predict the behavior of complex systems based on simpler underlying distributions.

Putting It Together

Applying the Transformation

Given $Y = a X + b$ , we can substitute $a$ with our target standard deviation $σ_{T}$ .

$Y = σ_{T} X$

We know now that: $Var (Y) = σ_{Y}^{2} = σ^{2} σ_{X}^{2}$ and by square root, we get: $σ_{Y} = \sqrt{V a r (Y)} = \sqrt{σ_{T}^{2} σ_{X}^{2}} = σ_{T} σ_{X}$ .
$= σ_{T} * 1$
We substitute $\sqrt{1 / n}$ and our origin standard deviation $1$ to get: $σ_{Y} = \sqrt{1 / n} * 1$

This concludes the formulation of MLP weight initialization. Hope this provides a good reference for folks to think about how weight initialization impacts the variance of a network's outputs.