Intuiting Neural Networks

by Justin Skycak on

NNs are similar to SVMs in that they project the data to a higher-dimensional space and fit a hyperplane to the data in the projected space. However, whereas SVMs use a predetermined kernel to project the data, NNs automatically construct their own projection.

This post is part of the series Intuiting Predictive Algorithms.


Neural Networks (NNs) consist of layers of “neurons,” where each neuron has an “activity” that is computed as a function of the weighted sum of activities of neurons in the previous layer. The first layer of neurons are activated directly from the data, and the activation of a particular neuron in the last layer represents the likelihood of the data belonging to a particular class. NNs are similar to SVMs in that they project the data to a higher-dimensional space and fit a hyperplane to the data in the projected space. However, whereas SVMs use a predetermined kernel to project the data, NNs automatically construct their own projection by iteratively adjusting (“training”) the weights in the intermediate (“hidden”) layers to minimize a loss function. Unlike with the kernel trick in SVMs, the training of additional layers in NNs incurs significant additional computational cost, and much work has been devoted to optimizing algorithms and hardware usage to speed up the training of Deep Neural Networks (DNNs) consisting of many hidden layers.

Each layer $\ell$ of a NN consists of a parameter matrix $W_\ell,$ where the $i$th row vector contains the weights received by the $i$th neuron in the next layer. If we define $f_\ell$ as the activation function that is applied component-wise (i.e. neuron-wise) at the $\ell$th layer, and include a bias term as a neuron in each layer whose activity is always 1, then the output activities of the network $\ell$ layers after the first layer is given by

$\begin{align*} o_\ell(x_i) &= f_\ell(W_{\ell f_{n-1}}(...W_2f_1(W_1x_i)...)) \end{align*}$


We can write this recursively as

$\begin{align*} o_0(x_i) &= x_i \\[5pt] o_\ell(x_i) &= f_\ell(W_no_{\ell-1}(x_i)), \ell \geq 1 \end{align*}$


When counting layers, we do not count the first layer because it reads in the data and is therefore not associated with trainable weight parameters. This way, each layer is associated to a weight matrix.

For a regression network with $n$ layers, a loss function $L(o_n)$ is chosen to compare the output $o_n(x_i)$ to the desired target $y_i$ from the data. Common choices for this loss function include L1 and L2 error. For a classification network with $n$ layers, we normalize the output to $\widehat{o_n}(x_i) = o_n(x_i)/\vert\vert o_n(x_i) \vert\vert$ so that we can interpret it as a probability. Then, a loss function $L(\widehat{o_n})$ is chosen to compare the discrepancy between the output $\widehat{o_n}(x_i)$ and the desired target (“ground truth”) $y_i$ from the data. For classification, the loss function is usually chosen as the cross-entropy. For each data point, the cross entropy is given by

$\begin{align*} o_0(x_i) &= x_i \\[5pt] o_\ell(x_i) &= f_\ell(W_no_{\ell-1}(x_i)), \ell \geq 1 \end{align*}$


To make sense of the cross-entropy, notice that it can be simplified to

$\begin{align*} L(\widehat{o_n}) = - \log \left[ \widehat{o_n}_j^{y_{ij}} (1-\widehat{o_n}_j)^{1-y_{ij}} \right] \end{align*}$


and if the ground truth is a single class $c,$ i.e. $y_{ic}=1$ and $y_{ij \neq c}=0,$ then it becomes

$\begin{align*} L(\widehat{o_n}) &= - \log \left[ \widehat{o_n}_c \prod_{j \neq c} (1-\widehat{o_n}_j) \right] \\[5pt] &= -\log \left[ P(\mbox{in correct class})P(\mbox{not in wrong class}) \right] \end{align*}$


The training algorithm, called gradient descent, consists of iteratively updating the weights at each layer according to

$\begin{align*} W_\ell \to W_\ell - \alpha \frac{\partial L}{\partial W_\ell} \end{align*}$


where $\alpha$ is the learning parameter that governs how quickly the weights change. There are other variations of gradient descent, such as stochastic gradient descent (SGD), where the learning parameter is randomized to assist the weights in breaking out of a shallow minima while allowing them to settle into a deeper minima, and SGD with momentum, which is meant to mimic the trajectory of a ball rolling down a bumpy hill into a valley. The main problem in all of these methods, though, is computing the derivative (“gradient”) of the loss function with respect to the weights. Luckily, there is a pattern to it, which we will see after computing the gradient for $\ell = n, n-1, n-2.$

Computing for $\ell = n,$ we have

$\begin{align*} \frac{\partial L}{\partial W_n} &= L'(o_n)\frac{\partial o_n}{\partial W_n} \\[5pt] &= L'(o_n) \cdot f_n'(W_no_{n-1}) \frac{\partial}{\partial W_n} \left[ W_n o_{n-1} \right] \\[5pt] &= \left[ L'(o_n) \cdot f_n'(W_no_{n-1}) \frac{\partial}{\partial W_n} \right] o_{n-1}^T \end{align*}$


Where the $(\cdot)$ operation represents the Hadamard product. We define

$\begin{align*} \delta_n = L'(o_n) \cdot f_n'( W_no_{n-1} ) \end{align*}$


so that

$\begin{align*} \frac{\partial L}{\partial W_n} = \delta_n o_{n-1}^T \end{align*}$


Computing for $\ell = n-1,$ we have

$\begin{align*} \frac{\partial L}{\partial W_{n-1}} &= \delta_n \frac{\partial}{\partial W_{n-1}} \left[ W_n o_{n-1} \right] \\[5pt] &= W_n^T \delta_n \frac{\partial o_{n-1}}{\partial W_{n-1}} \\[5pt] &= \left[ W_n^T \delta_n \cdot f_{n-1}'(W_{n-1}o_{n-2}) \right] o_{n-2}^T \end{align*}$


We define

$\begin{align*} \delta_{n-1} = W_n^T \delta_n \cdot f_{n-1}'( W_{n-1} o_{n-2} ) \end{align*}$


so that

$\begin{align*} \frac{\partial L}{\partial W_{n-1}} = \delta_{n-1} o_{n-2}^T \end{align*}$


Computing for $\ell = n-2,$ we have

$\begin{align*} \frac{\partial L}{\partial W_{n-2}} &= \delta_{n-1} \frac{\partial}{\partial W_{n-2}} \left[ W_{n-1} o_{n-2} \right] \\[5pt] &= \left[ W_{n-1}^T \delta_{n-1} \cdot f_{n-2}'(W_{n-2}o_{n-3}) \right] o_{n-3}^T \end{align*}$


We define

$\begin{align*} \delta_{n-2} = W_{n-1}^T \delta_{n-1} \cdot f_{n-2}'( W_{n-2} o_{n-3} ) \end{align*}$


so that

$\begin{align*} \frac{\partial L}{\partial W_{n-2}} = \delta_{n-2} o_{n-3}^T \end{align*}$


Putting it all together, we have that

$\begin{align*} \frac{\partial L}{\partial W_\ell} = \delta_\ell o_\ell^T \end{align*}$


where

$\begin{align*} \delta_n &= L'(o_n) \cdot f_n'(W_no_{n-1}) \\[5pt] \delta_{\ell-1} &= W_\ell^T \delta_\ell \cdot f_\ell'(W_{\ell-1} o_{\ell-2}), 2 \leq \ell \leq n \end{align*}$


This method for computing the gradients is called “backpropagation,” because we propagate the $\delta_\ell$ terms backwards through the layers, from the last layer to the first layer.

The equations for backpropagation also give us insight into our choice of activation function. If we choose a sigmoidal activation function that levels off, then the gradient will vanish for neurons whose activations are too large in magnitude. But if we choose a linear activation function that maintains a slope of 1 everywhere, then we have nothing more than a linear model, and the network is unable to project the data into a higher-dimensional space before fitting the hyperplane. The solution is to use a “rectified” linear unit (ReLU) which is linear for positive inputs, and zero for negative inputs:

$\begin{align*} f(x) = \max(0,x) \end{align*}$


Ideally, we’d use a “softmax” function which is differentiable at zero unlike ReLUs, but ReLUs are so much faster to compute that we use them anyway. We can usually get away with the slope being zero for negative inputs to the ReLU because the weighted sums in the network tend to be positive sometimes. However, if we set the learning rate too high, we can sometimes end up with neurons whose weighted sums are always negative, and consequently whose gradients and activity are always zero. To overcome this problem of “dead” neurons, one can use leaky ReLUs which have a small gradient and activity even for negative inputs:

$\begin{align*} f(x) = \max(\epsilon x,x) \end{align*}$


That being said, ReLUs may not be the best choice for the output layer of the network, which is supposed to represent a regression or classification prediction. For regressions, linear activation functions are a better choice in the final layer, and for classifications, softmax units are a better choice in the final layer.

Due to the large number of parameters in NNs, they are prone to overfitting. However, the risk of overfitting can be reduced by “dropout,” a method used to avoid training all of the weights on all of the training data. Dropout involves randomly turning of or “dropping out” neurons from the network during each training iteration, and then keeping the weights of those neurons unchanged during the weight update. Dropout also increases training speed, since dropping out half the neurons in a network cuts the number of computations in half.

One type of neural network that has seen widespread success in the realm of image processing is the convolutional neural network (CNN), which reduces the number of parameters (thus enabling deep networks of many layers) by taking advantage of spatially local input patterns. In CNNs, each layer of neurons is really a stack of sub-layers, and each neuron in a sub-layer is connected to only a small region (“receptive field”) of a single sub-layer in the preceding layer. Receptive field weights are shared across neurons within a sub-layer, thus forming a template (“convolution”) that can be interpreted as the pattern of activation that the sub-layer is trained to detect within the sub-layer in the preceding layer. A sub-layer’s convolution can be expressed as a weighted sum of different offsets of the convolution in the sub-layer in the preceding layer, and by carrying the weighted sum through all the layers down to the input layer, one can see the visual feature that the sub-layer is trained to detect within the image. Visual features of sub-layers within lower layers are usually simple, like lines and edges, whereas visual features of sub-layers within higher layers can be complex, like faces or cars.


This post is part of the series Intuiting Predictive Algorithms.