Summary of "Layer Normalization"

Summary of the paper: Layer Normalization﻿
AbstractWhen it comes to deep neural networks, training can sometimes be really computationally expensive. Methods like batch normalization reduce the training time by a significant amount, for feed-forward networks. However, batch normalization doesn't quite work for recurrent neural networks.
This paper takes batch normalization and modifies it to layer normalization so that it can be used for recurrent neural networks. It gives each neuron its own adaptive bias and performs the same computation for training and test times (unlike batch normalization). Compared to previous methods, layer normalization can reduce training time significantly.
IntroductionDeep neural networks are insane, especially with supervised learning tasks. However, to get this high performance requires a huge amount of computation. Speeding up this training is possible with better hardware or computing gradients, but that requires a ton of communications and complex software which isn't too practical.
Batch normalization is the current best way to reduce training time; it includes additional normalization stages in the network by standardizing the summed inputs with the mean and standard deviation through training. This is really simple but requires a running average of the summed input statistics, which isn't possible for recurrent neural networks.
The length of sequences in RNNs vary, so applying batch normalization would require different statistics for each time step.
This paper introduces a new method known as layer normalization which is a simple normalization for a variety of neural networks - not just feed-forward ones.
Layer normalization can directly estimate the normalization statistics from summed inputs, which is something that batch normalization can't do. This means it can work well for RNNs and improve the training time/generalization performance of existing models.
BackgroundFeed-forward networks have a non-linear mapping from the input pattern --> output vector. The parameters of the network use gradient-based optimization for learning which is computed with backpropagation. The gradients, with respect to eh the weights, in a layer are extremely dependent on the output from the previous neurons, meaning it's extremely correlated with each other. Batch normalization can reduce this "covariate shift" by normalizing the summed inputs of each hidden unit over the training cases. As well as use a rescaling method to rescale the summed input according to the variances under the distribution of data.
Layer NormalizationLayer normalization can solve problems that come with batch normalization. This is because it can compute the statistics of all hidden units in a layer like so:
﻿
Where H represents the number of hidden units in the layer.
Layer normalization doesn't impose constraints on the size o the mini-batch either.
Layer normalized recurrent neural networks
When applying batch normalization to RNNs we need to compute and store separate statistics for each time step in one sequence. Layer normalization doesn't have this problem, as its normalization terms depend on the summed inputs of to the layer at the current time step. 
With standard RNNs, there is a tendency for the average magnitude of the summed inputs to recurrent units, which either grows or shrinks every time step. This results in the hidden-to-hidden dynamic.
Related WorkBatch normalization has been previous extended for RNNs by keeping the independent normalization statics for each time step. They initialized the gain parameter of the recurrent batch normalization layer to 0.1 and found a significant difference in the final model performance. Layer normalization is similar to this in terms of weight normalization. 
L2 normalization of incoming weights is used for normalizing the summed inputs of neurons, instead of variances like batch normalization.
AnalysisThis section of the paper was investigating the invariance properties of the different normalization schemes.
Invariance under weights and data transformations
Normalization scalars are computed differently for batch and layer normalization, as it normalizes the summed inputs to neurons through two scalars.
Weight re-scaling and re-center: rescaling for any income weight of a single neuron doesn't have effects on the normalized summed inputs to the neuron. The layer normalization is invariant to the scaling of the entire matrix and has an invariant shift to incoming weights in the matrix.
Data re-scaling and re-centring: The normalization scalars only depend on the current input data and are invariant to re-scaling the dataset.
The geometry of parameter space during learning
This section analyzes the behaviour with geometry and manifold for parameter space.
The Riemannian manifold is under the Kullback-Leir divergence metric within a probabilistic distribution model. It measures changes in the model output from parameters in the parameter space.
They also focused on a geometric analysis for generalized linear models. The results from the analysis can be applied to deep neural networks with block-diagonal approximation to the Fisher information matrix, meaning each block corresponds to the parameter of a single neruon.
The generalized linear model (GLM) is used for parameterizing output distributions from the exponential family with weight vectors and bias scalars. The GLM is obtained by applying the normalization method for summed inputs in the original model. There is also an implicit learning rate reduction through the growth of the weight vector.
Experimental ResultsLayer normalization was performed on 6 different tasks focusing on the field of RNNs: image-sentence, ranking, question-answering, contextual language modelling, generative modelling, handwriting sequence generation and MNIST classification.
Order embeddings of image and language
They applied layer normalization to the order-embedding model to learn to join the embedding space of images and sentences. The same protocols were followed and the code was modified to include the layer normalization.
After training two models (one with and without layer norm), here were the results:
﻿
Layer normalization performs a speedup in the metrics and convergence as well as creating the best validation performance of the model.
Teaching machines to read and comprehend
They trained a unidirectional attentive reader model with a CNN corpus to compare layer normalization to recurrent batch normalization. A question-answering model is when the model can fill an answer given a specific question.
The following results show the layer normalization training faster and have better validation results.
﻿
Skip-thought vector
Skip-thought a generalization of the skip-gram model, but used for unsupervised distributed sentence representation instead. The model could produce a generic sentence representation to perform well on multiple tasks (without fine-tuning).
Once again, the layer normalization was able to speedup the baseline and give better results
﻿
Modelling binarized MNIST with DRAW
The deep recurrent attention writer (DRAW) was used for generative modelling on the MNIST dataset. It uses an attention mechanism and RNN to generate pieces of an image sequentially. 
After evaluating the effects of layer normalization, we noticed that it could speed up the convergence almost twice as fast as the baseline model.
﻿
Handwriting sequence generation
The previous experiments were all tailored in the field of NLP, this is why they will now perform hand-writing generation to show the effectiveness of this method.
The goal is to predict the sequence of x and y pen coordinates of the corresponding handwriting lined on a whiteboard.
The layer normalization converges to a log-likelihood as a baseline model but only faster.
﻿
Permutation invariant MNIST
The method was then used in feed-forward networks to compare layer normalization to batch normalization on MNIST classification. We applied layer normalization to fully connected hidden layers and showed that the layer normalization was more robust to the batch size and had a faster convergence.
﻿
Convolutional Neural Networks
The paper also suggests experimenting more with CNNs to see if layer norm could speed this training performance however, batch normalization was still able to perform better.
ConclusionThis paper introduced a new concept known as layer normalization that was able to speed up training on multiple types of neural networks. It also provided a theoretical analysis of comparing the invariance properties of layer normalization with batch and weight normalization.
Recurrent neural networks benefited the most from the layer normalization method, especially for long sequences.
﻿
That's pretty much everything you need to know about this paper!
Summary of "Layer Normalization"

Abstract

Introduction

Background

Layer Normalization

Related Work

Analysis

Experimental Results

Conclusion

Comments (loading...)

AI Paper Summaries