Summary of "Attention is All You Need"

Paper summary: Attention is All You Need﻿
AbstractTons of state-of-the-art sequence transduction models use complex recurrent or convolutional neural networks and connect the encoder and decoders with attention mechanisms. 
This paper proposes the Transformer architecture, which is just a simple architecture solely reliant on attention mechanism. The architecture was performed on two machine translation tasks and performed significantly better than other models.
This architecture can also help generalize models.
IntroductionRNNs, Gated-RNNs, LSTMs are all state-of-the-art models in sequence modelling and transduction problems. These recurrent models can take the input and output sequences and generate a sequence of hidden states as a function of the previous hidden state in the input position.
Many people have been trying to push these models to perform even better by improving efficiency through factorization tricks and conditional constraints. The attention mechanism allows for modelling dependencies without taking the distance of an input or output sequence into consideration. Although these methods work, they don't significantly boost their performance.
This is where the Transformer comes in. This is an architecture that avoids using recurrences and instead relies on attention mechanisms. Transformers can show significantly more parallelization and reach new state-of-the-art in translation tasks.
BackgroundReducing the computational of sequential models requires convolutional neural networks as its basic building block. This makes it harder to learn dependencies between the distant position, which continue to grow linearly. 
Self-attention is a type of attention mechanism relating to the different positions of a single sequence. It has been used in various tasks like reading comprehension, abstractive summarization, etc.
End-to-end memory networks are recurrent attention mechanisms and can perform well on simple question-answering and language modelling tasks. 
Transformers are the first transduction model that relies entirely on self-attention for computing representaions.
Model ArchitectureMany of these models have an encoder-decoder structure. The encoder maps the input sequence to the sequence of continuous representation. The decoder then generates the output sequence.
This Transformer uses a similar architecture, as it uses stacked self-attention and point-wise fully connected layers on both the encoder and decoder.
Here is a visual representation of the model architecture.
﻿
Encoder and Decoder Stacks
The encoder has 6 of the same layers, each with two sub-layers:
Multi-head self-attention mechanism
Position-wise fully connected feed-forward network.
The decoder also has 6 of the same layers and has three sub-layers, where the third one performs multi-head attention over the output of the encoder stacks. 
They both use layer normalization.
AttentionThe attention part maps the queries and key-value pairs for the output (which are all vectors). The output is just the weighted sum of the values. There are two different types of attention being used:
Scaled Dot-Product Attention
Multi-Head attention 
Scaled-Dot Product Attention
This computes the dot product of the queries with all its keys and applies a softmax function to get the weights on the values. The matrix is calculated as:
﻿
Multi-Head Attention
This attention is used for projecting the queries, keys and values linearly (which is better than a single attention function). The values are concatenated and project the results into a final value.
Applications of the Attention in the model
There are three uses for these attentions to be used in the Transformer:
In the encoder-decoder attention layer
The encoder contains a self-attention layer
The self-attention layer can be used inside the decoder
Position-wise Feed-Forward NetworksEach encoder and decoders have fully connected feed-forward networks and consist of 2 linear transformations and a ReLU activation.
Embedding and SoftmaxIt also uses learned embeddings to convert the input and output tokens into vectors and uses a softmax function to convert the decoder output into next-token probabilities.
Positional EncodingPositional encoding is also used to make sure the model uses an order of sequences. It has also been experimented with learned positional embeddings and produced almost the same results; however, they use this one as it's more convenient.
Why Self-AttentionThere are many reasons why they used self-attention instead of recurrent neural networks or convolutional neural networks like other models. For one, it is less computationally complex per layer. It is also less computationally expensive as it uses a minimal number of sequential operations compared to RNNs or CNNs. Finally, learning long-range dependencies is challenging with many tasks, but self-attention shortens the path to make it less challneging.
TrainingTraining Data and BatchingThey used two different datasets:
WMT 2014 English-German dataset - 4.5 million sentence pairs
WMT 2014 English-French dataset - 36 million sentences
The training batches contained a set of sentence pairs with almost 25000 source and target tokens.
Hardware and ScheduleThye used 8 NVIDIA P100 GPUs and trained for 100,000 steps/12 hours (0.4 seconds per step)
OptimizerThe Adam optimizer was applied with a varied learning rate over training:
﻿
Towards the first training steps, it increases at a linear rate and then decreasing proportionally.
RegularizationThere were two different regularization methods applied, residual dropout and label smoothing.
Residual Dropout
This was applied to the output for every sub-layer and then added the sub-layer normalized inputs. It was also used to the sum of embeddings and positional encodings in both the encoder and decoder.
Label Smoothing
This was used to lower the perplexity of the model. The model will learn to be less confident in some answers, which ultimately improves accuracy.
ResultsThe model was tested for two different tasks in machine translation with both the English-German and English-French datasets. The Transformer model outperformed all other previous models in both datasets and beaten reduced training time by almost 1/4 of what it was initially.
ConclusionThe paper proposes a Transformer model, the first even sequence transduction model that was entirely reliant on attention mechanisms. The model was trained way faster than other high-performance architectures in two different translation tasks.
The future of Transformers and its applications are exciting but still require further investigation to reduce current problems/limitations.
﻿
That was pretty much the main things covered in this paper
Summary of "Attention is All You Need"

Abstract

Introduction

Background

Model Architecture

Attention

Position-wise Feed-Forward Networks

Embedding and Softmax

Positional Encoding

Why Self-Attention

Training

Training Data and Batching

Hardware and Schedule

Optimizer

Regularization

Results

Conclusion

Comments (loading...)

AI Paper Summaries