Loading...

Postulate is the best way to take and share notes for classes, research, and other learning.

More info

Summary of "Attention is All You Need"

Profile picture of Bagavan SivamBagavan Sivam
Jun 27, 20216 min read

Paper summary: Attention is All You Need

Abstract

Tons of state-of-the-art sequence transduction models use complex recurrent or convolutional neural networks and connect the encoder and decoders with attention mechanisms.

This paper proposes the Transformer architecture, which is just a simple architecture solely reliant on attention mechanism. The architecture was performed on two machine translation tasks and performed significantly better than other models.

This architecture can also help generalize models.

Introduction

RNNs, Gated-RNNs, LSTMs are all state-of-the-art models in sequence modelling and transduction problems. These recurrent models can take the input and output sequences and generate a sequence of hidden states as a function of the previous hidden state in the input position.

Many people have been trying to push these models to perform even better by improving efficiency through factorization tricks and conditional constraints. The attention mechanism allows for modelling dependencies without taking the distance of an input or output sequence into consideration. Although these methods work, they don't significantly boost their performance.

This is where the Transformer comes in. This is an architecture that avoids using recurrences and instead relies on attention mechanisms. Transformers can show significantly more parallelization and reach new state-of-the-art in translation tasks.

Background

Reducing the computational of sequential models requires convolutional neural networks as its basic building block. This makes it harder to learn dependencies between the distant position, which continue to grow linearly.

Self-attention is a type of attention mechanism relating to the different positions of a single sequence. It has been used in various tasks like reading comprehension, abstractive summarization, etc.

End-to-end memory networks are recurrent attention mechanisms and can perform well on simple question-answering and language modelling tasks.

Transformers are the first transduction model that relies entirely on self-attention for computing representaions.

Model Architecture

Many of these models have an encoder-decoder structure. The encoder maps the input sequence to the sequence of continuous representation. The decoder then generates the output sequence.

This Transformer uses a similar architecture, as it uses stacked self-attention and point-wise fully connected layers on both the encoder and decoder.

Here is a visual representation of the model architecture.



Encoder and Decoder Stacks The encoder has 6 of the same layers, each with two sub-layers:

  1. Multi-head self-attention mechanism

  2. Position-wise fully connected feed-forward network.

The decoder also has 6 of the same layers and has three sub-layers, where the third one performs multi-head attention over the output of the encoder stacks.

They both use layer normalization.

Attention

The attention part maps the queries and key-value pairs for the output (which are all vectors). The output is just the weighted sum of the values. There are two different types of attention being used:

  1. Scaled Dot-Product Attention

  2. Multi-Head attention

Scaled-Dot Product Attention This computes the dot product of the queries with all its keys and applies a softmax function to get the weights on the values. The matrix is calculated as:



Multi-Head Attention This attention is used for projecting the queries, keys and values linearly (which is better than a single attention function). The values are concatenated and project the results into a final value.

Applications of the Attention in the model There are three uses for these attentions to be used in the Transformer:

  1. In the encoder-decoder attention layer

  2. The encoder contains a self-attention layer

  3. The self-attention layer can be used inside the decoder

Position-wise Feed-Forward Networks

Each encoder and decoders have fully connected feed-forward networks and consist of 2 linear transformations and a ReLU activation.

Embedding and Softmax

It also uses learned embeddings to convert the input and output tokens into vectors and uses a softmax function to convert the decoder output into next-token probabilities.

Positional Encoding

Positional encoding is also used to make sure the model uses an order of sequences. It has also been experimented with learned positional embeddings and produced almost the same results; however, they use this one as it's more convenient.

Why Self-Attention

There are many reasons why they used self-attention instead of recurrent neural networks or convolutional neural networks like other models. For one, it is less computationally complex per layer. It is also less computationally expensive as it uses a minimal number of sequential operations compared to RNNs or CNNs. Finally, learning long-range dependencies is challenging with many tasks, but self-attention shortens the path to make it less challneging.

Training

Training Data and Batching

They used two different datasets:

  1. WMT 2014 English-German dataset - 4.5 million sentence pairs

  2. WMT 2014 English-French dataset - 36 million sentences

The training batches contained a set of sentence pairs with almost 25000 source and target tokens.

Hardware and Schedule

Thye used 8 NVIDIA P100 GPUs and trained for 100,000 steps/12 hours (0.4 seconds per step)

Optimizer

The Adam optimizer was applied with a varied learning rate over training:



Towards the first training steps, it increases at a linear rate and then decreasing proportionally.

Regularization

There were two different regularization methods applied, residual dropout and label smoothing.

Residual Dropout This was applied to the output for every sub-layer and then added the sub-layer normalized inputs. It was also used to the sum of embeddings and positional encodings in both the encoder and decoder.

Label Smoothing This was used to lower the perplexity of the model. The model will learn to be less confident in some answers, which ultimately improves accuracy.

Results

The model was tested for two different tasks in machine translation with both the English-German and English-French datasets. The Transformer model outperformed all other previous models in both datasets and beaten reduced training time by almost 1/4 of what it was initially.

Conclusion

The paper proposes a Transformer model, the first even sequence transduction model that was entirely reliant on attention mechanisms. The model was trained way faster than other high-performance architectures in two different translation tasks.

The future of Transformers and its applications are exciting but still require further investigation to reduce current problems/limitations.



That was pretty much the main things covered in this paper


Comments (loading...)

Sign in to comment

AI Paper Summaries

A bunch of summaries of AI research papers.