Postulate is the best way to take and share notes for classes, research, and other learning.
Paper Summary: "Attention Is All You Need"
TL;DR - If you just take queries, keys and values you can get attention. Queries x Keys --> Distribution x values --> Attention (plus some other fancy stuff). Now we do that over and over again and using just attention, we end up with a novel state of the art Attention model!
Why this is important - This paper introduces the novel Transformers. It's the first model to throw our recurrency and convolutions and go solely on attention!
Right now the best models are RNNs + CNNs in an encoder / decoder with attention mechanisms. But this paper comes in with a new approach which throws out all the recurrence and convolutions. This is called a Transformer and it only uses attention mechanisms.
Transformers are superior to Encoder-Decoder networks as they can be massively parallelized and they require less time to train. Transformers are the state of the art.
RNN's, LSTM's, GRU's are all state of the art in terms of sequence models for lots of NLP tasks.
[They just go over how they work] {Read my other paper summary for how they work (All sections before Taxonomy of Attention: https://postulate.us/@dickson/p/2021-06-30-Paper-summary:-"An-Attentive-Survey-ij2yTDkcfragy9QC5n2tKc)}
Fundamentally, the issue with recurrent models is that we can't parallelize them. There are techniques that try to improve the computation efficiency, but the parallelization issue still remains. Attention also helps but we need something new.
Introducing transformers! Kick out all reccurent stuff and only use attention, attention is all you need! Attention will draw the information from the inputs to the outputs. More parallelization + it's the new state of the art.
There have been many previous attempts to kick out the sequential computation, we use convolutional blocks. But in these functions, the time that it takes to relate the input sequence to the output sequence grows. Some linearly, some logarithmically, so it's still hard to learn longer sequences. Transformers make it constant.
Self attention is when we take different parts of the sequence to the other parts of the sequence in order to find the relationships between the words.
End-to-end memory netwokr = using recurrent attention (instead of sequence aligned recurrence).
Transformers = first of their kind that don't use RNNs or CNNs.
Review of the Encoder and decoder network. Encoder takes in x = (x1, x2, x3, etc). Encoder maps to representation of z = (z1, z2, z3, etc). Finally the encoder comes in, takes z and outputs y (y1, y2, y3, etc). Each step also takes in previous outputs (hidden / final symbols). The transformer does this, except with self attention and pointwise fully connected layers.
Encoder & Decoder:
The encoder has 6 of these layers (so look at the diagram above and duplicate it 6 times in your head. Each layer will have 2 sublayers, a Multi-head attention one and a feedforward one.
There are residual connections around each of the sublayers + layer normalization (which occurs after the residual connection).
The decoder also has 6 of these layrs with 3 sublayers - 2 multiheads and 1 feedforward one. It's the same as encoding, except this time we shoving in the outputs of the encoding stack into the multihead of the decoder. We do a little masking so it prevents the blocks from looking at the blocks after it.
Attention:
Intuitively it's when the model starts picking things to pay attention to. This is done by dot producting the keys and query - and then softmaxing it (meaning we'll end up pulling the ones closest to the query). These keys are close to the query, thus is important to the network.
We don't exactly pull from the keys. Instead when we dot product it, we end up with a distribution, which we apply to the values in order to get a weighted sum of them to use to do cool things with.
And that's Scaled Dot-Product Attention! There's a little more math to it though. We do the exponential function in order to put one of the values crazy high (thus softmax is almost like argmax). And we also scale it down by a factor of root(d_k) in order for our gradient to not become extremely small.
Q = Queries. K = keys. V = values. Ex: when translating english to french, the query is the current word we're looking at. Keys = all the english words we have. Values = the french words we're going to translate to.
Multi-head Attention:
We take our keys value and queries and just linearly project them to h dimensions. When we project them, we're doing learned linear projections. We do the attention function to all of them and then concat them all together.
AKA Multihead is when we're doing the self-attention and comparing all values with each other. We can set h (number of heads)
How it all fits together:
There are 3 ways which the multi-head attention can be used:
Queries come from previous decoder layers. Keys + values come from the encoder. This mimics the classic encoder decoder ones
Self attention: Keys + values + queries = same place, (this scenario it's the previous encoder)
For the decoder we don't want the decoder to read stuff that hasn't come yet - thus we mask it
Position wise feed forward network:
Each of the encoders and decoders have one. It's just a linear translation + RELU + linear translation. Literally just a normal NN.
Embeddings and Softmax:
Learned embeddings = converts input tokens to output tokens. Linear transformations + softmax = convert decoder output to next-token probabilities. The embedding layers + linear transformation share same weights. But we multiply our weights by sqrt d_model.
Positional Encoding:
We don't have any recurrence + convolution, thus we've got to tell the network where the information is coming form. So we do this at the bottom of the stacks.
To do this we use positions on a sinusoid, to allow it to learn relative positions easily (and so we can extrapolate this function to take in sequence lengths longer than they were trained on.
We're evaluating the 3 types of attention models theoretically on 3 different metrics. For the last 2 Self attention is the best (/ tied for the best) For the first one, as long as n < d then we beat everyone else (which is often the case.
Plus an additional benefit is that attention models are much more interpretable. Each of the heads learn specific features and we can understand the behavior driving it.
[They just go over the data they use, the hardware and schedule, optimizer, & regularization]
State of the art performance
They also played around with the model architecture:
They also show that Transformers are generalizable when applying the model to another training set they weren't trained on.
Transformers! First of their kind to kick out RNN's + Conv layers from Attention models. Transformers are state-of-the-art, faster and can take advantage of parallelization! Transformers have an insane potential in the future and have the potential to adapt to other domains!
Special thanks to Saurabh Kumar for helping me understand it!
A Collection of Summaries of my favourite ML papers!