Summary of "Improving Language Understanding by Generative Pre-Training" Research Paper

TL;DRGenerative Pre-Training (GPT) is a transformer framework that can achieve a strong natural language understanding. This is done by pre-training the model on a diverse corpus with long text sequences, which the model uses to gain significant "world knowledge." This broader model can then be transferred into more specific, discriminative tasks which require minimal changes in the model architecture.
The GPT model outperformed discriminatively trained models on 9 out of 12 specific tasks, giving the model a significant performance boost.
Introduction
Many deep learning methods require a vast amount of labelled data when training. This creates a problem as it restricts applicability in many domains relating to NLP. Being able to annotate unlabelled data is a superpower, but currently, it is time-consuming and costly.
Learning in an unsupervised fashion will give a huge performance boost in the model; however, there are still many challenges when trying to perform this:
It is unclear what type of optimizer objective for text representation is transferable to other domains.
There is no consensus on the most effective way to transfer the learnt representation onto specific tasks.
After finding a way to perform this, transferring the model will then require considerable changes to the model architecture, defeating the purpose of being "wide-range."
The ApproachGPT's approach to solving this problem is by using a semi-supervised model for language understanding. The first part of the model will use an unsupervised pre-training method for creating a broader language understanding. While the second part, the supervised fine-tuning, will fine-tune the model to fit a discriminative task.
The goal is to get the model to learn a universal representation. This is both transferable and requires little adaption to a wide range of tasks, which will ultimately make it easy to use without making considerable changes to the architecture.
Model FrameworkGPT uses a transformer model architecture to perform strongly on various tasks. The transformer model creates more structured memory for long-term dependencies (compared to RNNs, which are mainly used for the short-term), which will allow for minimal changes in the architecture when performing a discriminative task. 
This model follows a two-stage procedure:
Using language modelling objective on unlabelled data to learn the initial parameters of the model
Adapt those parameters to the targeted supervised task.
Unsupervised pre-training
The unsupervised pre-training section of the model uses multi-layer transformer decoders, which is a variant of transformers. It is trained off the BooksCorpus dataset, consisting of approximately 7,000 wide genres of unique unpublished books.
This will allow the model to understand a wide range of text representations, which can then be applied to discriminative tasks.
Supervised Fine-tuning
When training the model with the second method, the inputs are first passed through the pre-trained model to obtain the final transformer block's activation. This will then be fed into an added linear output to predict y.
The supervised fine-tuning is used to make changes to the model's parameters to fit the specific task. It will also help the learning by:
Improving generalization of the supervised model
Accelerate the convergence from broad to specific tasks
Task-specific input transformations
Some tasks require more modifications to fit it, meaning you would need to change the model's architecture (defeating the whole purpose). This is why GPT solves this problem by converting the structured inputs into ordered sequences which the pre-trained model can process.
﻿
Experiments/ResultsThe model was pre-trained using a 12-layer decoder-only transformer and trained the model with an Adam Optimizer  (max learning rate of 2.5e-4) and a Gaussian Error Linear Unit activation function, and ran for 100 epochs.
After this model had been trained, it was then tested in four specific domains of NLP:
Natural Language Inference (NLI)
Question Answering
Semantic Similarity
Text Classification 
After training and testing, GPT outperformed discriminatively trained models on 9 out of the 12 specific tasks, and overall, achieved high performance from all functions. This shows how pre-training models on unlabelled data can help improve the performance of discriminative tasks significantly. 
﻿
﻿
﻿
That wraps up the summary of this paper, and if you have any questions or want to learn more, you can read the full paper here!
﻿
﻿
Summary of "Improving Language Understanding by Generative Pre-Training" Research Paper

TL;DR

The Approach

Model Framework

Experiments/Results

Comments (loading...)

AI Paper Summaries