Improving Language Understanding by Generative Pre-Training


Introduction

  • Having reading backward from GPT-3 to now GPT-1, one of the interest things is that changes of the degree of potential merit that each author argues unsupervised learning would have are apparent.
  • In GPT-1, the author suggests that unsupervised pretraining may boost the performance of supervised downstream tasks, compared to GPT-3 which author emphasizes that supervised learning is not indispensable if the model is large enough.

Approaches

  • In the beginning of GPT series, the major concerns of exploiting unlabeled data are unclear optimization objectives at learning text representations for transfer and the way to transfer text representation to the target task.
  • GPT-1 use a language modeling objective on the unlabeled data to initiate parameters of neural network and fine-tune the weights on the labeled data.
  • Pretraining in unsupervised manner enabled GPT-1 to generatlize the linguistic knowledge learning better acting as a regularization scheme.
  • In addition, traversal-style approaches enable the model to fine-tune effectively with minimal changes to the architecture of the pre-trained model.

Architecture

figure_1

  • GPT-1 tried to solve long dependency problem through Introducing Transformer architecture, and it actually allows the model to capture the longer range linguistic structure.
  • The model is 12 stacked layers of transformer block comprise of multihead attention sublayer and positionwise feedforward sublayer. Pretraining refers to optimizing this model with language modeling objective to predict the next token probability. It directly can be utilized to fine tuning by adding task-specific layers on the top of pretrained model.
  • Unlike feature-based approaches appending outputs of pretrained model to input vectors of downstream task, GPT uses traversal-style which passes input vectors of downstream task to pertained model after appending a task-specific linear layer on top of pretrained model and updates the parameters in it. In addition, using language modeling as an auxiliary objective in fine tuning phase can improves the generalization of model and accelerates the convergence.

Learning knowledge analysis

  • Having obeserved that varying the number of transferred layers affects the performance of supervised learning, the author analyzed zero-shot behaviors of the pre-trained model; Whether the model has acquired useful linguistic knowledge for downstream tasks through pretraining.

    Why should be Transformer-based model?

  • The hypothesis is that underlying generative model, the 12 layers of transformer block, learns to perform downstream tasks while optimizing language modeling capability, and the comparison of performance without fine tuning between pretrained GPT and LSTM makes the hypothesis plausible. GPT shows stable growing tendency with further pretraining, while LSTM does not. LSTM shows larger performance variance between tasks, which means that LSTM is not suitable for zero-shot behavior. Thus, for now, Transformer-based model seems the most promising candidate for pretrained language model.

    Insights from tendencies

  • Also, some additional experiments give extra insights for further researches. First, model leverages more from language modeling as an auxiliary objective when fed on large dataset compared to small dataset. Second, the less pre-training, the worse the result after fine tuning. It seems to have affected to GPT-2 and GPT-3, the tremendously large-scale pretrained model.

Conclusion

  • As it can be seen from the number of citations, it is a milestone in natural language processing. It suggests the learning paradigm of fine tuning with ony small dataset after pretraining exploiting large unlabeled data, and Transformer-based model can learns linguistic knowledges from pretraining better than LSTM, the previous benchmark model.
  • These discoveries opened the door to exploitation of large-scale unannotated data which was hardly considered as useful for model learning. This paper is meaningful that it presented the viewpoint of optimizing model to learn zero-shot behavior and showed significant evidence to bolster.

Reference and Implementation:




© 2022.03. by bigshane