What kind of theory was established for large-scale language models such as ChatGPT? Summary of 24 important papers



ChatGPT was released in November 2022, and after

exceeding 1 million users in just one week , chat AI using large-scale language models such as Google's Bard and Microsoft's Bing AI Chat was released one after another. I'm here. Sebastian Laschka, who studies chat AI, summarizes the trajectory of research until chat AI is put into practical use, focusing on 24 important papers.

Understanding Large Language Models - by Sebastian Raschka
https://magazine.sebastianraschka.com/p/understanding-large-language-models


◆ Contents
・Main architecture and tasks
・Improve scaling and efficiency
・Guide the language model to the intended direction
・Reinforcement learning with human feedback (RLHF)




・Main architecture and tasks


◆1: Neural Machine Translation by Jointly Learning to Align and Translate (2014)
By introducing 'attention', which part of the input is emphasized in the recurrent neural network (RNN) , it has become possible to accurately handle longer sentences.



◆2:

Attention Is All You Need (2017)
A 'transformer' model was introduced, which consists of an encoder part and a decoder part. The paper also introduces a number of modern foundational concepts such as positional input encoding.



◆3:

On Layer Normalization in the Transformer Architecture (2020)
I've shown that placing the 'Norm' layer of the transformer model in front of the block works better.



◆4:

Learning to Control Fast-Weight Memories: An Alternative to Dynamic Recurrent Neural Networks (1991)
Already in 1991 an approach comparable to Transformers was being considered. Mr. Rashka says, 'Recommended for those who are interested in history.'



◆5: Universal Language Model Fine-tuning for Text Classification (2018)
We showed that training a language model in two stages, pre-learning and fine-tuning, enables it to perform tasks well. This paper was written a year after the Transformers paper, but the focus is on regular RNNs rather than Transformers.



◆6:

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding (2018)
According to the structure of the transformer, which is divided into an encoder and a decoder, the research field is also divided into two directions: the direction of the encoder type transformer, which performs text classification, etc., and the direction of the decoder type transformer, such as translation and summarization.

The BERT paper introduced a technique of masking parts of sentences to make predictions, allowing the language model to understand the context.



◆7:

Improving Language Understanding by Generative Pre-Training (2018)
First GPT paper. We trained a language model with a decoder-type structure by ``predicting the next word''.



◆8:

BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension (2019)
We combined an encoder-type transformer, which is good at prediction, and a decoder-type transformer, which is good at text generation, so that we can take advantage of the strengths of both.



◆9:

Harnessing the Power of LLMs in Practice: A Survey on ChatGPT and Beyond (2023)
It's not a research paper, it's a paper summarizing the findings of how different architectures evolved. As shown by the blue branch on the right, we can see that the development of the decoder type is particularly remarkable.





・Improve scaling and efficiency


◆ 10:

FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness (2022)
It is a wonderful explanation of an algorithm that can perform attention calculations at high speed and also reduce memory consumption.



11:

Cramming: Training a Language Model on a Single GPU in One Day (2022)
It has been shown that small models can be trained quickly, but at the same time they are less efficient to train. Conversely, even if the size of the model is increased, it can be trained in a similar amount of time.



◆12:

LoRA: Low-Rank Adaptation of Large Language Models (2021)
There are various methods for fine-tuning large-scale language models, but among them, “LoRA” is a method with high parameter efficiency.



LoRA is explained in the middle of the article below.

A method 'QLoRA' that makes it possible to train a language model with a large number of parameters even if the GPU memory is small has appeared, what kind of method is it? -GIGAZINE



◆13: Scaling Down to Scale Up: A Guide to Parameter-Efficient Fine-Tuning (2022)
Pre-trained language models perform well in various tasks, but fine-tuning is necessary if you want them to specialize in specific tasks. In this paper, a number of techniques for efficient fine-tuning are reviewed.



◆14:

Scaling Language Models: Methods, Analysis & Insights from Training Gopher (2022)
After seeing how performance improves as we increase the number of parameters in the language model, we find that it gets better at tasks like comprehension, fact-checking, and identifying toxic words. On the other hand, the results of logic and mathematical reasoning tasks did not change much.



◆15:

Training Compute-Optimal Large Language Models (2022)
We showed a new relationship between the number of parameters in the model and the number of training data for performance improvement on the generation task. He points out that models such as GPT-3 and Gopher are undertrained.



◆16:

Pythia: A Suite for Analyzing Large Language Models Across Training and Scaling (2023)
I studied how language models acquire abilities in the process of training.



This paper presents the following:
Training on duplicate data neither benefits nor harms
The order of training has no effect on memorization
・Words used many times in pre-training improve performance on related tasks
Doubling the batch size halves the training time but does not affect convergence.



・Guide the language model to the intended direction


◆17:

Training Language Models to Follow Instructions with Human Feedback (2022)
We introduced ``Reinforcement learning with human feedback (RLHF)'' that incorporates humans into the reinforcement learning loop. It is called the InstructGPT paper, using the name of the language model used in this paper.



18:

Constitutional AI: Harmlessness from AI Feedback (2022)
We developed a rule-based self-training mechanism to create a 'harmless' AI.



◆19:

Self-Instruct: Aligning Language Model with Self Generated Instruction (2022)
When fine-tuning a language model, there is a problem that it is difficult to scale if humans have prepared instruction data. This paper describes a mechanism for preparing the instruction data itself in the language model. Although the performance is better than the original language model and the model trained with human-prepared data, it loses to the model that performed RLHF.





・Reinforcement learning with human feedback (RLHF)


RLHF is considered the best option available as of May 2023, Raschka said. Mr. Rashka expects that the influence of RLHF will continue to increase in the future, so he will introduce additional RLHF papers for those who want to learn more about RLHF.

◆20:

Asynchronous Methods for Deep Reinforcement Learning (2016)
This paper introduces the policy gradient method.

◆21: Proximal Policy Optimization Algorithms (2017)
We have developed Proximal Policy Optimization (PPO), which improves the policy gradient method and increases data efficiency and scaling.

◆22: Fine-Tuning Language Models from Human Preferences (2020)
Introduced PPO to RLHF.

◆23: Learning to Summarize from Human Feedback (2022)
We created a model that achieves better results than normal supervised learning through three-step training: 'pre-learning' → 'fine-tuning' → 'PPO'.



◆24: Training Language Models to Follow Instructions with Human Feedback (2022)
Paper No. 17 has reappeared. Training is performed in the same three steps as above, but text generation is emphasized instead of text summarization, and the number of evaluation options is increased.

in Software, Posted by log1d_ts