May 31, 2023 19:00:00

Benchmark measurement of various chat AI language models revealed that Meta's large-scale language model 'LLaMA' could reproduce ChatGPT

In recent years, research in the field of machine learning has progressed at a dizzying pace, and

large-scale language models with more than billions of parameters have been announced one after another. A research team led by Yao Fu, a large-scale language model researcher at the University of Edinburgh, UK, has published results on GitHub that compare the performance of multiple large-scale language models based on their own benchmarks.

GitHub - FranxYao/chain-of-thought-hub: Benchmarking large language models' complex reasoning ability with chain-of-thought prompting
https://github.com/FranxYao/chain-of-thought-hub

According to the research team, many people claim that `` even a language model with parameters less than 10B can achieve performance equivalent to OpenAI's GPT-3.5 ''. However, OpenAI points out that ``the performance difference of large-scale language models appears when faced with tasks with sufficient complexity'' when GPT-4 is released. Therefore, in order to confirm the performance difference of various large-scale language models based on certain benchmarks, we created the following ``list of complex inference tasks''.

・MMLU … high school and college level knowledge questions.
・GSM8K ……Arithmetic at elementary school level. Performance gains on this dataset translate directly into the power of everyday mathematics when interacting with large language models.
• MATH - Very difficult math and science problems.
・BBH … 27 difficult reasoning questions.
・HumanEval … A classic dataset for evaluating coding ability.
・C-Eval …… A collection of questions for the Chinese knowledge test covering 52 fields.
・TheoremQA …… A question and answer dataset based on 350 theorems in various fields such as mathematics, physics, electrical and electronic engineering, computer science, and finance.

And the table of the results of the benchmark measurement by the research team is below. The 'type' item of each large-scale language model is as follows: 'BASE' is pre-trained, 'SIFT' is after supervised instruction fine-tuning, and 'RLHF' is after subject learning with human feedback. shows things

Model name	number of parameters	type	GSM8K	MATH	MMLU	BBHMore	Human Eval	C-Eval	Theorem QA
gpt-4	?	RLHF	92.0	42.5	86.4	-	67.0	68.7	43.4
claude-v1.3	?	RLHF	81.8	-	74.8	67.3	-	54.2	24.9
PaLM-2	?	Base	80.7	34.3	78.3	78.1	-	-	31.8
gpt-3.5-turbo	?	RLHF	74.9	-	67.3	70.1	48.1	54.4	30.2
claude-instant	?	RLHF	70.8	-	-	66.9	-	45.9	23.6
text-davinci-003	?	RLHF	-	-	64.6	70.7	-	-	22.8
code-davinci-002	?	Base	66.6	19.1	64.5	73.7	47.0	-	-
text-davinci-002	?	SIFTMore	55.4	-	60.0	67.2	-	-	16.6
Minerva	540B	SIFTMore	58.8	33.6	-	-	-	-	-
Flan-PaLM	540B	SIFTMore	-	-	70.9	66.3	-	-	-
Flan-U-PaLM	540B	SIFTMore	-	-	69.8	64.9	-	-	-
PaLM	540B	Base	56.9	8.8	62.9	62.0	26.2	-	-
LLaMA	65B	Base	50.9	10.6	63.4	-	23.7	38.8	-
PaLM	64B	Base	52.4	4.4	49.0	42.3	-	-	-
LLaMA	33B	Base	35.6	7.1	57.8	-	21.7	-	-
Instruct Code T5+	16B	SIFTMore	-	-	-	-	35.0	-	11.6
StarCoder	15B	Base	8.4	15.1	33.9	-	33.6	-	12.2
Vicuna	13B	SIFTMore	-	-	-	-	-	-	12.9
LLaMA	13B	Base	17.8	3.9	46.9	-	15.8	-	-
Flan-T5	11B	SIFTMore	16.1	-	48.6	41.4	-	-	-
Alpaca	7B	SIFTMore	-	-	-	-	-	-	13.5
LLaMA	7B	Base	11.0	2.9	35.1	-	10.5	-	-
Flan-T5	3B	SIFTMore	13.5	-	45.5	35.2	-	-	-

Looking at the table, we can see that even with the same large-scale language model, there is a large difference in performance depending on the number of parameters, and that the scores of each benchmark are roughly proportional to the number of parameters in the model. The research team points out the following points from this result.

・'GPT-4' clearly outperforms all other models in GSM8K and MMLU.
・'LLaMa' developed by Meta has performance very close to 'text/code-davinci-002', a natural language processing engine using GPT-3, in a model with 65B parameters, and is adjusted correctly. If you do, you may be able to reproduce ChatGPT based on 65B LLaMa.
Developed by AI research startup
Anthropic , Claude is the only large-scale language model family comparable to the GPT family.
・The fact that ``gpt-3.5-turbo'' is superior to ``text-davinci-003'' for GSM8K confirms the ``improved mathematical ability'' mentioned by OpenAI in the January 30, 2023 release notes. there is
・For MMLU, 'gpt-3.5-turbo' is slightly better than 'text-davinci-003', but the difference is not large.

The research team said that it is generally very difficult to strictly compare the performance of large language models due to factors such as whether the large language models were trained on their corresponding domains and whether the prompts were optimized. indicate. Therefore, the results should be viewed as approximate reference values.

Related Posts:

May 31, 2023 19:00:00 in Software, Web Service, Posted by log1h_ik