May 26, 2023 21:00:00

GPT-4-based ChatGPT ranks first in conversational chat AI benchmark rankings, Claude-v1 ranks second, and Google's PaLM 2 also ranks in the top 10

The Large Model Systems Org (LMSYS Org), an open research organization established by UC Berkeley students and faculty in collaboration with UC San

Diego and Carnegie Mellon The Scale Language Model (LLM) benchmark ' Chatbot Arena ' is open to the public.

Chatbot Arena Leaderboard Updates (Week 4) | LMSYS Org
https://lmsys.org/blog/2023-05-25-leaderboard/

In Chatbot Arena, users are invited to FastChat , an open platform for evaluating LLM-based conversational AI, have conversations with two anonymous models, and vote on which one is more accurate. increase. Based on the results of this vote, the win/loss and rating based on the Elo rating widely used in chess etc. will be performed, and the standings will be published.

Below is the ranking based on anonymous voting data of 27,000 votes from April 24th to May 22nd, 2023. ChatGPT, which is based on OpenAI's GPT-4, ranked first, while OpenAI competitor Anthropic's Claude-v1 and its lightweight model came in second and third.

rank	model	Elo Rating	Commentary
1	GPT-4	1225	ChatGPT based on GPT-4
2	Claude-v1	1195	Anthropic Chat AI
3	Claude-instant-v1	1153	Faster and cheaper with Claude's lighter model
Four	GPT-3.5-turbo	1143	ChatGPT based on GPT-3.5
Five	Vicuna-13B	1054	Chat AI fine-tuned from LLaMA, 13 billion parameters
6	PaLM2	1042	A chat AI based on 'PaLM 2' like Google's chat AI 'Bard'.
7	Vicuna-7B	1007	Chat AI fine-tuned from LLaMA, 7 billion parameters
8	Koala-13B	980	Chat Ai based on GPT-3.5 Turbo
9	mpt-7B-chat	952	Chat AI based on MosaicML's open source LLM 'MPT-7B'
Ten	FastChat-T5-3B	941	Chat AI developed by LMSYS org
11	Alpaca-13B	937	Chat AI based on LLM 'Alpaca 7B', fine-tuned from Meta's LLaMA
12	RMKV-4-Raven-14B	928	RNN-employed LLM-based chat AI with comparable performance to Transformer-employed LLM
13	Oasst-Pythia-12B	921	Open assistant by LAION
14	ChatGLM-6B	921	An Open Bilingual Dialogue Language Model by Tsinghua University
15	StableLM-Tuned-Alpha-7B	882	Stablity AI's language model-based chat AI
16	Dolly-V2-12B	886	Open-source LLM-based chat AI tuned by Databricks MIT
17	LLaMA-13B	854	Chat AI based on Meta's LLaMA-13B

Below is a table showing the winning percentage in color. Higher win rates are shown in blue, lower win rates are shown in red.

In this result, LMSYS Org focuses on 'Google PaLM 2'. PaLM 2 is ranked 6th in the standings and has a good win rate. However, LMSYS Org said, ``PaLM 2 seems to be more regulated than other models.When users ask unclear or difficult questions, PaLM 2 is more likely to refrain from answering than other models. will be.”

For example, when asked to emulate a Linux terminal or programming language

interpreter , PaLM 2 refused. Furthermore, LMSYS Org evaluates that 'PaLM 2's inference ability is not sufficient.'

Also, it seems that PaLM 2 tended not to answer questions other than English, such as Chinese, Spanish, and Hebrew. PaLM 2 ranked 5th when only questions asked in English were taken into account, but fell to 16th when asked questions in non-English.

LMSYS Org also noted the high ranking of chatbots based on smaller LLMs such as Vicuna-7B and mpt-7b-chat. It seems that the small model showed a performance advantage when compared to a large model with more than twice the number of parameters. And fine-tuning datasets seem to be more important in some cases,' he said, pointing out that preparing a high-quality dataset through pre-training and fine-tuning is an important approach to reducing model size. doing.

Related Posts:

May 26, 2023 21:00:00 in Software, Posted by log1i_yk