GPT-4-based ChatGPT ranks first in conversational chat AI benchmark rankings, Claude-v1 ranks second, and Google's PaLM 2 also ranks in the top 10



The Large Model Systems Org (LMSYS Org), an open research organization established by UC Berkeley students and faculty in collaboration with UC San

Diego and Carnegie Mellon The Scale Language Model (LLM) benchmark ' Chatbot Arena ' is open to the public.

Chatbot Arena Leaderboard Updates (Week 4) | LMSYS Org
https://lmsys.org/blog/2023-05-25-leaderboard/



In Chatbot Arena, users are invited to FastChat , an open platform for evaluating LLM-based conversational AI, have conversations with two anonymous models, and vote on which one is more accurate. increase. Based on the results of this vote, the win/loss and rating based on the Elo rating widely used in chess etc. will be performed, and the standings will be published.

Below is the ranking based on anonymous voting data of 27,000 votes from April 24th to May 22nd, 2023. ChatGPT, which is based on OpenAI's GPT-4, ranked first, while OpenAI competitor Anthropic's Claude-v1 and its lightweight model came in second and third.

rank model Elo Rating Commentary
1 GPT-4 1225 ChatGPT based on GPT-4
2 Claude-v1 1195 Anthropic Chat AI
3 Claude-instant-v1 1153 Faster and cheaper with Claude's lighter model
Four GPT-3.5-turbo 1143 ChatGPT based on GPT-3.5
Five Vicuna-13B 1054 Chat AI fine-tuned from LLaMA, 13 billion parameters
6 PaLM2 1042 A chat AI based on 'PaLM 2' like Google's chat AI 'Bard'.
7 Vicuna-7B 1007 Chat AI fine-tuned from LLaMA, 7 billion parameters
8 Koala-13B 980 Chat Ai based on GPT-3.5 Turbo
9 mpt-7B-chat 952 Chat AI based on MosaicML's open source LLM 'MPT-7B'
Ten FastChat-T5-3B 941 Chat AI developed by LMSYS org
11 Alpaca-13B 937 Chat AI based on LLM 'Alpaca 7B', fine-tuned from Meta's LLaMA
12 RMKV-4-Raven-14B 928 RNN-employed LLM-based chat AI with comparable performance to Transformer-employed LLM
13 Oasst-Pythia-12B 921 Open assistant by LAION
14 ChatGLM-6B 921 An Open Bilingual Dialogue Language Model by Tsinghua University
15 StableLM-Tuned-Alpha-7B 882 Stablity AI's language model-based chat AI
16 Dolly-V2-12B 886 Open-source LLM-based chat AI tuned by Databricks MIT
17 LLaMA-13B 854 Chat AI based on Meta's LLaMA-13B


Below is a table showing the winning percentage in color. Higher win rates are shown in blue, lower win rates are shown in red.



In this result, LMSYS Org focuses on 'Google PaLM 2'. PaLM 2 is ranked 6th in the standings and has a good win rate. However, LMSYS Org said, ``PaLM 2 seems to be more regulated than other models.When users ask unclear or difficult questions, PaLM 2 is more likely to refrain from answering than other models. will be.”

For example, when asked to emulate a Linux terminal or programming language

interpreter , PaLM 2 refused. Furthermore, LMSYS Org evaluates that 'PaLM 2's inference ability is not sufficient.'

Also, it seems that PaLM 2 tended not to answer questions other than English, such as Chinese, Spanish, and Hebrew. PaLM 2 ranked 5th when only questions asked in English were taken into account, but fell to 16th when asked questions in non-English.



LMSYS Org also noted the high ranking of chatbots based on smaller LLMs such as Vicuna-7B and mpt-7b-chat. It seems that the small model showed a performance advantage when compared to a large model with more than twice the number of parameters. And fine-tuning datasets seem to be more important in some cases,' he said, pointing out that preparing a high-quality dataset through pre-training and fine-tuning is an important approach to reducing model size. doing.

in Software, Posted by log1i_yk