``Hallucination Leaderboard'' ranking how many hallucinations various chat AIs will be published



Large-scale language models such as ChatGPT can manipulate words smoothly as if they were human, but on the other hand, ' hallucinations ' may occur in which they speak things that are not true. There is also. The AI company Vectara has published the results of a study to determine how often hallucinations occur in various large-scale language models.

vectara/hallucination-leaderboard: Leaderboard Comparing LLM Performance at Producing Hallucinations when Summarizing Short Documents
https://github.com/vectara/hallucination-leaderboard


Cut the Bull…. Detecting Hallucinations in Large Language Models - Vectara
https://vectara.com/cut-the-bull-detecting-hallucinations-in-large-language-models/


Examples of actual hallucinations are as follows. Information that is not in the original text appears in the summary.

Original translation content:
The plants were discovered during a search of a warehouse near Ashbourne on Saturday morning. Police said they were in a 'sophisticated cultivation facility.' A man in his late 40s was arrested at the scene.

PaLM summary:
Police have arrested a man in his late 40s after cannabis plants worth an estimated £100,000 were discovered in a warehouse near Ashbourne.

The hallucination rate was evaluated by passing the following prompts to each large-scale language model and examining the results using the Hughes Hallucination Evaluation Model .

You are a chat bot answering questions using data. You must stick to the answers provided solely by the text in the passage provided. You are asked the question 'Provide a concise summary of the following passage, covering the core pieces of information described.' <PASSAGE>

(You are a chatbot that uses data to answer questions. You must stick to providing answers only through the text of the provided sentences. Please provide a concise summary of the text (<text>).



The results are shown in the table below. GPT 4 received the top score for both accuracy and low illusion rate, followed by GPT 3.5 and Google Gemini Pro. In addition, if the sentence after the summary is too short, it will be treated as ``no answer''.

model Accuracy Hallucination rate Answer rate Average number of words in a summary
GPT4 97.0% 3.0% 100.0% 81.1
GPT 4 Turbo 97.0% 3.0% 100.0% 94.3
GPT 3.5 Turbo 96.5% 3.5% 99.6% 84.1
Google Gemini Pro 95.2% 4.8% 98.4% 89.5
Llama 2 70B 94.9% 5.1% 99.9% 84.9
Llama 2 7B 94.4% 5.6% 99.6% 119.9
Llama 2 13B 94.1% 5.9% 99.8% 82.1
Cohere-Chat 92.5% 7.5% 98.0% 74.4
Cohere 91.5% 8.5% 99.8% 59.8
Anthropic Claude 2 91.5% 8.5% 99.3% 87.5
Google Palm 2 (beta) 91.4% 8.6% 99.8% 86.6
Mixtral 8x7B 90.7% 9.3% 99.9% 90.7
Amazon Titan Express 90.6% 9.4% 99.5% 98.4
Mistral 7B 90.6% 9.4% 98.7% 96.1
Google Palm 2 Chat (beta) 90.0% 10.0% 100.0% 66.2
Google Palm 2 87.9% 12.1% 92.4% 36.2
Google Palm 2 Chat 72.8% 27.2% 88.8% 221.1



The table examining the probability of such hallucinations will be updated with new data on GitHub every time a new model appears or is updated. In order to be able to update regularly, it is stated that the incidence of hallucinations was investigated using a large-scale language model.

However, in this study, only the consistency of the facts between each output summary and the original text was evaluated, and models that simply cut and pasted parts of the original text received higher evaluations. . The hallucination incidence rate and the quality of the summary are completely different evaluation axes, and should be evaluated independently using different measurements.

Although there is still a long way to go before solving the hallucination problem of large-scale language models, by making the ``Hughes Hallucination Evaluation Model'' used in this evaluation open source, we will involve the community and solve the hallucination problem. Vectara said they want to take their response to the next level.

in Software, Posted by log1d_ts