May 28, 2024 14:00:00

The results of evaluating the performance of 'ChatGPT', 'Copilot', 'Gemini', 'Claude' and 'Perplexity' in everyday conversations are published

As AI accuracy improves, chat AIs that can handle everyday conversations smoothly, such as ChatGPT, Copilot, and Gemini, are appearing one after another. However, it is difficult for general users to judge which chat AI is the most powerful. Meanwhile, The Wall Street Journal conducted a 'test to evaluate the response performance of everyday conversations' for five types of chat AI and published the test results.

The Great AI Chatbot Challenge: ChatGPT vs. Gemini vs. Copilot vs. Perplexity vs. Claude - WSJ
https://www.wsj.com/tech/personal-tech/ai-chatbots-chatgpt-gemini-copilot-perplexity-claude-f9e40d26

When AI companies and AI researchers promote the performance of their own AI, they often use scores measured using benchmark tools. However, just because an AI has a good benchmark test score does not necessarily mean that it can accurately answer questions asked in everyday conversations. Therefore, the Wall Street Journal conducted a test to evaluate the responses of five chat AIs, 'ChatGPT,' 'Copilot,' 'Gemini,' 'Claude,' and 'Perplexity,' by inputting questions that are likely to arise in everyday conversations.

The questions used in the test were created in collaboration with Wall Street Journal editors and columnists, and included questions in a variety of categories such as 'health,' 'finance,' and 'cooking.' For example, the cooking category included questions such as, 'Can you bake a chocolate cake without flour, gluten, dairy, nuts, or eggs? If so, please give me the recipe.' These questions were entered into five chat AIs, and editors and columnists evaluated the responses for 'accuracy,' 'usefulness,' and 'overall quality' without identifying which AI they were. Paid versions of the chat AIs were used for the test: ChatGPT used 'GPT-4o,' and Gemini used 'Gemini 1.5 Pro.'

The test results are as follows. Although performance varied depending on the question category, Perplexity came in first in the overall evaluation. However, Perplexity had the slowest response time among the five chat AIs. In addition, there was no significant difference between the five chat AIs in coding questions.

	First place	No.2	3rd place	4th	No.5
health	ChatGPT	Gemini	Perplexity	Claude	Copilot
finance	Gemini	Claude	Perplexity	ChatGPT	Copilot
cooking	ChatGPT	Gemini	Perplexity	Claude	Copilot
Work-related writing	Claude	Perplexity	Gemini	ChatGPT	Copilot
Creative Writing	Copilot	Claude	Perplexity	Gemini	ChatGPT
summary	Perplexity	Copilot	ChatGPT	Claude	Gemini
Current Affairs	Perplexity	ChatGPT	Copilot	Claude	Gemini
coding	Perplexity	ChatGPT	Gemini	Claude	Copilot
Response Time	ChatGPT	Gemini	Copilot	Claude	Perplexity
Overall rating	Perplexity	ChatGPT	Gemini	Claude	Copilot

Microsoft told the Wall Street Journal that they plan to integrate GPT-4o into Copilot in the near future. Therefore, it is expected that Copilot's performance will improve in the near future. Also, please note that the Wall Street Journal's test is only in English.

There are other examples of comprehensive analysis of AI performance. For example, Stanford University has been publishing a report analyzing the performance and impact of AI every year since 2017. The contents of Stanford University's AI Report 2024 can be found in the following article.

Stanford University's 'AI Index Report 2024' is released, summarizing 'AI is more powerful than humans, but humans are better in some tests' and 'The learning cost of high-performance AI is several tens of billions of yen' - GIGAZINE

Related Posts:

May 28, 2024 14:00:00 in Software, Posted by log1o_hf