Feb 25, 2025 11:26:00

'Claude 3.7 Sonnet' and 'Claude Code' have been released, and they have succeeded in defeating three 'Pokemon' gym leaders with performance that exceeds OpenAI o1 and DeepSeek-R1

Anthropic has announced ' Claude 3.7 Sonnet '. According to Anthropic, Claude 3.7 Sonnet is 'the first hybrid inference model on the market' and has outperformed OpenAI's o1, o3-mini, and DeepSeek-R1 in benchmark tests.

Claude 3.7 Sonnet and Claude Code \ Anthropic

https://www.anthropic.com/news/claude-3-7-sonnet

Claude's extended thinking \ Anthropic
https://www.anthropic.com/research/visible-extended-thinking

Anthropic's 'hybrid inference model' means a model that can provide both 'real-time answers' that answer questions immediately and 'thoughtful answers' that involve more inference. Users can choose whether to activate the AI model's inference function, and can choose whether to have Claude 3.7 Sonnet answer instantly or ponder over the answer.

Specifically, there are two modes: Normal and Extended. Normal mode is an upgraded version of Claude 3.5 Sonnet. Extended mode iterates reasoning before answering, which is said to improve performance on complex tasks such as solving math and physics problems and coding.

Additionally, when using Claude 3.7 Sonnet via the API, it is now possible to specify token values, giving users the freedom to tailor them to the speed, cost, and quality of answers they require.

Additionally, Anthropic says that in developing its inference models, it has focused less on optimizing for math and computer science competition problems and instead on real-world tasks that better reflect the situations in which companies actually use LLMs.

Below is a comparison of software engineering benchmark results using

SWE-bench Verified . Claude 3.7 Sonnet (far left) has a higher accuracy than Claude 3.5 Sonnet (October 2024 model), OpenAI's o1, o3-mini, and DeepSeek-R1.

In the

TAU-bench benchmark, which evaluates the performance of conversational AI agents in more realistic tasks, Claude 3.7 Sonnet outperformed Claude 3.5 Sonnet and OpenAI o1 in both Retail and Airlines.

Other benchmark results are as follows. The results on the left are for Claude 3.7 Sonnet in Extended mode, and the results on the right are for Normal mode.

In addition, Anthropic is running a benchmark to measure the performance of Claude 3.7 Sonnet's agents by playing the Game Boy game 'Pokémon Red.'

This benchmark test examines how far an AI model can go in playing Pokémon games by giving it screen recognition and basic controls. Claude 3.7 Sonnet was able to reach Vermilion City and defeat the gym leader, Surge.

At the time of writing, it is unclear how many calculations Claude 3.7 Sonnet had to make to get the three gym badges, or how long each one took, but Anthropic reports that 'Claude 3.7 Sonnet performed 35,000 actions before reaching the final gym leader, Matisse.'

Anthropic has also released Claude Code , a coding tool with an AI agent, as a research preview.

Claude Code overview - Anthropic

https://docs.anthropic.com/en/docs/agents-and-tools/claude-code/overview

Claude Code is a tool that allows you to search and read code, edit files, write and run tests, commit and push to GitHub, use command line tools, and more.

Although Claude Code is still in development, Anthropic said, 'It has already become indispensable for our team in debugging and refactoring.' Tasks that would normally take more than 45 minutes to complete manually have been completed in one pass, reducing development time and overhead. Based on the usage of Anthropic, Claude Code will continue to be improved in the future.

'The Claude 3.7 Sonnet and Claude Code represent an important step towards AI systems that can truly augment human capabilities,' said Anthropic. 'With the ability to reason deeply, work autonomously, and collaborate effectively, AI will enrich what humans can achieve and move us closer to a more augmented future.'

Claude 3.7 Sonnet is available for all plans including Free, Pro, Team, and Enterprise, as well as Anthropic API, Amazon Bedrock, and Google Cloud Vertex AI. Extended mode is available for all paid Claude plans. For API use, the price is $3 (approx. 450 yen) per million input tokens and $15 (approx. 2,250 yen) per million output tokens.

Continued
Claude 3.7 Anthropic starts streaming 'ClaudePlaysPokemon' on Twitch, letting Sonnet play Pokemon, everyone watches the super slow play while inferring - GIGAZINE

Related Posts:

Feb 25, 2025 11:26:00 in Software, Posted by log1i_yk