What was revealed by comparing DeepSeek's inference model 'DeepSeek-R1' with OpenAI's o1 & o3?

Most AI benchmarks measure the output accuracy (skill) of AI, but skill does not represent the intelligence of AI. In order to measure the intelligence of AI rather than skill, the benchmark '
R1-Zero and R1 Results and Analysis
https://arcprize.org/blog/r1-zero-r1-results-analysis

The goal of the ARC Prize is to define and evaluate new ideas for artificial general intelligence (AGI). To that end, the ARC Prize strives to build the strongest possible global innovation environment. According to the ARC Prize, the prevailing view in the AI industry is that 'AGI does not exist at the time of writing, and innovation remains constrained.'
Meanwhile, DeepSeek has announced the DeepSeek-R1 family, a proprietary inference model that achieves performance equivalent to OpenAI's inference model o1. DeepSeek-R1, DeepSeek-R1-Zero, and o1 all have scores of about 15-20% on ARC-AGI, while DeepSeek-R1's operating costs are only 3.6% of o1's. In addition, since the score on ARC-AGI of conventional large-scale language models (LLMs) was at most about 5%, DeepSeek-R1 is said to be a very excellent AI model even from the perspective of the ARC Prize.
How did DeepSeek beat OpenAI's O1 at 3% of the cost? - GIGAZINE

However, the inference model ' o3 ' announced by OpenAI in December 2024 achieved very high scores in ARC-AGI, with 76% in low computing mode and 88% in high computing mode, making it 'the first computer to achieve a practical and general score for a computer adapted to unknown problems,' and the ARC Prize commented that 'o3's extremely high scores in ARC-AGI have received little attention or coverage in the mainstream media.'
The table below summarizes the ARC-AGI scores for R1, R1-Zero, o1 (low, medium, and high computing modes), and o3 (low and high computing modes), along with the average tokens and average operating costs for each model.
Model name | ARC-AGI score | Avg Tokens | Average cost |
---|---|---|---|
R1-Zero | 14% | 11K | 0.11 dollars (approximately 1.7 yen) |
R1 | 15.8% | 6K | 0.06 dollars (approximately 9.3 yen) |
o1(low) | 20.5% | 7K | $0.43 (about 66 yen) |
o1(Medium) | 31% | 13K | $0.79 (about 120 yen) |
o1(High) | 35% | 22K | $1.31 (about 200 yen) |
o3(low) | 75.7% | 335K | $20 (about 3,100 yen) |
o3(High) | 87.5% | 57M | $3,400 (approx. ¥530,000) |
ARC Prize speculates that OpenAI's o1 and o3 inference systems work as follows. The reason why it is called 'speculation' is because OpenAI's o1 and o3 are closed models and the processing process up to answer output is unknown.
1: Generate a chain of thought ( CoT ) for the problem domain
2. Combine human experts ( supervised fine-tuning or SFT ) and automated machines (e.g. , reinforcement learning ) to label intermediate CoT steps
3: Train the base model using step 2
4: Iterative inference from process models at test time
In contrast, DeepSeek's R1 family is open source, so how the inference system works is clear. According to the ARC Prize, the key insight of DeepSeek's inference system is that novelty fitness (and reliability) improves along three dimensions:
1: Adding human labels (SFT) to CoT process model training
2: Perform CoT search (parallel CoT inference per step) instead of linear inference
3: Sampling the entire CoT (parallel trajectory estimation)
ARC Prize states, 'The most interesting thing DeepSeek has done is to publish R1 and R1-Zero separately. R1-Zero is a model that does not use SFT, as described in item (1), and instead relies on reinforcement learning. R1-Zero and R1 recorded high scores of 14% and 15.8%, respectively, on ARC-AGI, and also achieved excellent scores in DeepSeek's independently reported benchmark scores. For example, their scores on MATH AIME 2024 were 71% (R1-Zero) and 76% (R1), respectively, a significant increase from about 40% for DeepSeek-V3. '

However, the R1 developer wrote in the paper that 'DeepSeek-R1-Zero faces challenges such as difficulty in reading and language mixing,' and
Based on these results, the ARC Prize writes that the following three points are suggested:
1: SFT (e.g., human expert labeling) is not required for accurate and legible CoT inference in domains that require strong validation
2: R1-Zero’s training process creates its own internal Domain Specific Language (DSL) within the token space via reinforcement learning optimization.
3: SFT is necessary to increase the generality of the CoT reasoning domain
The ARC Prize wrote, 'This makes intuitive sense, since the language itself is effectively an inference DSL. Just like a program, the exact same 'words' can be learned in one domain and applied to another. Pure reinforcement learning approaches are not yet capable of discovering a broad common vocabulary, but we expect this to be a focus of future research.'
In addition, ARC Prize states, 'DeepSeek is almost certain to be targeting OpenAI's o3. It will be important to watch whether SFT will be required to add CoT search and sampling, or whether R2-Zero (the next inference model) will follow the same logarithmic accuracy vs. inference scaling curve. Based on the results of R1-Zero, we believe that SFT will not be necessary for the next model to achieve a high score on ARC-AGI.'

Based on these insights, ARC Prize said, 'Economically, there are two big changes happening in AI. One is that we can spend more money to get higher accuracy and reliability, and the other is that training is shifting to inference. Both will create huge demand for inference, but neither will reduce the demand for computing. In fact, the demand for computing will increase. AI inference systems promise much greater benefits than improved accuracy in benchmarks. The biggest problem preventing further use of AI automation is trust. I've spoken to hundreds of Zapier users who are trying to introduce AI agents into their businesses, and the feedback is consistent: 'I don't trust them yet because they're not trustworthy.''
He further added, 'Because R1 is open and reproducible, more people and teams will push CoT and searches to their limits. This will help us see more quickly where the frontiers actually are, stimulating a wave of innovation that will increase the likelihood of reaching AGI faster. I've already heard from several people that they plan to use an R1-style system for the ARC Prize 2025, and I'm excited to see the results. The fact that R1 is open is great for the world. DeepSeek has advanced the cutting edge of science,' he wrote, praising DeepSeek.
Related Posts:
in Software, Posted by logu_ii