Allegations of benchmark fraud against Meta's AI model 'Llama 4' surface, but Meta flatly denies them, calling them 'unfounded'

Meta announced its next-generation AI model,
Meta's benchmarks for its new AI models are a bit misleading | TechCrunch
https://techcrunch.com/2025/04/06/metas-benchmarks-for-its-new-ai-models-are-a-bit-misleading/

Meta exec denies the company artificially boosted Llama 4's benchmark scores | TechCrunch
https://techcrunch.com/2025/04/07/meta-exec-denies-the-company-artificially-boosted-llama-4s-benchmark-scores/
Meta defends Llama 4 release against 'reports of mixed quality,' blames bugs | VentureBeat
https://venturebeat.com/ai/meta-defends-llama-4-release-against-reports-of-mixed-quality-blames-bugs/
Llama 4: Did Meta just push the panic button?
https://www.interconnects.ai/p/llama-4
Meta announced 'Llama 4' on April 5, 2025. It is a native multimodal model designed from the start to handle multiple information formats, including not only text but also images and videos, in an integrated manner. Furthermore, its MoE architecture selectively runs only the most suitable specialized models for each task, known as 'experts,' to maintain high performance while minimizing resource waste. It also uses a new position embedding technique called 'iRoPE (Improved Rotary Position Embeddings)' to mitigate accuracy degradation in long-text context processing.
In particular, Llama 4 Scout and Llama 4 Maverick, which have 17 billion active parameters, are reported to be able to achieve the same accuracy as competing models such as Gemma 3, Gemini 2.0 Flash-Lite, Mistral 3.1, as well as GPT-4o and DeepSeek-V3, while using fewer computational resources.
Meta releases next-generation multimodal AI 'Llama 4,' adopting MoE architecture to boast high performance comparable to competing models - GIGAZINE

On the other hand, these models are suspected of being trained on a different test set than the publicly available Llama 4 in order to achieve better scores on the AI evaluation platform LM Arena.
“Serious issues in Llama 4 training. I Have Submitted My Resignation to GenAI“
by u/rrryougi in LocalLLaMA
AI researcher and author Andrew Burkov criticized Llama 4 Scout, saying, 'Sending more than 256,000 tokens to Scout, which claims to support a very long context window of 10 million tokens, results in very low-quality output.'
I will save you reading time about Llama 4.
— Andriy Burkov (@burkov) April 5, 2025
The declared 10M context is virtual because no model was trained on prompts longer than 256k tokens. This means that if you send more than 256k tokens to it, you will get low-quality output most of the time.
And even if your problem…
Also, on the bulletin board site Reddit, it was reported that the coding task of 'simulating a ball bouncing inside a rotating heptagon' was performed using Llama 4, and the performance was lower than that of DeepSeek-V3.
Furthermore, former Meta researcher and senior research scientist at the Allen Institute for Artificial Intelligence, Nathan Lambert, said, 'Meta's decision not to disclose the models it used to create its marketing pitch is a major problem.'
Ahmad Al Darreh, vice president of generative AI at Meta, said: 'We've heard claims that Llama 4 was trained on a test set, but this is completely untrue. The variations in quality reported by some users are a way to stabilize the implementation.'
We're glad to start getting Llama 4 in all your hands. We're already hearing lots of great results people are getting with these models.
— Ahmad Al-Dahle (@Ahmad_Al_Dahle) April 7, 2025
That said, we're also hearing some reports of mixed quality across different services. Since we dropped the models as soon as they were…
Dahle also claimed that 'some users are confusing Llama 4 Maverick with Llama 4 Scout across the various cloud providers hosting the model.' He added, 'We have taken down the publicly available model. We will spend a few days adjusting the model and will re-release it when it is ready. We will continue to work on development by fixing bugs in Llama 4 and onboarding partners.'
Related Posts:





