Allegations of benchmark fraud against Meta's AI model 'Llama 4' surface, but Meta flatly denies them, calling them 'unfounded'



Meta announced its next-generation AI model,

Llama 4 , on April 5, 2025. It's designed to maintain high performance while minimizing resource waste. Llama 4 Maverick, with 17 billion active parameters, boasts accuracy equal to or better than OpenAI's GPT-4o and DeepSeek-V3, while using fewer computing resources. Meanwhile, some developers have pointed out that Llama 4's high benchmark scores indicate it's an 'experimental version tuned for conversational use.' Meta countered, stating that this is 'not true.'

Meta's benchmarks for its new AI models are a bit misleading | TechCrunch
https://techcrunch.com/2025/04/06/metas-benchmarks-for-its-new-ai-models-are-a-bit-misleading/



Meta exec denies the company artificially boosted Llama 4's benchmark scores | TechCrunch
https://techcrunch.com/2025/04/07/meta-exec-denies-the-company-artificially-boosted-llama-4s-benchmark-scores/

Meta defends Llama 4 release against 'reports of mixed quality,' blames bugs | VentureBeat
https://venturebeat.com/ai/meta-defends-llama-4-release-against-reports-of-mixed-quality-blames-bugs/

Llama 4: Did Meta just push the panic button?
https://www.interconnects.ai/p/llama-4

Meta announced 'Llama 4' on April 5, 2025. It is a native multimodal model designed from the start to handle multiple information formats, including not only text but also images and videos, in an integrated manner. Furthermore, its MoE architecture selectively runs only the most suitable specialized models for each task, known as 'experts,' to maintain high performance while minimizing resource waste. It also uses a new position embedding technique called 'iRoPE (Improved Rotary Position Embeddings)' to mitigate accuracy degradation in long-text context processing.

In particular, Llama 4 Scout and Llama 4 Maverick, which have 17 billion active parameters, are reported to be able to achieve the same accuracy as competing models such as Gemma 3, Gemini 2.0 Flash-Lite, Mistral 3.1, as well as GPT-4o and DeepSeek-V3, while using fewer computational resources.

Meta releases next-generation multimodal AI 'Llama 4,' adopting MoE architecture to boast high performance comparable to competing models - GIGAZINE



On the other hand, these models are suspected of being trained on a different test set than the publicly available Llama 4 in order to achieve better scores on the AI evaluation platform LM Arena.

“Serious issues in Llama 4 training. I Have Submitted My Resignation to GenAI“
by u/rrryougi in LocalLLaMA



AI researcher and author Andrew Burkov criticized Llama 4 Scout, saying, 'Sending more than 256,000 tokens to Scout, which claims to support a very long context window of 10 million tokens, results in very low-quality output.'




Also, on the bulletin board site Reddit, it was reported that the coding task of 'simulating a ball bouncing inside a rotating heptagon' was performed using Llama 4, and the performance was lower than that of DeepSeek-V3.

I'm incredibly disappointed with Llama-4
by u/Dr_Karminski in LocalLLaMA



Furthermore, former Meta researcher and senior research scientist at the Allen Institute for Artificial Intelligence, Nathan Lambert, said, 'Meta's decision not to disclose the models it used to create its marketing pitch is a major problem.'

Ahmad Al Darreh, vice president of generative AI at Meta, said: 'We've heard claims that Llama 4 was trained on a test set, but this is completely untrue. The variations in quality reported by some users are a way to stabilize the implementation.'




Dahle also claimed that 'some users are confusing Llama 4 Maverick with Llama 4 Scout across the various cloud providers hosting the model.' He added, 'We have taken down the publicly available model. We will spend a few days adjusting the model and will re-release it when it is ready. We will continue to work on development by fixing bugs in Llama 4 and onboarding partners.'

in AI,   Software, Posted by log1r_ut