Apr 08, 2025 20:00:00

Allegations of benchmark fraud against Meta's AI model 'Llama 4' surface, but Meta flatly denies them, calling them 'unfounded'

Meta announced its next-generation AI model,

Llama 4 , on April 5, 2025. It's designed to maintain high performance while minimizing resource waste. Llama 4 Maverick, with 17 billion active parameters, boasts accuracy equal to or better than OpenAI's GPT-4o and DeepSeek-V3, while using fewer computing resources. Meanwhile, some developers have pointed out that Llama 4's high benchmark scores indicate it's an 'experimental version tuned for conversational use.' Meta countered, stating that this is 'not true.'

Meta's benchmarks for its new AI models are a bit misleading | TechCrunch
https://techcrunch.com/2025/04/06/metas-benchmarks-for-its-new-ai-models-are-a-bit-misleading/

Meta exec denies the company artificially boosted Llama 4's benchmark scores | TechCrunch
https://techcrunch.com/2025/04/07/meta-exec-denies-the-company-artificially-boosted-llama-4s-benchmark-scores/

Meta defends Llama 4 release against 'reports of mixed quality,' blames bugs | VentureBeat
https://venturebeat.com/ai/meta-defends-llama-4-release-against-reports-of-mixed-quality-blames-bugs/

Llama 4: Did Meta just push the panic button?
https://www.interconnects.ai/p/llama-4

Meta announced 'Llama 4' on April 5, 2025. It is a native multimodal model designed from the start to handle multiple information formats, including not only text but also images and videos, in an integrated manner. Furthermore, its MoE architecture selectively runs only the most suitable specialized models for each task, known as 'experts,' to maintain high performance while minimizing resource waste. It also uses a new position embedding technique called 'iRoPE (Improved Rotary Position Embeddings)' to mitigate accuracy degradation in long-text context processing.

In particular, Llama 4 Scout and Llama 4 Maverick, which have 17 billion active parameters, are reported to be able to achieve the same accuracy as competing models such as Gemma 3, Gemini 2.0 Flash-Lite, Mistral 3.1, as well as GPT-4o and DeepSeek-V3, while using fewer computational resources.

Meta releases next-generation multimodal AI 'Llama 4,' adopting MoE architecture to boast high performance comparable to competing models - GIGAZINE

On the other hand, these models are suspected of being trained on a different test set than the publicly available Llama 4 in order to achieve better scores on the AI evaluation platform LM Arena.

“Serious issues in Llama 4 training. I Have Submitted My Resignation to GenAI“
by u/rrryougi in LocalLLaMA

AI researcher and author Andrew Burkov criticized Llama 4 Scout, saying, 'Sending more than 256,000 tokens to Scout, which claims to support a very long context window of 10 million tokens, results in very low-quality output.'

I will save you reading time about Llama 4.

The declared 10M context is virtual because no model was trained on prompts longer than 256k tokens. This means that if you send more than 256k tokens to it, you will get low-quality output most of the time.

And even if your problem…
— Andriy Burkov (@burkov) April 5, 2025

Also, on the bulletin board site Reddit, it was reported that the coding task of 'simulating a ball bouncing inside a rotating heptagon' was performed using Llama 4, and the performance was lower than that of DeepSeek-V3.

I'm incredibly disappointed with Llama-4
by u/Dr_Karminski in LocalLLaMA

Furthermore, former Meta researcher and senior research scientist at the Allen Institute for Artificial Intelligence, Nathan Lambert, said, 'Meta's decision not to disclose the models it used to create its marketing pitch is a major problem.'

Ahmad Al Darreh, vice president of generative AI at Meta, said: 'We've heard claims that Llama 4 was trained on a test set, but this is completely untrue. The variations in quality reported by some users are a way to stabilize the implementation.'

We're glad to start getting Llama 4 in all your hands. We're already hearing lots of great results people are getting with these models.

That said, we're also hearing some reports of mixed quality across different services. Since we dropped the models as soon as they were…
— Ahmad Al-Dahle (@Ahmad_Al_Dahle) April 7, 2025

Dahle also claimed that 'some users are confusing Llama 4 Maverick with Llama 4 Scout across the various cloud providers hosting the model.' He added, 'We have taken down the publicly available model. We will spend a few days adjusting the model and will re-release it when it is ready. We will continue to work on development by fixing bugs in Llama 4 and onboarding partners.'

Related Posts:

Apr 08, 2025 20:00:00 in AI, Software, Posted by log1r_ut