Meta used an “experimental” version of Llama 4 Maverick to secure top AI benchmark rankings, sparking backlash and policy changes
Meta’s grand unveiling of its latest Llama 4 AI models — Scout and Maverick — quickly turned sour this weekend after the company was caught using an “experimental” version of Maverick to secure top rankings on a leading AI benchmark site. The model’s performance, initially hailed as a serious threat to GPT-4o and Gemini 2.0, is now under scrutiny.
The controversy began when Meta proudly announced that Llama 4 Maverick had climbed to the number-two spot on LMArena, a popular crowd-sourced benchmark platform where users vote on model responses in head-to-head battles. Boasting an impressive ELO score of 1417, Maverick was portrayed as outperforming OpenAI’s GPT-4o and trailing only behind Google’s Gemini 2.5 Pro.
But buried in Meta’s own documentation was a critical admission: the model submitted to LMArena wasn’t the publicly available version. Instead, Meta used a “Maverick-03-26-Experimental” variant, which had been “optimized for conversationality” — a tweak specifically designed to appeal to human voters in chat-based comparisons.
The discrepancy was first flagged by researchers and amplified on social media. A screenshot shared by AI researcher Zain Hasan showed Meta’s subtle disclosure, which quickly went viral. The revelation ignited a backlash from the AI community, accusing Meta of gaming the system to inflate the model’s standing.
Two days after the model’s release, LMArena posted a statement on X (formerly Twitter), clarifying its position. “Meta’s interpretation of our policy did not match what we expect from model providers,” the platform stated. “Meta should have made it clearer that ‘Llama-4-Maverick-03-26-Experimental’ was a customized model to optimize for human preference.”
As a direct result of the incident, LMArena has announced changes to its leaderboard policies, aiming to ensure transparent and reproducible evaluations. The update is designed to prevent model creators from submitting unrepresentative or tuned versions of their systems without full disclosure.
Embed from Getty ImagesThe benchmark fudging raises broader questions about how AI performance is measured and marketed. In an increasingly crowded and competitive landscape, top spots on benchmark charts like LMArena are valuable — both for investor confidence and public perception. A high ELO score translates into trust and buzz, especially for models claiming to compete with the likes of OpenAI and Google.
Meta’s open-weight Llama models are already seen as outliers in an industry leaning toward closed, proprietary systems. That openness has earned the company praise for democratising AI research and deployment. But this latest episode could erode some of that goodwill, especially among developers and researchers relying on benchmark integrity.
It’s unclear how much the “optimized” tweaks skewed Maverick’s performance. Meta has yet to publicly address whether the experimental model meaningfully differed from the released one in tasks beyond conversational chats. However, the use of a fine-tuned variant that the public cannot access undermines the claim that Maverick, in its released form, holds superiority over competitors.
As for the future of benchmark testing, LMArena’s response suggests a tightening of rules — and perhaps a recalibration of how much weight such scores should carry in public discourse.
In the arms race of AI supremacy, every benchmark win counts. But as Meta just discovered, stacking the deck doesn’t go unnoticed.