The study accuses LM Arena of the help of Top Ai Labs Game

Photo of author

By [email protected]


New paper From AI LAB COHERE, Stanford, MIT and AI2 accusing LM Arena, the organization behind the Ai Crowds -Benchmark Penchmark Arena, to help a selection of artificial intelligence companies achieve better degrees than the leaders of the leaders at the expense of competitors.

According to the authors, LM Arena allowed some of the leading AI companies such as Meta, Openai, Google and Amazon to test many variables of artificial intelligence models, then do not publish dozens of less artists. The authors say this made it easy for these companies to achieve a higher point for leaders on the platform, although the opportunity did not give each company.

“Only a handful of (companies) has been told that this special test is available, and that the amount of special tests that some companies received is much more than others,” said COHERE Vice President at Ai Research and co -author of the study, Sarah Hker, in an interview with Techcrunch. “This is Gamification.”

It was created in 2023 as an academic research project from the University of California in Berkeley, and Chatbot Arena became a standard for artificial intelligence companies. It works by placing answers from different models of artificial intelligence alongside “battle”, and users are required to choose the best one. It is not uncommon to see unprecedented models competing in the square under a pseudonym.

Sounds over time contribute to the degree of the model – and thus put it on leaders in Chatbot Arena. While many commercial actors participate in Chatbot Arena, LM Arena has long maintained that its index is a neutral and just situation.

However, this is not what the authors of the paper say they discovered.

A company, Meta, managed to test 27 Statbot Arena variables between January and March before the giant Llama 4 technology release, as the authors claim. In Launch, Meta publicly only revealed one model – a model that has become near the top of the Chatbot Arena panel.

TECHRUNCH event

Berkeley, California
|
June 5


Book now

The graph was withdrawn from the study. (Credit: Singh and others)

In an email to Techcrunch, co -founder LM Arena and UC Berkeley Ion Stoica said that the study was full of “inaccuracy” and “doubtful analysis”.

“We are committed to the fair assessments that depend on society, and invite all model services providers to provide more examples for testing and improve their performance on human preferences,” LM Arena said in a statement to Techcrunch. “If the model provider chooses to provide more tests than the other model provider, this does not mean that the second model provider is treated unfairly.”

Armand Joulin, the lead researcher at Google DeepMind, also note in a After x Some study numbers were inaccurate, claiming that Google only sent the Gemma 3 AI model to LM Arena for a pre -version test. Holker responded to Julien on X, and promised that the authors would make a correction.

It is supposed to be preferred laboratories

The authors of the paper began conducting their research in November 2024 after learning that some artificial intelligence companies might be granted preferential access to Chatbot Arena. In total, they measured more than 2.8 million Chatbot Arena battles over five months.

The authors say they found evidence that LM Arena allowed some artificial intelligence companies, including Meta, Openai and Google, to collect more data from Chatbot Arena through the appearance of their models in a larger number of “battles”. The authors claim that this increased sampling rate gave these companies an unfair advantage.

Using additional data from LM Arena can improve the performance of the model on Arena hard, and LM Arena maintains 112 %. However, LM Arena said in a After x This solid performance is not directly associated with the performance of Chatbot Arena.

Hooker said it is unclear how some artificial intelligence companies have received a priority, but it is important to increase LM transparency regardless.

in After xLM Arena said that many claims in the paper do not reflect reality. The organization referred to a Blog post It was published earlier this week, indicating that models of non -laboratory laboratories appear in the battles of more Chatbot Arena than the study indicates.

One of the important restrictions of the study is that it depends on “self -recognition” to determine the models of artificial intelligence that was in a special test on Chatbot Arena. Authors pushed artificial intelligence models several times about their original company, and relied on the answers of the forms to classify them – a method that is not guaranteed.

However, Hooker said that when the authors reached LM Arena to exchange their initial results, the organization did not exceed it.

Techcrunch arrived at Meta, Google, Openai and Amazon – all mentioned in the study – for comment. He did not respond immediately.

LM Arena in hot water

In the paper, the authors call LM Arena to implement a number of changes aimed at making Chatbot Arena more “fair”. For example, the authors say, LM Arena can determine a clear and transparent limit for the number of special tests that the AI ​​lab can do, and publicly reveal the grades of these tests.

in Publishing on X, LM Arena rejected these suggestions, claiming that it had published information on the pre -version test Since March 2024. The measurement organization also said that it is “meaningless to show the grades of pre -version models that are not available to the public”, because the artificial intelligence community cannot test the models for themselves.

The researchers also say LM Arena can adjust the rate of sampling in Chatbot Arena to ensure all models in the square appear in the same number of battles. LM Arena did not accept this recommendation publicly, and indicated that it would create a new algorithm to take samples.

The paper comes weeks after Meta standards for games in Arena Chatbot about the launch of the Lama 4 models mentioned above. Meta has improved a Llama 4 model for “Topersality”, which helped her achieve an impressive degree of leaders at Chatbot Arena. But the company has never issued the optimal model – and vanilla version It ended with much worse performance On Chatbot Arena.

At that time, LM Arena said that Meta should have been more transparent in its approach to analogy.

Earlier this month, LM Arena announced that it was Launching a companyWith plans to raise capital from investors. The study increases the scrutiny of the private standard institution – and whether it can be trusted to assess the artificial intelligence models without affecting companies that weaken the process.



https://techcrunch.com/wp-content/uploads/2024/05/GettyImages-2080972792.jpg?resize=1200,800

Source link

Leave a Comment