Publishing the new artificial intelligence coding challenge its first results – it is not beautiful

The new artificial intelligence coding challenge revealed the first winner-and put a new tape for software engineers who work with artificial intelligence.

On Wednesday at 5 pm, the Lod Non -profit Institute announced the first Konk award winner, a challenge to coding from the multi -round artificial intelligence launched by Databricks and co -founder of the confusion of Andy Konwinski. The winner was a Brazilian directed engineer named Eduardo Rocha de Andradi, who will receive $ 50,000 for the award. But more surprising than winning was his final degrees: he won correct answers over only 7.5 % of the test questions.

“We are happy that we have already built a difficult standard,” said Koninski. He continued: “The criteria must be difficult if they are important,” adding: “The grades will be different if the large laboratories have entered their larger models. But this is a type of point. The K prize was run without contact with a limited account, so you prefer smaller and open models. I love it.

Konwinski pledged a million dollars of an open source model that can record more than 90 % in the test.

Like a well-known SWE-Pect system, K prize is tested for models for problems that have been in the field of reporting Github as a test of how much models are able to deal with programming problems in the real world. But although Swe-Bused depends on a fixed set of problems that models can be trained, the K award is designed as a “Swe-Bus pollution free version”, using a time-guarding entry system against any standards training. For the first round, the models were due by March 12. Then the prize organ organizers built the test using only GitHub problems that were marked after that date.

The higher grades are 7.5 % in a noticeable contradiction with SWE-Bench itself, which currently shows 75 % higher in the easiest “verification” test and 34 % in its most difficult “full” test. Konwinski is still not sure whether the contrast is due to pollution on the bench or just a challenge to collect new problems from GitHub, but it is expected to answer the A Prize project on the question soon.

“While we get more runs, we will have a better feeling,” because we expect people to adapt to the dynamics of competition for this every few months. “

TECHRUNCH event

San Francisco
|
27-29 October, 2025

It may seem like a strange place to progress, given a wide range of artificial intelligence coding tools already available to the public – but with the standards become very easy, many critics see projects like the K award as a necessary step towards solving them The growing evaluation problem of artificial intelligence.

Prinston, Siash Kapoor researcher, who put forward a similar idea In a modern sheet. “Without such experiments, we cannot actually know whether the problem is pollution, or even targeting the Swe-Bench plate with a person in the episode.”

For Konwinski, it is not just a better standard, but it represents an open challenge for the rest of the industry. He says: “If you listen to the noise, we must see artificial intelligence doctors, artificial intelligence lawyers and artificial intelligence programs, and this is not true.” “If we cannot get more than 10 % on a pollution -free seat, this is to check the reality for me.”

https://techcrunch.com/wp-content/uploads/2024/06/GettyImages-1388336038.jpg?resize=1200,675

Source link

Liam Dawson rejects Yashasvi Jaiswal to demand the first Wikit test after 2928 days in Manchester

Objectives of analysts, but see the long -term value in UNITEDHELHELTH (United Nations)

Leave a Comment Cancel reply