Artificial Intelligence Models fail to produce – here how to fix the selection of the form

Photo of author

By [email protected]


Join daily and weekly newsletters to obtain the latest updates and exclusive content to cover the leading artificial intelligence in the industry. Learn more


Institutions need to know whether the models that operate their applications and agents are working in real life scenarios. this Evaluation It can sometimes Be complicated Because it is difficult to predict specific scenarios. A renewed version of Rawardbench standards is looking to give institutions a better idea of ​​performing the real model.

the The Allen Institute of Artificial Intelligence (AI2) Rewardbench 2, an updated version of the Bonuses of the Bonuses of the Bonuses, which they claim provides a more comprehensive offer to perform the model and evaluate how models are compatible with the goals and standards of the institution.

AI2 The platform was built with the classification tasks that measure the links by calculating the time of reasoning and training. RAWARDBENCH mainly deals with the RM models (RM), which can act as judges and evaluate LLM outputs. RMS sets a degree or “bonus” of the reinforcement learning with human comments (RHLF).

Nathan Lambert, the chief research scientist in AI2, told Venturebeat that the first bonus is intended when launch. However, the typical environment has evolved rapidly, as well as its criteria.

He said: “When the rewards models became more advanced and the use of cases is more accurate, we quickly realized with society that the first version did not completely get the complexity of human preferences in the real world.”

Lambert added that with the Bench 2 bonus, “We have begun to improve both the breadth and depth of the evaluation – which provides more diverse and difficult demands and refining the methodology to reflect the best of how human beings are in practice.” He said that the second version uses invisible human claims, and has a more challenging registration and new ranges.

Using assessments of models that reside

While the reward models are tested how successful of the forms, it is also important for RMS to be in line with the company’s values; Otherwise, the precise learning process and the promotion of bad behavior can enhance, such as hallucinations, reduce generalization, and records very high responses.

Rawardbench 2 covers six different areas: realism, the following careful education, mathematics, safety, concentration and relationships.

“Institutions must use Rawardbench 2 in two different ways depending on their application. If they perform RLHF themselves, they must adopt best practices and data groups from the leading models in their pipelines because the rewards models need a body training recipes (that is, the reward models that reflect the model they are trying to train with RL). Lambert.

Lambert noted that standards such as Rawardbench provides users with a way to evaluate the models they choose based on “the dimensions that concern them more, instead of relying on a narrow degree of one size.” He said that the idea of ​​performance, which claims many evaluation methods to evaluate, is very subjective because the good response from a model depends on the context and goals of the user. At the same time, human preferences become very accurate.

AI 2 released the first edition of Bonus in March 2024. At that time, the company said it was the first standard and the leaders of the bonus models. Since then, several ways to measure and improve RM have appeared. Researchers in DeadExhibition Rewordbench. Dibsic Sadr a A new technology called self -custody control For RM more intelligent and developed.

How did the models have

Since Rewardbench 2 is an updated version of Rawardbench, AI2 tested both the current and newly trained models to see if they continue high. This included a variety of models, such as Gemini, Claude, GPT-4.1, and Llama-3.1, along with data collections and models such as QWEN, Skywork, and Tolo.

The company found that large reward models work better on this standard because their basic models are stronger. In general, the strongest performance models are variables of Llama-3.1. Regarding concentration and safety, Skywork data is “especially useful”, and Tulu has actually achieved.

AI2 said that although they believed that Rawwardbench 2 “is a step forward in the wide -based multi -field assessment” of reward models, they have warned that the model evaluation should be used mainly as a guide to selecting models that work better with the institution’s needs.



https://venturebeat.com/wp-content/uploads/2025/06/crimedy7_illustration_of_a_robot_rewarding_another_robot_abstra_11fd0825-4ec3-4e18-80c7-dd371d901a25.png?w=1024?w=1200&strip=all
Source link

Leave a Comment