Join daily and weekly newsletters to obtain the latest updates and exclusive content to cover the leading artificial intelligence in the industry. Learn more
Logic through A series of ideas (COT) – The process that saves the forms of problems to “ideas” that can be controlled before deducting the answers – became an integral part of the latest generation of large border language models (LLMS).
However, the costs of reasoning for thinking models can accumulate quickly as models generate vibrant symbols. in New paperResearchers at Carnegie Mellon University suggest a LLM training technology that gives developers more control over the cradle.
The improvement of the policy -controlled policy (LCPO), and the terms of the model technology to provide the correct answers while maintaining its “ideas” also within a pre -defined symbolic budget. Experiments show that LCPO’s trained models provide a smooth comparison between accuracy and costs and can outperform the largest models over equal thinking lengths. LCPO can significantly help reduce the costs of inference in institutions applications by saving thousands of symbols in each round of conversation with LLM.
LLM performance leads to longer speed transplantation
Thinking models such as Openai O1 and Deepsek-R1 They are trained by reinforcement learning (RL) for use Test time test And generate the effects of a bed before producing an answer. Experimental evidence indicates that when models think for a longer period, they tend to perform better in thinking tasks.
For example, the R1 was initially trained on pure RL without examples of humans. One of the ideas was that with the improvement of the performance of the model, he also learned to create longer effects.
While in general, long COT chains lead to more accurate responses, they also create the bottleneck in the application of thinking models on a large scale. There is currently very few control over the test time account budget, and the sequences can easily extend to tens of thousands of symbols without making great gains. There were some efforts to control the length of thinking chains, but it usually analyzed the performance of the model.
Explain the length of the policy controlled (LCPO)
Training RL classic LLMS only to achieve the correct response. LCPO changes this model by entering two training goals: 1) Get the correct result and 2) Keep a limited COT chain within the length of a specific symbol. Therefore, if the model results from the correct response, but it generates a lot of distinctive symbols, it will receive a penalty and have to reach a series of thinking that reaches the same answer, but with a smaller symbolic budget.
The researchers wrote: “LCPO’s trained models are satisfied with satisfying length restrictions while improving thinking performance, instead of relying on inferring hand engineering,” the researchers wrote.
They suggest the flavor of two LCPO: (1) LCPO perception, which requires that the generated thinking be completely equal to the target length, and (2) LCPO-MAX, which requires that the output be longer than the length of the target.
To test this technology, the researchers set the 1.5B thinking model (QWEN-Deslanted-R1-1.5B) on the LCPO schemes proposed to create L1-Max and L1 models. Training was based on mathematical problems with distinguished and verified results. However, the evaluation included mathematics problems as well as external distribution tasks such as measuring the language of the massive multiple tasks (mmlu) Technology, standard of resistance questions and answers at the level of graduate studies (GPQA).
Their results show that L1 models can accurately balance the distinctive symbol budget and the performance of thinking, and the smooth interconnection between short, effective and longer thinking and longer thinking, and more accurate by demanding the model with the restrictions of different length. More importantly, in some tasks, L1 models can reproduce the performance of the original thinking model with a lower symbolic budget.

Compared to the S1 – the only way that restricts Cot models – L1 models show up to 150 % of performance gains on different symbolic budgets.
“This big difference can be attributed to two main factors,” the researchers write. (1) L1 intelligently adapts to its bed to fit within the limited length restrictions without disrupting the thinking process, while the S1 is often cut in the middle of the period; and (2) L1 was explicitly trained to generate high -quality thinking chains of varying lengths, and effectively cut the thinking patterns from longer chains to shorter chains. “
L1 also outperforms its non-compatible counterpart by 5 % and GPT-4O by 2 % along the equal generation. “With regard to our knowledge, this is the first evidence that the 1.5B model can outperform border models like GPT-4O, despite the use of the same length,” researchers write.
Interestingly, Cot Cot show that he learns to control the thinking process based on its symbolic budget. For example, on the long budgets, the model is likely to generate the distinctive symbols associated with self -correction and verification (i.e. “but” and “wait”) and draw the summary (“so” and “so”).

Besides improving the length of length in preparing standard mathematics thinking, circulating L1 models amazingly amazing to distribution tasks, including GPQA and MMLU.
This new line of research on models that their thinking budget can adjust important uses of applications in the real world, giving institutions the ability to expand the scope of thinking models without fleeing expenses. It is a strong alternative to publishing larger and more expensive models-and it can be a crucial factor in making artificial intelligence more economically applications for applications in the real world.
The researchers have opened sources LCPO code and Weighs for L1 models.
https://venturebeat.com/wp-content/uploads/2024/07/Thinking-fast-and-slow.jpg?w=1024?w=1200&strip=all
Source link