Nvidia researchers enhance LLM students’ thinking skills by making them “think” during pre-training

Researchers at Nvidia have developed a new technology that flips the script on how large language models (LLMs) learn to reason.

The method is called Reinforce learning before training (RLP), RL is incorporated into the initial training phase instead of saving it for the end.

This approach The model encourages “thinking for oneself before predicting what will come next, thus teaching independent thinking behavior early in pre-training.” The researchers stated in their paper.

By learning to think in plain text without the need for external verification tools, Models trained using RLP show significant improvements in learning complex reasoning tasks Ultimately, pointing to a future of AI that is more capable and adaptable to real-world tasks.

Model LLM course

Typically, large language models are first pre-trained on huge amounts of text using "Predict the next symbol" Objective, where they are given a string of text and asked to continuously guess the next word (or token). At this stage, they learn basic rules, facts, and associations.

In the subsequent post-training phase, models typically learn complex reasoning abilities such as A series of ideas (CoT) where the model explains its logic step by step. This phase often involves supervised fine-tuning (SFT) or Enhanced learning from human feedback (RLHF), which requires specialized and curated data sets.

The authors of the paper argue that this sequential process does not match human understanding, and is “not one symbolic linear process after another, but rather a parallel integration of inputs with prior knowledge.” Current pretraining methods lack this mechanism, which hinders the model’s ability to develop deep thinking from the beginning.

How pre-training reinforcement learning works

RLP reformulates this process by treating CoT generation as an action the model takes before predicting the next token. In each step, the model first creates an inner element "belief" Or chain logic. He then predicts the next word in the text, using the original context augmented by his new thought.

The model receives a reward based on how much its idea improves the accuracy of its predictions compared to a baseline that did not generate an idea (pure next code prediction). This reward signal is automatically calculated based on the change in probability, eliminating the need for third-party verifiers or human-labeled data.

The reward is only positive when the generated idea helps the model better predict the next token. By rewarding insights based on their predictive usefulness, RLP effectively teaches the model how to usefully reason about the same large, unstructured data sets used in standard pre-training.

A constant feedback loop allows the model to know when simple predictive guessing is sufficient and when it needs to engage in deeper thinking. As the researchers put it, “RLP is designed to shape thinking through basic models Reward only those insights that concretely help predict the next symbol.”

However, this foundational approach does not make subsequent fine-tuning stages obsolete. According to Brian Catanzaro, vice president of applied deep learning research at Nvidia and co-author of the paper, RLP is designed to complement, not replace, these critical steps. "RLP is not intended to replace subsequent post-training stages such as supervised fine-tuning or reinforcement learning from human feedback," Catanzaro told VentureBeat. "These stages remain critical to improving model behavior… and are actually designed to amplify the effectiveness of those later stages by giving the model an early start."

RLP in action

In experiments with Qwen3-1.7B and Nimotron-Nano-12BThe Nvidia team tested RLP across a range of mathematics and science benchmarks. The results show that Models augmented with RLP consistently outperformed their traditionally trained counterparts, with particularly strong gains on more thought-intensive tasks.

For an organization, this improved logic can translate into more reliable output in multi-step workflows such as financial analysis or legal document summarization.

"During pre-training, RLP encourages the model to think before it makes a prediction, which helps the model accommodate a more coherent thinking style." Catanzaro said. "This can help reduce subtle logic errors, especially in longer workflows.

While Catanzaro stressed that models trained on RLP will still need the usual guardrails such as validation layers, human moderation, and consistency checks, Catanzaro said that “RLP gives you a stronger baseline."

Importantly, the benefits of the RLP complex rather than disappear during subsequent fine-tuning phases (catastrophic forgetting is a common problem in LLM training, where subsequent training phases cause the model to forget its previously learned skills and knowledge). The RLP-trained model achieved an overall score 7-8% higher than the baselines after a similar post-training regimen. The researchers concluded that RLP “establishes a strong rationale that is not eliminated by final alignment but instead accumulates post-training.”

The efficiency of this technique is the main result. In the Qwen3-1.7B model, RLP improved performance by 17% over standard continuous pretraining and also beat a similar technique called Reinforcement Pretraining via Rewards Prefix Matching (RPT). This advantage persisted even when the base model was trained using 35 times more data to match the computational cost, confirming that the gains come from the method itself, not just more processing.

Furthermore, RLP shows incredible scalability and versatility, successfully extracting logical signal from general-purpose web data – not just formatted datasets. When applied to the Nemotron-Nano-12B Mamba-Transformer hybrid model, RLP achieved a relative improvement of 35% compared to the intensively trained baseline While using only a small portion of the data.

While these findings point to a more efficient path to building robust models, Catanzaro positions innovation as a fundamental shift in the learning process itself, rather than an immediate solution to high training costs.

"This research is exciting because it presents a shift in how models absorb information during pre-training leading to a smarter learning process." He explained. "This will not replace extensive pre-training, but it offers another creative way to build the best possible models."

A new foundation for artificial intelligence training

Ultimately, RLP points toward a future where pre-training is no longer a monolithic process of predicting the next token. Instead, the next generation of models could be built on a combination of goals, creating AI that learns to think more robustly from day one. Catanzaro offers a powerful analogy to frame this shift:

"Predicting the next symbol tells the model what the world looks like; Reinforcement style goals like RLP can teach him how to think about what he sees," He said. "Combining these two goals can help models develop deeper, more structured thinking very early in training… Tools like RLP can build on this foundation, making learning more active, more curious, and even more efficient."

There is still a lot to learn about the dynamics of pre-training reinforcement learning, but what seems clear is that “introducing exploration early in training opens up a new axis of expansion — not just in scale, but in how models learn to think,” Catanzaro said.

[og_img]

Source link