Self-improving language models are now a reality thanks to MIT’s updated SEAL technology

Photo of author

By [email protected]



Researchers at the Massachusetts Institute of Technology (MIT) are gaining renewed interest in developing and Open sources It is a technology that allows large language models (LLMs) – such as those that underpin ChatGPT and most modern AI-powered chatbots – to optimize themselves by generating synthetic data to fine-tune.

The technology, known as SEAL (self-adaptive LLMs), was first described in a research paper published in June and It was covered by VentureBeat at the time.

Significantly expanded and The updated version of the paper was released last monthbesides Open source code published on GitHub (under an MIT license, allowing for commercial and institutional use), and is making new waves among AI users on the social network X this week.

SEAL allows LLMs to independently create and apply their own fine-tuning strategies. Unlike traditional models that rely on static external data and man-made optimization pipelines, SEAL enables models to evolve by producing their own synthetic training data and corresponding optimization guidance.

The development comes from a team affiliated with MIT’s Improbable Artificial Intelligence Laboratory, including Adam Zweiger, Jyothesh Barry, Han Ju, Ekin Akyurek, Yun Kim, and Pulkit Agrawal. Their research was recently presented at the 39th Conference on Neural Information Processing Systems (NeurIPS 2025).

Background: From “beyond static AI” to self-adaptive systems

Earlier this year, VentureBeat first reported on SEAL as an early-stage framework that allowed language models to generate their own synthetic data and train on it — a potential cure for the stagnation of pre-trained models once deployed.

At that point, SEAL was formulated as a proof of concept that could allow enterprise AI agents to learn continuously in dynamic environments without manual retraining.

Since then, research has advanced significantly. The new version expands on the previous framework by showing that SEAL’s self-adaptive ability scales with model size, incorporates reinforcement learning more effectively to reduce catastrophic forgetting, and formalizes the double-loop structure of SEAL (supervised internal fine-tuning and external reinforcement optimization) for repeatability.

The updated paper also presents assessments across different stimulus formats, improves stability during learning cycles, and discusses practical deployment challenges at inference time.

Addressing the limitations of fixed models

While LLMs have demonstrated remarkable abilities in text generation and comprehension, their adaptation to new tasks or knowledge is often manual, fragile, or context dependent.

SEAL challenges this status quo by providing models with the ability to generate what the authors call “self-adjustments”—natural language outputs that specify how the model updates its weights.

These self-adjustments may take the form of reformulated information, logical implications, or tool configurations for augmentation and training. Once created, the model adjusts itself based on these modifications. The process is guided by reinforcement learning, where the reward signal comes from improved performance on a subsequent task.

The design mimics how human learners might reformulate or reorganize course material to better absorb information. Restructuring knowledge prior to assimilation is a major advantage over models that passively consume new data “as is.”

Performance across tasks

SEAL was tested across two main areas: knowledge fusion and few-shot learning.

In a knowledge fusion framework, the researchers evaluated how well the model could accommodate new real-world content from passages similar to those in the SQuAD dataset, a standard reading comprehension dataset provided by Stanford University in 2016, consisting of more than 100,000 crowd-sourced question-answer pairs based on Wikipedia articles (Rajpurkar et al., 2016).

Instead of fine-tuning directly on the clip text, The model generated compositional traces of the segment And then set it on them.

After two rounds of reinforcement learning, the model improved question-answer accuracy from 33.5% to 47.0% in the context-free version of SQuAD – exceeding results obtained using synthetic data generated by GPT-4.1.

In the few-shot learning environment, SEAL was assessed using a subset of the ARC criterion, where tasks require thinking through only a few examples. Here, the SEAL has created self-adjustments that define data increments and hyperparameters.

After reinforcement learning The success rate of correctly solving pending tasks jumped to 72.5%, compared to 20% using self-generated modifications without reinforcement learning. Models that relied solely on learning in context without any adaptation received a score of 0%.

Technical framework

SEAL works using a two-loop architecture: the inner loop performs supervised fine-tuning based on self-editing, while the outer loop uses reinforcement learning to optimize the policy that generates those self-editing.

The reinforcement learning algorithm used is based on ReSTEM, which combines sampling and reproduction of filtered behavior. During training, only self-adjustments that lead to improved performance are reinforced. This approach effectively teaches the model which types of modifications are most beneficial for learning.

For efficiency, SEAL applies LoRA-based fine-tuning rather than full parameter updates, enabling rapid experimentation and low-cost adaptation.

Strengths and limitations

The researchers report that SEAL can produce highly useful training data with minimal supervision, outperforming even large external models such as GPT-4.1 on specific tasks.

They also demonstrate that SEAL generalizes beyond its original setup: it continues to perform well when expanding from single pass updates to continuous multi-document pretraining scenarios.

However, the framework is not without limitations. One problem is catastrophic forgetting, where updates to incorporate new information can lead to decreased performance on previously learned tasks.

In response to this concern, co-author Gio Barry told VentureBeat via email that reinforcement learning (RL) appears to mitigate forgetting more effectively than supervised fine-tuning (SFT), citing a recent paper on the topic. Combining this insight with SEAL could lead to new variables as SEAL learns not only training data, but reward functions, he added.

Another challenge is computational overhead: evaluating each self-modification requires fine-tuning and performance testing, which can take 30 to 45 seconds per modification, much more than standard reinforcement learning tasks.

As Geo explained, “Training SEAL is non-trivial because it requires two optimization loops, an external RL loop and an internal SFT loop. At inference time, updating the model weights will also require a new systems infrastructure.” He stressed the need for future research into deployment systems as a critical path to making SEAL operational.

In addition, the current design of SEAL assumes the presence of associated tasks and reference answers for each context, which limits its direct applicability to unnamed groups. However, Geo explained that as long as there is an ultimate mission with a calculable reward, a SEAL can be trained to adapt accordingly — even in safety-critical areas. In principle, a SEAL-trained model could learn to avoid training on harmful or malicious inputs if it was guided by the appropriate reward signal.

AI community feedback

The AI ​​research and construction community has reacted with a mixture of excitement and speculation to the SEAL paper. On X, formerly Twitter, several prominent AI-focused accounts addressed the potential impact.

user @VraserXa self-described AI educator and enthusiast, described SEAL as “the birth of continuous self-learning AI” and predicted that models like OpenAI’s GPT-6 could adopt a similar architecture.

In their words, SEAL represents “the end of the era of frozen weights,” ushering in systems that evolve as the world around them changes.

They highlight the SEAL’s ability to form stable memories, repair knowledge, and learn from real-time data, and compare it to a foundational step toward models that not only use information but also absorb it.

Meanwhile, @alex_prompterco-founder of an AI-powered marketing venture, frames SEAL as a leap toward models that literally rewrite themselves. “MIT has just built an AI that can rewrite its own code to become smarter,” he wrote. Cite Paper Key Findings – 40% increase in actual recall and outperformance of GPT-4.1 using self-generated data – He described the results as confirmation that “the self-regulating LLM is no longer science fiction.”

This enthusiasm reflects a broader appetite in AI for models that can evolve without constant retraining or human oversight — especially in rapidly changing fields or personal use cases.

Future directions and open questions

In response to questions about expanding SEAL to larger models and missions, Geo pointed to experiments (Appendix B.7) that show that as the size of a model increases, so does its ability to self-adapt. Contrast this with students improving their study techniques over time – larger models are simply better at generating useful self-adjustments.

When asked if the SEAL program generalizes to new motivational methods, he confirmed this, citing Table 10 in the paper. However, he also acknowledged that the team has not yet tested the SEAL’s ability to transport across entirely new domains or modular architectures.

“SEAL is a prime act that showcases the possibilities,” he said. “But it requires more testing.” He added that generalizability may improve as SEALs train to distribute tasks more widely.

Interestingly, the team found that only a small number of reinforcement learning steps actually led to measurable performance gains. “This is exciting, because it means that with more computing, hopefully we can get more improvements,” Gio noted. He suggested that future experiments should explore more advanced reinforcement learning methods beyond ReSTEM, such as Group Proportional Policy Optimization (GRPO).

Towards more adaptive and effective models

SEAL represents a step toward models that can independently improve over time, whether by incorporating new knowledge or reshaping how we learn. The authors envision future extensions where SEALs could aid in self-training, continuous learning, and the development of agent systems — models that react to evolving environments and incrementally adapt.

In such settings, the model can use SEAL to collect weight updates after each interaction, gradually assimilating behaviors or thoughts. This can reduce the need for frequent supervision and manual intervention, especially in areas with limited or specialized data.

As public web text becomes saturated and scaling LLMs becomes restricted by data availability, self-directed approaches such as SEAL can play a critical role in pushing the boundaries of what LLMs can achieve.

You can access the SEAL Project, including additional code and documentation, at: https://jyopari.github.io/posts/seal



[og_img]

Source link

Leave a Comment