Researchers in University of Illinois Urbana-Champaign and Google Cloud AI Research She developed a framework that enables Large Language Model (LLM) agents to organize their experiences into a memory bank, helping them get better at complex tasks over time.
Frame, called Reasoningbankextracts “generalizable inference strategies” from an agent’s successful and failed attempts to solve problems. The agent then uses this memory during reasoning to avoid repeating past mistakes and make better decisions when faced with new problems. Researchers show that when combined with… Test time measurement techniques,When an agent makes multiple attempts to solve a problem, ReasoningBank ,significantly improves the performance and efficiency of LLM ,agents.
Their findings show that ReasoningBank consistently outperforms classical memory mechanisms across web browsing and software engineering benchmarks, providing a practical path toward building more adaptive and reliable AI agents for enterprise applications.
The challenge of LLM agent memory
As LLM agents are deployed in applications that run for long periods, they are faced with a constant stream of tasks. A major limitation of current LLMs is their failure to learn from this accumulated experience. By approaching each task individually, they inevitably repeat past mistakes, ignore valuable insights from related problems, and fail to develop skills that would make them more capable over time.
The solution to this limitation is to give agents some kind of memory. Previous efforts to give agents memory have focused on storing past interactions for reuse by organizing information in various forms from plain text to structured graphs. However, these methods often fall short. Many of them use raw interaction logs or just store examples of successful tasks. This means that they cannot extract high-level, transferable patterns of thinking and, more importantly, they do not extract and use valuable information from the agent’s failures. As the researchers note in their paper, “Current memory designs often remain limited to passive record keeping rather than providing actionable and generalizable guidelines for future decision making.”
How does Reasoning Bank work?
ReasoningBank is a memory framework designed to overcome these limitations. Its central idea is to extract useful strategies and logical hints from past experiences and turn them into structured memory elements that can be stored and reused.
According to Jun Yan, a research scientist at Google and co-author of the paper, this represents a fundamental shift in how agents work. "Traditional agents operate statically, with each task being processed individually." Yan explained. "ReasoningBank changes this by turning every significant experience (successful or failed) into an organized, reusable reasoning memory. As a result, the agent is not starting from scratch with each client; It remembers and adapts proven strategies from previous similar situations."
The framework addresses successful and failed experiences and turns them into a set of useful strategies and preventive lessons. The agent judges success and failure by… LLM plans as a judge To avoid the need for human labeling.
Yan provides a practical example of this process in action. A dealer tasked with finding Sony headphones may fail because their broad search query returns more than 4,000 unrelated products. "ReasoningBank will first try to find out why this approach failed," Yan said. "Then, strategies such as “refining your search query” and “limiting products to category filtering” will be extracted. These strategies will be very useful for successfully completing future similar tasks."
The process works in a closed loop. When an agent encounters a new task, it uses embedding-based search to retrieve relevant memories from the ReasoningBank to guide its actions. These memories are inserted into the client’s system prompt, providing context for the decision-making process. Once a task is complete, the framework creates new memory elements to extract insights from successes and failures. This new knowledge is then analyzed, distilled, and integrated into ReasoningBank, allowing the agent to continually evolve and improve its capabilities.
Free memory with scaling
Researchers have found a strong synergy between memory and… Scaling the test time. The classic testing time measure involves generating multiple independent answers to the same question, but the researchers argue that “this vanilla format is suboptimal because it does not take advantage of the inherent differential signal that arises from redundant exploration of the same problem.”
To address this issue, they propose the Memory-Aware Test Time Scale (MaTTS), which integrates the measure with ReasoningBank. MaTTS comes in two forms. In “parallel scaling,” the system creates multiple paths to the same query, then compares and contrasts them to identify consistent patterns of reasoning. In sequential scaling, the agent improves its reasoning iteratively during a single attempt, with intermediate feedback and corrections also serving as valuable memory cues.
This creates a virtuous circle: the memory in ReasoningBank steers the agent toward more promising solutions, while the diverse experiences generated through expansion enable the agent to create higher-quality memories to store in ReasoningBank.
“This positive feedback loop positions the expansion of memory-based experience as a new measurement dimension for agents,” the researchers wrote.
ReasoningBank in action
The researchers tested their framework on WebArena (browsing the web) and SWE-Bench Verified (Software Engineering) using models such as Google’s Gemini 2.5 Pro and Anthropic’s Claude 3.7 Sonnet. They compared ReasoningBank with baselines including memory-free agents and agents using path-based or workflow-based memory frameworks.
The results show that ReasoningBank consistently outperforms these baselines across all datasets and LLM backbones. In WebArena, it improved the overall success rate by up to 8.3 percentage points compared to the memory-free proxy. They also generalized better to more difficult, multi-domain tasks, while reducing the number of interaction steps needed to complete the tasks. When combined with MaTTS, both parallel and serial benchmarking enhanced performance, consistently outperforming the standard test time benchmark.
This gain in efficiency has a direct impact on operational costs. Yan points to a case where a memory-free agent took eight trial-and-error steps just to find the right product candidate on a website. "These trial and error costs can be avoided by leveraging relevant insights from ReasoningBank," He pointed out. "In this case, we save almost twice the operating costs," Which also improves user experience by resolving issues faster.
For enterprises, ReasoningBank can help develop cost-effective agents that can learn from experience and adapt over time in complex workflows and areas such as software development, customer support, and data analysis. As the study concluded, “Our findings suggest a practical path toward building resilience and lifelong learning.”
Yan emphasized that their findings point to a future of true synthetic intelligence. For example, a coding agent can learn separate skills such as API integration and database administration from separate tasks. "Over time, these modular skills…become building blocks that the agent can flexibly reassemble to solve more complex tasks," he said, suggesting a future where agents can autonomously pool their knowledge to manage entire workflows with minimal human oversight.
[og_img]
Source link