Join our daily and weekly newsletters for the latest updates and exclusive content on our industry-leading AI coverage. He learns more
A new neural network architecture developed by researchers at Google may solve one of the big challenges facing large language models (LLMs): expanding their memory at inference time without increasing memory and computing costs. Named Titans,The architecture allows models to find small pieces of ,important information in long sequences and store them during ,inference.
Titans combines traditional LLM attention blocks with “neural memory” layers that enable models to handle short- and long-term memory tasks efficiently. According to the researchers, LLMs that use neural long-term memory can scale to millions of symbols and outperform both classical LLMs and alternatives like Mamba while having many fewer parameters.
Attention layers and linear models
It uses the classic transformer architecture used in LLMs Self-attention mechanism To calculate relationships between symbols. This is a powerful technique that can learn complex and granular patterns in token sequences. However, as sequence length grows, the computing and memory costs of calculating and storing attention increase quadratically.
Includes the most recent proposals Alternative buildings Which has linear complexity and can be scaled without increasing memory and computation costs. However, Google researchers argue that linear models do not show competitive performance compared to classical transformers, because they compress their contextual data and tend to miss important details.
They suggest that an ideal structure should contain different memory components that can be coordinated to use existing knowledge, memorize new facts, and learn abstractions from their context.
“We argue that in a model of operant learning, similar to the human brain, there are distinct but interconnected modules, each responsible for a critical element of the learning process,” the researchers wrote.
Long-term neural memory
“Memory is an association of systems—for example, short-term memory, working memory, and long-term memory—each serving a different function with different neural structures, and each capable of functioning independently,” the researchers wrote.
To fill the gap in current language models, researchers propose a “neural long-term memory” module that can learn new information at inference time without the inefficiency of the full attention mechanism. Instead of storing information during training, the neural memory unit learns a function that can memorize new facts during reasoning and dynamically adapts the memorization process based on the data it encounters. This solves the generalization problem that other neural network architectures suffer from.
To determine which pieces of information are worth storing, the neural memory module uses the concept of “surprise.” The more the sequence of tokens differs from the type of information stored in the existing model weights and memory, the more surprising it is and thus worth memorizing. This allows the module to make efficient use of its limited memory and store only the portions of data that add useful information to what the model already knows.
To handle very long sequences of data, the neural memory module has an adaptive forgetting mechanism that allows it to remove information that is no longer needed, which helps manage limited memory capacity.
The memory module could be complementary to the attention mechanism in current switch models, which the researchers describe as “short-term memory modules, concerned with the window size of the current context. On the other hand, our neural memory that has the ability to continuously learn from data and store its weights can play The role of long-term memory.
Titan architecture

The researchers describe Titans as a family of models that integrate existing switch blocks with neural memory modules. The model has three main components: the “primary” unit, which functions as short-term memory and uses the classical attention mechanism to attend to the current portion of input codes processed by the model; The “long-term memory” module, which uses neural memory architecture to store information outside the current context; and a “persistent memory” unit, which are learnable parameters that remain constant after training and store knowledge independent of time.
Researchers suggest different ways to link the three components. But in general, the main advantage of this architecture is to enable attention and memory modules to integrate with each other. For example, attention layers can use historical and current context to determine which parts of the current context window should be stored in long-term memory. At the same time, long-term memory provides historical knowledge that does not exist in the current context of interest.
The researchers ran small-scale tests on Titan models, ranging from 170 million to 760 million parameters, on a variety of tasks, including language modeling and long-sequence language tasks. They compared the performance of Titans with various transformer-based models, and linear models such as Mamba and Hybrid models Like Samba.

Titans demonstrated strong language modeling performance compared to other models and outperformed both transformers and linear models of similar sizes.
The performance difference is particularly apparent in tasks with long sequences, such as “Needle in a haystack“, where the model must retrieve pieces of information from a very long sequence, and Babylongwhere the model must make inferences across facts distributed in very long documents. In fact, in these tasks, Titan outperforms models by orders of magnitude with more parameters, including GPT-4 and gpt-4o-miniThe Llama-3 model is enhanced with Recall Augmented Generation (RAG) technology.
Furthermore, the researchers were able to expand Titans’ context window to up to 2 million tokens while keeping memory costs at a modest level.
The models still need to be tested at larger sizes, but the research results show that researchers have not yet reached the ceiling of the giants’ potential.
What does it mean for enterprise applications?
With the presence of Google At the forefront of long context modelsWe can expect this technology to find its way into private and open models like Gemini and Gemma.
With LLM’s support for longer context windows, there is increased potential to create applications where you can compress new knowledge into your vector rather than using techniques like RAG. The development cycle to develop and iterate across rapid applications is much faster than complex RAG pipelines. At the same time, architectures like Titans can help reduce inference costs for very long sequences, allowing companies to deploy LLM applications for more use cases.
Google plans to release PyTorch and JAX code to train and evaluate Titans models.
https://venturebeat.com/wp-content/uploads/2025/01/Titan.webp?w=1024?w=1200&strip=all
Source link