New LLM optimization technology reduces memory costs by up to 75%

Photo of author

By [email protected]


Join our daily and weekly newsletters for the latest updates and exclusive content on our industry-leading AI coverage. He learns more


Researchers at Tokyo startup Sakana AI have developed a new technology that enables language models to use memory more efficiently, helping companies reduce the costs of building applications on large language models (LLMs) and other compiler-based models.

The technology is called “Global adapter memory“, uses special neural networks to optimize LLMs to retain important pieces of information and discard redundant details from their context.

Optimize adapter memory

The responses of transformer models, which represent the backbone of LLMs, depend on the content of “Context window“- that is, what they receive as input from users.

The context window can be thought of as the working memory of the model. Modifying the content of the context window can have a huge impact on the model’s performance, giving rise to a whole field of “Rapid engineering“.

Support current models Context windows that are too long In the hundreds of thousands, or even millions, of tokens (LLM digital representations of the words, word parts, phrases, concepts, and numbers that users enter into their prompts).

This enables users to cram more information into their claims. However, longer claims can result in higher computing costs and slower performance. Improving claims by removing unnecessary tokens and retaining important information can reduce costs and increase speed.

Existing real-time optimization techniques are resource-intensive or require users to manually test different configurations to reduce the size of their claims.

Neural attention memory modules

Universal Switch Memory improves claims using neural attention memory models (NAMMs), simple neural networks that decide whether to “remember” or “forget” each token stored in LLM memory.

“This new capability allows adapters to eliminate useless or redundant details and focus on the most important information, something we find crucial for tasks that require long-context thinking,” the researchers wrote.

Global adapter memory
Global Transformer Memory (Source: Sakana AI)

NAMMs are trained separately from LLM and are combined with the pre-trained model at inference time, making them flexible and easy to deploy. However, they need access to the model’s internal activation, which means they can only be applied to open source models.

Like other technologies developed by Sakana AI, NAMMs are trained through it Evolutionary algorithms Instead of gradient based optimization methods. By iteratively mutating and selecting the best-performing models through trial and error, evolutionary algorithms optimize NAMMs for efficiency and performance. This is especially important since NAMMs are trying to learn an indistinguishable goal: keep or discard tokens.

NAMMs operate on the attention layers of LLMs, a key component of the transformer architecture that defines the relationships and importance of each token in the model context window. Based on the values ​​of interest, NAMMs determine which tokens to keep and which to discard from the LLM context window. This attention-based mechanism allows the trained NAMM to be used in different models without further modification. For example, NAMM trained on textual data only can be applied to vision or multimodal models without additional training.

He fell asleep
Neural attention memory models (NAMMs) examine layers of attention to determine which tokens to retain or discard from the context window (Source: Sakana AI)

Global memory in action

To test the concept of global transformer memory in practice, the researchers trained NAMM on top of the open source Meta Llama Model 3-8B. Their experiments show that with NAMMs, transformer-based models perform better on natural language problems and coding in very long sequences. At the same time, by eliminating unnecessary tokens, NAMM enabled the LLM model to save up to 75% of its cache while performing tasks.

“By our standards, NAMMs provide clear improvements in Llama 3 8b converter performance,” the researchers wrote. “Moreover, our memory systems achieve notable side benefits, reducing the size of context for each layer, while not being explicitly optimized for memory efficiency.”

He fell asleep
NAMM models compete with leading rapid optimization techniques while improving model performance (Source: Sakana AI)

They also tested the model on the 70B version of the Llama as well as Transformer models designed for other modalities and missions, such as Lava (computer vision) and decision converter (reinforcement learning).

“Even in these out-of-distribution settings, NAMMs retain their benefits by eliminating tokens such as redundant video frames and suboptimal actions, allowing their new underlying models to focus on the most relevant information to improve performance,” the researchers wrote.

Task-dependent behavior

Another interesting finding is that NAMMs automatically adjust their behavior based on the task.

For example, for programming tasks, the model ignores adjacent parts of tokens that correspond to comments and whitespace that do not affect code execution.

On the other hand, in natural language tasks, the model ignores tokens that represent syntactic redundancy and do not affect the meaning of the sequence.

The researchers issued Code to create your own NAMMsTechnologies like Universal Transformer Memory can be very useful for enterprise applications that process millions of tokens and can benefit from increased speed and reduced cost. The reusability of trained NAMM also makes it a versatile tool to use across different applications in an organization.

For the future, researchers propose more advanced techniques, such as using NAMMs while training LLMs to expand their memory capabilities.

“This work has only begun to harness the potential of our new class of memory models, which we anticipate may provide many new opportunities for the development of future generations of switches,” the researchers wrote.



https://venturebeat.com/wp-content/uploads/2024/12/transformer-memory.jpg?w=1024?w=1200&strip=all
Source link

Leave a Comment