Meta proposes new scalable memory layers that improve cognition and reduce hallucinations

Join our daily and weekly newsletters for the latest updates and exclusive content on our industry-leading AI coverage. He learns more

As companies continue to adopt large language models (LLMs) in various applications, one of the major challenges they face is improving real-world knowledge of the models and reducing hallucinations. In a new paper, researchers at Meta artificial intelligence I suggest “Scalable memory layers“, which could be one of many possible solutions to this problem.

Scalable memory layers add more parameters to LLMs to increase their learning ability without requiring additional computational resources. The architecture is useful for applications where you can provide additional memory for real-world knowledge but also want inference speed for more agile models.

Dense and memory layers

traditional Language models Use “dense layers” to encode huge amounts of information in its parameters. In dense layers, all parameters are fully utilized and mostly activated at the same time during inference. Dense layers can learn complex functions, and increasing these functions requires additional computational resources and energy.

In contrast, for simple factual knowledge, much simpler classes with associative memory structures would be more efficient and interpretable. This is what memory layers do. They use simple sparse activations and key-value search mechanisms to encode and retrieve knowledge. Sparse layers consume more memory than dense layers but only use a small portion of the parameters at a time, making them more computationally efficient.

Memory layers have been around for many years but are rarely used in modern deep learning architectures. It is not optimized for current hardware accelerators.

Existing frontier LLM programs typically use some form of “A mix of experts” (MoE), which uses a mechanism vaguely similar to memory layers. MOE models consist of many small, specialized components specialized in specific tasks. At inference time, the routing mechanism determines which expert to activate based on the input sequence. Peersan architecture recently developed by Google DeepMind, extends MoE to millions of experts, providing more precise control over the parameters activated during inference.

Upgrade memory layers

Memory layers are light on computation but heavy on memory, which presents specific challenges to existing hardware and software frameworks. In their paper, the Meta researchers propose several modifications that solve these challenges and make it possible for widespread use.

First, the researchers configured the memory layers for parallelism, distributing them across multiple GPUs to store millions of key-value pairs without changing other layers in the model. They also implemented a special CUDA kernel to handle high memory bandwidth operations. They developed a parameter sharing mechanism that supports a single set of memory parameters across multiple memory layers within the model. This means that the keys and values used for lookups are shared across layers.

These modifications allow memory layers to be implemented within LLMs without slowing down the model.

“Memory layers with their sparse activations nicely complement dense networks, providing increased capacity for knowledge acquisition while being light on computation,” the researchers wrote. “It can scale efficiently, providing practitioners with an attractive new direction for trading memory for compute.”

To test the layers of memory, the researchers modified them Llama models By replacing one or more dense layers with a shared memory layer. They compared memory-enhanced models with MBA-intensive programs as well as Department of Education and PEER models on several tasks, including factual question answering, world knowledge, and scientific and logical encoding.

Memory model versus dense layers — The 1.3B memory model (solid line) trained on 1 trillion symbols comes close to the performance of the 7B model (dashed line) on realistic question-answering tasks as it is given more memory parameters (Source: arxiv)

Their results show that memory models improve significantly on dense baselines and outperform models that use 2X to 4X more computation. It also matches the performance of MoE models with the same computing budget and number of parameters. The model’s performance is particularly notable on tasks that require factual knowledge. For example, when answering real-world questions, the memory model with 1.3 billion parameters comes close to the performance of Llama-2-7B, which was trained on twice as many tokens and 10 times more computation.

Furthermore, the researchers found that the benefits of memory models remained consistent with model size as they scaled their experiments from 134 million to 8 billion parameters.

“Given these results, we strongly support the need to integrate memory layers into all next-generation AI architectures,” the researchers wrote, adding that there is still more room for improvement. “In particular, we hope that new educational methods will be developed to enhance the effectiveness of these classes even further, enabling reduced forgetting, reduced hallucinations, and continuous learning.”

Daily insights into business use cases with VB Daily

If you want to impress your boss, VB Daily has you covered. We give you the inside scoop on what companies are doing with generative AI, from organizational transformations to hands-on deployments, so you can share insights to maximize ROI.

Read our privacy policy

Thanks for subscribing. Check more VB newsletters here.

An error occurred.