Meta’s new BLT architecture upgrades LLMs by exchanging tokens

Join our daily and weekly newsletters for the latest updates and exclusive content on our industry-leading AI coverage. He learns more

The AI research community continues to find new ways to improve large language models (LLMs), the latest of which is a new architecture introduced by scientists at Meta and the University of Washington.

their technology, latent byte t Ranformer (BLT), could be the next important model to make LLMs more versatile and scalable.

BLT solves one of the long-standing problems of LLM operating at the byte level instead of tokens. BLT can open the way to new models that can process raw data, are robust to changes and do not rely on a fixed vocabulary.

Symbols versus bytes

Most LLMs are trained based on a fixed set of tokens, and pre-defined sets of byte sequences.

During inference, the tokenizer splits the input sequence into tokens before passing them to the LLM.

This makes models more efficient in using computational resources but also creates biases that can reduce model performance when encountering tokens not included in the vocabulary.

For example, many leading language models can become slow and more expensive when faced with languages that have a small representation on the web because their words are not included in the model’s distinct vocabulary. Misspelled words can also cause the model to encode the input incorrectly. Avatar models can have difficulty with personality-level tasks, such as processing sequences.

Moreover, modifying the vocabulary requires retraining the model. Expanding the distinct vocabulary may require architectural changes to the model to accommodate the additional complexity.

Alternatively, LLMs can be trained directly on a single byte, which can solve many of the above problems. However, byte-level LLM programs are too expensive to train at scale and cannot be handled Very long sequencesThis is why coding remains an essential part of today’s LLM programmes.

Byte Latent Transformer (BLT)

Byte Latent Transformer (BLT) is a token-free architecture that learns directly from raw bytes and matches the performance of token-based models. To solve the shortcomings of other byte-level LLMs, BLT uses a dynamic method that groups bytes based on the level of information they contain.

“Central to our architecture is the idea that models should dynamically allocate computation when it is needed,” the researchers wrote.

Unlike symbolic models, BLT does not have a fixed vocabulary. Instead, it assigns random groups of bytes to splatter Using entropy measures. BLT does this dynamic correction through a new architecture containing three converter blocks: two small byte-level encoder/decode modules and a large “global latent converter.”

BLT structure — BLT architecture (Source: arXiv)

The encoder and decoder are lightweight models. The encoder takes the raw input bytes and creates patch representations that are fed to the global converter. At the other end, the local decoder takes the batch representations processed by the global converter and decodes them into raw bytes.

The global latent transformer is the main backbone of the model. It takes the patch representations generated by the encoder and predicts the next patch in the sequence. When processed by the decoder, this patch is broken down into one or several bytes.

The global adapter accounts for the largest share of computing resources during training and inference. Therefore, the correction mechanism determines how the global switch is used and can help control the amount of computation used for different parts of the input and output.

BLT redefines the trade-off between vocabulary size and computation requirements. In standard LLMs, increased vocabulary size means larger tokens on average, which may reduce the number of steps required to process the sequence. However, it would also require larger dimensions in the projection layers within the transformer, which itself consumes more resources.

In contrast, BLT can balance computing resources based on data complexity rather than vocabulary size. For example, the ending of most words is easy to predict and requires fewer resources. On the other hand, predicting the first byte of a new word or the first word of a sentence requires more computation cycles.

“BLT opens up a new dimension of scaling, allowing simultaneous increases in model and patch size within a fixed inference budget,” the researchers wrote. “This new paradigm becomes useful for computing systems commonly encountered in practical environments.”

BLT in action

The researchers conducted experiments on BLT and classical transformers on models of various scales, ranging from 400 million to 8 billion parameters.

According to the authors, this is “the first flop-controlled scaling study of byte-level models with up to 8B parameters and 4T training bytes, demonstrating that we can train an end-to-end model on a large scale of bytes without encoding static vocabulary.”

Their results show that when controlling for the amount of computing resources allocated to training, BLT matches the performance of Llama 3 While using up to 50% fewer FLOPs when inferring. This efficiency comes from dynamic correction of the model, which results in longer sets of bytes, providing computation that can be reallocated to increase the size of the global latent switch.

“To our knowledge, BLT is the first byte-level switch architecture that achieves scaling trends identical to BPE-based models in optimal computing systems,” the researchers wrote.

Beyond efficiency, BLT models have proven to be more robust to noisy inputs compared to token-based models. They enhanced character-level comprehension abilities and also showed improved performance on tasks such as character manipulation and low-resource machine translation. According to the researchers, BLT’s ability to process raw bytes directly rather than tokens “provides significant improvements in modeling the long tail of data,” meaning the models are better at working with patterns that don’t often appear in the training body.

This is still the beginning of what could be a new standard for creating language models. The researchers note that existing adapter libraries and code bases are designed to be highly efficient in token-based adapter architectures. This means that BLT still has room to benefit from software and hardware improvements.

Daily insights into business use cases with VB Daily

If you want to impress your boss, VB Daily has you covered. We give you the inside scoop on what companies are doing with generative AI, from organizational transformations to hands-on deployments, so you can share insights to maximize ROI.

Read our privacy policy

Thanks for subscribing. Check more VB Newsletters are here.

An error occurred.