Paper Review: Byte Latent Transformer: Patches Scale Better Than Tokens

Paper link

Code link

Main image

The Byte Latent Transformer is a novel byte-level LLM that rivals tokenization-based LLMs in performance while offering superior inference efficiency and robustness. It encodes bytes into dynamically sized patches, adjusting patch size based on data complexity (entropy of the next byte) to optimize compute usage. BLT is scalable, with experiments up to 8 billion parameters and 4 trillion training bytes, and eliminates the need for fixed vocabularies. Dynamic patch selection enhances both training and inference efficiency, improving reasoning and long-tail generalization. BLT demonstrates better scaling for fixed inference costs than traditional tokenization-based models by expanding patch and model size simultaneously.

Patching: From Individual Bytes to Groups of Bytes

Patching schemes

BLT segments byte sequences into patches to dynamically allocate compute based on context. Formally, a patching function determines where patches begin in the sequence, impacting the computational cost of the Transformer, which depends on the number of patches. The average patch size is a key factor in processing cost during training and inference.

There are multiple patching methods explored in this paper:

  • Fixed-Size Patching: This straightforward method groups bytes into patches of fixed size. While easy to implement and control computational cost, it fails to allocate compute efficiently to areas of high complexity and causes inconsistent patching across similar byte sequences.
  • Whitespace Patching: An improvement on the previous method - it creates patches at spaces, ensuring consistent patching for words and focusing compute on complex predictions. However, it is limited by its inability to handle all languages or domains and lacks variability in patch size.
  • Entropy-Based Patching: A data-driven approach that identifies patch boundaries based on next-byte prediction uncertainty. Using a small language model, it computes entropy to determine patch boundaries using one of the two methods: global threshold - points exceeding a fixed entropy value and relative entropy - points breaking monotonic decreases in entropy within patches.

BLT replaces fixed-vocabulary tokens with dynamically sized patches, avoiding trade-offs like larger embedding tables in token-based models. BLT requires patching decisions to be made independently of future bytes, ensuring consistency regardless of sequence continuation. This differentiates patches from subword tokenization methods like BPE, which depend on sequence context for tokenization.

BLT Architecture

BLT Modules

The Latent Global Transformer is an autoregressive transformer that processes latent input patch representations into output patch representations. It uses a block-causal attention mask, limiting attention to the current and preceding patches within the same document. The LGT is computationally intensive, consuming most of the FLOPs during both pre-training and inference. Controlling when to invoke it allows dynamic allocation of compute based on the complexity of the input and output, optimizing resource usage.

Local Encoder

The Local Encoder Model in BLT is a lightweight transformer. Its main purpose is to map input byte sequences into expressive patch representations. First, the input bytes are embedded into vectors using a learnable matrix. These embeddings are then optionally augmented with hash n-gram embeddings, which provide contextual information by incorporating preceding byte sequences. A hash function (RollPolyHash) maps these byte n-grams to indices in a fixed-size embedding table, creating robust and expressive representations.

To transform these byte embeddings into patch representations, the model employs a multi-headed cross-attention module inspired by the Perceiver architecture. It pools byte representations for each patch and projects them into latent patch representations. Unlike traditional approaches, this design dynamically adapts to the variable-sized patches in BLT - queries, keys, and values are derived from byte embeddings, with each query restricted to attend only to the bytes within its respective patch, ensuring patch-level locality.

The Local Encoder Model uses a local block-causal attention mask to enforce that bytes can only attend to a fixed window of preceding bytes, while preventing attention across document boundaries. To enhance stability and efficiency, the model applies pre-LayerNorm to queries, keys, and values, omits positional embeddings, and incorporates residual connections around the cross-attention block.

The Local Decoder is also a lightweight transformer-based model. It decodes global patch representations into raw bytes, using a sequence of cross-attention and transformer layers. The decoder operates autoregressively, predicting raw bytes as a function of previously decoded bytes, with its input derived from the hidden representations produced by the local encoder.

In the decoder’s cross-attention mechanism, the roles of queries and key/values are reversed compared to the encoder: byte representations act as queries, while patch representations serve as key/values.

Similar to the encoder, the decoder uses multi-headed attention, pre-LayerNorm, residual connections, and no positional embeddings.

Experiments

The model is pre-trained on the Llama 2 dataset (2 trillion tokens) and BLT-1T (a new dataset with 1 trillion tokens).

Scaling trends

BLT matches or outperforms BPE models in scaling trends, validating its effectiveness in compute-optimal regimes. Larger BLT patch sizes (6–8 bytes) improve performance and reduce inference FLOPs by up to 50%. While models with larger patch sizes start with lower performance (e.g., at 1B parameters), they surpass BPE models at larger scales (e.g., 7B parameters).

Comparison

The BLT-Entropy model outperforms the Llama 3 model on 4 out of 7 tasks while being trained on the same number of bytes.

BLT allows simultaneous increases in model size and patch size while maintaining constant training and inference FLOPs and data volume. BLT models exhibit better scaling trends than BPE-based models, especially beyond the compute-optimal regime. While BPE models perform better with smaller training budgets, BLT quickly surpasses them as compute budgets increase.

Robustness

BLT demonstrates significant advantages over tokenizer-based models in robustness, character-level understanding, and low-resource language tasks due to its byte-level architecture:

  • Robustness to Noisy Data: BLT outperforms tokenizer-based models, including Llama 3.1, by an average of 8 points in noised benchmark tasks like HellaSwag. This includes handling noise strategies such as random case changes, character repetition, and uppercase transformations.
  • Phoneme Mapping: In Grapheme-to-Phoneme tasks from the Phonology Bench, BLT surpasses Llama 3, showing its superior ability to process character-level representations and map them to phonemes.
  • Character-Level Understanding: On the CUTE benchmark, which tests composition, orthographic similarity, and sequence manipulation, BLT outperforms Llama 3 by over 25 points. It excels in character manipulation tasks, achieving near-perfect scores on spelling tasks, demonstrating that character-level information is harder to learn for BPE models.
  • Low-Resource Machine Translation: On the FLORES-101 benchmark, BLT surpasses Llama 3 in translating into English by 2 BLEU points and translating from English by 0.5 points. It performs especially well in low-resource languages.
Translation

Limitations and possible improvements

Many experiments were conducted on models up to 1B parameters, and architectural choices might evolve as BLT scales to 8B parameters and beyond, reaching further performance improvements.

Existing transformer libraries are optimized for tokenizer-based architectures. While BLT uses efficient components like FlexAttention, its implementation is not yet fully optimized and could benefit from further refinement.

BLT currently relies on a separately trained entropy model for patching. Learning patching in an end-to-end manner is identified as a promising direction for enhancing performance.

paperreview deeplearning nlp llm transformer scaling tokenization