Tag: attention
12 posts
Beyond Positional Bias: How DroPE Unlocks Zero-Shot Long Context in LLMs
A review of DroPE, a simple but counterintuitive method that extends LLM context length by dropping positional embedd...
Paper Review: NeoBERT: A Next-Generation BERT
A compact 250M-parameter bidirectional encoder that incorporates RoPE, SwiGLU, and modern pretraining to outperform m...
Paper Review: Titans: Learning to Memorize at Test Time
A new architecture that pairs attention with a learnable long-term memory module, scaling to 2M+ tokens and outperfor...
Paper Review: Smarter, Better, Faster, Longer: A Modern Bidirectional Encoder for Fast, Memory Efficient, and Long Context Finetuning and Inference
BERT rebuilt with modern tricks — 2 trillion training tokens, 8192 context length, Flash Attention, and rotary embedd...
Paper Review: Differential Transformer
My review of the paper Differential Transformer
Paper Review: Masked Attention is All You Need for Graphs
My review of the paper Masked Attention is All You Need for Graphs
Paper Review: Vision-RWKV: Efficient and Scalable Visual Perception with RWKV-Like Architectures
My review of the paper Vision-RWKV Efficient and Scalable Visual Perception with RWKV-Like Architectures
Paper Review: Griffin: Mixing Gated Linear Recurrences with Local Attention for Efficient Language Models
My review of the paper Griffin Mixing Gated Linear Recurrences with Local Attention for Efficient Language Models
Paper Review: DocLLM: A layout-aware generative language model for multimodal document understanding
My review of the paper DocLLM A layout-aware generative language model for multimodal document understanding
Paper Review: Long-Short Transformer Efficient Transformers for Language and Vision
My review of the paper Long-Short Transformer Efficient Transformers for Language and Vision
Paper Review: CoAtNet Marrying Convolution and Attention for All Data Sizes
My review of the paper CoAtNet Marrying Convolution and Attention for All Data Sizes
Paper Review: Linformer: Self-Attention with Linear Complexity
My review of the paper Linformer Self-Attention with Linear Complexity.