Spotify has recently expanded its offerings to include audiobooks, presenting challenges in personalized recommendation due to the inability to skim audiobooks before purchase, data scarcity with the introduction of a new content type, and the need for a fast, scalable model. To overcome these obstacles, Spotify developed a novel recommendation system 2T-HGNN consisting of Heterogeneous Graph Neural Networks and a Two Tower model.
By decoupling users from the HGNN graph and using a multi-link neighbor sampler, the complexity of the HGNN model is significantly reduced, ensuring low latency and scalability. Empirical evaluations with millions of users demonstrated a substantial improvement in personalized recommendations, resulting in a 46% increase in the rate of starting new audiobooks and a 23% increase in streaming rates. Additionally, the model also positively impacted the recommendation of podcasts, indicating its broader applicability beyond just audiobooks.
The graph connects audiobooks and podcasts as nodes based on user interactions. Node features are augmented by embeddings from titles and descriptions via multi-language Sentence-BERT, facilitating the HGNN’s learning of complex patterns from both content and user preferences.
The model iteratively updates node features by first aggregating neighbor features based on their relationships, then combining these with the node’s original features across several layers. It normalizes node embeddings for training stability and search efficiency and extends the GraphSAGE framework to handle heterogeneous graphs. HGNN uses a contrastive loss function during training to enhance the similarity of connected node embeddings while distancing those of unconnected nodes, optimizing the network to produce meaningful embeddings reflective of the graph’s structure.
To counteract the imbalance in the co-listening graph, which has more podcast-podcast and audiobook-podcast edges than audiobook-audiobook connections, a multi-link neighborhood sampler was developed. By undersampling the majority edge types and selecting equal numbers of audiobook-audiobook and audiobook-podcast connections, it ensures diverse and comprehensive training data coverage across epochs.
The 2T-HGNN model uses Two Tower structure to enhance user and audiobook representation by combining deep neural networks, one for users and another for audiobooks. The user tower inputs include demographic information and historical interactions with music, audiobooks, and podcasts—the latter two being represented by averaged HGNN embeddings from recent interactions. Additionally, it incorporates streams and “weak signals” like follows and previews. The audiobook tower processes metadata such as language and genre, along with embeddings from titles and descriptions, and the specific HGNN embedding for each audiobook.
The model produces separate output vectors for users and audiobooks, optimizing a loss function that aligns user vectors closer to audiobooks they’ve engaged with while distancing them from unrelated audiobook vectors.
The 2T-HGNN model generates user and audiobook vectors daily for personalized recommendations. Each day begins with training the HGNN model to update podcast and audiobook embeddings, which are then used to train the 2T model. After training, audiobook vectors are created, and a Nearest Neighbor index is built for real-time recommendations. For now, a brute-force search is used for the relatively small audiobook catalog, with plans to switch to an approximate k-NN index for efficiency as the catalog grows. User vectors are generated on-the-fly to ensure recommendations are up-to-date, especially for new users, with a latency target under 100 ms.
The HGNN can produce embeddings for new or unstreamed audiobooks using only their metadata, allowing for inductive inference.
HGNN models are implemented in PyTorch and optimized with Adam using a two-layer architecture. The 2T model, built in TensorFlow, includes three fully connected layers in each tower and uses demographic and interaction features for users, alongside metadata and LLM embeddings for audiobooks.
Training is done on a single machine with an Intel 16 vCPU and 128 GB memory.
Ablations:
Audiobook recommendation:
Podcast recommendation:
The integration of audiobooks and podcasts into a single graph for the 2T-HGNN model has significantly enhanced podcast recommendations on an existing online platform that previously only featured podcasts. This approach has not only improved HR@10 by 7% but also remarkably increased coverage by 80% for both warmstart and coldstart users.
Production A/B Experiment:
An A/B test involving 11.5 million monthly active Spotify users compared the online performance of the 2T-HGNN model against the current production model and a 2T model for personalizing audiobook recommendations. Divided into three groups, each experienced recommendations from one of the models. Results demonstrated that the 2T-HGNN model notably improved both the rate at which new audiobooks were started and the overall audiobook streaming rate compared to the other models. The 2T model, while competitive, offered a lesser increase in new audiobook start rates and didn’t significantly affect streaming rates.
]]>NaturalSpeech 3 is a new text-to-speech system that efficiently models intricate speech with disentangled subspaces using factorized diffusion models and a neural codec with factorized vector quantization. It generates attributes for each subspace based on prompts, leading to high-quality, natural speech generation. It outperforms current TTS systems in quality, similarity, prosody, and intelligibility and can be scaled up to 1 billion parameters and 200,000 hours of training data.
FACodec2 is designed to convert speech waveforms into distinct subspaces for content, prosody, timbre, and acoustic details, then reconstruct them into high-quality speech. It includes a speech encoder, a timbre extractor, three factorized vector quantizers for the different attributes, and a speech decoder. The encoder uses convolutional blocks to create a pre-quantization latent representation of the speech, which the timbre extractor, using a Transformer encoder, turns into a vector for timbre attributes. Other attributes are processed by factorized vector quantizer into discrete tokens. The decoder, larger than the encoder, combines these representations, incorporating timbre through conditional layer normalization, to reconstruct the speech waveform with high quality.
Techniques for better speech attribute disentanglement in a factorized neural speech codec include:
The model generates speech by discretely diffusing and sequentially generating speech attributes using a non-autoregressive model. Duration is generated first due to its design, and acoustic details last, following a logical sequence. The generative model receives attribute-specific prompts and applies discrete diffusion within each attribute’s subspace. Utilizing a codec to break down speech prompts into attribute prompts facilitates in-context learning. For example, for prosody generation, the authors concatenate the prosody prompt (without noise) and the target sequence (with noise) and then gradually remove noise from the target sequence with the prosody prompt.
The factorized diffusion model consists of a phoneme encoder and diffusion modules for each attribute, following the same discrete diffusion formulation but excluding explicit timbre generation, which is derived directly from prompts. Speech synthesis combines these attributes and decodes them via a codec decoder to produce the target speech.
NaturalSpeech 3 advances the NaturalSpeech TTS series by focusing on high-quality and diverse speech synthesis across single and multi-speaker, multi-style, and multi-lingual scenarios. It builds upon the foundational encoder/decoder and duration prediction components of its predecessors, distinguishing itself by introducing factorized diffusion models for generating discrete speech attributes. Unlike the flow-based models of NaturalSpeech and the latent diffusion models of NaturalSpeech 2, NaturalSpeech 3 uses the FACodec to disentangle speech into subspaces like prosody, content, acoustic details, and timbre, simplifying speech modeling and enhancing synthesis quality and diversity.
Evaluations of NaturalSpeech 3 show that it closely matches ground-truth recordings in quality and significantly surpasses baseline systems across various metrics. Speech quality assessed through CMOS tests shows NaturalSpeech 3 produces high-quality and natural speech. Speaker similarity evaluations, both objective and subjective, indicate it closely mimics speaker characteristics, outperforming other models. Prosody similarity assessments further demonstrate its superiority in capturing the nuances of speech prosody. Additionally, its robust zero-shot TTS capabilities are highlighted by lower word error rates compared to both ground truth and other baselines, indicating high intelligibility and superior performance.
Ablation studies:
Evaluations on the factorized diffusion model’s scaling effects show that both data and model scaling significantly enhance performance. With data scaling, even with just 1K hours of speech data, NaturalSpeech 3 shows effective speech generation, improving as data scales up to 200K hours, indicating clear benefits in speaker similarity and robustness. Model scaling, from 500M to 1B parameters, further boosts these metrics, suggesting larger models and more data contribute to better zero-shot TTS performance, with potential for even greater improvements with extended training and larger model sizes.
P. S. You can read my review of NaturalSpeech2 here.
]]>Hawk and Griffin are new RNNs by DeepMind. Hawk surpasses Mamba’s performance using gated linear recurrences, while Griffin, a hybrid combining gated linear recurrences and local attention, matches Llama-2’s performance with significantly less training data. Griffin also excels in processing longer sequences than seen during training. Both models are as hardware efficient as Transformers but offer lower latency and higher throughput during inference. Griffin is scaled to 14 billion parameters.
The architecture has three main components: a residual block, an MLP block, and a temporal-mixing block. The residual and MLP blocks are consistent across models, while three types of temporal-mixing blocks are considered: global Multi-Query Attention (MQA), local MQA, and a novel recurrent block.
Residual Block: Inspired by pre-norm Transformers, it processes input sequences through several layers, applying RMSNorm for final activations and using a shared linear layer for token probabilities.
MLP Block: Uses a gated mechanism with an expansion factor, applying linear layers and a GeLU non-linearity, followed by element-wise multiplication and a final linear layer.
Temporal-Mixing Blocks:
The Real-Gated Linear Recurrent Unit (RG-LRU) features a recurrence gate and an input gate, both using the sigmoid function for non-linearity, and performs element-wise operations for stable recurrence. The RG-LRU uses a learnable parameter to ensure stable gating values between 0 and 1. The gates don’t depend on the recurrent state, which allows efficient computation.
The recurrence gate allows to discard the input and preserve all information from the previous history.
All three model families are trained across a range of scales from 100M to 14B parameters, adhering to the Chinchilla scaling laws and using the MassiveText dataset. All models show a linear scaling relationship between validation loss and training FLOPs. Griffin notably achieves a lower validation loss than a Transformer baseline across all FLOP budgets without using global attention layers, whereas Hawk shows slightly higher validation loss, which narrows with increased training budget.
For downstream task evaluation, models were trained for 300B tokens and compared against Mamba-3B and Llama-2, which were trained on significantly more tokens. Despite this, Hawk and Griffin demonstrated very strong performance, with Hawk outperforming Mamba-3B at the 3B scale and Griffin not only surpassing Mamba-3B but also being competitive with Llama-2 at the 7B and 14B scales, despite the vast difference in training data. Griffin also outperforms the MQA Transformer baseline, showing the effectiveness of these models in achieving high performance with fewer training tokens.
For large-scale training, the authors use Megatron-style sharding for MLP and MQA blocks and block-diagonal weights for RG-LRU gates to reduce cross-device communication. ZeRO parallelism and bfloat16 representations are used to manage memory consumption.
To address the computational challenge of the RG-LRU layer’s low FLOPs-to-byte ratio, the authors write a custom kernel in Pallas (JAX) with a linear scan. This resulted in almost 3x speedup.
Training speed comparisons across different model sizes and sequence lengths show that as sequence length increases, Griffin maintains consistent training time, contrasting with slower Transformer baseline times, especially at smaller model sizes. This efficiency is attributed to the computational scaling of linear layers versus the RG-LRU and attention mechanisms. However, for short sequences, Griffin trains slightly slower than the MQA baseline due to its slightly higher parameter and FLOP count.
Inference in LLMs involves two stages: “prefill” where the prompt is processed in parallel, leading to compute-bound operations similar in speed to those during training, and “decode” where tokens are generated auto-regressively, with recurrent models showcasing lower latency and higher throughput, especially at longer sequence lengths due to smaller key-value cache sizes compared to Transformers.
Latency and throughput are the main metrics for evaluating inference speed.
During decoding, both Transformer and recurrent models are memory bound, particularly when batch sizes are moderate. Recurrent models, with smaller recurrent state sizes compared to the KV cache of Transformers, have lower latency and allow for larger batch sizes, thus improving throughput. This difference becomes particularly notable for longer sequences.
In an inference performance comparison of 1B parameter models, Hawk and Griffin demonstrated superior latency and throughput compared to an MQA Transformer baseline, especially with long sequences. The lower latency of Hawk and Griffin becomes particularly evident with increased prefill lengths, highlighting the efficiency of linear recurrences and local attention mechanisms.
The authors evaluate Hawk and Griffin’s capacity to utilize longer contexts for improved predictions. Both models demonstrate enhanced next-token prediction with extended contexts, with Griffin showing notable extrapolation capability. Further exploration with models trained on 8k token sequences against those trained on 2k token sequences reveals that models adapted to longer contexts (Hawk-8k and Griffin-8k) perform better on longer sequences, showcasing their ability to learn from extended contexts. However, for shorter sequences, models trained on 2k tokens (Hawk-2k and Griffin-2k) slightly outperform, suggesting the importance of aligning training sequence length with the anticipated application needs of the model.
Hawk and Griffin’s capabilities in copying and retrieving tokens from context are explored through synthetic tasks and a practical phone number lookup task, comparing them against a MQA Transformer baseline. In the Selective Copying and Induction Heads tasks, Griffin matches the learning speed of Transformers and demonstrates superior extrapolation abilities on longer sequences, unlike the Transformer baseline which struggles with extrapolation. Hawk, while slower in learning, shows exceptional extrapolation on the Induction Heads task.
In a real-world phonebook lookup task, pre-trained Hawk, Griffin, and the MQA Transformer models were tested for their ability to memorize and retrieve correct phone numbers. Hawk performs well on short phonebook lengths but struggles as length increases due to its fixed-size state. The Transformer baseline succeeds up to its training sequence length but fails beyond that. Griffin stands out by solving the task up to its local attention window size and extrapolating better to longer sequences, though its performance declines once the context exceeds this window.
]]>This paper introduces a novel concept named Programmable Gradient Information to address the issue of data loss in deep learning networks as data undergoes layer-by-layer feature extraction and spatial transformation. PGI aims to provide complete input information for calculating the objective function, ensuring reliable gradient information for network weight updates. Alongside PGI, the authors present a new lightweight network architecture called Generalized Efficient Layer Aggregation Network (GELAN), which is designed based on gradient path planning. GELAN leverages conventional convolution operators to achieve better parameter utilization compared to state-of-the-art methods that use depth-wise convolution. The effectiveness of GELAN and PGI is demonstrated through object detection tasks on the MS COCO dataset, showing that this approach allows train-from-scratch models to surpass the performance of models pre-trained on large datasets.
The Information Bottleneck Principle highlights the inevitable information loss data X experiences during transformation in deep neural networks; it illustrates that with each layer the data passes through, the likelihood of information loss increases, potentially leading to unreliable gradients and poor network convergence due to incomplete information about the prediction target. One proposed solution to mitigate this issue is to enlarge the model with more parameters, allowing for a more comprehensive data transformation and improving the chances of retaining sufficient information for accurate target mapping. However, this approach does not address the fundamental issue of unreliable gradients in very deep networks. The authors suggest exploring reversible functions as a potential solution to maintain information integrity throughout the network, aiming to achieve better convergence by preserving essential data through the network layers.
The concept of reversible functions means that a function and its inverse can transform data without loss of information. This principle is applied in architectures like PreAct ResNet, which ensures data is passed through layers without loss, aiding in deep network convergence but potentially compromising the depth’s advantage in solving complex problems. An analysis using the information bottleneck principle reveals that retaining critical information mapping data to targets is essential for training effectiveness, especially in lightweight models. The aim is to develop a new training method that generates reliable gradients for model updates and is feasible for both shallow and lightweight neural networks, addressing the core issue of significant information loss during data transformation.
PGI consists of three components: a main branch for inference without extra cost, an auxiliary reversible branch to counteract the effects of network depth, and multi-level auxiliary information to mitigate error accumulation in deep supervision and lightweight models with multiple prediction branches.
Auxiliary Reversible Branch helps maintain complete information flow from data to targets, mitigating the risk of false correlations due to incomplete features. However, integrating a reversible architecture with a main branch significantly increases inference costs. To counteract this, PGI treats the reversible branch as an expansion of deep supervision, enhancing the main branch’s ability to capture relevant information without the necessity of retaining complete original data. This approach allows for effective parameter learning and application to shallower networks. Importantly, the auxiliary reversible branch can be omitted during inference, preserving the network’s original inference efficiency.
This component aims to address the information loss in deep supervision architectures, particularly in object detection tasks using multiple prediction branches and feature pyramids for detecting objects of various sizes. This component integrates a network between the feature pyramid layers and the main branch to merge gradient information from different prediction heads. This integration ensures that each feature pyramid receives comprehensive target object information, enabling the main branch to retain complete information for learning predictions across various targets. By aggregating gradient information containing data about all target objects, the main branch’s learning is not skewed towards specific object information, mitigating the issue of fragmented information in deep supervision.
YOLOv9 outperforms existing real-time object detectors across various model sizes, achieving higher accuracy with fewer parameters and reduced computational requirements. Specifically, YOLOv9 surpasses lightweight and medium models like YOLO MS in terms of parameter efficiency and accuracy, matches the performance of general models such as YOLOv7 AF with significantly fewer parameters and calculations, and exceeds the large model YOLOv8-X in both efficiency and accuracy.
Additionally, when compared to models using depth-wise convolution or ImageNet pretraining, YOLOv9 demonstrates superior parameter utilization and computational efficiency. The success of YOLOv9, particularly in deep models, is attributed to the PGI, which enhances the ability to retain and extract crucial information for data-target mapping, leading to performance improvements while maintaining lower parameter and computation demands.
LiRank is a large-scale ranking framework developed by LinkedIn that incorporates state-of-the-art modeling architectures and optimization techniques, including Residual DCN, Dense Gating, and Transformers. It introduces novel calibration methods and uses deep learning-based explore/exploit strategies, along with model training and compression techniques like quantization and vocabulary compression for efficient deployment.
Applied to LinkedIn’s Feed, Jobs Recommendations, and Ads CTR prediction, LiRank has led to significant performance improvements: a 0.5% increase in member sessions for the Feed, a 1.76% rise in qualified job applications, and a 4.3% boost in Ads CTR.
Feed Ranking Model. The primary Feed ranking model at LinkedIn uses a point-wise approach to predict the likelihood of various actions (like, comment, share, vote, click, and long dwell) for each member and candidate post pair. These predictions are linearly combined to calculate the final score of a post.
The model is built on a TensorFlow multi-task learning architecture with two main components: a click tower for click and long dwell probabilities, and a contribution tower for contribution actions and related predictions. Both towers use the same normalized dense features and multiple fully-connected layers, while sparse ID embedding features are converted into dense embeddings via lookup in specific embedding tables.
Ads CTR Model. The ads selection uses a click-through-rate prediction model to estimate the likelihood of members clicking on recommended ads, which then informs ad auction decisions. Advertisers can define what constitutes a chargeable click, with some counting social interactions like ‘likes’ or ‘comments’ and others only considering visits to the ad’s website. The CTR prediction model is an MTL model with three heads for different chargeability categories, grouping similar chargeable actions together. Each head uses independent interaction blocks, including MLP and DCNv2. The model incorporates traditional features from members and advertisers, as well as ID features to represent advertisers, campaigns, and advertisements.
Residual DCN. To enhance feature interaction capture, DCNv2 was utilized. To manage the high parameter count from DCNv2’s large feature input dimension, the authors replaced the weight matrix with two low-rank matrices and reduced input feature dimension through embedding-table look-ups, achieving nearly a 30% reduction in parameters. Additionally, the cross-network of DCNv2 was improved by introducing an attention mechanism with a low-rank approximation.
Isotonic Calibration Layer in DNN. Model calibration is essential for ensuring that estimated class probabilities accurately reflect real-world occurrences, critical for applications like Ads pricing based on CTR. Traditional calibration methods like Platt scaling and isotonic regression face challenges with deep neural networks due to parameter space constraints and scalability issues with multiple features. To overcome these, a customized isotonic regression layer, designed to integrate directly with deep neural networks, was developed. This layer, trainable within the network, uses a piece-wise fitting approach to bucketize predicted values and assigns trainable weights to each bucket, updated during training. The isotonic property is maintained through non-negative weights, ensured by the ReLU activation function. For calibration with multiple features, weights are combined with an embedding representation of calibration features, enhancing the model’s calibration capability.
Dense Gating and Large MLP. Personalized embeddings were added to global models to facilitate interactions among dense features, including multi-dimensional count-based and categorical features, by flattening these into a single dense vector and combining them with embeddings for processing in MLP layers. Increasing the width of MLP layers was found to enhance model performance by enabling more complex interactions, with the largest tested configuration being an MLP with 4 layers, each 3500 units wide, showing gains primarily when personalized embeddings were used. Additionally, a gating mechanism inspired by Gate Net was introduced to hidden layers to regulate information flow, enhancing learning with minimal additional computational cost and consistently improving online performance.
Incremental Training. Large-scale recommender systems need to frequently update to include new content like Ads, news feed updates, and job postings. The authors use incremental training, which not only initializes weights from the previous model but also adds an informative regularization term based on the difference between the current and previous model weights, adjusted by a forgetting factor. To further mitigate catastrophic forgetting, both the initial cold start model and the prior model are used for weight initialization and regularization, introducing a new parameter called cold weight to balance the influence of the initial and previous models.
Member History Modeling. To capture member interactions with content on the platform, a method involving historical interaction sequences for each member is used, where item embeddings are combined with action embeddings and the embedding of the item being evaluated. This combined data is processed by a two-layer Transformer-Encoder, with the max-pooling token serving as a feature in the ranking model. Additionally, the last five steps of the sequence are flattened, concatenated, and used as extra input features to enhance the model’s information. To minimize latency, experiments were conducted with shorter sequences and reduced dimensions of the feed-forward network within the Transformer. While longer sequences can increase relevance, the additional training and serving time does not justify their use.
Explore and Exploit. The authors predict values using the last layer’s weights and input, then applying Bayesian linear regression to obtain the posterior probability of the weights, which is fed into Thompson Sampling. This approach does not require independently training a model for the last layer’s representation but updates the posterior probability of weights incrementally after each offline training period.
Dwell Time Modeling. To better understand member behavior and preferences on LinkedIn, a ‘long dwell’ signal was introduced to detect passive content consumption, addressing the challenge of capturing passive but positive engagement. Direct or logarithmic prediction of dwell time was found unsuitable due to data volatility, and static thresholds for defining ‘long dwell’ lacked adaptability and consistency, potentially biasing towards content with inherently longer dwell times. To overcome these issues, a binary classifier was developed to predict whether the time spent on a post exceeds a certain percentile, with specific percentiles adjusted based on contextual features like ranking position, content type, and platform. This approach allows for dynamic adjustment of long-dwell thresholds, capturing evolving user preferences and reducing bias and noise.
To enhance the scalability of training large ranking models, several optimization techniques were used, resulting in significant reductions in training time:
Incremental training was applied to both Feed ranking and Ads CTR models, showing significant improvements in metrics and reductions in training time after tuning parameters.
For Feed ranking, an offline “replay” metric was used to compare models by estimating the online contribution rate (likes, comments, re-posts) through a pseudo-random ranking method. This method allowed for unbiased offline comparison of models, with various production modeling techniques like Isotonic calibration, low-rank DCNv2, and others leading to a 0.5% relative increase in member sessions.
In Jobs Recommendations, embedding dictionary compression and task-specific DCN layers were used without performance loss, achieving significant offline AUC lifts for Job Search and JYMBII models. This resulted in a 1.76% improvement in Qualified Applications in online A/B testing.
For Ads CTR, incremental improvements were made using techniques like ID embeddings, Quantization, and Isotonic calibration, among others, following a multilayer perceptron baseline model. These techniques led to a 4.3% relative improvement in CTR in online A/B tests.
Scaling up Feed Training Data Generation. To manage the increased volume of training data from scaling up to use 100% of sessions, two major optimizations were implemented. First, the data pipeline was adjusted to explode only post features and keys before joining with the labels dataset, and then adding session-level features in a subsequent join. This approach reduced the overall shuffle write size by 60%. Second, tuning Spark compression further reduced shuffle write size by 25%.
Model Convergence with DCNv2. Initial experiments with DCNv2 showed a high divergence rate. To stabilize training, the learning rate warm-up period was increased from 5% to 50% of training steps, which not only resolved instability issues but also enhanced offline relevance gains. Batch normalization was applied to numeric input features, and it was discovered that the model was under-fitting at the current number of training steps. However, increasing training steps was impractical for production. Instead, tripling the learning rate, given the extended warm-up period, effectively closed the gap in relevance metrics without extending training duration.
Additionally, optimization strategies varied across models. While Adam was effective in general, models with many sparse features performed better with AdaGrad. Learning rate warm-up and gradient clipping were particularly useful for larger batch sizes, improving model generalization. The practice of increasing the learning rate proportionally with batch size, without exceeding 60% of total training steps, was found to enhance generalization across different settings and mitigate generalization gaps at larger batch sizes.
]]>Lag-Llama is a new foundation model designed for univariate probabilistic time series forecasting, using a decoder-only transformer architecture with lags as covariates. It is pretrained on a diverse corpus of time series data from various domains, showcasing exceptional zero-shot generalization capabilities. When fine-tuned on small subsets of new datasets, Lag-Llama achieves superior performance, surpassing previous deep learning methods and setting new benchmarks in time series forecasting.
In univariate time series modelling the dataset comprises of one or more time series, each sampled at discrete time points, with the goal of predicting the future values. Instead of using the entire history of each time series for prediction, a fixed context window is used to learn an approximation of the distribution of the next values, incorporating covariates. Predictions are made through an autoregressive model, leveraging the chain rule of probability, and are conditioned on learned neural network parameters.
The tokenization process for Lag-Llama involves generating lagged features from prior time series values using specified lag indices that include quarterly, monthly, weekly, daily, hourly, and by the second. These lag indices create a vector for each time value, where each element corresponds to the value at a specific lag. Date-time features across different frequencies, from second-of-minute to quarter-of-year, are integrated to provide supplementary information and help the model understand the frequency of the time series. The resulting tokens comprise the size of the lag indices plus the number of date-time features. However, a limitation of this approach is the need for a context window that is at least as large as the number of lags used (by definition).
Lag-Llama uses a decoder-only transformer architecture, based on LLaMA, designed for univariate time series forecasting. The model processes sequences by first tokenizing them along with covariates into a sequence of tokens, which are then mapped to a hidden dimension suitable for the attention module. It incorporates pre-normalization techniques like RMSNorm and Rotary Positional Encoding to enhance its attention mechanism, aligning with the practices of the LLaMA architecture. The transformer layers, which are causally masked to prevent future information leakage, output the parameters of the forecast distribution for the next time step. The model’s training objective is to minimize the negative log-likelihood of this predicted distribution across all time steps.
For predictions, Lag-Llama takes a feature vector from a time series, generating a distribution for the next time point through greedy autoregressive decoding. This process allows for the simulation of multiple future trajectories up to a predefined prediction horizon. From these simulations, uncertainty intervals can be calculated, aiding in downstream decision-making and evaluation against held-out data.
The final component of Lag-Llama is the distribution head, a layer that translates the model’s learned features into parameters of a specific probability distribution. In their experiments, the creators adopted a Student’s t-distribution, configuring the distribution head to output its three parameters: degrees of freedom, mean, and scale, with special adjustments to maintain the positivity of these parameters.
To handle the diversity in numerical magnitudes across different time series datasets during pretraining, Lag-Llama employs a scaling heuristic. For each univariate window, it calculates the mean and variance of the time series within the window and standardizes the time series data by subtracting the mean and dividing by the variance. Additionally, the mean and variance are included as time-independent covariates (summary statistics) alongside each token to inform the model about the input data’s statistical properties.
Furthermore, the model adopts a Robust Standardization: normalizing the time series by subtracting the median and scaling by the Interquartile Range, making the preprocessing step more robust to extreme values in the data.
During training, the authors use stratified sampling and augmentation technics Freq-Mix and Freq-Mask.
Lag-Llama demonstrates strong performance in time series forecasting, comparing favorably with supervised baselines across unseen datasets in both zero-shot and fine-tuned settings. In the zero-shot scenario, it matches the performance of all baselines with an average rank of 6.714. Fine-tuning further enhances its capabilities, leading to state-of-the-art performance in three of the used datasets and significantly improved performance in others, achieving the best average rank of 2.786. This performance underscores Lag-Llama’s potential as a go-to method for diverse datasets without prior data knowledge, fulfilling a foundational model’s key requirement.
The experiments suggest that at scale, decoder-only transformers may outperform other architectures in time series forecasting, mirroring observations from the NLP community regarding the impact of inductive bias.
Lag-Llama was also evaluated on its ability to adapt to different amounts of historical data, with experiments conducted using only the last 20%, 40%, 60%, and 80% of the data from training sets. Lag-Llama was fine-tuned and consistently achieved the best average rank across all levels of available history, showcasing its strong adaptation capabilities. As the volume of available history increased, so did Lag-Llama’s performance, widening the performance gap between it and baseline models.
However, it’s noted that in the exchange-rate dataset, which represented a new domain and frequency not seen during pretraining, Lag-Llama was frequently outperformed by the TFT model, suggesting that Lag-Llama benefits from more historical data in scenarios where the dataset is significantly different from the pretraining corpus.
]]>Lumiere is a novel text-to-video diffusion model that stands out for its ability to synthesize videos with realistic, diverse, and coherent motion. It differs from traditional models by using a Space-Time U-Net architecture, generating videos in a single pass instead of the usual method of creating keyframes and then adding details. This approach, which includes spatial and temporal down- and up-sampling, helps maintain global temporal consistency. Lumiere uses a pre-trained text-to-image diffusion model, enabling it to produce full-frame-rate, low-resolution videos at multiple space-time scales. It achieves state-of-the-art results in text-to-video generation and is adaptable for various content creation and video editing tasks, such as image-to-video conversion, video inpainting, and stylized video generation.
Lumiere uses Diffusion Probabilistic Models for video generation, which approximate a data distribution through denoising steps, starting from Gaussian noise and gradually refining it. Lumiere’s framework includes a base model for generating low-resolution video clips and a spatial super-resolution model for upscaling to high resolution.
The Space-Time U-Net (STUnet) in Lumiere downsamples both spatially and temporally, focusing computation on a compact space-time representation. This architecture, inspired by biomedical data processing techniques, interleaves temporal blocks with spatial resizing modules and incorporates temporal convolutions and attention. The temporal attention is used at the coarsest resolution to manage computational demands.
For spatial super-resolution, Lumiere uses Multidiffusion to handle memory constraints and avoid temporal artifacts. This involves splitting the video into overlapping segments, processing each with SSR, and then combining them. The combination is optimized to minimize differences between the segments and their SSR predictions, ensuring smooth transitions in the final high-resolution video.
Lumiere uses a technique inspired by GAN-based interpolation, blending the fine-tuned T2I weights with the original weights. Different styles result in distinct motion characteristics in the generated videos, with examples including watercolor painting for realistic motion and line drawing or cartoon styles for unique, non-realistic motion.
Lumiere is extended to generate videos based on additional input signals: like a noisy video, a conditioning video (or image) or a binary mask. The applications include Image-to-Video, inpainting and cinemagraphs.
In the qualitative assessment, Lumiere outshines its competitors by producing 5-second videos with greater motion magnitude, better temporal consistency, and overall quality. Gen2 and Pika, while high in per-frame visual quality, tend to produce nearly static videos. ImagenVideo, AnimateDiff, and ZeroScope, despite showing noticeable motion, suffer from visual artifacts and generate shorter videos.
In quantitative terms, Lumiere is evaluated on the UCF101 dataset for zero-shot text-to-video generation and achieves competitive Frechet Video Distance and Inception Score metrics. However, these metrics have their limitations in accurately reflecting human perception and capturing long-term motion.
Further validation comes from a user study conducted using the Two-alternative Forced Choice protocol on Amazon Mechanical Turk. Participants compared Lumiere’s videos with those from baseline methods, focusing on visual quality, motion, and alignment with text prompts. Lumiere was consistently preferred over the baselines, indicating its superior performance in aligning with text prompts and in overall video quality.
]]>AIM is a collection of vision models pre-trained with an autoregressive objective inspired by LLMs and demonstrating similar scaling properties. The authors’ findings include the scaling of visual feature performance with model capacity and data quantity and a correlation between the objective function value and model performance in downstream tasks. 7B AIM pre-trained on 2 billion images achieves 84.0% on ImageNet1k with a frozen trunk without showing performance saturation, indicating a potential new frontier in large-scale vision model training. AIM’s pre-training process mirrors that of LLMs and doesn’t require unique image-specific strategies for stable scaling.
The models are pre-trained on the DFN dataset, a subset of 12.8B image-text pairs from Common Crawl, refined to 2B high-quality images by removing inappropriate content, blurring faces, deduplicating, and ranking based on image-caption alignment. No content-based curation is involved, allowing the potential use of larger, less aligned image collections. For pre-training, a blend of DFN-2B (80%) and ImageNet-1k (20%) is used, emulating the LLM practice of oversampling high-quality data, creating the DFN-2B+ dataset.
The training objective uses an autoregressive model on image patches, treating each image as a sequence of non-overlapping patches. The probability of an image is the product of conditional probabilities of each patch, given previous patches. The training loss is the negative log-likelihood of these probabilities, aiming to learn the true image distribution. The basic loss is a normalized pixel-level regression, minimizing the L2 distance between predicted and actual patch values. A cross-entropy loss variant with discrete tokens is also tested, but pixel-wise loss yields stronger features.
AIM uses ViT architecture, prioritizing width over depth for scaling. It employs causal masks during pre-training for autoregressive modeling of image patches, ensuring efficient computation. To bridge the gap between autoregressive pre-training and bidirectional attention in downstream tasks, a Prefix Transformer approach is introduced, treating initial patches as context. Simple MLP prediction heads are used during pre-training to maintain feature generality for downstream tasks. AIM doesn’t require optimization stability mechanisms and uses sinusoidal positional embeddings and a standard MLP design.
For downstream adaptation, AIM focuses on scenarios with fixed model weights, only training a classification head to minimize adaptation costs and overfitting risks. Unlike contrastive learning, AIM’s loss is computed per patch without global image descriptors. It uses attention pooling over patch features to create a global descriptor, enhancing performance with minimal parameter increase. This method (Attentive Probe) maintains the benefits of linear probing, such as low parameter count and reduced overfitting risk, while offering improved performance.
AIM demonstrates strong performance among generative methods, outperforming BEiT and MAE models of similar or larger capacities, even when the latter is trained on a large private dataset. It also shows competitive results against joint embedding methods such as DINO, iBOT, and DINOv2. Although AIM is outperformed by DINOv2, which uses higher-resolution inputs and various optimization techniques, AIM’s pre-training is simpler and more scalable in terms of parameters and data, leading to consistent improvements.
Interestingly, higher-quality features can be extracted from shallower layers of the model rather than the last layer, suggesting that the generative nature of AIM’s pre-training objective allows for the distribution of semantically rich features across different layers, not just concentrating them around the last layer.
LoRA can further improve the performance.
]]>Ferret is a new Multimodal Large Language Model adept at understanding and linking detailed descriptions to specific areas within images, regardless of their shape or size. It employs a novel hybrid region representation combining discrete coordinates with continuous features for accurate image region representation. A spatial-aware visual sampler enables it to handle various region inputs, including points, bounding boxes, and free-form shapes. Ferret was trained using GRIT, a comprehensive dataset with over 1.1 million samples, including 95,000 hard negative examples to enhance robustness. It excels in traditional image referring and grounding tasks, outperforms existing MLLMs in region-based multimodal communication, and shows improved capabilities in detailed image description and reduced object hallucination.
Usually, there are three formats for referencing image regions: point, box, and free-form shapes. Points and boxes are represented by simple coordinates, but free-form shapes, which include various types like scribbles and polygons, are more complex. To effectively represent these shapes, the authors propose a hybrid region representation, combining discrete coordinates with continuous visual features: a 2D binary mask is created for the region and visual features are extracted using a spatial-aware visual sampler. Points are represented by their coordinates and a fixed-radius circle, while boxes and free-form shapes use corner coordinates and a feature extracted from the defined region.
Ferret consists of three main components:
a cat [100, 50, 200, 300] ⟨SPE⟩
), which can be combined with ordinary text to form complete sentences.GRIT has ~1.1 million multimodal dialogues for training models. It has three types of data:
The training takes ∼5/2.5 days on 8 A100 GPU for a Ferret-13B/7B.
Input Referring: In this task, the model classifies objects in a specified image region, presented as a binary-choice question. The evaluation uses the LVIS dataset, covering over 1000 object categories, and tests three types of referring: point, box, and free-form shape. Ferret outperforms previous models in all referring types.
Output Grounding: The model’s grounding capability is evaluated in two sub-tasks: visual grounding and grounded captioning. Visual grounding involves grounding language queries into image regions. Grounded captioning requires generating a caption and grounding all noun phrases to image regions. Ferret demonstrates outstanding performance in these tasks, achieving state-of-the-art results.
Ferret-Bench: Multimodal Chatting with Referring and Grounding: To assess Ferret’s practical application in multimodal chatting, a new benchmark called Ferret-Bench is introduced. It includes three types of region-based questions: Referring Description, Referring Reasoning, and Grounding in Conversation. These tasks test the model’s ability to describe, reason, and accurately ground objects or regions in an image. Ferret significantly outperforms previous models in these tasks, showcasing its strong spatial understanding and reasoning capabilities.
]]>DocLLM is a new extension to LLMs designed for processing visual documents such as forms and invoices. It uniquely uses bounding box information instead of image encoders to understand document layouts. The model modifies the attention mechanism in transformers to separately handle text and spatial information. It is pre-trained to fill in text segments, aiding in managing diverse layouts and content in visual documents. After pre-training, DocLLM is fine-tuned on a large dataset for four core document intelligence tasks. It outperforms state-of-the-art LLMs in most tested datasets and shows strong generalization capabilities to new datasets.
DocLLM represents inputs as pairs of text tokens and their corresponding bounding boxes. It encodes bounding boxes into separate hidden vectors and decomposes the attention mechanism into four scores: text-to-text, text-to-spatial, spatial-to-text, and spatial-to-spatial, using projection matrices and hyperparameters to balance the importance of each score. The hidden vectors for spatial information are reused across layers, reducing the number of extra parameters compared to image-based models. By not simply adding spatial information to text (which would couple layout with semantics), DocLLM maintains a disentangled representation, enabling selective focus on either modality when necessary.
DocLLM is pre-trained in a self-supervised manner on a large dataset of unlabeled documents. Instead of focusing on individual tokens, DocLLM considers larger segments or blocks of related tokens, such as text blocks or linear sequences. Additionally, DocLLM employs a text infilling objective, where predictions are based on both the preceding and following tokens rather than just the preceding ones. This approach is particularly effective for handling OCR noise, misalignments, and relationships between different document fields.
During pre-training, text blocks identified from OCR information are randomly masked and shuffled for reconstruction. This block infilling is done autoregressively, with special tokens indicating the start and end of blocks. This method is used only in pre-training and not in subsequent fine-tuning or downstream tasks. The model minimizes a cross-entropy loss.
DocLLM is fine-tuned using instruction-based methods on 16 datasets covering four DocAI tasks: Visual Question Answering (VQA), Natural Language Inference (NLI), Key Information Extraction (KIE), and Document Classification (CLS). Different templates are used for each task, the examples include the following ones:
"{document} What is the deadline for scientific abstract submission for ACOG - 51st annual clinical meeting?"
"{document} \"The UN commission on Korea include 2 Australians.\", Yes or No?"
"{document} What is the value for the \"charity number\"?"
"{document} What type of document is this? Possible answers: [budget, form, file folder, questionnaire]."
The models are trained on 24GB A10g GPUs.
In the SDDS (Same Datasets, Different Splits), DocLLM-7B outperforms other models in 12 out of 16 datasets, including GPT4 and Llama2 in zero-shot settings. Particularly, it excels in layout-intensive tasks like KIE and CLS. However, in the other two tasks, it is outperformed by GPT-4, likely due to GPT-4’s better reasoning and abstraction capabilities.
In the STDD (Same Tasks, Different Datasets), DocLLM surpasses Llama2 in four out of five datasets and achieves the best scores in two, especially in the KIE task. However, DocLLM’s classification accuracy is lower, possibly due to its training on only one classification dataset, which might limit its generalization to new datasets.
Disentangled Spatial Attention: Focusing solely on spatial-to-spatial interaction led to the highest accuracy in understanding documents with rich layouts. This finding emphasizes the importance of incorporating spatial features in document analysis.
Autoregressive Block Infilling: The block infilling approach with spatial modality demonstrated the best performance, underscoring the value of spatial information in the learning process.
Prefix Decoder vs. Causal Decoder: The authors compared a prefix decoder, which allows bidirectional visibility of the whole document, with a conventional causal decoder. The experiments showed only marginal differences between the two decoders across various configurations, with the causal decoder slightly outperforming the prefix decoder.
]]>