artgor

Paper Review: Contextual Document Embeddings

2024-10-21T00:00:00+00:00

Paper Review: Contextual Document Embeddings

Dense document embeddings often lack contextual information from surrounding documents, which can limit their effectiveness in retrieval tasks. To address this, two complementary methods are proposed: a contrastive learning objective that incorporates document neighbors into the loss function and a new architecture that explicitly encodes neighboring document information. These approaches enhance performance, especially in out-of-domain scenarios, and achieve state-of-the-art results on the MTEB benchmark without complex training techniques like hard negative mining or large batch sizes. The methods can be applied to improve any contrastive learning dataset or biencoder model.

Preliminary.

Text retrieval methods can be viewed probabilistically by computing a distribution over documents using a scalar score function, which matches documents and queries. Vector-based retrieval methods factor the score into two embeddings: one for the document and one for the query. This allows precomputation of document embeddings, facilitating fast retrieval of the top results. Traditional statistical methods rely on frequency-based embeddings, while neural retrieval methods learn dense vectors using training data with document-query pairs. To handle large datasets, contrastive learning is used, where the likelihood of selecting a relevant document is approximated with negative samples and mini-batch examples to optimize the embeddings efficiently.

The approach

Contextual training with adversarial contrastive learning

In document retrieval, terms that are rare in general training datasets may be common in specific domains, making standard training less effective. To address this, a meta-learning-inspired objective is used, where documents are grouped into fine-grained pseudo-domains for training. These groups are formed using clustering to find challenging configurations based on document-query similarity. The clustering problem is framed as an asymmetric K-Means task, minimizing distances between pairs and centroids.

False negatives, which can disrupt training, are filtered out using equivalence classes defined by a surrogate scoring function. The final step packs clusters from within the same domain into equal-sized batches, introducing randomness in training. This approach enhances the model’s generalization by simulating domain-specific variations during training.

Contextual document embeddings

The authors introduce a two-stage architecture to add contextualization directly into document embeddings, inspired by traditional approaches that utilize corpus statistics. The goal is to allow the model to learn contextual information without having full access to the entire dataset, balancing efficiency and effectiveness.

First Stage: Pre-embed a subset of the corpus using a separate embedding model to gather contextual information. These context embeddings are shared within a batch, reducing computational costs.
Second Stage: Compute the document embeddings by combining the context embeddings with the document tokens, using a second model.

A similar approach is used for the query encoder, but only documents provide context since queries typically lack context at test time.

To improve generalization, the model uses sequence dropout, replacing some context embeddings with a null token to handle scenarios with limited or unavailable context. Positionality is removed to treat the documents as unordered.

For efficient training, the model employs a two-stage gradient caching technique, enabling larger batches and more contextual samples without memory issues. This approach calculates gradients separately for each stage, freezing representations initially and then backpropagating through the second stage, allowing a tradeoff between computation and memory.

Experiments

The authors perform experiments with both small and large configurations. In the small setup, a six-layer transformer with a maximum sequence length of 64 and up to 64 contextual tokens is used to evaluate on a truncated version of the BEIR benchmark. They explore different batch sizes ranging from 256 to 4096 and cluster sizes up to 4.19 million. Training follows the typical two-phase approach: a large unsupervised pre-training phase followed by a short supervised phase.

The implementation uses the GTR model with clustering algorithms based on FAISS, and training initializes the two-stage models with weights from BERT-base.

The results prove the effectiveness of combining adversarial contrastive learning with a contextual architecture for improving document retrieval models. In smaller experiments, both techniques independently showed improvements over standard biencoder training, with the largest gains seen when combined.

The use of contextual batching revealed a strong correlation between batch difficulty and downstream performance. Reordering datapoints to create more challenging batches enhanced overall learning, aligning with previous research findings. Additionally, filtering out false negatives improved model performance significantly.

The experiments compared the contextual architecture to a biencoder across multiple datasets from BEIR. The contextual approach consistently matched or outperformed the biencoder, especially in smaller or out-of-domain datasets like ArguAna and SciFact.

Full-scale training with multiple epochs demonstrated optimal performance after four epochs on the BGE datasets. The best supervised model, cde-small-v1, achieved state-of-the-art results on MTEB without relying on multiple hard negatives per query. The model also showed improvements in non-retrieval tasks like clustering, classification, and semantic similarity.

In scenarios with limited context (simulated using random documents), there was an average performance drop of 1.2 points, indicating the model’s dependency on contextual information for optimal results.

Paper Review: Differential Transformer

2024-10-14T00:00:00+00:00

Paper Review: Differential Transformer

Paper link

Code link

The Differential Transformer introduces a new attention mechanism that improves upon the traditional Transformer by reducing irrelevant context. It achieves this by using a differential attention mechanism, where attention scores are calculated as the difference between two softmax attention maps. This helps cancel out noise and promotes sparse attention patterns, leading to better focus on relevant context. Experiments show that Diff Transformer outperforms standard Transformers, particularly in areas like long-context modeling, key information retrieval, hallucination reduction, and in-context learning. It also offers improved robustness against issues like order permutation in input sequences.

Differential Transformer

The Diff Transformer builds upon the decoder-only model structure and introduces a differential attention mechanism, which replaces traditional softmax attention with two softmax functions that cancel out noise. Each layer includes the differential attention module and a feed-forward network. Multi-head differential attention is applied, where the attention for each head is processed independently and normalized using RMSNorm and headwise normalization to maintain training stability.

The differential attention mechanism is inspired by differential amplifiers in electrical engineering, designed to eliminate common noise by subtracting attention scores from two softmax functions. FlashAttention is used to enhance efficiency, while multi-head differential attention allows for richer contextualization across different attention heads. The architecture integrates improvements from models like LLaMA, including SwiGLU activation functions and pre-RMSNorm.

Experiments

Zero-shot results from the LM Eval Harness benchmark demonstrate that Diff Transformer performs favorably compared to well-tuned models. Additionally, experiments show that DIFF Transformer consistently outperforms standard Transformers across various tasks, using comparable training setups to ensure fair comparisons.

6.8B-size Diff Transformer achieves a validation loss comparable to 11B-size Transformer, requiring only 62.2% of parameters. Similarly, 7.8B-size Diff Transformer matches the performance of 13.1B-size Transformer, requiring only 59.5% of parameters. Diff Transformer trained with 160B tokens achieves comparable performance as Transformer trained with 251B tokens, consuming only 63.7% of the training tokens.

Diff Transformer can effectively leverage the increasing context.

The authors evaluate the performance of Diff Transformer and standard Transformer models on a key information retrieval task with context lengths ranging from 8K to 64K tokens. The results show that Diff Transformer maintains stable accuracy across different context lengths, whereas the accuracy of standard Transformer declines as the context length increases. For example, at 25% depth in a 64K context, Diff Transformer shows a 76% accuracy improvement over Transformer.

Additionally, attention score analysis reveals that Diff Transformer allocates higher attention scores to the relevant answer span and reduces attention noise, better preserving useful information compared to the Transformer, especially when key information is placed at different positions within the context.

Compared with Transformer, Diff Transformer mitigates contextual hallucination on summarization and question answering. This improvement possibly stems from Diff Transformer’s better focus on essential information needed for the task, instead of irrelevant context.

Ablation studies:

Diff Transformer outperforms Transformer in both overall and fine-grained loss, even when halving the number of attention heads to maintain model size.
Removing GroupNorm from Diff Transformer leads to performance degradation, as GroupNorm helps stabilize training by normalizing diverse statistics across multiple heads. In contrast, adding GroupNorm to Transformer shows little effect.
The main improvements of Diff Transformer stem from the differential attention mechanism, rather than configuration changes or normalization methods.
Various strategies for initializing the λ parameter (used in the differential attention) show minimal impact on validation loss, indicating that Diff Transformer is robust to different initialization choices. The default used value is λinit = 0.8 − 0.6 × exp(−0.3 · (l − 1))

Paper Review: Depth Pro: Sharp Monocular Metric Depth in Less Than a Second

2024-10-07T00:00:00+00:00

Paper Review: Depth Pro: Sharp Monocular Metric Depth in Less Than a Second

Paper link

Code link

Depth Pro is a foundation model designed for zero-shot metric monocular depth estimation, producing high-resolution depth maps with sharp details and accurate scale, without needing metadata like camera intrinsics. It generates 2.25-megapixel depth maps in 0.3 seconds on V100 GPU. Key innovations include a multi-scale vision transformer for dense prediction, a training approach combining real and synthetic data for high accuracy, new evaluation metrics for boundary accuracy in depth maps, and SOTA focal length estimation from a single image. Depth Pro surpasses previous models across various performance metrics.

The approach

The architecture

Depth Pro uses ViT encoders to process image patches at multiple scales and fuse the predictions into a single high-resolution depth map. It employs two ViT encoders: a patch encoder for scale-invariant learning and an image encoder for global context. The network operates at a fixed resolution of 1536×1536 to balance large receptive fields with stable runtimes and prevent memory errors. The input image is downsampled and divided into 384×384 patches, with overlapping patches at finer scales to avoid seams. These patches are processed independently, enabling parallelization and computational efficiency. The network’s patch-based design is more efficient than scaling the ViT directly, as multi-head self-attention’s complexity scales quadratically with input pixels.

Sharp monocular depth estimation

The training process focuses on predicting a metric depth map from an inverse depth image, scaled by the camera’s field of view. The model is trained using multiple objectives, prioritizing areas close to the camera to enhance visual quality, especially for tasks like novel view synthesis. The training uses both real-world and synthetic datasets, with a focus on canonical inverse depth to handle potentially noisy real-world data. A two-stage training curriculum is applied: the first stage focuses on generalization across domains using a mix of all labeled datasets, while the second stage fine-tunes on synthetic datasets to sharpen boundaries and capture finer details.

Key loss functions include MAE for depth predictions, as well as gradient and Laplace errors to refine boundary details. The training leverages a combination of first- and second-order derivative losses to improve sharpness, particularly in synthetic data.

Focal length estimation

To address potential inaccuracies or missing EXIF metadata in images, Depth Pro includes a focal length estimation head. This small convolutional module uses frozen features from the depth estimation network and additional features from a separate ViT image encoder to predict the horizontal angular field of view. The focal length head is trained separately from the depth estimation network, using L2 as the loss function. This separation avoids conflicts between depth and focal length objectives and allows for training on a different set of datasets: it excludes narrow-domain, single-camera datasets and incorporates large-scale image datasets that provide focal length supervision but lack depth supervision.

Experiments

Depth Pro outperforms many competitors, consistently ranking high across various datasets like Booster, Middlebury, SunRGBD, ETH3D, nuScenes, and Sintel. The evaluation uses the δ1 metric, which measures the percentage of inlier pixels where predicted and ground-truth depths are within 25% of each other.

Other metrics, such as AbsRel, Log10, δ2, δ3, and point-cloud metrics confirm prior findings of domain bias in some models, like Depth Anything and Metric3D, which rely on domain-specific models or crop sizes, violating the zero-shot premise. Depth Pro, in contrast, shows strong generalization and consistently ranks among the top approaches, achieving the best average rank across datasets.

Depth Pro outperforms all baselines in boundary accuracy across datasets, particularly excelling in capturing sharp boundaries and thin structures like hair and fur. Depth Pro’s recall is consistently higher than its competitors, even when compared to diffusion-based models like Marigold, which is trained on billions of images, or variable-resolution approaches like PatchFusion. Moreover, Depth Pro is significantly faster in runtime than Marigold and PatchFusion while maintaining superior accuracy.

The authors address the lack of comprehensive evaluations for focal length estimators on in-the-wild images by curating a zero-shot test dataset. This dataset includes diverse image collections with intact EXIF data from sources like FiveK (SLR camera photos), SPAQ (mobile phone photos), PPR10K (portrait images), and ZOOM (scenes with varying zoom levels). Depth Pro is evaluated against state-of-the-art focal length estimators, and the results show that it outperforms competitors across all datasets. For instance, on the PPR10K dataset, 64.6% of Depth Pro’s predictions have a relative estimation error below 25%, compared to only 34.6% for the second-best method, SPEC.

Limitations

The model is limited in dealing with translucent surfaces and volumetric scattering, where the definition of single pixel depth is ill-posed and ambiguous.

Paper Review: Training Language Models to Self-Correct via Reinforcement Learning

2024-09-23T00:00:00+00:00

Paper Review: Training Language Models to Self-Correct via Reinforcement Learning

Paper link

SCoRe - a new approach to improving LLMs self-correction ability through multi-turn online RL. Existing methods rely on multiple models or external supervision, but SCoRe uses only self-generated data. The authors found that previous supervised fine-tuning methods are insufficient due to distribution mismatches and ineffective correction behavior. SCoRe addresses this by training the model on its own correction traces, using regularization to enhance self-correction. Applied to Gemini 1.0 Pro and 1.5 Flash models, SCoRe improves self-correction performance by 15.6% and 9.1% on MATH and HumanEval benchmarks.

Preliminaries and Problem Setup

The goal is to train LLMs to improve their own predictions by relying solely on self-generated data. In this self-correction setting, models try to correct their responses without external feedback. The approach uses a dataset of problems and oracle responses, training a policy that generates multiple attempts to solve a problem. A reward function evaluates the correctness of the responses. The model learns to detect and correct mistakes without access to the reward function during testing. The objective is to optimize a policy over multiple turns, using policy gradient reinforcement learning with a KL-divergence penalty to ensure gradual improvements. Key performance metrics include accuracy on the first and second attempts, the improvement in accuracy between attempts, the fraction of initially incorrect problems that become correct after self-correction, and the fraction of initially correct responses that become incorrect.

Supervised Fine-Tuning on Self-Generated Data is Insufficient for Self-Correction

An empirical study was conducted to evaluate whether SFT approaches can improve large language models’ self-correction abilities. Two methods, STaR and an approach with training only one model, were tested. Although these methods improve the base model’s self-correction performance, they still fail to achieve a positive self-correction rate, often producing worse second attempts. The failures stem from SFT amplifying the initial biases of the base model, resulting in only minor changes in responses. Adjusting the distribution of initial responses helps but does not fully solve the problem, as learning remains hindered by distribution shifts or bias amplification.

These methods were tested for improving self-correction in large language models using the MATH dataset. The STaR approach filters model-generated traces to retain only those where incorrect responses were successfully revised, then applies SFT on this filtered dataset. The second method, called Pair-SFT, pairs incorrect responses with correct ones to create “synthetic” repair traces without training a separate corrector model or using multi-turn traces. The datasets for each method were constructed from Gemini 1.5 Flash’s outputs, and SFT was applied: STaR underwent three iterations of data collection and fine-tuning, while Pair-SFT had only one epoch due to the large dataset size.

Pair-SFT showed a small 1.8% gain in self-correction compared to the base model, mainly by reducing the number of correct responses that were mistakenly revised to incorrect ones. However, it did not significantly improve the correction of incorrect first attempts. STaR, in contrast, did not reduce the number of incorrect revisions, indicating a lack of understanding of when to make changes. The discrepancy is attributed to differences in data distributions: Pair-SFT’s random pairing covered a broader range of revision scenarios, while STaR had a narrower focus. Adding “correct-to-correct” data improved STaR slightly but still yielded minimal self-correction. For SFT, however, adding this data overly biased the model, causing it to avoid making changes entirely.

The authors analyzed the self-correction behavior of the models by measuring the edit distance ratio, which quantifies how much models modify their first-attempt responses. The results showed that while the base model sometimes made substantial edits, the fine-tuned models were overly conservative and often made no edits at all. STaR performed similarly on both training and validation data, whereas Pair-SFT showed discrepancies in edit distance ratios between training and validation, indicating poor generalization. Furthermore, while Pair-SFT optimized correction accuracy on training data and maintained accuracy on validation problems, its self-correction accuracy degraded when tested on self-generated responses with more training. SCoRe, on the other hand, avoids the bias of minimal edits without explicit training for controlling the degree of response changes.

That takeaway is that there are two key failures of SFT methods for self-correction. STaR was too focused on a single correction strategy that made only minor changes, while Pair-SFT, despite covering a broader range of data, suffered from a degradation in self-correction performance due to distribution shift. These findings highlight two important criteria for an effective approach:

it must train on self-generated traces to address distribution mismatch
it should prevent models from collapsing into making only minor edits during training

SCoRe: Self-Correction via Multi-Turn Reinforcement Learning

To develop an effective method for teaching LLMs to self-correct using only self-generated data, SCoRe leverages on-policy RL to address distribution mismatch and prevent mode collapse. A key challenge is that multi-turn RL, while addressing distribution shift, often leads to models that don’t self-correct, instead opting to maintain their first attempt responses. This happens because both producing the best first attempt or improving on it appear equally optimal during training.

SCoRe addresses this challenge in two stages:

Stage I explicitly trains the model to correct its second attempts based on a relatively static first attempt, encouraging high-reward responses and reducing the likelihood of mode collapse. At this stage, the authors apply a KL-divergence constraint to keep the first-attempt responses close to the base model’s distribution, preventing the first and second attempts from becoming too similar and falling into a local optimum.
Stage II uses this trained initialization for multi-turn RL, applying a reward bonus to incentivize effective self-correction. This approach biases the model towards learning to improve its responses across attempts, rather than sticking with an initial response. The model is trained using a policy gradient approach with an objective that optimizes rewards for both attempts. To encourage self-correction, the authors use reward shaping: a bonus is added to the second attempt’s reward if it improves the correctness compared to the first attempt, while penalties are applied if the response degrades.

Experiments

In the MATH benchmark, SCoRe achieves a 4.4% self-correction gain with overall Accuracy@t2 increasing by 23.0%, outperforming the base model by 15.6% and surpassing Pair-SFT by 10.2%. SCoRe also improves the rate of fixing incorrect responses and reduces the number of correct answers changed to incorrect.

For the code generation task, SCoRe boosts performance from 47.3% to 60.6% on MBPP-R, a gap similar to that between GPT-3.5 and GPT-4. It also generalizes well to the HumanEval dataset, achieving a 12.2% self-correction delta, outperforming the base model by 9%. While Pair-SFT performs well in static repair tasks, it degrades self-correction performance, highlighting the importance of on-policy sampling.

Additionally, SCoRe is effective when combined with inference-time compute scaling strategies like self-consistency decoding (majority voting). Combining parallel sampling with sequential self-correction yields a 10.5% improvement, compared to 7.4% from parallel sampling alone.

Ablation studies :

Multi-turn training: Single-turn training improves first-turn performance but negatively impacts self-correction.
Multi-stage training: Stage I is essential; skipping it leads to 2% lower self-correction gain and 3% lower accuracy on the second attempt.
Reward shaping: Removing reward shaping reduces performance, highlighting its importance in guiding self-correction learning.
On-policy RL: Replacing REINFORCE with STaR in Stage II significantly reduces performance, showing that on-policy sampling is critical for multi-turn self-correction, unlike in single-turn settings.

Paper Review: Loopy: Taming Audio-Driven Portrait Avatar with Long-Term Motion Dependency

2024-09-16T00:00:00+00:00

Paper Review: Loopy: Taming Audio-Driven Portrait Avatar with Long-Term Motion Dependency

Paper link

Project link

Loopy is an end-to-end audio-only conditioned video diffusion model for human video generation. It avoids using additional spatial signals that can limit natural motion, unlike previous methods. Loopy introduces inter- and intra-clip temporal modules and an audio-to-latents module to learn natural motion patterns and enhance the correlation between audio and portrait movement. This eliminates the need for predefined spatial motion templates (used to constrain motion during inference). Experiments show Loopy outperforms recent audio-driven portrait diffusion models, producing more lifelike and high-quality videos in various scenarios.

The approach

Framework

Loopy builds upon Stable Diffusion and its initialization weights. Unlike the original SD, which processes a single image, Loopy takes sequences representing video clips as an input. Additionally, the inputs include reference latents, audio embeddings, motion frames and timestamp. The denoising network uses a Dual U-Net architecture, adding a reference net module that incorporates reference image features and motion frames for more accurate temporal and spatial attention computations, enhancing the overall generation process.

Inter/Intra- Clip temporal layers design

Unlike existing methods that use a single temporal layer for both motion frame latents and noisy latents, Loopy employs two separate temporal layers. The inter-clip temporal layer in Loopy processes motion frames from preceding clips using a reference network, extracting features and concatenating them with those from the denoising U-Net along the temporal dimension. Learnable temporal embeddings are added to distinguish between different latents, and self-attention is applied to these concatenated tokens. After that, the intra-clip temporal layer processes only the features from the noisy latents of the current clip.

To enhance motion modeling, a temporal segment module is introduced, which segments the motion frames into multiple parts and extracts representative frames to expand the temporal range covered by the inter-clip layer. This segmentation uses two hyperparameters: stride (number of frames per segment) and expand ratio (length of frames in each segment).

Audio Condition Module

For the audio condition, Loopy uses wav2vec for audio feature extraction, obtaining multi-scale audio features by concatenating hidden states from each wav2vec layer. For each video frame, it combines audio features from two preceding and two succeeding frames to create a 5-frame audio embedding. In each residual block, cross-attention is applied using noisy latents as the query and audio embeddings as the key and value, producing an attended audio feature that is added to the noisy latents.

The Audio-to-Latents module is then introduced to map conditions, including audio and other facial movement-related features, into a shared motion latents space. The module uses learnable embeddings and attention computation to create new motion latents, which replace the input condition for further processing. During training, input conditions for this module are sampled from audio embeddings, landmarks, motion variance, and expression variance. During testing, only audio is used to generate motion latents. By transforming audio embeddings into motion latents, the model enhances control over portrait motion, enabling a more direct influence on the resulting movements.

Training strategies

Loopy uses condition masking and dropout strategies to improve learning and control during training. Various conditions, including the reference image (cref), audio features (caudio), preceding frame motion (cmf), and motion latents (cml), contain overlapping information. To address this, caudio and motion latents are randomly masked to all-zero features with a 10% probability. For cref and cmf, conflicting relationships are managed through dropout: cref has a 15% chance of being dropped, which also results in dropping motion frames to avoid artifacts. Additionally, motion frames have an independent 40% chance of being masked.

Loopy’s training follows a two-stage process. In the first stage, the model is trained without temporal layers and the audio condition module, focusing on pose variations at the image level. In the second stage, initialized with parameters from the first stage, the inter-/intra-clip temporal layers and the audio condition module are added for full training.

During inference, Loopy performs class-free guidance using multiple conditions. Three inference runs are executed with variations in condition masking to control the final output’s adherence to the reference image and alignment with audio.

Experiments

The model is trained on 24 Nvidia A100 GPUs.

On the CelebV-HQ dataset, which includes videos with diverse settings, poses, and speaking styles, Loopy significantly outperforms other methods in most metrics. While its motion-related metrics aren’t always the best, they closely match the ground truth in smoothness and dynamic expression. Loopy also excels in video synthesis quality and lip-sync accuracy.

On the RAVDESS dataset, which tests emotional expression in high-definition talking scenes, Loopy outperforms other methods in the E-FID metric and is closer to the ground truth in motion dynamics. Though its lip-sync accuracy is slightly behind VExpress, Loopy captures more dynamic motion, unlike VExpress’s more static results.

In open-set scenarios involving various input styles (real people, anime, crafts) and audio types (speech, singing, rap, emotional audio), Loopy consistently outperforms other methods, showcasing its robustness across different contexts.

Ablation studies

The authors analyzed the impact of Loopy’s two key components: the inter/intra-clip temporal layer and the audio-to-latents module. For the inter/intra-clip temporal layer, experiments showed that removing the dual temporal layer design and using a single layer, or removing the temporal segment module, degraded performance. The dual temporal layer improves temporal stability and image quality, while the temporal segment module enables the model to capture long-term motion dependencies, enhancing expressiveness and stability.

For the audio-to-latents module, its removal reduced overall visual and motion quality because the module incorporates spatial conditions during training, providing clearer motion guidance and aiding model convergence.

The authors also investigated the impact of long-term temporal dependency. Extending motion frame length to 20 with a single temporal layer enhanced motion dynamics but degraded image quality. However, in the full model, adding inter/intra-clip temporal layers improved results. Increasing stride allowed better cross-clip modeling, while a larger expand ration provided broader temporal coverage, leading to better performance. The uniform sampling strategy for motion frames in the temporal segment module proved most effective, as it offers stable interval information, aiding the learning of long-term motion.

Paper Review: Agentic Retrieval-Augmented Generation for Time Series Analysis

2024-09-04T00:00:00+00:00

Paper Review: Agentic Retrieval-Augmented Generation for Time Series Analysis

Paper link

The proposed approach introduces a novel framework for time series analysis using a multi-agent RAG system. This framework addresses challenges such as complex spatio-temporal dependencies and distribution shifts in time series data. It employs a hierarchical architecture where a master agent delegates tasks to specialized sub-agents, each fine-tuned for specific time series tasks. These sub-agents use smaller pre-trained language models and retrieve relevant prompts from a shared repository to enhance predictions. The modular, multi-agent RAG approach offers flexibility and achieves state-of-the-art performance across various time series tasks.

Problem formulation

The dataset consists of N univariate time series, each collected over T timestamps. It is represented as a data matrix. There are four tasks:

Forecasting: A sliding window is used to construct subsequences from previous steps to predict future values for the next steps.
Missing Data Imputation: A binary mask matrix identifies missing data, and observed samples are used to estimate missing values by leveraging spatio-temporal dependencies within a sliding window.
Anomaly Detection: Anomalies are detected by comparing current behavior with the normal pattern from a training period. A sliding window approach is used to predict future values, and anomaly scores are computed to flag deviations from the moving averaged maximum anomaly value.
Classification: Unsupervised K-means clustering is applied to identify clusters in the data. A sliding window approach is used to predict future cluster labels based on past time steps.

The approach

Dynamic Prompting Mechanism

A dynamic prompting mechanism enhances time series modeling by addressing non-stationarity and distributional shifts. It improves traditional methods that use fixed window lengths, which may miss short-range or long-range dependencies. The approach retrieves relevant prompts from a shared pool of key-value pairs encoding historical patterns like seasonality, cyclicality, irregularities, etc. Input time series are projected into a vector space, and cosine similarity is used to match them with the most relevant prompts. These prompts are combined with the input data to improve predictions, allowing the model to adapt and leverage past knowledge for better performance across varying datasets.

Fine-Tuning/Preference Optimization SLM

Pretrained small language models, like Google’s Gemma and Meta’s Llama-3, are limited by an 8K token context window, which hinders their ability to process long input sequences. To address this, a two-tiered attention mechanism (grouped and neighbor attention) is introduced, allowing SLMs to capture long-range dependencies without fine-tuning, improving performance on extended text sequences. While fine-tuning SLMs for specific tasks can enhance performance, instruction-tuning with an extended 32K token context window, using parameter-efficient fine-tuning techniques, improves their ability to handle time series tasks. Additionally, Direct DPO is used to steer SLM predictions toward more reliable task-specific outcomes by randomly masking 50% of the data and performing binary classification to predict correct task-specific outcomes.

Experiments

The Agentic-RAG framework variants were evaluated against baseline methods on seven benchmark datasets for forecasting, as well as on anomaly detection tasks. The results demonstrate that the proposed framework significantly outperforms baseline methods across these datasets.

An ablation study was conducted to evaluate the contribution of individual components within the Agentic-RAG framework. The study analyzed the impact of removing key components: dynamic prompting mechanism, sub-agent specialization, instruction-tuning, and DPO. Results showed that the full framework consistently outperformed ablated versions in time series forecasting, anomaly detection, and classification tasks across multiple datasets.

Paper Review: Winning Amazon KDD Cup24

2024-08-19T00:00:00+00:00

Paper Review: Winning Amazon KDD Cup’24

Paper link

Competition link

Amazon KDD Cup 2024 Multi Task Online Shopping Challenge for LLMs involved building a useful assistant for online shopping queries. The competition featured 57 tasks across 5 types and 4 tracks.

The winning solution used a single model per track, fine-tuning Qwen2-72B-Instruct on a custom training dataset. Since only 96 example questions were provided, additional data was generated using public datasets and synthetic data. The approach included wise-finetuning, ensembling LoRA adapters, and employing Logits Processors to constrain output. AWQ 4-bit Quantization and vLLM were used for efficient inference.

Amazon KDDCup 2024: Multi Task Online Shopping Challenge for LLMs

The KDD Cup 2024, hosted by Amazon, focused on developing LLMs for online shopping assistance. Participants worked with a small development dataset of 96 questions covering only 18 of 57 tasks. The test dataset, called ShopBench, contained 20000 questions. The competition was a code submission challenge where models were evaluated on Amazon’s infrastructure, using 4x NVIDIA T4 GPUs.

Participants faced several challenges: no training dataset, hidden tasks (only 18 out of 57 were visible), and time and compute constraints. The competition had five tracks: Shopping Concept Understanding, Shopping Knowledge Reasoning, User Behavior Alignment, Multi-lingual Abilities, and an Overall track that combined all others. Each track involved multiple task types: Multiple Choice, Ranking, NER, Retrieval, and Generation.

The requirement for each question was to generate text that could be parsed by Amazon’s evaluation script. If the script couldn’t parse the response, a score of 0 would be given for that question.

Datasets

In order to create the training data, participants used a mix of real and synthetic datasets. Real datasets included non-e-commerce sources like MMLU and Alpaca-Cleaned, as well as e-commerce-specific data like Amazon-M2, Amazon Reviews 2023, NingLab/ECInstruct, and ESCI-data.

To further diversify the data, synthetic data generation pipelines were used:

Prompting LLMs to rephrase existing tasks, combining product details into task-specific prompts.
Extracting correct labels (e.g., product type, categories) from seed data using LLMs and constructing related questions.
Using GPT-4 to generate instructions with varied wordings and create multiple-choice tasks from ESCI-data, with correct answers randomly selected from relevant entries.

Model

In the KDD Cup 2024 competition, the winning team explored both zero-shot and fine-tuned LLM models, ultimately finding fine-tuning to provide the best accuracy. They used a prompt template that specified the task type and incorporated a heuristic rule classifier during inference to determine the question’s task type. system_prompt = "You␣are␣a␣helpful␣online␣shopping␣ assistant.␣Your␣task␣is␣{task_type}."

For fine-tuning, they used the Qwen2-72B-Instruct model, training it on a dataset of 500,000 examples using 8x NVIDIA A100 GPUs for 24 hours. They focused on supervised fine-tuning (and not RLHF), deeming it enough - many questions have exact solutions and thus don’t require human preferences.

The solution also involved ensembling four fine-tuned LoRA adapters for different tracks, balancing task-specific performance with zero-shot capabilities using wise-ft, which interpolates between base model weights and fine-tuned weights

Logits processors were used to constrain model outputs to specific formats: digits and commas for multiple choice, ranking and retrieval; encouraging models to cite from the prompt for NER. When switching to larger models in Phase 2, the importance of these processors decreased.

Given the competition’s constraints (limited compute resources and time), the team employed 4-bit quantization to reduce the size of the Qwen2-72B model from 150GB to 40GB and used vLLM for faster inference. They calibrated AWQ quantization with the development questions, improving accuracy while staying within the time limits for each track.

Results

Paper Review: Wolf: Captioning Everything with a World Summarization Framework

2024-08-12T00:00:00+00:00

Paper Review: Wolf: Captioning Everything with a World Summarization Framework

Paper link

Leaderboard

Wolf (WOrLd summarization Framework) is a video captioning framework that combines different VLMs to improve accuracy. It uses both image and video models to capture various levels of information, enhancing video understanding, auto-labeling, and captioning. To evaluate caption quality, the authors introduce a new LLM-based metric called CapScore, which assesses the similarity and quality of generated captions against ground truth. They also create four human-annotated datasets across three domains - autonomous driving, general scenes, and robotics - for comprehensive evaluation. Wolf outperforms existing state-of-the-art research (VILA1.5, CogAgent) and commercial (Gemini-Pro-1.5, GPT-4V) solutions, significantly improving caption quality and similarity, particularly in challenging driving videos. The authors also establish a benchmark and a leaderboard to promote further advancements in the field.

Wolf: Captioning Everything with a World Summarization Framework

First, image-based VLMs, which are pre-trained on larger datasets, generate captions for sequential key frames of the video (two key-frames per second) using Chain-of-thought. These captions provide detailed scene-level information and object locations. Then, GPT-4 summarizes all the captions to produce a cohesive description of the video, capturing both visual and temporal information. A separate process uses the video-level VLM to generate captions for the entire video, providing a high-level overview of the content. Then, Wolf uses an LLM to summarize and refine the captions from both image-level and video-level models, reducing hallucinations and redundancy. This approach allows Wolf to capture a rich variety of details and accurately describe motion behavior in videos.

Wolf Benchmark: Benchmarking Video Captioning

Autonomous Driving: 500 intensely interactive video-caption pairs and 4785 normal driving scene video-caption pairs
Robot Manipulation: 100 robot manipulation videos
Pexels: 473 high-quality videos

CapScore addresses the challenge in video captioning of lacking a standard evaluation metric. Inspired by BERTScore and CLIPScore, CapScore uses GPT-4 to assess captions on two criteria: Caption Similarity and Caption Quality.

Caption Similarity measures how closely a predicted caption aligns with the ground truth, focusing on content and context.
Caption Quality evaluates the accuracy of the caption, checking for reduced hallucinations (incorrect or extraneous details) compared to the ground truth.

GPT-4 is used to provide scores for these metrics on a scale from 0 to 1.

Experiments

Wolf excels in capturing detailed motion behaviors in videos, such as vehicles moving in different directions and reacting to traffic signals, which other models struggle with.

To further validate Wolf’s effectiveness, the authors fine-tuned the VILA-1.5 model using captions generated by Wolf on the Nuscenes dataset. This fine-tuning significantly improved the model’s performance, particularly in caption similarity and quality, outperforming GPT-4V and approaching the performance of Gemini-Pro-1.5. The results suggest that Wolf-generated captions can enhance the performance of VLMs, pushing them to achieve higher accuracy and detail in video captioning.

The ablation study shows that the chain-of-thought approach in Wolf significantly improves video understanding in models like CogAgent. Wolf consistently achieves higher CapScores by incorporating additional video captions from various models.

In evaluating token efficiency, it’s observed that longer captions initially improve similarity scores but eventually plateau or decline. Caption quality varies, with GPT-4V maintaining consistency, while Wolf and Gemini-Pro-1.5 perform better with shorter captions. This indicates that concise captions may be more effective in certain contexts, balancing similarity and quality.

Limitations and Optimization

Wolf is more cost-effective for auto-labeling and captioning than human labels, but its ensemble method raises efficiency concerns, particularly regarding GPU resource usage. To optimize Wolf, the focus is on three areas: Low-Hanging Fruit, Batched Inference, and Model Quantization. For, example, quantizing models to 4 bits improves efficiency by reducing memory usage and increasing batch sizes, allowing more videos to be processed in parallel.

For safety, Wolf reduces the risk of missing crucial information or including hallucinations in captions, which is vital for autonomous systems. However, there are unresolved safety issues, such as aligning captions with the task at hand, measuring caption alignment, and quantifying confidence in captions.

Paper Review: Diffusion Feedback Helps CLIP See Better

2024-08-05T00:00:00+00:00

Paper Review: Diffusion Feedback Helps CLIP See Better

Paper link

Code link

CLIP excels at abstracting openworld representations across domains and modalities, but has limitations in distinguishing orientation, quantity, color, and structure due to biased image-text pairs used in training. DIVA (uses the DIffusion model as a Visual Assistant for CLIP), a post-training approach using a self-supervised diffusion process, enhances CLIP’s visual representations by leveraging generative feedback from text-to-image diffusion models. This method improves CLIP’s performance on the MMVP-VLM benchmark (which assesses fine-grained visual abilities), multimodal understanding, and segmentation tasks, while preserving its strong zero-shot capabilities, as confirmed by evaluations on 29 image classification and retrieval benchmarks.

The approach

Preliminaries

CLIP’s visual deficiencies

CLIP demonstrates strong generalization capabilities but struggles with detailed visual distinctions due to its training paradigm and data format. Its contrastive learning strategy focuses on high-level semantics, often missing visual details like orientation, quantity, color, and structure, resulting in similar embeddings for visually distinct images. Additionally, the text in image-text pairs used for training is limited in length (effective text length is 20), lacking descriptions of visual details, which further hinders CLIP’s ability to perceive fine visual distinctions.

Generative diffusion models

A diffusion model learns to model a probability distribution by reversing a process that progressively adds noise to an image. Starting from an image sample, a forward diffusion process adds random Gaussian noise in steps defined by a Markov chain. The noise addition follows a time-dependent variance schedule. As the number of steps increases, the image becomes pure noise. The model then reverses this process, starting from noise to generate an image sample.

Overall structure of DIVA

DIVA consists of two main components: the CLIP model and a pre-trained diffusion model. The CLIP model takes an original image and an empty text as inputs. The visual features encoded by CLIP are combined with the empty text’s embeddings from the diffusion model’s text encoder. The diffusion model then predicts the noise added at each step in a process repeated N times, optimizing randomly selected states. The training objective minimizes the reconstruction loss, keeping all weights except the CLIP model frozen. This process refines CLIP’s semantic representations to include more visual details without significantly harming its zero-shot performance.

Diffusion’s Condition Design

DIVA’s Visual Dense Recap Scheme enhances CLIP’s visual capabilities by incorporating local patch token features along with the class token into the diffusion model’s condition. This strategy enriches the visual information used for reconstruction, overcoming the limitations of using only the class token, which lacks sufficient detail.

The density of the visual recap is crucial: too high a density simplifies the task and limits optimization, while too low a density makes reconstruction overly difficult. Optimal densities depend on specific models. For example, the authors randomly select local token features at 15% probability from 224 resolution images and 30% from 336 resolution from CLIP.

Experiments

DIVA is trained on 8 NVIDIA-A100 80GB GPUs.

DIVA effectively addresses the visual capability deficiencies of various CLIP models, consistently enhancing their performance on the MMVP-VLM benchmark by 3-7%. The improved CLIP vision encoders boost multimodal understanding tasks, showing significant accuracy gains in benchmarks like LLaVA and MMBench. Additionally, DIVA enhances segmentation tasks, showing considerable performance improvements on ADE20K and Pascal Context benchmarks. Finally, DIVA improves fine-grained visual perception without compromising the original generalization abilities of CLIP models, maintaining strong zero-shot performance in image classification and retrieval tasks.

Ablation studies

The effect of condition design for diffusion models in DIVA is crucial for enhancing CLIP’s representation quality. Two condition settings were examined: using pure text embeddings and incorporating densely recapped visual features with empty text embeddings. The latter yielded the best performance gains (up to 6.6%). The performance improvement is sensitive to the density of visual features introduced.

Data scaling with the CC-3M dataset demonstrated that DIVA consistently improves CLIP’s performance on the MMVP-VLM benchmark as the amount of training data increases. The framework showed no signs of diminishing returns, suggesting great potential for scalability. With 100% training data, DIVA significantly boosted visual perception capabilities with efficient training time.

Incorporating different diffusion models for generative guidance showed that SD-2-1-base achieved the highest performance gain (6.6%). DiT-XL/2, however, worsened CLIP’s ability to capture visual details due to its poorer representation quality.

Paper Review: Masked Attention is All You Need for Graphs

2024-07-29T00:00:00+00:00

Paper Review: Masked Attention is All You Need for Graphs

Paper link

Graph neural networks (GNNs) and message passing algorithms are widely used for learning on graphs but require significant research and handcrafted designs. Masked attention for graphs (MAG) offers a simpler alternative by using attention mechanisms exclusively. MAG represents graphs as node or edge sets and enforces connectivity by masking the attention weight matrix. This method achieves state-of-the-art performance on long-range tasks and outperforms message passing baselines and other attention-based methods in over 55 tasks. MAG also shows better transfer learning capabilities and efficient time and memory scaling compared to GNNs, making it suitable for dense graphs.

Preliminaries

Graphs are defined by a set of nodes and edges, where each node and edge has associated feature vectors. The connectivity of the graph can be represented using an adjacency matrix, though using an edge list is often more practical. GNNs use a message passing framework to update node features based on their neighbors’ features and edge information.

The Set Transformer is an attention-based architecture designed for learning on sets. It uses multihead attention blocks, self-attention blocks, and pooling by multihead attention blocks.

Recent advancements in efficient attention mechanisms, such as Flash attention, have enabled exact attention with better memory scaling and faster training times by optimizing the use of modern GPU architecture. These methods allow self-attention to scale more efficiently in terms of memory, providing comparable runtime performance to standard implementations.

Masked attention for graphs

Graph learning is formulated as a learning problem on sets, applying attention directly to node or edge feature matrices using self-attention blocks. MAG incorporates graph structure by masking the pairwise attention weight matrix based on adjacency information. MAG supports two modes of information propagation: on nodes (MAGN) using the node feature matrix, and on edges (MAGE) using the edge feature matrix.

The Set Transformer is extended with masked multihead attention blocks and SABs, utilizing masks that restrict attention to adjacent nodes or edges sharing a common node. These masks are computed dynamically for each batch.

The MAG architecture consists of an encoder with alternating MSAB and SAB blocks, and a PMA-based decoder. It uses a pre-layer normalization architecture with optional MLPs after multihead attention. For graph-level tasks, the PMA module serves as the readout function, fully based on attention, while for node-level tasks, the PMA is not needed.

Experiments

On long-range molecular tasks MAGE outperformed GraphGPS, Exphormer, GCN, and GIN. MAGE achieved top results on the PEPT-STRUCT and PEPTFUNC leaderboards, despite using fewer layers. It is notable for being exclusively based on attention without any positional encoding and is general-purpose.

For node-level tasks, typically involving citation networks, MAGN outperformed other methods significantly. Some methods like Graphormer and TokenGT were not suitable for node-level classification due to large graph sizes.

MAGE performed best on most graph-level tasks, thought it lagged on some of them due to quick convergence issues. In the DOCKSTRING benchmark, MAGE excelled on four out of five tasks, particularly on the most difficult target (PGR).

Transfer learning improved MAGE’s performance significantly, showing a 45% improvement for HOMO and 53% for LUMO in the inductive case, with even greater improvements in the transductive case. GNNs showed only modest improvements.

MAG’s most computation-intensive component is the encoder, which performs masked self-attention efficiently with O(N ** 0.5) memory complexity. The decoder uses cross-attention and benefits from Flash attention, resulting in efficient time and memory scaling. MAGE runs effectively with up to 30000 edges on a consumer GPU with 24GB, demonstrating competitive time and memory utilization.