Paper Review: Large Language Models Orchestrating Structured Reasoning Achieve Kaggle Grandmaster Level

Paper link

Main image

Agent K v1.0 is an autonomous data science agent designed to handle the entire data science lifecycle through experience-based learning. Unlike traditional methods, it uses a flexible, structured reasoning framework to optimize memory storage and retrieval, enabling complex decision-making without fine-tuning or backpropagation. In evaluations using Kaggle competitions, Agent K v1.0 performs tasks autonomously, leveraging Bayesian optimization, feature engineering, and various popular libraries to handle multiple data modalities. Achieving a 92.5% success rate across tasks, it ranks in the top 38% among Kaggle competitors, reaching Expert-level performance and achieving a record comparable to a Kaggle Grandmaster, including six gold, three silver, and seven bronze medals.

Learning to Reason by Experience

The authors propose a novel approach for training LLMs to handle data science tasks without traditional fine-tuning and backpropagation, which are computationally demanding. Instead, the agent uses an internal working memory and an external long-term memory database to dynamically adapt and learn from experience. Modeled as a Markov Decision Process, the agent makes decisions based on environmental state, working memory, and external database, which it can update over time to guide reasoning and actions.

The agent performs three types of actions: managing long-term memory, updating short-term working memory, and interacting with the environment (e.g., generating code or submitting results). A reward-based policy selectively stores high-utility data and code snippets, continuously optimizing decision-making. This framework enables the agent to adapt its reasoning without changing core model parameters, maximizing performance in data science tasks through memory-based learning rather than extensive data collection and backpropagation.

Autonomous Data Science Agents

Data Science pipeline

Agent K v1.0 is designed as a scalable, autonomous data science agent capable of handling diverse multimodal tasks, including tabular, time series, computer vision, NLP, and multimodal data. Each task is represented by a tuple with a natural language description and the relevant datasets. Agent K starts by automatically fetching and scraping tasks from Kaggle using their URLs. Then it sets up the solution generation process.

Phase I (Automation) - Setting Up Data Science Tasks

Automatic setup

The setup pipeline begins with extracting raw data and task descriptions, then creating a task-specific plan based on identified data modalities. This plan standardizes data into structured “input maps” for different types like tabular, image, or text data, allowing automated processing and unit testing at each stage.

Agent K’s setup process is modeled as a Markov Decision Process that defines the state space (pipeline stages and workspace content) and assigns rewards based on unit test outcomes. Each pipeline stage involves actions like creating files or summarizing data, with success determined by passing stage-specific and meta-unit tests. If a unit test fails, the agent remains in the current stage to reattempt; passing all tests progresses the setup to the next stage. Transition dynamics adjust based on which tests pass or fail, allowing the agent to adaptively repeat or advance stages to refine setup processes for diverse data science tasks.

Action Generation through Structural Reasoning

Standard RL methods struggle with the complex, high-dimensional state and action spaces in Agent K’s setup process, where actions generate long token sequences and rewards are sparse. To address this, the Pangu-Agent framework extends to data science scenarios, allowing Agent K to use intrinsic functions alongside extrinsic actions, structured by an internal policy that manages both memory and task actions. The setup automation policy aims to maximize rewards by executing generated code and assessing it via per-stage and cross-stage unit tests, which evaluate code quality without being specific to each task.

Tackling Credit Assignment with Nested Reasoning

Credit assignment (CA) helps determine which parts of the generated code need modification to pass meta-unit tests. Instead of traditional RL state critics, Agent K uses LLM-generated thoughts to analyze errors when a meta-unit test fails. The agent revisits each stage in the process and selectively rewrites code only if it suspects an error at that stage, efficiently isolating the root cause of failure.

The CA process involves a sequence of intrinsic actions:

  • The agent generates a “META-ERROR-THOUGHT” based on the failure, storing insights in memory;
  • A second action, “CA-THOUGHT,” evaluates if previous code contributed to the error;
  • A new code segment is generated based on this reflection, which is then executed and validated;

Phase II (Optimisation) - Solving Data Science Tasks

After Agent K v1.0 sets up a task by creating valid data loaders and task-specific metrics, another policy generates code to optimize task performance, including feature engineering, model training, and hyperparameter tuning. The agent then submits predictions on Kaggle to evaluate performance based on leaderboard scores.

Solution generation is modeled as an MDP, with the goal of maximizing rewards based on final leaderboard results. Agent K v1.0 uses intrinsic functions and specialized tools - deep learning models, feature engineering, AutoML frameworks, and Bayesian optimization - to iteratively improve solutions. The optimization process allows the agent to make reasoned, memory-based adjustments to each component of the solution pipeline, ultimately aiming to maximize task-specific metrics through structured reasoning and advanced tools.

Solution generation process

Agent K v1.0’s solution process adapts based on task modality, using customized approaches for different data types. For tabular tasks, the agent uses an AutoML tool built on the RAMP library to format data and handle tasks like column encoding and hyperparameter tuning. For computer vision, NLP, and multimodal tasks, the agent uses deep neural networks with late-fusion architectures, leveraging pretrained models from Torchvision and Torchtext.

Additionally, the agent uses HEBO (Bayesian Optimization) to fine-tune hyperparameters and a blending tool (MLP on the top of predictions of several models).

Phase III (Generalisation) - Multi-Task and Active Task Selection

Agent K v1.0 can be extended to handle multiple tasks across domains, evolving into a multi-task and continual learning agent. In this multi-task setup, Agent K maximizes performance by sharing long-term memory, allowing knowledge transfer across data science domains. Each task’s setup and solution phases are formulated as optimization problems, incorporating shared memory to improve efficiency.

The agent builds a curriculum of tasks that balances exploration and exploitation by drawing on successful and failed experiences stored in a retrieval database. Task selection prioritizes tasks similar to successful cases and dissimilar to failures, gradually increasing in difficulty to enhance learning. To measure task similarity and difficulty, the agent generates matrices based on cosine similarity of task descriptions and structural metadata. By factoring in task similarity, difficulty, and recency, Agent K builds an efficient learning sequence, starting with simpler tasks and advancing to more challenging ones, adjusting based on success outcomes.

Agent design

Experiments

Kaggle performance

Agent K v1.0 was tested on 65 Kaggle competitions, where it autonomously set up tasks, generated submissions, and was evaluated on leaderboards using standard Kaggle guidelines. Using the open-source Qwen-2.5 72B model Agent K achieved a performance level comparable to a Kaggle Grandmaster, with medals across various data science domains: 6 gold, 3 silver, and 7 bronze medals. Gold medals spanned tasks in tabular data, computer vision, and NLP, including high-participation competitions in sentiment analysis and large-scale image classification.

Agent K achieved over 80th percentile in 22 tasks and exceeded the 50th percentile in 62% of competitions. Challenges, such as non-converging losses, incorrect submission files, and class imbalances, were identified in a subset of tasks where Agent K performed below the 20th percentile.

Human comparison

Agent K v1.0’s performance was evaluated against 5,856 human participants who competed in at least three of the same Kaggle competitions. Using a multiplayer Elo scoring system, Agent K achieved an Elo-MMR score of 1542, placing it in the top 38%, outperforming 63% of competitors and ranking between the first and third quartiles of Grandmasters.

A comparative analysis showed Agent K’s performance improvements relative to specific users, with notable gains such as a 200% improvement over “tracyporter” and over 100% against “grapestone5321.” However, Agent K underperformed against some top competitors, such as “tunguz” (by 43%) and “alexryzhkov” (by 30%), indicating room for enhancement.

My personal opinion

While the approach described in this paper is interesting, the results raise some questions. All the competitions where Agent K got a gold or a silver medal are playground or community competitions, which means that they didn’t have any significant amount of experienced participants. Also, the comparison to the human performance raises some questions: for example, some of the participants marked as Masters have this rank in Notebooks or Discussions. While these ranks have a meaning, they aren’t relevant in the context of measuring competition performance.

The approach itself, with setting up the automated pipeline with tests, is interesting, but the performance of the approach would benefit from refining.

paperreview deeplearning nlp llm agent