artgor

MiniMax Sparse Attention: Per-Group Block Selection for Cheap Million-Token Inference

2026-06-15T00:00:00+00:00

MiniMax Sparse Attention: Per-Group Block Selection for Cheap Million-Token Inference

Long-context LLMs keep promising the same thing: feed more tokens into the prompt and let the model reason over them. The bottleneck is rarely the window itself: it is the cost of attending over it once agentic workflows, repo-scale code reasoning, and persistent memory push the context into the hundreds of thousands or millions of tokens. Softmax attention is quadratic, so million-token context is not just a modeling challenge, but an inference-cost and deployment challenge.

MiniMax Sparse Attention (MSA) is the attention design behind MiniMax M3, and it tackles this problem with blockwise sparsity built on top of Grouped Query Attention. The idea is to keep exact softmax attention but run it over a tiny, query-dependent subset of the key-value history instead of the whole thing. A lightweight Index Branch decides which blocks matter, and the expensive Main Branch perform exact softmax attention only over those selected blocks. At 1M-token context, MSA cuts per-token attention FLOPs by 28.4x against a dense GQA baseline of the same 109B-parameter configuration, while staying on par with it on quality.

This is the same framing DeepSeek used for its sparse attention. The interesting question is no longer maximum context length, but whether the model can compute over a long context cheaply enough to deploy. MSA’s particular bet is per-group block selection, and most of the design follows from that choice.

You can read about my experience with MSA in the MiniMax M3 review, which covers its multimodal capabilities.

The approach

Two branches over one GQA backbone

MSA splits each attention layer into an Index Branch and a Main Branch that share the same GQA backbone. The Index Branch adds one index-query head per GQA group and a single shared index-key head, scores every causally visible key token, then max-pools those scores up to the block level. For each GQA group it keeps the top-16 of those 128-token blocks, always force-including the local block that holds the query, for a fixed 2,048-token selection budget. The Main Branch then runs standard exact softmax attention restricted to exactly those selected blocks. Per-query cost drops from growing with sequence length to a constant set by the budget, so attention compute stays flat as context grows.

The design choice that separates MSA from its neighbors is that the Top-k selection is shared per GQA group rather than across all query heads, and it applies at block granularity rather than token granularity. Each group retrieves its own blocks, which preserves multi-group selectivity, while block-level selection keeps the KV reads contiguous so the kernel can stay efficient.

Training a non-differentiable selector

Top-k selection is not differentiable, so the Index Branch cannot learn from the language-model loss directly. MSA trains it with a KL alignment loss that matches the index branch’s block-score distribution to the Main Branch’s actual attention distribution over the selected tokens, so the selector learns to predict which blocks the exact attention would have wanted. Gradient detach confines this auxiliary loss to the index projections alone, keeping it from perturbing the rest of the model. Training starts with an indexer warmup that runs full attention before switching to sparse, which gives the local block a stable target to imitate. The same warmup recipe is what converts a pretrained dense checkpoint into a sparse one.

That conversion path is worth separating from training from scratch, because the two behave differently. MSA-PT is trained sparse from the start; MSA-CPT takes an existing dense GQA checkpoint, swaps in MSA, and continues training.

Experiments

On quality, MSA-PT matches or slightly beats the dense Full-Attention baseline across most of the benchmark table, with the largest margins on multimodal and long-context tasks. MSA-CPT, the conversion route, stays close to the dense checkpoint it started from and is strongest on text, code, and perplexity. The sparse beats dense claims warrant some caution: these are research-scale 109B models trained on only 3T tokens, and the multimodal jumps may reflect native sparse pretraining acting as a form of regularization at this scale rather than evidence that sparse attention wins at frontier scale.

The ablations show that a sliding-window baseline held to the same FLOP budget has uniformly higher perplexity than MSA.

At 1M context, MSA delivers a measured 14.2x prefill and 7.6x decode wall-clock speedup on H800 against the dense GQA baseline, both growing with context length. The paper is honest that the runtime gains are smaller than the 28.4x FLOPs reduction would suggest: index construction, Top-k selection, and a less regular memory-access pattern all decrease the theoretical win.

Conclusions

There are two major approaches within sparse-attention research. One sparsifies a model that was already pretrained densely: methods such as H2O, SnapKV, Quest, MInference, and InfLLM reduce the cost of long-context inference by pruning, compressing, or selectively retrieving from the KV cache. While effective, they inherit the quadratic cost of dense pretraining and can only approximate the attention patterns the model originally learned.

MSA belongs to the second group, where sparsity is built into training from the beginning. The model learns to operate under a fixed attention budget rather than being sparsified after the fact. Several architectures pursue this idea in different ways. NSA combines compressed, selected, and sliding-window attention branches; MoBA performs routing at block granularity using block-level summaries and learns routing implicitly through the language-model objective; InfLLM-V2 retrieves relevant blocks without introducing a learned routing network. The most similar is DeepSeek’s DSA, which performs sparse token-level retrieval on top of Multi-head Latent Attention using a lightweight ReLU-based indexer.

MSA’s distinguishing feature is its combination of per-GQA-group retrieval and block-level sparsity. Instead of forcing all heads to share the same sparse view of the context, each GQA group selects its own relevant blocks. At the same time, operating on contiguous blocks rather than individual tokens produces a sparsity pattern that maps efficiently onto GPU kernels, allowing MSA to improve both retrieval flexibility and hardware efficiency.

I like that MSA is a complete, deployed system rather than a proposal, with a released inference kernel and the cheap MSA-CPT conversion route that lets an existing dense model become sparse without retraining from scratch. At the same time, there are certain caveats: the runtime speedup trails the FLOPs reduction, retrieval-heavy long-context subtasks don’t show outstanding performance, and the scope focuses on pretraining, with no RL or post-training and no benchmark against NSA, MoBA, or DSA.

Testing MiniMax M3 on real tasks: repo refactor, screenshot debugging, and Spotify recommendations

2026-06-10T00:00:00+00:00

Testing MiniMax M3 on real tasks: repo refactor, screenshot debugging, and Spotify recommendations

I got early access to MiniMax M3, so I plugged it into Claude Code and used it to work on a few tasks that I wanted to complete for some time: a code audit and refactor of my old web game, two UI bugs from it that I had been putting off, and a music-recommendation experiment built from my Spotify history. I used M3 for the implementation work, then asked Opus 4.8 to review it.

M3 is the first open-weights model (will soon be fully open-sourced on HuggingFace and GitHub) to combine three things in one release: frontier-level coding and agentic ability, a 1M-token context window, and native multimodality. I reviewed MiniMax M2.7 earlier, and M3 is a clear step up from M2.7 in the areas I tested.

M3 was most useful when I gave it concrete artifacts — a repo, tests, screenshots, and data exports. It did a lot of real work quickly, but an independent review still caught some regressions.

What MSA is, and why MiniMax keeps changing its attention

MiniMax has changed its attention twice (if you want to know more about attention, you can read my note). MiniMax-01 and M1 used lightning attention, a linear-attention variant, in a 7:1 hybrid — seven linear layers per softmax layer. M2 and M2.7 then reverted to full attention; the team’s candid post Why Did M2 End Up as a Full Attention Model? blamed linear attention’s precision sensitivity, immature infra, and multi-hop deficits — all costs of approximating the softmax.

M3 uses MiniMax Sparse Attention (MSA), which keeps the softmax exact and only narrows where it runs. An index branch cheaply scores blocks of context (one lightweight query per GQA group → block-max-pool → top-k), then the real query heads run ordinary full attention over just the selected blocks. MiniMax reports it running 4× faster than Flash-Sparse-Attention, at ~1/20 the per-token compute of M2, with 9× prefill and 15× decode speedups — their own numbers, unreproducible until the weights and report ship.

So MSA “matching full attention on the vast majority of capabilities” isn’t surprising: it doesn’t approximate, it selects. The only thing that can break is the selection — drop a block that mattered and the answer is gone. The real question is how good the selector is at long range.

Auditing and refactoring an old idle game

A year ago, I vibe-coded an idle game, Eternum Alchemist, with Sonnet, and I wanted to pick it up again. Before adding anything new, I asked M3 to carefully review the code for bugs, security issues, and logic problems. It spent roughly 30 minutes on the repository understanding and analysis, which isn’t surprising given it has ~100 files and ~26k lines of code.

The report was quite good. It was organized by severity (12 critical, around 20 high, 30 medium, 20 low), carried file paths and line numbers, and included a recommended order of work. Some of the most important issues were:

shouldAttack using an integer-modulo model that made every enemy with an attack speed above 1 always attack, so the snake monster was effectively slower than intended.
A lot of unfinished code/configs. For example, after skills reached prestige, they couldn’t level up, because their XP scaling was nested under ranks[rank] while the function read a top-level field and got NaN.

I asked M3 to fix all issues. It worked for ~2h 40m across three phases, increased the number of tests from 188 to 237, and most of the fixes were correct and well tested.

But then I asked Opus to review the changes, and it found two critical regressions that M3’s own green tests had hidden.

M3 added schema validation to the import path, changing the data format and conflicting with the save format. Thankfully, the game is in alpha or pre-alpha stage, so this is fine, but if this were in production, the saves would be broken.
M3 fixed non-working multipliers, but forgot that the crit hit chance was applied in two places, which resulted in it scaling as 1.05 to the power of twice the level. It was exactly the config-drift pattern the audit itself had flagged elsewhere and not fixed here.

Other than that, Opus found that six fixes were partial and six issues were untouched. As a takeaway, I can say that M3 did a large amount of correct, well-structured work quickly. But it was my mistake to let M3 both write the tests and fix the code issues. Next time, I’ll use two separate sessions for it.

Two UI bugs that needed a screenshot

The next two problems were UI-related, and that’s where M3’s multimodality came in handy.

The first was a freeze. On the Skills screen, clicking a skill froze the whole panel, and every click after the first did nothing. Describing the symptom in words got me nowhere — the model kept guessing at event handlers and React state. So I shared the screen: a screenshot plus the open DevTools. After about fifteen minutes of reading the actual rendered DOM, it found the cause, and it was not in any handler. There were two stacked global-CSS collisions, both from Create React App shipping non-scoped CSS. The first was real but only cosmetic: a .main-content grid rule squeezed the Skills window into the left half. The actual click-blocker was a .progress-text rule in ProgressBar.css — position: absolute; top: 0; left: 0; width: 100%; height: 100% — meant to center a percentage label over a progress bar. That global class leaked onto SkillsView’s own .progress-text, and because the nearest positioned ancestor was .main-content, it expanded into a full-size invisible overlay covering the entire view. That caused the issue that I encountered: with no skill selected, there is no XP-info element and no overlay, so the first click renders the overlay and every later click hits it instead. The fix was to put ProgressBar.css under a .progress-bar-container, and a regression test now fails if this breaks.

The second bug was simple. The “Current Bonuses” panel listed every future unlock — at level 10+, 25+, 30+, and so on — as if it were already active. I took a screenshot, described the problem (“this section shows all future bonuses; it should show only the current ones”), and asked it to ultrathink and fix it. It split the panel: “Current Bonuses” now shows only what is active at the current level (a single “3% increased gathering speed” for my level 3 skill), and the future unlocks moved into their own “Upcoming Bonuses” section.

Both fixes were small once found, and both were much easier to diagnose with screenshots. Being able to ask a model to reason about the image demonstrates the value of multimodal models.

Music recommendations from years of Spotify history

The last task was for fun. I have listened to music offline for years (Winamp, AIMP, VOX), but several years ago, I switched to Spotify after my friends pressured me to try new music, and I was curious. At first, I loved Spotify recommendations, but over time, they drifted into either repetition or noise. I wanted to see if M3 could do better, so I exported my extended streaming history and asked it to analyze it in depth, identify my tastes, and recommend new and exploratory artists and songs, with the output as an HTML report plus a CSV of the full list.

The analysis was mostly what I expected (I know what I listen to, and Spotify Wrapped helps too). M3 builds a listening profile from about 74k streams over five years (4.3k hours, 2.1k artists): a melancholic-romantic core with an anime and synth-pop streak and a symphonic-metal second life, broken into genre pillars by hours — Japanese city pop and 80s J-pop near 700 hours, symphonic and power metal near 500, anime and game OST near 400, then everything else.

One funny thing was the visualization of the countries. I have never been to France or Germany, but they were at the top of the chart thanks to VPN.

The analysis was cool and interesting, but some of the plots were questionable. The Platforms chart unnecessarily showed 6 platforms when two of them had like 99.9% of the streams, and the rest were insignificant.

The recommendations were organized into tiers, from safe to exploratory (“cross-niche bridges,” “adventurous picks with hidden bridges”), each with a one-line reason explaining how the artist connects to something I already listen to. I have not worked through all of them, but the results are positive: most songs are okay, a few are completely off, and several are my new favorites. One of the best finds was 中森明菜 (Akina Nakamori) with DESIRE -情熱-, the kind of song I like, the one Spotify never showed me before.

Conclusion

Across the audit, the UI bugs, and the music experiment, M3 was most useful where the task gave it something concrete to work against: a test suite, a screenshot, a data export. It is fast and cheap enough to run several supervised passes, and its multimodality is a real practical advantage for debugging. The one thing I would not skip is an independent review of anything that matters, because a model that writes both the fix and the tests can be confidently wrong on both, and here it took a second model to catch it. I will keep using it for supervised refactors and screenshot-driven debugging, with a separate reviewer in the loop for the parts I care about.

This post was written in a paid partnership with the MiniMax team. If you want to try MiniMax, you can use this code for a 12% discount.

Book Review: 50 ML Projects to Understand LLMs

2026-06-09T00:00:00+00:00

Book Review: 50 ML Projects to Understand LLMs

Amazon

Author’s website

I was offered to read 50 ML Projects To Understand LLMs by Mike X Cohen in exchange for an honest review. Rather than building, fine-tuning, or prompting LLMs, the book treats GPT-2 as a scientific specimen and teaches you to investigate it with code, statistics, and controlled experiments across 50 hands-on projects. As I have spent a lot of time working on model validation and LLM evaluation, I liked that the author focuses on the statistical discipline throughout the book — permutation tests, multiple-comparison corrections, control baselines, manipulation checks. That kind of validation rigor is what separates a real result from a lucky one, both at work and in ML competitions.

The overall structure

The 50 projects are organized into six chapters that loosely follow the flow of data through a transformer model:

Tokenization: how text becomes integers, whether tokenization is really compression, and how strongly tokenizers favor English.
Embeddings: cosine similarity, comparing models with representational similarity analysis, semantic axes, and analogy vectors.
Output logits: softmax, sampling strategies, the loss function, perplexity, evaluation with HellaSwag, and measuring language bias.
Transformer outputs: the residual stream, hidden states, the logit lens, and patching hidden states to find where a capability lives.
Attention: query-key-value weights and activations, raw versus softmax attention scores, head silencing, and patching attention heads.
MLP: neuron characteristics, grammar tuning, lesioning neurons, supervised probing with XGBoost, and a deliberately silly recommender-system capstone.

Each project has a similar structure: a bit of background, a task box telling you what to build, a figure to reproduce, and an interpretation of what you found. Every project includes two notebooks, a “helper” with gaps to fill and a complete solution, so you can pick your difficulty: code from a blank notebook, fill in the helper, or read the solution. Cohen explicitly tells you it is not cheating to look at the solutions. The whole book uses GPT-2, which is small enough to run on a laptop and open enough to inspect completely.

What I liked

There were many things I liked in this book, and I want to highlight several in particular:

The statistical rigor is a part of analysis: Cohen’s d for effect sizes, permutation testing with a proper discussion of exchangeability, FDR and Bonferroni corrections, split-half cross-validation, and manipulation checks.
The controls and ablations are thorough and precise. One project ablates the least-tuned neurons as a baseline before moving the interesting ones; another compares real hidden states against shuffled ones to estimate effective dimensionality. Without that kind of baseline, it is far too easy to convince yourself that a meaningless number means something.
The “try the obvious thing, watch it fail, learn the right approach” structure. You compare embeddings across two models with plain cosine similarity, see why it cannot work, and arrive at representational similarity analysis. You fit a linear regression to categorical token positions, watch it misbehave, and switch to logistic regression.
GPT-2 by itself. Because the model is small, you actually run every experiment yourself instead of reading about someone else’s results.
I liked the conversational tone of the book. “Are you disappointed in the results? I was when I first saw them!” appears more than once, and one whole project ends with the weight distributions being “not terribly interesting.” That is an unusual and welcome thing in a teaching book.
The tokenizer-bias project, where the same sentence needs 36 GPT-2 tokens in English but 557 in Tamil.

The projects I had most fun with are the ones that go deep into model internals. In Project 32, you patch hidden states inside an indirect-object-identification task and fit a sigmoid to find the layer where the model settles on the answer. In Project 44, you go looking for grammar in MLP projections: at the population level, nouns and verbs look identical, but if we drill down to the level of individual neurons, we can notice the difference. In Project 35, you analyze why negative raw attention scores enforce sparsity after softmax. These are real interpretability experiments, and you come out able to run your own.

I also liked the short section on AI assistance in the introduction. Cohen writes that “the more time goes on, the less I use AI for coding and writing,” explains that he mainly used LLMs for code review and brainstorming, and warns readers that “if you have code you do not understand and solutions you cannot explain, then you’re probably relying too heavily on AI”.

What could have been better

There are a few small things that could have been handled differently:

Top-p sampling is called “nuclear sampling” in a couple of places, but the common name is “nucleus sampling”. A tiny fix, but it would make it easier to find the original paper.
The book focuses almost entirely on GPT-2, with only occasional references to BERT, RoBERTa, and Pythia. Several findings are presented as general properties of transformers, and one or two more cross-model or cross-scale comparisons would make those claims more convincing. The author flags this himself, so it is more a wish than a complaint.
The tokenization chapter spends seven projects on tokenizers but doesn’t work through a byte-pair-encoding example by hand. A short walk-through of how BPE builds its vocabulary would round out an otherwise excellent chapter.

But these are small nitpicks that are completely overshadowed by the good sides of the book.

Conclusion

This book is a good fit for people who already use transformers and want to see inside them: ML engineers and data scientists comfortable in Python, students looking for a hands-on on-ramp into mechanistic interpretability, and anyone who learns better by running experiments than by reading papers. You do need real Python skills and some patience, but you do not need a background in LLM internals, since that is exactly what the book gives you.

GPT-2 is old by the standards of the field, and the specific numbers in any interpretability result will date quickly. The value that lasts is the way of thinking: form a hypothesis, build the right control, test it properly, and stay skeptical of your own results even when they look exciting. Those habits carry over to any model, and they are what I will keep from this book long after GPT-2 stops being a useful teaching tool.

Gamma-World: Simplex Agent Encoding and Hub Attention for Multi-Agent World Models

2026-06-01T00:00:00+00:00

Gamma-World: Simplex Agent Encoding and Hub Attention for Multi-Agent World Models

Paper

Code

Project

Most interactive video world models still assume a single agent: one user, one action stream, one generated future. γ-World adopts a harder, more realistic setting: several independently acting agents share the same evolving world. This is essential for games, robotics, embodied AI, social simulation, and agent training environments, where the key problem is not only visual fidelity, but whether multiple agents can act, interact, and remain consistent over time.

The paper’s central contribution is a clean multi-agent design for generative world modeling. It introduces Simplex Rotary Agent Encoding to represent agent identities without fixed slots or arbitrary ordering, Sparse Hub Attention to let agents exchange information without expensive all-to-all attention, and a teacher-student distillation setup that turns a full-context diffusion model into a causal streaming model. Gamma-World itself is a DiT-based latent video diffusion model trained with flow-matching, extended along an explicit agent axis. The result is a model that can produce action-responsive multi-agent rollouts in real time, while preserving independent controllability and even generalizing from two to four players without additional training.

The approach

The model uses a transformer-based latent video diffusion model adapted for autoregressive generation, in which rotary position embeddings encode spatial and temporal locations. The authors modify this position embedding to account for agent identities and implement a multi-agent aware attention masking mechanism to reduce computational cost. The model is trained in two steps: a bidirectional teacher model and a causal student model that supports the streaming setting.

Unlike traditional world models that generate a future for a single player, γ-World generates futures for all agents simultaneously.

The model:

Receives the first observation from each agent.
Receives an action sequence for each agent.
Predicts future observations for every agent jointly.

The key goal is consistency across agents and time. If agent A moves left and agent B observes agent A, both generated views should agree on what happened.

Shared Action Conditioning

Every agent has its own action sequence. A single shared encoder maps actions to latent representations, action features are injected into the transformer at every layer as additive biases. This means that each agent can be controlled independently while the parameters are shared.

Simplex Rotary Agent Encoding

Standard 3D RoPE gives video transformers rotary bands for time, height, and width. Simplex Rotary Agent Encoding (SRAE) adds a fourth band for agent identity: instead of assigning each agent a learned ID embedding, which fixes the roster and breaks symmetry the moment you reorder players, SRAE puts agents at the vertices of a regular simplex in rotary-angle space. Every pair of agents then sits at equal distance, so no agent is privileged, and the encoding does not care which slot a player occupies. The architecture never changes when an agent is added, which is what lets a model trained on two players accept four without retraining.

Sparse Hub Attention

The other half of the design is how agents share information. Dense all-to-all attention across agents is quadratic in the number of agents, so it scales poorly for large numbers of agents. Sparse Hub Attention (SHA) routes cross-agent interaction through a small set of learnable hub tokens that act as a compact representation of the environment state. Each agent attends to its own stream and to the hub tokens. The information flow is linear (agent -> hub -> agent) rather than quadratic.

This design significantly reduces computational cost while maintaining a communication pathway between agents. Together, these two ideas allow the model to generate coherent multi-agent rollouts, preserve consistency across viewpoints, and scale beyond the two-player settings that dominate most previous work.

Model training and inference

A major challenge in world modeling is balancing generation quality with real-time interactive inference. High-quality diffusion models typically rely on bidirectional attention, allowing them to look into the future during training, but this makes them unsuitable for online generation. Conversely, causal models support streaming generation but often suffer from exposure bias because they are trained on ground-truth histories while being evaluated on their own predictions.

γ-World uses a three-stage training pipeline:

First, the authors train a powerful bidirectional teacher model that has access to the full multi-agent trajectory and can learn rich temporal dependencies and cross-agent interactions.
Next, they train a causal student model using Diffusion Forcing, enabling autoregressive generation while preserving multi-agent communication through Sparse Hub Attention.
Finally, the causal model is distilled into a few-step generator using Conditional Self-Forcing, where the model learns under its own rollout distribution and is encouraged to remain faithful to both the initial observations and the specified action sequences.

This training strategy allows γ-World to combine the strengths of both paradigms: the visual quality and consistency of diffusion models with the low-latency streaming capabilities required for interactive simulation. During inference, the distilled model generates future blocks autoregressively using KV-cached attention, while maintaining cross-agent coordination through shared hub states. The result is a real-time multi-agent world model capable of streaming coherent rollouts at 24 FPS.

Experiments

Training is on two-agent Minecraft trajectories, and the generation-quality numbers are all two-player. Against the concurrent Solaris (a multiplayer-Minecraft world model that uses dense joint attention and learned per-player IDs) and a frame-concatenation baseline, γ-World achieves significantly lower FID and FVD scores in tasks requiring memory, grounding, movement, building, and cross-view consistency. On Memory it cuts FVD from Solaris’s 333.8 to 184.1 and FID from 51.7 to 24.8; on Consistency, the hardest protocol, FVD drops from 443.1 to 280.0.

The ablation studies indicate that each of the paper’s major design decisions contributes to performance. Treating agents as separate streams is more effective than spatially concatenating their observations, Simplex Rotary Agent Encoding consistently outperforms learned view embeddings, and Sparse Hub Attention preserves quality while providing a scalable communication mechanism between agents. Together, these components produce the strongest overall results, supporting the authors’ central claim that agents should be modeled as distinct but exchangeable entities connected through a shared interaction state.

Four-player generation is shown only qualitatively, zero-shot, with no metrics.

Conclusions

This idea loosely echoes recent work such as DroPE, in the sense that both papers treat rotary embeddings as an architectural degree of freedom rather than a fixed implementation detail. However, the motivation is almost opposite: DroPE removes positional embeddings to improve length extrapolation in LLMs, while γ-World reallocates rotary dimensions to introduce a permutation-symmetric agent axis for multi-agent world modeling.

γ-World belongs to the emerging family of interactive video world models, alongside systems such as Oasis/Genie-style single-agent worlds, Matrix-Game-style real-time long-horizon models, MultiWorld-style multi-agent multi-view models, and ActionParty-style action-binding models. Compared with these, γ-World’s strongest distinguishing idea is not just “better video generation”, but a principled treatment of agent exchangeability: the model should not depend on Player 1/Player 2 ordering, learned identity slots, or dense pairwise attention.

Its contribution is therefore architectural and conceptual. Matrix-Game 3.0 emphasizes high-resolution real-time generation and long-horizon memory; MultiWorld emphasizes multi-view consistency; ActionParty emphasizes subject-action binding. γ-World contributes a scalable, permutation-symmetric, independently controllable multi-agent generation. The paper is important because it pushes world models closer to actual shared environments rather than controllable video demos.

Testing MiniMax M2.7 via API on three real ML and coding workflows

2026-05-18T00:00:00+00:00

Testing MiniMax M2.7 via API on three real ML and coding workflows

I recently got access to some MiniMax M2.7 API credits, so I decided to plug this model directly into Claude Code and run it on three workflows I do regularly. The same tasks were run using Claude Opus 4.7 as the comparison baseline.

The three workflows: scaffolding an entry for an active Kaggle competition, drafting and auditing knowledge-base notes for my Obsidian vault, and updating an old PyTorch project that became outdated. I wanted to find out how well M2.7 works inside an agentic loop when the task has clear boundaries. The results were consistent across the three runs: M2.7 was useful when the constraints were explicit, and the output format was concrete. It stumbled when important context was left implicit, though some of the same gaps appeared with Opus 4.7 as well.

For the more open-ended cases, I would still keep a human review pass in the loop.

Setup

I added a claude-mm command that points Claude Code at the MiniMax API and ran M2.7 with thinking set to max in the CC interface. I ran on MiniMax’s Plus tier (High-Speed, $40/month), where the context window and per-day throughput no longer became bottlenecks for multi-step agentic work.

claude-mm() {
  ANTHROPIC_BASE_URL="https://api.minimax.io/anthropic" \
  ANTHROPIC_AUTH_TOKEN="$MINIMAX_API_KEY" \
  ANTHROPIC_MODEL="MiniMax-M2.7" \
  ANTHROPIC_DEFAULT_SONNET_MODEL="MiniMax-M2.7" \
  ANTHROPIC_DEFAULT_OPUS_MODEL="MiniMax-M2.7" \
  ANTHROPIC_DEFAULT_HAIKU_MODEL="MiniMax-M2.7" \
  ANTHROPIC_SMALL_FAST_MODEL="MiniMax-M2.7" \
  API_TIMEOUT_MS="3000000" \
  CLAUDE_CODE_DISABLE_NONESSENTIAL_TRAFFIC="1" \
  claude "$@"
}

In agentic work, the harness can be as important as the model itself. Most of the failures I describe below had similar reasons: the prompt did not explicitly state a constraint the task depended on, and the model filled the gap with a plausible default. In practice, model quality and harness design are hard to separate. A stronger model may infer missing constraints; a better harness may make those constraints explicit. I treated this as a workflow test, not a pure model benchmark.

Refactoring an old PyTorch project

The first workflow was a refactor: my pytorch_tempest repo is a framework for training neural nets using Hydra + PyTorch Lightning. I wanted to update dependencies, modernize the tooling, and clean up the code issues that had accumulated over time. The merged result is PR: refactoring old code and updating dependencies.

The changes:

Updated CI versions and pre-commit hooks.
Replaced black and flake8 with ruff for both linting and formatting.
Enabled fsdp_sharding_strategy in the Lightning trainer config.
Refreshed the documentation.
Added uv for environment management.
Switched to modern Python typing (list[X] over List[X], X | None over Optional[X]).
Removed duplicate code paths.
Fixed a lot of small issues.

I guided M2.7 explicitly: provided step-by-step requirements (“switch black + flake8 to ruff”, “update the pre-commit config”), reviewed each change before moving to the next, and provided feedback when the diff went outside scope. I had enough tests to check whether anything broke after the changes, and rerunning model training took only several minutes. I had some challenges running CI, and the agent helped me fix them one by one.

A lot of engineers I know do not want to give an agent free rein over a codebase they care about; they want to supervise the execution and know every existing line of code. M2.7 fits this approach well. You can write short, narrow-scope prompts, conduct line-level review, and then move to the next step.

Knowledge notes for the Obsidian vault

The second workflow was writing and auditing notes for my Obsidian vault, where I keep around ML reference notes. I write most of them by hand; sometimes I have an LLM draft a parallel version to compare against and take inspiration from.

It is important to remember that different models prefer different prompt styles. A 100-line prompt tuned for Opus 4.7 does not transfer one-to-one to M2.7. To handle that, I did a small bootstrap: I asked both models to generate notes from the same starting prompt, then asked M2.7 to read both notes and propose an improved prompt for itself. The next iteration used the M2.7-tuned prompt.

I used two prompts (a writer command and a critic agent), each around 100 lines. Here is a condensed version of the first one:

Fill one broken-link stub in the DSWoK vault: research the topic, draft the note in DSWoK voice, run draft-critic-mm, save to the right folder.

1. Read context: writing style guide, frontmatter taxonomy, alias rule.
2. Pick the stub.
3. Locate references — Grep for [[]] across the vault.
4. Pick the destination folder based on topical group.
5. Find a structural template from neighbouring notes.
6. Research via 3–5 sources, search-first — don't trust memory for citations, formula conventions, or post-2024 work.
6.5. Verify each cited URL before pasting it. Hard-to-verify URLs are blocking errors.
7. Determine note type and structure.
8. Draft the note with frontmatter taxonomy + style rules.
9. Cross-link inline to adjacent notes.
10. Run draft-critic-mm and address every blocking issue.
</code></pre></div></div>

<p>The critic agent has a similarly explicit checklist. The point of writing two detailed prompts is to make the evaluation criteria concrete: this means the model needs to make fewer judgment calls and can self-audit its output.</p>

<p>I shared gists with the <a href="https://gist.github.com/Erlemar/37c62e7afca0e25d7553547fabc18afd">command</a> and the <a href="https://gist.github.com/Erlemar/f58422e8923458ce12ea345f4017bd3f">critic</a>.</p>

<p>I tested both M2.7 and Opus on four notes: Negative Sampling, MAP (Mean Average Precision), Cold Start (a recommender-systems problem), and RMSE.</p>

<p></p>

<p>In the RMSE note, M2.7 got several things right:</p>

<ul>
  <li>It flagged that RMSE “does not decompose into bias and variance” the way MSE does, because the square root is nonlinear.</li>
  <li>It cited Hyndman & Koehler 2006 (the canonical forecasting paper introducing MASE and scaled errors) at the right place.</li>
  <li>The Properties section used inline mini-headers with bold formatting, as defined in the style guide.</li>
  <li>The intro was tighter than Opus’s version.</li>
</ul>

<p>What needed editing:</p>

<ul>
  <li>Bullet-label bold (<code class="language-plaintext highlighter-rouge">**Rating prediction.**</code>, <code class="language-plaintext highlighter-rouge">**Not robust to heavy-tailed noise.**</code>) - this is against the style guide, but easily fixed.</li>
  <li>Missing Variants section: RMSLE, NRMSE, and weighted RMSE are absent. This wasn’t defined in the prompt, but it would be a very good addition to the text.</li>
  <li>The Willmott reference pointed to a 2006 JAM paper (DOI 10.1175/JAM2472.1) rather than the canonical 2005 Climate Research paper that practitioners usually cite.</li>
</ul>

<p>The other three notes had the same pattern: solid first drafts, accuracy in the technical core, occasional citation mistakes, and occasional ignoring of style rules. Most of these issues (except the hallucinations) are easy to notice and to fix.</p>

<p>One additional experiment: I asked M2.7 to audit my existing notes and find possible problems. The audit was useful: the model found many formatting issues, including incorrect tags, typos, and missing cross-links. One flagged item was funny, though:</p>

<blockquote>
  <table>
    <tbody>
      <tr>
        <td><code class="language-plaintext highlighter-rouge">Metrics and losses/f1 score.md</code></td>
        <td>Only 1 tag (<code class="language-plaintext highlighter-rouge">evaluation</code>); missing domain tag (<code class="language-plaintext highlighter-rouge">recsys</code>, <code class="language-plaintext highlighter-rouge">nlp</code>, or <code class="language-plaintext highlighter-rouge">cv</code>)</td>
      </tr>
    </tbody>
  </table>
</blockquote>

<p>The F1 score is a general classification metric and does not need a domain tag by my taxonomy. M2.7 inferred a rule by analyzing larger notes, even though such a rule doesn’t exist. The fix was to include the tag hierarchy in the prompt next time, just as I include the writing-style guide for the drafting task.</p>

<p>Across the four notes and the audit run, M2.7 worked well for creating a first draft. It created useful tables and small visualizations, and the technical content was usually right, but references needed checking.</p>

<p>Here are the final versions of the notes after review, adding more ideas and heavy editing:</p>

<ul>
  <li><a href="https://dswok.com/Deep-Learning/Negative-sampling">Negative Sampling</a></li>
  <li><a href="https://dswok.com/General-ML/Cold-start">Cold start</a></li>
</ul>

<h3 id="kaggle-rogii--wellbore-geology-prediction">Kaggle: ROGII — Wellbore Geology Prediction</h3>

<p>The final task was the <a href="https://www.kaggle.com/competitions/rogii-wellbore-geology-prediction/overview">ROGII Wellbore Geology Prediction</a> competition: predicting geological layer tops along well paths from drilling-time measurements. Quasi-spatial data, anisotropic distances, a handful of wells with target labels, and per-well prediction error as the scoring metric.</p>

<p>I’m a Competition Master and Notebook Grandmaster, and I was curious to see how well an agent could perform in a new competition. I intentionally started with a high-level prompt rather than a fully specified implementation plan, because that is a realistic simulation for a first interaction with Kaggle. I’ve accumulated notes, code, and write-ups from earlier Kaggle competitions over the years; I shared them as context, along with explanations of what Kaggle is, what competitions are, and how to participate. The goal was to create a first submission that could be iterated on.</p>

<p>M2.7 spent a considerable time on the analysis. The first working result was this notebook <a href="https://www.kaggle.com/code/artgor/rogii-wellbore-final-kriging?scriptVersionId=316989130">rogii-wellbore-final-kriging</a>: a 5-fold validation split by well, ~40 features, and training a gradient boosting model. The validation split was not standard:</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># 5-fold GroupKFold by well_id
</span><span class="n">unique_wells</span> <span class="o">=</span> <span class="n">pre_ps_train</span><span class="p">[</span><span class="s">"well_id"</span><span class="p">].</span><span class="n">unique</span><span class="p">()</span>
<span class="n">well_to_fold</span> <span class="o">=</span> <span class="p">{</span><span class="n">w</span><span class="p">:</span> <span class="n">i</span> <span class="o">%</span> <span class="mi">5</span> <span class="k">for</span> <span class="n">i</span><span class="p">,</span> <span class="n">w</span> <span class="ow">in</span> <span class="nb">enumerate</span><span class="p">(</span><span class="n">unique_wells</span><span class="p">)}</span>
<span class="n">fold_assignments</span> <span class="o">=</span> <span class="n">pre_ps_train</span><span class="p">[</span><span class="s">"well_id"</span><span class="p">].</span><span class="nb">map</span><span class="p">(</span><span class="n">well_to_fold</span><span class="p">)</span>
</code></pre></div></div>

<p>The usual approach would be to use <code class="language-plaintext highlighter-rouge">GroupKFold</code> from sklearn, but this “cheap” version was fine for a first pass.</p>

<p>There were two issues, both due to Kaggle-specific mechanics rather than the model’s ML reasoning. ROGII is a <strong>kernel-only</strong> competition: at submission time, the test set you see (three rows with target values exposed) gets swapped out for the real test set (much larger, no target values). This means that models can miss these mechanics unless they are stated in the prompt. As a result, I noticed two problems:</p>

<ul>
  <li>The model assumed the three exposed test rows were the entire test set and hardcoded them.</li>
  <li>It treated the exposed target column as a regular feature and used it during feature engineering.</li>
</ul>

<p>The first submission didn’t succeed: with the target leaked into the feature set, the model trained against a column it would not have at inference, and the submission failed due to hardcoding the three available test samples.</p>

<p>Interestingly, <strong>Opus 4.7 also used the exposed target</strong> in feature engineering in the same setup. The kernel-only rules are not something either model picks up from “this is a Kaggle competition” — they have to be in the prompt. After I explicitly explained the mechanics, M2.7 fixed both bugs in one pass, and the submission worked.</p>

<p>It then produced a more advanced version: the <a href="https://www.kaggle.com/code/artgor/rogii-idw-lightgbm-residual?scriptVersionId=317035608">rogii-idw-lightgbm-residual</a> notebook, with inverse-distance-weighting features and a LightGBM residual model (without leaks) on top, scoring better than the first attempt.</p>

<p>In terms of participating in Kaggle competitions, M2.7 worked well for building a scaffold for future work: setting up basic validation, starting feature engineering, and training a model. After that, it can iteratively improve the solution if you provide strict constraints and specify the direction (e.g., improving a specific metric).</p>

<h3 id="cost-and-throughput">Cost and throughput</h3>

<p>I ran this on MiniMax’s $40/month Plus plan and never came close to the rate limits across five days of intensive Claude Code sessions. The subscription dashboard showed that M2.7 processed roughly 91M total tokens, with most of them cache reads. At M2.7’s PAYG rates ($0.30/$1.20 per million input/output, $0.06 per million cache reads), that’s around $8 worth of usage. I didn’t log Opus 4.7’s token usage, but at its rates ($5/$25, $0.50 cache reads), it would cost around 10x.</p>

<p>In terms of speed, M2.7 returned tool calls and completed multi-step plans noticeably faster than Opus 4.7 on the same tasks — subjectively around 2x. I didn’t benchmark rigorously, but the difference was noticeable. Combined with the cost ratio, this means you can run several supervised iterations on M2.7 within the time and budget of one Opus iteration</p>

<h3 id="where-id-use-m27-going-forward-and-where-i-wouldnt">Where I’d use M2.7 going forward, and where I wouldn’t</h3>

<p>Across the ROGII submission, the four Obsidian notes, and the pytorch_tempest refactor, the results are similar. M2.7 works well when the task has clear boundaries, explicit evaluation criteria, and concrete output requirements. The cases where it fell short had a common cause: the prompt left a piece of context unstated, and the model filled the gap with a reasonable but wrong assumption. In some cases, the same prompt produced the same gap in Opus.</p>

<p>I would use M2.7 going forward for:</p>

<ul>
  <li>Supervised refactors with a narrow scope and rapid iteration.</li>
  <li>First-draft technical content that I am going to review anyway: knowledge notes, drafts, or boilerplate for new repos.</li>
  <li>Audit of existing documents: when I explicitly provide the taxonomy or a list of checks.</li>
  <li>Iterating over existing machine learning code to improve the metrics given explicit constraints.</li>
</ul>

<p>What I would not yet hand to M2.7 unsupervised:</p>

<ul>
  <li>Open-ended ML competition strategy beyond the initial setup. The decisions should be made by humans or by an advanced model. When the direction is split into tasks, M2.7 can start implementing them.</li>
  <li>Reference-heavy technical writing without verification. This is not specific to M2.7 — citation hallucinations happen with most models I have tested. The workaround is the same: verify every URL, and treat it as another step in the workflow.</li>
</ul>

<p>Across the three workflows, M2.7 was the right tool when I could define the constraints. When the task required the model to figure out the constraints itself (what “kernel-only” implies, what taxonomy applies to F1) both M2.7 and Opus failed, and Opus failed less. The trade is roughly 10x in cost per equivalent task. For supervised work with rapid iteration, using M2.7 is worth it.</p>

<p>This post was written in partnership with the MiniMax team. If you are interested in trying MiniMax, you can use this <a href="https://platform.minimax.io/subscribe/coding-plan?code=2Q1yZ8xHj9&source=link">code</a> for <strong>12% discount</strong>.</p>

<p><strong>UPD</strong>: Now you can read my next blogpost: <a href="https://andlukyane.com/blog/minimax-m3">Testing MiniMax M3 on real tasks: repo refactor, screenshot debugging, and Spotify recommendations</a>.</p>
</article>
<article>
<h1>DeepSeek-V4 Review: Why Million-Token Context Needs Efficient Attention, Not Just Larger Windows</h1>
<p>2026-04-24T00:00:00+00:00</p>
<h2 id="deepseek-v4-review-why-million-token-context-needs-efficient-attention-not-just-larger-windows">DeepSeek-V4 Review: Why Million-Token Context Needs Efficient Attention, Not Just Larger Windows</h2>

<p><a href="https://huggingface.co/deepseek-ai/DeepSeek-V4-Pro/blob/main/DeepSeek_V4.pdf">Paper</a></p>

<p><a href="https://deepseek.ai/deepseek-v4">Project</a></p>

<p></p>

<p>Long-context LLMs usually promise a simple capability: put more tokens into the prompt and let the model reason over them. This works up to a point, but it hides a structural bottleneck: a long context window is only useful if the model can actually afford to attend over it during inference, tool use, and long reasoning trajectories.</p>

<p><strong>DeepSeek-V4</strong> changes the focus from maximum context length to <strong>efficient long-horizon computation</strong>. Both available models (V4-Pro with 1.6T total / 49B active parameters and <strong>V4-Flash</strong> with 284B / 13B active) support <strong>1M-token context windows</strong>. The whole architecture is built around making that window usable: hybrid compressed attention (Compressed Sparse, Heavily Compressed, and Sliding Window, interleaved across layers), a scaled MoE with <strong>Manifold-Constrained Hyper-Connections</strong>, Muon optimizer, reduced KV-cache cost, and a post-training recipe that replaces unified-policy RL with <strong>on-policy distillation</strong> of independently trained domain specialists</p>

<p>The central claim is that future reasoning and agentic systems will not be limited only by model quality, but also by whether the model can maintain useful state over very long trajectories. DeepSeek-V4 is interesting because it treats long context as an infrastructure problem inside the model itself.</p>

<h3 id="sparse-moe-mhc-and-training-stability">Sparse MoE, mHC, and training stability</h3>

<p></p>

<p>The DeepSeekMoE backbone from V3 scales up: V4-Flash has <strong>256 routed experts + 1 shared</strong>, V4-Pro <strong>384 + 1 shared</strong>, both activating 6 experts per token. Load balancing uses V3’s auxiliary-loss-free scheme plus a sequence-wise balance loss to prevent pathological routing on individual sequences. DeepSeek-V4 keeps multiple parts of the previous DeepSeek design: DeepSeekMoE for feed-forward layers, Multi-Token Prediction, and the broader MoE approach. The model also replaces dense FFN layers in the early transformer blocks with MoE layers using hash routing, while keeping the MTP strategy from DeepSeek-V3.</p>

<p>V4 integrates <a href="https://andlukyane.com/blog/paper-review-mhc">mHC</a> directly into the backbone, projecting the residual mixing matrix onto the <strong>Birkhoff polytope of doubly stochastic matrices</strong> via Sinkhorn–Knopp with ~20 normalization iterations. This keeps the residual connection in the generalized identity regime that plain Hyper-Connections break. V4 is the first frontier-scale deployment of mHC, and the authors report it trains cleanly where unconstrained HC diverges.</p>

<p>Two stability mechanisms get added on top, both mentioned as empirical without theoretical grounding. <strong>Anticipatory Routing</strong> computes and caches routing indices <code class="language-plaintext highlighter-rouge">Δt</code> steps earlier, using historical router parameters, then applies them during the later main training step. <strong>SwiGLU Clamping</strong> clamps the gate’s linear component to <code class="language-plaintext highlighter-rouge">[-10, 10]</code> and caps the gate component above at 10, and reportedly eliminates loss spikes that emerge at trillion-parameter scale.</p>

<h3 id="hybrid-attention">Hybrid attention</h3>

<p></p>

<ul>
  <li><strong>Compressed Sparse Attention (CSA)</strong> first compresses the KV cache along the sequence dimension, then applies DeepSeek Sparse Attention over the compressed representation. Instead of allowing every query to attend densely to the full history, it compresses groups of tokens into fewer KV entries and then selects a limited number of compressed blocks for each query.</li>
  <li><strong>Heavily Compressed Attention (HCA)</strong> uses a much larger compression ratio, but removes sparse selection. The compressed sequence becomes short enough that dense attention over compressed blocks is affordable. In other words, CSA preserves more selectivity, while HCA provides an aggressively compressed global view.</li>
</ul>

<p></p>

<ul>
  <li><strong>Attention sinks</strong> add learnable sink logits to the attention denominator in CSA and HCA. This means each query head does not have to distribute all attention mass over previous tokens or compressed blocks: the total attention assigned to actual context can be less than 1, and even close to 0. This is useful in long-context attention because not every query should be forced to attend to some distant or weakly relevant context block.</li>
</ul>

<h3 id="systems-and-precision">Systems and precision</h3>

<p></p>

<p>Pretraining uses <strong>Muon</strong> as the main optimizer (AdamW on embeddings, prediction heads, and RMSNorm weights) across <strong>32T tokens for Flash and 33T for Pro</strong>. Sequence length is ramped 4K → 16K with dense attention; the sparse-attention path is switched on at a 64K stage. The query-key indexer path is quantized FP32 → BF16 for a 2× speedup with 99.7% recall on the top-k set.</p>

<p>Compared with DeepSeek-V3.2, DeepSeek-V4-Pro uses only 27% of the single-token inference FLOPs and 10% of the KV cache size at one million tokens. DeepSeek-V4-Flash reduces this further to 10% of the FLOPs and 7% of the KV cache size.</p>

<p>In DeepSeek-V3.2, reasoning traces were preserved across tool-result rounds but discarded when a new user message arrived. DeepSeek-V4 changes this for tool-calling scenarios. If the conversation contains tool calls, the reasoning content is preserved across the entire conversation, including across user message boundaries. A long-running coding agent needs to remember why it changed a file, which tests failed, which hypotheses were rejected, and what the next step should be.</p>

<p>The authors introduce Quick Instruction tokens for auxiliary tasks such as deciding whether to trigger search or recognizing intent. Instead of using a separate small model that requires redundant prefilling, these special tokens reuse the already-computed KV cache. The point is not just model quality; it is reducing orchestration overhead around the model.</p>

<h3 id="post-training-on-policy-distillation-and-generative-reward-models">Post-training: On-Policy Distillation and Generative Reward Models</h3>

<p>The post-training recipe diverges from V3 and <a href="https://andlukyane.com/blog/paper-review-deepseekr1">DeepSeek-R1</a>. R1 ran GRPO on a single unified policy with rule-based rewards; V4 instead trains <strong>N domain specialists independently</strong> (math, competitive coding, agent use, instruction following, and others), each with its own RL loop on high-quality in-domain data. The merge happens via <strong>On-Policy Distillation (OPD)</strong>: a weighted sum of <strong>full-vocabulary KL divergences</strong> from each specialist’s output distribution into a single student policy, with the student trained on its own on-policy rollouts.</p>

<p>The KL is computed over the full vocabulary rather than a token-level estimate, stabilizing gradients when specialists disagree. The per-specialist weighting is tunable, which means specialists explore different regions of behavior, and the final model learns how to absorb their distributions in contexts generated by itself.</p>

<p>This is why DeepSeek-V4 supports multiple reasoning-effort modes: the model is trained to operate under different inference budgets.</p>

<p>For hard-to-verify tasks, V4 also moves away from conventional scalar reward models. The authors use rubric-guided RL data and a Generative Reward Model, where the actor itself functions as the evaluator. This is less clean than rule-based verification, but it gives them a way to apply RL to tasks where correctness cannot be reduced to tests or exact answers.</p>

<h3 id="experiments">Experiments</h3>

<p></p>

<p>For base models, DeepSeek-V4-Pro-Base improves over DeepSeek-V3.2-Base across many knowledge, reasoning, coding, and long-context benchmarks.  For long context, the MRCR results show stable retrieval up to 128K tokens, with degradation beyond that point but still meaningful performance at one million tokens. DeepSeek-V4-Pro-Max reports 0.59 average MMR on MRCR 8-needle at 1M tokens, while V4-Flash-Max reports 0.49.</p>

<p>Overall, the authors claim reaching open-source SOTA in agentic coding, strong world knowledge among open models, and reasoning performance that rivals top closed models.</p>

<p>Some evaluations are blank because APIs were too busy to return responses, and GPT-5.4 was not evaluated on some long-context tasks because its API failed to respond to many queries.</p>

<h3 id="limitations">Limitations</h3>

<p>The model is released as a preview. The technical report is detailed, but many practical questions will only be answered by external usage: how stable the one-million-token context is across real agent traces, how often compression loses critical details, and how well the tool-use thinking path generalizes outside DeepSeek’s own harness.</p>

<p>Second, the evaluations are strong but not fully independent. Several evaluations use internal frameworks, internal tasks, or vendor-controlled harnesses. This is normal for frontier model reports, but it means the most useful evidence will come from external SWE-bench-style, terminal, retrieval, and long-context evaluations.</p>

<h3 id="conclusions">Conclusions</h3>

<p>V4 is the first DeepSeek release where the architectural part is more interesting than the RL one. Hybrid compressed attention makes 1M context servable at a fraction of V3.2’s cost, and <strong>on-policy distillation of independent domain specialists</strong> replaces the unified GRPO pipeline from <a href="https://andlukyane.com/blog/paper-review-deepseekr1">DeepSeek-R1</a> with a compositional alternative. R1 showed that RL on a base model can elicit reasoning; V4 now claims that decomposing into specialists and merging via full-vocabulary KL is better than holding every skill in one policy. Compared to <a href="https://andlukyane.com/blog/paper-review-kimik25">Kimi K2.5</a>, V4 and Kimi K2.5 focus on different bottlenecks: K2.5 on native multimodality and learned agent orchestration, V4 on sparse attention and compositional post-training.</p>

<p>Instead of treating context length as a static model property, DeepSeek-V4 treats it as part of the runtime system for reasoning and tool use. This is the right direction. Long-horizon agents will not work just because models become smarter. They need memory that is cheap enough to keep, structured enough to retrieve from, and stable enough to support many steps of reasoning.</p>

<p>I like that the paper is honest about what is still open. The stability tricks are empirical without theory. Opus 4.6 retains a 13-point lead on internal R&D coding, and long-context performance degrades gradually rather than staying flat at 1M.</p>
</article>
<article>
<h1>FIPO: Teaching LLMs Which Thoughts Actually Matter</h1>
<p>2026-04-20T00:00:00+00:00</p>
<h2 id="fipo-teaching-llms-which-thoughts-actually-matter">FIPO: Teaching LLMs Which Thoughts Actually Matter</h2>

<p><a href="https://arxiv.org/abs/2603.19835">Paper</a></p>

<p><a href="https://qwen-pilot.notion.site/fipo">Notion writeup</a></p>

<p><a href="https://github.com/qwenpilot/FIPO">Code</a></p>

<p></p>

<p>Most reasoning models today rely on outcome-based RL: generate a solution, check if the answer is correct, and reinforce the whole trajectory. As reasoning becomes longer, the learning signal collapses - important steps and irrelevant tokens receive the same credit, and performance plateaus.</p>

<p>FIPO addresses this directly by introducing <strong>token-level credit assignment</strong> based on future impact. This turns a sparse, outcome-only signal into a dense, structured one. The authors train <strong>Qwen2.5-32B-Base</strong> on top of the DAPO recipe inside VeRL and evaluate on <strong>AIME 2024</strong>. They report Pass@1 going from 50.0% (DAPO) to a peak of 58.0%, converging around 56.0%. Response lengths roughly double during training, from around 4k tokens to more than 10k.</p>

<h3 id="fipo">FIPO</h3>

<p></p>

<p>FIPO measures, for each token, how much the policy’s behaviour on the rest of the rollout has shifted since the last update. For each position in a response, the authors compute the log-ratio between the current and old policy at every subsequent position, then sum these shifts with a discount factor that decays with distance. This is the <strong>Future-KL</strong>: a per-token statistic that measures how much the policy’s behaviour on the tail of the rollout has moved since the last update. Closer future tokens matter more than far ones. The discount works as a soft half-life rather than a hard cutoff.</p>

<p>Future-KL can be unstable due to distributional shifts, so it needs some mechanisms to avoid destabilizing the training process. First, a <strong>dual-clip mask</strong> zeroes out tokens whose importance ratio exceeds a threshold around 10, so that extreme values don’t hurt the gradients. Second, the future-KL value is mapped through an exponential clip into a bounded <strong>influence weight</strong>. Tokens whose future trajectory shifted a lot after the update receive larger effective advantages; tokens that did not move the future of the rollout receive less weight.</p>

<h3 id="experiments">Experiments</h3>

<p></p>

<ul>
  <li>FIPO: 58.0% peak, around 56.0% at convergence.</li>
  <li>DAPO baseline on Qwen2.5-32B-Base: 50.0% Pass@1. Reproduced DeepSeek-R1-Zero-32B: ~47%.</li>
</ul>

<p>Both DAPO and FIPO show length growth, but the FIPO curves are longer at the same step count. That is consistent with denser credit letting the model commit to longer reasoning - useful tokens in the middle of a rollout can now be rewarded directly.</p>

<p>The main drivers of FIPO’s effectiveness are: the emergence of length-based scaling in reasoning chains, the positive learning signal, and the significantly improved stability of the optimization process.</p>

<h3 id="limitations">Limitations</h3>

<ul>
  <li>Compute overhead from per-step future-KL accumulation and 10k-plus rollouts;</li>
  <li>Narrow task scope (math reasoning only);</li>
  <li>Fixed dataset (DAPO’s open math corpus, not scaled up);</li>
  <li>A base-model confound that makes it hard to isolate the contribution of the RL recipe from the underlying pretraining and SFT.</li>
</ul>

<h3 id="conclusions">Conclusions</h3>

<p>FIPO targets a different problem compared to methods like GRPO, DAPO, and recent reasoning systems (<a href="https://andlukyane.com/blog/paper-review-deepseekr1">DeepSeek-R1</a>, o-series). Most existing approaches improve reasoning by scaling models, improving sampling or refining reward signals. FIPO instead changes <strong>how the reward is distributed within a trajectory</strong>.</p>

<p>This makes it fundamentally different:</p>

<ul>
  <li>Compared to GRPO/DAPO, it removes uniform credit assignment</li>
  <li>Compared to PPO-style RLHF, it avoids critics while still providing dense signals</li>
  <li>Compared to top reasoning models, it suggests gains can come from training signal design, not just scale</li>
</ul>

<p>I am less sure how well the gains hold up outside AIME or under a compute-matched comparison. The short-horizon future-KL proxy probably rewards some variance that is not reasoning-relevant. For now, it looks like a promising direction for reasoning-specific training rather than a drop-in upgrade.</p>
</article>
<article>
<h1>Book Review: Unlocking Data with Generative AI and RAG, Second Edition</h1>
<p>2026-04-09T00:00:00+00:00</p>
<h2 id="book-review-unlocking-data-with-generative-ai-and-rag-second-edition">Book Review: Unlocking Data with Generative AI and RAG, Second Edition</h2>

<p><a href="https://www.amazon.com/Unlocking-Data-Generative-RAG-fundamentals-ebook/dp/B0G2B5VLL8/">Amazon</a></p>

<p></p>

<p>I was offered an opportunity to read <strong>Unlocking Data with Generative AI and RAG, Second Edition</strong>, by Keith Bourne, in exchange for an honest review. I <a href="https://artgor.medium.com/book-review-unlocking-data-with-generative-ai-and-rag-3ec7cab074a5">reviewed the first edition</a> back in 2024 and liked it, so I was curious to see how the second edition would handle the fact that <strong>RAG</strong> has since absorbed an entirely new layer of the stack — agents, graph retrieval, semantic caches, and memory systems. The book covers the modern RAG landscape across 20 chapters in three parts: classical RAG foundations, production-grade retrieval and evaluation, and an entirely new Part III on agentic RAG spanning <strong>LangGraph</strong>, ontology-driven graph RAG with <strong>Neo4j</strong>, semantic caching, and the <strong>CoALA</strong> memory framework. I liked this book, and its biggest strength is the continuous running example that carries you from a basic retriever in Chapter 2 all the way to a stateful, learning agent in Chapter 19 — a thread very few books manage to present.</p>

<h3 id="the-overall-structure">The overall structure</h3>

<p></p>

<p>The book is organized into three parts:</p>
<ul>
  <li>Part I (Chapters 1–6) introduces RAG and its vocabulary, gets a complete pipeline running by Chapter 2, and then layers in practical applications, a security chapter with red and blue teaming, and a Gradio UI for demos.</li>
  <li>Part II (Chapters 7–11) goes deeper into the components: vectors and vector stores, similarity search, evaluation with <strong>Ragas</strong>, the <strong>LangChain</strong> retrievers and integrations, and the loaders, splitters, and output parsers that hold a real RAG system together.</li>
  <li>Part III (Chapters 12–19) is where the second edition contributes the most: agents and <strong>LangGraph</strong>, ontology engineering with <strong>Protégé</strong>, graph RAG on <strong>Neo4j</strong>, semantic caching, the <strong>CoALA</strong> memory framework, and a capstone investment-advisor agent that integrates all four CoALA memory types.</li>
</ul>

<p>The implementations use real libraries rather than building everything from scratch, which mirrors how most teams actually work and saves the reader from the usual from-scratch boilerplate.</p>

<h3 id="what-i-liked">What I liked</h3>

<p></p>

<p>There were many things I liked in this book, and I want to highlight several in particular:</p>

<ul>
  <li>The single running example is used throughout all 20 chapters. I’ve read enough ML books where each chapter is its own throwaway notebook, so I really appreciate it when one isn’t, and the continuity makes it easier for me to follow the ideas presented in the book.</li>
  <li>The small but important Chapter 3 detail of returning sources alongside answers. Until we have full trust in LLM, being able to answer the question “where did this answer come from?” is very important.</li>
  <li>Chapter 9’s evaluation is built around <strong>Ragas</strong> — faithfulness, answer relevancy, context precision, context recall, plus direct insights from a Ragas co-founder. A lot of evaluation work happens after deployment, which matches my experience on several real projects.</li>
  <li>Chapters 13–14 guide you from building a financial ontology in <strong>Protégé</strong> to loading it into Neo4j with hybrid embeddings that blend text and graph structure. Designing a clean ontology has the same slow, painful, valuable feel as the months we spent organizing data labeling on one of my past projects: it is the kind of upfront work that pays off everywhere downstream.</li>
  <li>The Chapter 5 red team / blue team prompt-injection lab. Security chapters in ML books usually read like policy documents, while this one actually has you attacking and defending your own RAG pipeline, which is a much better way to build practice.</li>
</ul>

<p></p>

<p>The standout production chapter for me was Chapter 15 on semantic caches. The long-tail framing of real-world queries, the cross-encoder verification step, the adaptive thresholds, and the eviction policy discussion are exactly the level of detail that separates a toy semantic cache from one that won’t embarrass you in production. Chapter 16 had a great comparison of three agentic memory frameworks: <strong>Mem0</strong>, <strong>LangMem</strong>, and <strong>Zep/Graphiti</strong>. And Chapter 19 brings everything together into a capstone investment-advisor agent that uses working, episodic, semantic, and procedural memory simultaneously.</p>

<h3 id="what-could-have-been-better">What could have been better</h3>

<p>There are a few small things that could have been handled differently:</p>

<ul>
  <li>Failure modes are under-discussed. Almost every chapter sells the upside of its technique and skips what breaks. Agents make systems slower and harder to debug, cached answers can become stale, and memory stores can grow in ways that hurt retrieval. Such a comprehensive book could be more honest about the failure side, and I’d love a “what goes wrong” subsection in each Part III chapter in a future edition.</li>
  <li>A few places could use head-to-head numbers. When graph-based RAG is introduced, or when the memory-equipped agent is presented, I’d have liked even one quantitative comparison against a simpler baseline on the same questions — the architectures are convincing in concept, and a small numbers-on-a-table moment would make them convincing in practice too.</li>
  <li>The introduction chapter includes a table listing popular models and their context lengths for October 2025 – some of these models were already outdated at that time.</li>
</ul>

<p>But these are small nitpicks that are completely overshadowed by the good sides of the book.</p>

<h3 id="conclusion">Conclusion</h3>

<p>This book is a good fit for RAG practitioners past the hello-world stage — people who already know what an embedding is and what a retriever does, and who now want to go beyond basic vector search. It is particularly useful for engineers being asked to turn a working RAG prototype into something that handles evaluation, security, caching, and agentic workflows without falling apart in production. If you already own the first edition, the question is whether Part III is worth the upgrade, and the answer is yes — Chapters 12 through 19 are essentially a self-contained book on agentic RAG, and very few other resources cover this terrain end-to-end with working code.</p>

<p>RAG is moving fast, and any book on the topic will age quickly in its specifics — LangChain APIs will churn, new memory libraries will appear, and several embedding-model tables will look dated within a year. The value here is less in the details of any particular notebook and more in the mental scaffold: how retrieval quality connects to evaluation, how agents extend the RAG loop, why memory systems matter once an agent lives longer than a single turn, and where graph-based retrieval earns its weight. That scaffold holds up, even as the field continues to evolve.</p>
</article>
<article>
<h1>Book Review: A Practical Guide to Reinforcement Learning from Human Feedback</h1>
<p>2026-04-06T00:00:00+00:00</p>
<h2 id="book-review-a-practical-guide-to-reinforcement-learning-from-human-feedback">Book Review: A Practical Guide to Reinforcement Learning from Human Feedback</h2>

<p><a href="https://www.amazon.com/Practical-Guide-Reinforcement-Learning-Feedback/dp/B0FV3414ST/ref=sr_1_2?crid=160SZQL0IJGZU&dib=eyJ2IjoiMSJ9.pfBEIkmcg4uKFShSCU2chy-aN-16qA3kbGpR5fyvBd8O7iy9IVMBkOHkVgTDYo51RtXqDqPzCZazu69tUQVyBlVHY9fxOWlk4j4Ji7oRJVZivh0VPH22I4bBuchG9Mvkk5DHx_i4js2-PlfBkMtAeR5wzQ8QZ5wzg8piIqagfVGRp02-AcKkHKxlXu05Kn-YIWCTXv_7qgOLv5W1vYOnxQ.9tFnY5IzjLbWu4nmf_TXK51XfEcRPt26t-fkPPg5UVQ&dib_tag=se&keywords=reinforcement+learning+human+feedback&qid=1773219960&sprefix=reinforcement+learning+human%2Caps%2C288&sr=8-2">Amazon</a></p>

<p><a href="https://www.linkedin.com/in/sandipdkulkarni/">Author’s LinkedIn page</a></p>

<p></p>

<p>I was offered to read <strong>A Practical Guide to Reinforcement Learning from Human Feedback</strong> by Sandip Kulkarni in exchange for an honest review. The book covers the full RLHF pipeline across 12 chapters: from classical reinforcement learning through reward modeling and PPO-based fine-tuning, and then into newer methods like <strong>DPO</strong>, <strong>RLAIF</strong>, and <strong>Constitutional AI</strong>. I liked this book and consider it to be a well-structured learning resource providing good theory basics and a lot of practical examples.</p>

<h3 id="the-overall-structure">The overall structure</h3>

<p></p>

<p>The book follows a deliberate three-stage progression:</p>
<ul>
  <li>Explaining the core principles of reinforcement learning and policy optimization</li>
  <li>Building a complete RLHF pipeline for LLMs</li>
  <li>Exploring the evolution of alignment research</li>
</ul>

<p>By the time you reach DPO in Chapter 10, you understand why it exists, because you have already built a reward model, dealt with PPO’s clipped objective, and seen the full canonical RLHF loop in action. That kind of understanding is hard to get from blog posts or paper summaries alone.</p>

<p>I also liked that the implementations use real libraries (<strong>TRL</strong>, <strong>PEFT</strong>, Hugging Face ecosystem) rather than building everything from scratch. In industry, we use existing tooling to avoid bugs and save time, and the book does the same. The memory optimization advice scattered through the middle chapters (gradient checkpointing, staged model loading, working within Colab constraints) is the kind of practical wisdom that many books skip entirely.</p>

<h3 id="what-i-liked">What I liked</h3>

<p></p>

<p>There were many things in this book that I liked, and I want to highlight several in particular:</p>
<ul>
  <li>The discussions on annotator bias and labeling evaluation were especially interesting for me, as on one of my projects, we spend months on organizing data labeling</li>
  <li>Using human keyboard inputs for Mountain Car demonstrations was fun</li>
  <li>The code walkthroughs for configuring PEFT and LoRA were highly practical and well-written</li>
  <li>Using smaller models like Qwen2-0.5B-Instruct makes it easier to play with them when you don’t have good enough hardware for larger models</li>
  <li>The visualizations explaining the intuition behind the algorithms were great.</li>
</ul>

<p></p>

<p>Chapters 9–12 are very interesting and useful; they cover modern approaches and offer many practical tips. The Constitutional AI section includes deployment architecture, transparency mechanisms, and governance considerations that feel like they were written by someone thinking about production systems, not just research experiments. The comparison between DPO and PPO trade-offs is well-argued and clearly presented. The evaluation chapter tackles genuinely hard problems (self-preference bias of LLM judges, mode collapse, the difficulty of proxy evaluations) without pretending there are clean answers.</p>

<h3 id="what-could-have-been-better">What could have been better</h3>

<p>There are some things that could have been handled differently or better:</p>
<ul>
  <li>I’m not sure if it was necessary to explain self-attention and transformers in such detail.</li>
  <li>The TRL library versions are inconsistent across chapters, and since TRL’s API changed significantly between these versions, it would be great if all chapters used the latest version.</li>
  <li>I’d love to see some examples of UI/UX for data collection and user annotation.</li>
</ul>

<p>But these are small nitpicks that are completely overshadowed by the good sides of the book.</p>

<h3 id="conclusion">Conclusion</h3>

<p>This book is a good fit for ML practitioners seeking a single, structured resource that covers the RLHF landscape from foundations to modern methods. It is particularly useful for people transitioning into post-training roles, or for anyone who learns better from implementations than from papers. The code is reproducible, the tooling is up to date, and the pedagogical progression genuinely helps build intuition.</p>

<p>RLHF is moving fast, and any book on the topic will age quickly in its specifics. The value here is less in the details of any particular implementation and more in the mental scaffold: what reward models do, why evaluation is hard, how newer methods relate to older recipes, and why alignment is an engineering discipline rather than a collection of tricks. That scaffold holds up, even as the field continues to evolve.</p>
</article>
<article>
<h1>Redesigning My Personal Website with Claude Code</h1>
<p>2026-03-23T00:00:00+00:00</p>
<h2 id="redesigning-my-personal-website-with-claude-code">Redesigning My Personal Website with Claude Code</h2>

<p></p>

<p>Three years ago, I wrote about <a href="https://andlukyane.com/blog/how-i-created-this-website">creating this website</a> using Jekyll, GitHub Pages, and a custom domain. That post is still one of the most-read things on my blog. The site served me well since then — it helped during job interviews, got me a few consulting projects, and became a home for 190+ ML paper reviews.</p>

<p>But I felt that by early 2026, the site was in need of a refresh:</p>
<ul>
  <li>Some content was outdated (About and Career pages weren’t updated for years)</li>
  <li>Some pages weren’t structured well (Actitivies page was a flat chronological list, Projects page had exactly two entries, Tags page was a wall of unsorted tags)</li>
  <li>And the site itself lacked many basic features, like search or dark mode support</li>
</ul>

<p>I did the redesign in two waves. I started with infrastructure improvements (search, dark mode, structured data, footer, sharing). Then I did a full content and layout overhaul using <strong>Claude Code</strong>.</p>

<h3 id="wave-1-infrastructure-improvements">Wave 1: infrastructure improvements</h3>

<p>Before touching any content, I wanted to fix the technical foundation. It took several sessions to make them.</p>

<h4 id="full-text-search">Full-text search</h4>

<p>The site had no search functionality at all. With 190+ posts, finding something specific meant scrolling through the blog listing or using the browser’s Ctrl+F on the tags page. I added a client-side search powered by a JSON index file that Jekyll generates at build time. The search modal opens with <code class="language-plaintext highlighter-rouge">Ctrl + K</code> (or <code class="language-plaintext highlighter-rouge">Cmd + K</code> on Mac) and supports multi-term filtering across titles, descriptions, tags, and content.</p>

<p>The search index is generated by a <code class="language-plaintext highlighter-rouge">search.json</code> file with Liquid:</p>

<div class="language-liquid highlighter-rouge"><div class="highlight"><pre class="highlight"><code>[<span class="p">{%</span><span class="w"> </span><span class="nt">for</span><span class="w"> </span><span class="nv">post</span><span class="w"> </span><span class="nt">in</span><span class="w"> </span><span class="nv">site.posts</span><span class="w"> </span><span class="p">%}</span>
  {
    "title": <span class="p">{{</span><span class="w"> </span><span class="nv">post</span><span class="p">.</span><span class="nv">title</span><span class="w"> </span><span class="p">|</span><span class="w"> </span><span class="nf">jsonify</span><span class="w"> </span><span class="p">}}</span>,
    "url": "<span class="p">{{</span><span class="w"> </span><span class="nv">post</span><span class="p">.</span><span class="nv">url</span><span class="w"> </span><span class="p">}}</span>",
    "date": "<span class="p">{{</span><span class="w"> </span><span class="nv">post</span><span class="p">.</span><span class="nv">date</span><span class="w"> </span><span class="p">|</span><span class="w"> </span><span class="nf">date</span><span class="p">:</span><span class="w"> </span><span class="s1">'%b %d, %Y'</span><span class="w"> </span><span class="p">}}</span>",
    "description": <span class="p">{{</span><span class="w"> </span><span class="nv">post</span><span class="p">.</span><span class="nv">description</span><span class="w"> </span><span class="p">|</span><span class="w"> </span><span class="nf">jsonify</span><span class="w"> </span><span class="p">}}</span>,
    "tags": <span class="p">{{</span><span class="w"> </span><span class="nv">post</span><span class="p">.</span><span class="nv">tags</span><span class="w"> </span><span class="p">|</span><span class="w"> </span><span class="nf">jsonify</span><span class="w"> </span><span class="p">}}</span>,
    "content": <span class="p">{{</span><span class="w"> </span><span class="nv">post</span><span class="p">.</span><span class="nv">content</span><span class="w"> </span><span class="p">|</span><span class="w"> </span><span class="nf">strip_html</span><span class="w"> </span><span class="p">|</span><span class="w"> </span><span class="nf">truncatewords</span><span class="p">:</span><span class="w"> </span><span class="mi">50</span><span class="w"> </span><span class="p">|</span><span class="w"> </span><span class="nf">jsonify</span><span class="w"> </span><span class="p">}}</span>
  }<span class="p">{%</span><span class="w"> </span><span class="kr">unless</span><span class="w"> </span><span class="nb">forloop.last</span><span class="w"> </span><span class="p">%}</span>,<span class="p">{%</span><span class="w"> </span><span class="kr">endunless</span><span class="w"> </span><span class="p">%}</span>
<span class="p">{%</span><span class="w"> </span><span class="nt">endfor</span><span class="w"> </span><span class="p">%}</span>]
</code></pre></div></div>

<p>The JavaScript fetches this JSON once on the first search, then filters locally. It is not as powerful as Algolia or Lunr, but it works well enough for a static site with a few hundred posts and requires no external service.</p>

<h4 id="dark-mode">Dark mode</h4>

<p>The site already had a basic dark mode toggle from an earlier refactor, but it was incomplete — many elements (code blocks, tables, form inputs, the career timeline) still used hardcoded light colors. I expanded the <code class="language-plaintext highlighter-rouge">dark-mode.css</code> file from about 100 lines to over 250, covering every component on the site. The dark mode toggle also needed to sync with the Utterances comment theme, so switching to dark mode now sends a <code class="language-plaintext highlighter-rouge">postMessage</code> to the Utterances iframe:</p>

<div class="language-javascript highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kd">const</span> <span class="nx">updateUtterancesTheme</span> <span class="o">=</span> <span class="p">(</span><span class="nx">theme</span><span class="p">)</span> <span class="o">=></span> <span class="p">{</span>
  <span class="kd">const</span> <span class="nx">utterancesFrame</span> <span class="o">=</span> <span class="nb">document</span><span class="p">.</span><span class="nx">querySelector</span><span class="p">(</span><span class="dl">'</span><span class="s1">.utterances-frame</span><span class="dl">'</span><span class="p">);</span>
  <span class="k">if</span> <span class="p">(</span><span class="nx">utterancesFrame</span><span class="p">)</span> <span class="p">{</span>
    <span class="nx">utterancesFrame</span><span class="p">.</span><span class="nx">contentWindow</span><span class="p">.</span><span class="nx">postMessage</span><span class="p">(</span>
      <span class="p">{</span> <span class="na">type</span><span class="p">:</span> <span class="dl">'</span><span class="s1">set-theme</span><span class="dl">'</span><span class="p">,</span> <span class="na">theme</span><span class="p">:</span> <span class="nx">theme</span> <span class="p">},</span>
      <span class="dl">'</span><span class="s1">https://utteranc.es</span><span class="dl">'</span>
    <span class="p">);</span>
  <span class="p">}</span>
<span class="p">};</span>
</code></pre></div></div>

<p>Speaking of Utterances — the comments had an embarrassing bug. The comment section did not appear at all until you manually refreshed the page. The cause was that the site uses AJAX navigation (via History.js in <code class="language-plaintext highlighter-rouge">personal.js</code>), so navigating to a post did not trigger a full page load. The Utterances <code class="language-plaintext highlighter-rouge"><script></code> tag was embedded inline in the post layout, and inline scripts only execute on initial load — AJAX page transitions skip them entirely. The fix was to replace the inline script with a placeholder <code class="language-plaintext highlighter-rouge"><div id="utterances-container"></code> and move the Utterances initialization into <code class="language-plaintext highlighter-rouge">personal.js</code>, where it runs on every page transition.</p>

<p>This also handles the initial dark mode state — the theme is read from <code class="language-plaintext highlighter-rouge">localStorage</code> at injection time, so comments load with the correct theme from the start. The <code class="language-plaintext highlighter-rouge">postMessage</code> approach described above is still needed when the user toggles dark mode while comments are already visible.</p>

<h4 id="structured-data-and-opengraph">Structured data and OpenGraph</h4>

<p>I added JSON-LD structured data for both the site (WebSite schema) and individual blog posts (BlogPosting schema). The blog post schema includes headline, description, author, publication date, and keywords from tags:</p>

<div class="language-html highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nt"><script </span><span class="na">type=</span><span class="s">"application/ld+json"</span><span class="nt">></span>
<span class="p">{</span>
  <span class="dl">"</span><span class="s2">@context</span><span class="dl">"</span><span class="p">:</span> <span class="dl">"</span><span class="s2">https://schema.org</span><span class="dl">"</span><span class="p">,</span>
  <span class="dl">"</span><span class="s2">@type</span><span class="dl">"</span><span class="p">:</span> <span class="dl">"</span><span class="s2">BlogPosting</span><span class="dl">"</span><span class="p">,</span>
  <span class="dl">"</span><span class="s2">headline</span><span class="dl">"</span><span class="p">:</span> <span class="dl">"</span><span class="s2">Redesigning My Personal Website with Claude Code</span><span class="dl">"</span><span class="p">,</span>
  <span class="dl">"</span><span class="s2">description</span><span class="dl">"</span><span class="p">:</span> <span class="dl">"</span><span class="s2">A practical walkthrough of redesigning a Jekyll personal website — adding search, dark mode, structured data, and a full card-based redesign using Claude Code. What changed, what broke, and what it is like to iterate on a site with an AI coding assistant.</span><span class="dl">"</span><span class="p">,</span>
  <span class="dl">"</span><span class="s2">datePublished</span><span class="dl">"</span><span class="p">:</span> <span class="dl">"</span><span class="s2">2026-03-23T00:00:00+00:00</span><span class="dl">"</span><span class="p">,</span>
  <span class="dl">"</span><span class="s2">author</span><span class="dl">"</span><span class="p">:</span> <span class="p">{</span>
    <span class="dl">"</span><span class="s2">@type</span><span class="dl">"</span><span class="p">:</span> <span class="dl">"</span><span class="s2">Person</span><span class="dl">"</span><span class="p">,</span>
    <span class="dl">"</span><span class="s2">name</span><span class="dl">"</span><span class="p">:</span> <span class="dl">"</span><span class="s2">Andrey Lukyanenko</span><span class="dl">"</span><span class="p">,</span>
    <span class="dl">"</span><span class="s2">url</span><span class="dl">"</span><span class="p">:</span> <span class="dl">"</span><span class="s2">https://andlukyane.com</span><span class="dl">"</span>
  <span class="p">},</span>
  <span class="dl">"</span><span class="s2">keywords</span><span class="dl">"</span><span class="p">:</span> <span class="p">[</span><span class="dl">"</span><span class="s2">blogpost</span><span class="dl">"</span><span class="p">,</span><span class="dl">"</span><span class="s2">career</span><span class="dl">"</span><span class="p">]</span>
<span class="p">}</span>
<span class="nt"></script></span>
</code></pre></div></div>

<p>I also added missing OpenGraph tags (<code class="language-plaintext highlighter-rouge">og:url</code>, <code class="language-plaintext highlighter-rouge">og:type</code>) and a Twitter Card <code class="language-plaintext highlighter-rouge">twitter:site</code> tag. These are small changes, but they improve how links appear when shared on LinkedIn, Twitter, and other platforms. The <code class="language-plaintext highlighter-rouge">og:type</code> dynamically switches between <code class="language-plaintext highlighter-rouge">article</code> for blog posts and <code class="language-plaintext highlighter-rouge">website</code> for other pages.</p>

<h4 id="floating-share-bar-and-footer-redesign">Floating share bar and footer redesign</h4>

<p>Blog posts got a floating share bar on desktop — four social buttons (Twitter, LinkedIn, Facebook, Reddit) that appear when you start reading and disappear when you scroll past the article. The visibility is controlled by checking the article’s bounding rectangle on the scroll:</p>

<div class="language-javascript highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kd">function</span> <span class="nx">check</span><span class="p">()</span> <span class="p">{</span>
  <span class="kd">var</span> <span class="nx">rect</span> <span class="o">=</span> <span class="nx">article</span><span class="p">.</span><span class="nx">getBoundingClientRect</span><span class="p">();</span>
  <span class="kd">var</span> <span class="nx">show</span> <span class="o">=</span> <span class="nx">rect</span><span class="p">.</span><span class="nx">top</span> <span class="o"><</span> <span class="nb">window</span><span class="p">.</span><span class="nx">innerHeight</span> <span class="o">*</span> <span class="mf">0.3</span>
          <span class="o">&&</span> <span class="nx">rect</span><span class="p">.</span><span class="nx">bottom</span> <span class="o">></span> <span class="nb">window</span><span class="p">.</span><span class="nx">innerHeight</span> <span class="o">*</span> <span class="mf">0.5</span><span class="p">;</span>
  <span class="nx">bar</span><span class="p">.</span><span class="nx">classList</span><span class="p">.</span><span class="nx">toggle</span><span class="p">(</span><span class="dl">'</span><span class="s1">visible</span><span class="dl">'</span><span class="p">,</span> <span class="nx">show</span><span class="p">);</span>
<span class="p">}</span>
<span class="nb">window</span><span class="p">.</span><span class="nx">addEventListener</span><span class="p">(</span><span class="dl">'</span><span class="s1">scroll</span><span class="dl">'</span><span class="p">,</span> <span class="nx">check</span><span class="p">,</span> <span class="p">{</span> <span class="na">passive</span><span class="p">:</span> <span class="kc">true</span> <span class="p">});</span>
</code></pre></div></div>

<p></p>

<p>The footer was redesigned from a single-column layout to a three-column grid (about, quick links, social icons) using CSS Grid. I also added proper focus-visible styles for accessibility across all interactive elements.</p>

<h3 id="wave-2-content-and-layout-overhaul-with-claude-code">Wave 2: content and layout overhaul with Claude Code</h3>

<p>With the infrastructure in place, the site still had the same content problems: outdated positioning, buried content, and flat hierarchies. For this part, I used Claude Code to do the whole thing in a single session (not one-shot).</p>

<h4 id="gathering-ideas">Gathering ideas</h4>

<p>First, I prepared a list of what I wanted to change myself, then asked three AI assistants (ChatGPT, Claude, Gemini) to analyze the live site and suggest improvements. All three converged on the same core problems: the strongest professional signals were buried or absent, good content was difficult to find, and some content was years out of date. I saved their suggestions into Markdown files, compared the overlapping recommendations, and wrote a combined improvement plan.</p>

<h4 id="about-page">About page</h4>

<p></p>

<p>The old page was a chronological autobiography. The <a href="https://andlukyane.com/">new page</a> is structured as a hub: a short intro, a row of credential badges, a horizontally scrolling “Latest Paper Reviews” section, curated “Featured Work” and “Beyond Work” card grids, and a collapsible section for fun facts at the bottom.</p>

<p>The “Latest Paper Reviews” section autopopulates using a Liquid include:</p>

<div class="language-html highlighter-rouge"><div class="highlight"><pre class="highlight"><code>{% assign reviews = site.posts | where_exp: "post",
    "post.tags contains 'paperreview'" %}
{% for post in reviews limit:3 %}
<span class="nt"><a</span> <span class="na">href=</span><span class="s">"{{ post.url }}"</span> <span class="na">class=</span><span class="s">"card"</span><span class="nt">></span>
  <span class="nt"><div</span> <span class="na">class=</span><span class="s">"card__meta"</span><span class="nt">></span>{{ post.date | date: "%b %d, %Y" }}<span class="nt"></div></span>
  <span class="nt"><div</span> <span class="na">class=</span><span class="s">"card__title"</span><span class="nt">></span>{{ post.title }}<span class="nt"></div></span>
  <span class="nt"><div</span> <span class="na">class=</span><span class="s">"card__description"</span><span class="nt">></span>
    {{ post.description | truncate: 140 }}
  <span class="nt"></div></span>
<span class="nt"></a></span>
{% endfor %}
</code></pre></div></div>

<p>The horizontal scroll is done with pure CSS using flexbox and <code class="language-plaintext highlighter-rouge">overflow-x: auto</code>:</p>

<div class="language-css highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nc">.card-scroll</span> <span class="p">{</span>
  <span class="nl">display</span><span class="p">:</span> <span class="n">flex</span><span class="p">;</span>
  <span class="py">gap</span><span class="p">:</span> <span class="m">20px</span><span class="p">;</span>
  <span class="nl">overflow-x</span><span class="p">:</span> <span class="nb">auto</span><span class="p">;</span>
  <span class="py">scroll-snap-type</span><span class="p">:</span> <span class="n">x</span> <span class="n">mandatory</span><span class="p">;</span>
  <span class="nl">-webkit-overflow-scrolling</span><span class="p">:</span> <span class="n">touch</span><span class="p">;</span>
<span class="p">}</span>

<span class="nc">.card-scroll</span> <span class="nc">.card</span> <span class="p">{</span>
  <span class="nl">min-width</span><span class="p">:</span> <span class="m">280px</span><span class="p">;</span>
  <span class="nl">max-width</span><span class="p">:</span> <span class="m">340px</span><span class="p">;</span>
  <span class="nl">flex-shrink</span><span class="p">:</span> <span class="m">0</span><span class="p">;</span>
  <span class="py">scroll-snap-align</span><span class="p">:</span> <span class="n">start</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>

<p>The main technical challenge was that Jekyll refused to process Liquid tags in the root <code class="language-plaintext highlighter-rouge">index.html</code>. After debugging, the cause turned out to be a quirk of how Jekyll handles root-level files — possibly a conflict with the paginator plugin. The fix was moving the homepage to <code class="language-plaintext highlighter-rouge">_pages/home.html</code> with <code class="language-plaintext highlighter-rouge">permalink: /</code>. Not obvious, and the kind of thing that wastes hours.</p>

<h4 id="card-component-system">Card component system</h4>

<p>Most visual changes are built on a reusable CSS card system (<code class="language-plaintext highlighter-rouge">css/cards.css</code>, 237 lines). The core card is a simple flexbox column with a hover effect:</p>

<div class="language-css highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nc">.card</span> <span class="p">{</span>
  <span class="nl">background</span><span class="p">:</span> <span class="n">var</span><span class="p">(</span><span class="n">--background-alt-color</span><span class="p">,</span> <span class="m">#f4f5f6</span><span class="p">);</span>
  <span class="nl">border-radius</span><span class="p">:</span> <span class="m">8px</span><span class="p">;</span>
  <span class="nl">padding</span><span class="p">:</span> <span class="m">24px</span><span class="p">;</span>
  <span class="nl">transition</span><span class="p">:</span> <span class="n">box-shadow</span> <span class="m">0.3s</span> <span class="n">ease</span><span class="p">,</span> <span class="n">transform</span> <span class="m">0.2s</span> <span class="n">ease</span><span class="p">;</span>
  <span class="nl">display</span><span class="p">:</span> <span class="n">flex</span><span class="p">;</span>
  <span class="nl">flex-direction</span><span class="p">:</span> <span class="n">column</span><span class="p">;</span>
<span class="p">}</span>

<span class="nc">.card</span><span class="nd">:hover</span> <span class="p">{</span>
  <span class="nl">box-shadow</span><span class="p">:</span> <span class="m">0</span> <span class="m">4px</span> <span class="m">20px</span> <span class="n">rgba</span><span class="p">(</span><span class="m">0</span><span class="p">,</span> <span class="m">0</span><span class="p">,</span> <span class="m">0</span><span class="p">,</span> <span class="m">0.1</span><span class="p">);</span>
  <span class="nl">transform</span><span class="p">:</span> <span class="n">translateY</span><span class="p">(</span><span class="m">-2px</span><span class="p">);</span>
<span class="p">}</span>
</code></pre></div></div>

<p>Cards are placed in responsive grids that go from one column on mobile to two or three on desktop:</p>

<div class="language-css highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nc">.card-grid</span> <span class="p">{</span>
  <span class="nl">display</span><span class="p">:</span> <span class="n">grid</span><span class="p">;</span>
  <span class="py">grid-template-columns</span><span class="p">:</span> <span class="m">1</span><span class="n">fr</span><span class="p">;</span>
  <span class="py">gap</span><span class="p">:</span> <span class="m">20px</span><span class="p">;</span>
<span class="p">}</span>

<span class="k">@media</span> <span class="p">(</span><span class="n">min-width</span><span class="p">:</span> <span class="m">768px</span><span class="p">)</span> <span class="p">{</span>
  <span class="nc">.card-grid</span> <span class="p">{</span> <span class="py">grid-template-columns</span><span class="p">:</span> <span class="nb">repeat</span><span class="p">(</span><span class="m">2</span><span class="p">,</span> <span class="m">1</span><span class="n">fr</span><span class="p">);</span> <span class="p">}</span>
<span class="p">}</span>

<span class="k">@media</span> <span class="p">(</span><span class="n">min-width</span><span class="p">:</span> <span class="m">1024px</span><span class="p">)</span> <span class="p">{</span>
  <span class="nc">.card-grid--3</span> <span class="p">{</span> <span class="py">grid-template-columns</span><span class="p">:</span> <span class="nb">repeat</span><span class="p">(</span><span class="m">3</span><span class="p">,</span> <span class="m">1</span><span class="n">fr</span><span class="p">);</span> <span class="p">}</span>
<span class="p">}</span>
</code></pre></div></div>

<p>Tag chips are small pill-shaped links that appear on cards and in the blog listing:</p>

<div class="language-css highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nc">.tag-chip</span> <span class="p">{</span>
  <span class="nl">display</span><span class="p">:</span> <span class="n">inline-block</span><span class="p">;</span>
  <span class="nl">font-size</span><span class="p">:</span> <span class="m">12px</span><span class="p">;</span>
  <span class="nl">padding</span><span class="p">:</span> <span class="m">4px</span> <span class="m">10px</span><span class="p">;</span>
  <span class="nl">border-radius</span><span class="p">:</span> <span class="m">12px</span><span class="p">;</span>
  <span class="nl">background</span><span class="p">:</span> <span class="n">var</span><span class="p">(</span><span class="n">--background-color</span><span class="p">,</span> <span class="m">#ffffff</span><span class="p">);</span>
  <span class="nl">color</span><span class="p">:</span> <span class="n">var</span><span class="p">(</span><span class="n">--text-light-color</span><span class="p">,</span> <span class="m">#6B7B8D</span><span class="p">);</span>
  <span class="nl">border</span><span class="p">:</span> <span class="m">1px</span> <span class="nb">solid</span> <span class="n">var</span><span class="p">(</span><span class="n">--border-color</span><span class="p">,</span> <span class="m">#dddddd</span><span class="p">);</span>
  <span class="nl">transition</span><span class="p">:</span> <span class="n">background</span> <span class="m">0.2s</span> <span class="n">ease</span><span class="p">,</span> <span class="n">color</span> <span class="m">0.2s</span> <span class="n">ease</span><span class="p">;</span>
<span class="p">}</span>

<span class="nt">a</span><span class="nc">.tag-chip</span><span class="nd">:hover</span> <span class="p">{</span>
  <span class="nl">background</span><span class="p">:</span> <span class="n">var</span><span class="p">(</span><span class="n">--accent-color</span><span class="p">,</span> <span class="m">#3498db</span><span class="p">);</span>
  <span class="nl">color</span><span class="p">:</span> <span class="m">#ffffff</span><span class="p">;</span>
  <span class="nl">border-color</span><span class="p">:</span> <span class="n">var</span><span class="p">(</span><span class="n">--accent-color</span><span class="p">,</span> <span class="m">#3498db</span><span class="p">);</span>
<span class="p">}</span>
</code></pre></div></div>

<p>Everything uses CSS custom properties, so dark mode works automatically.</p>

<h4 id="blog-listing-tag-chips">Blog listing tag chips</h4>

<p></p>

<p>Every post in the <a href="https://andlukyane.com/blog/">blog</a> listing now shows up to four clickable tag chips. The first attempt had an interesting bug: the entire blog post card was clickable via a jQuery handler in <code class="language-plaintext highlighter-rouge">personal.js</code>, so clicking a tag chip navigated to the post instead of the tag page. The fix was a guard in the click handler:</p>

<div class="language-javascript highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nx">$</span><span class="p">(</span><span class="nb">document</span><span class="p">).</span><span class="nx">on</span><span class="p">(</span><span class="dl">'</span><span class="s1">click</span><span class="dl">'</span><span class="p">,</span> <span class="dl">'</span><span class="s1">.post</span><span class="dl">'</span><span class="p">,</span> <span class="kd">function</span> <span class="p">(</span><span class="nx">e</span><span class="p">)</span> <span class="p">{</span>
  <span class="k">if</span> <span class="p">(</span><span class="nx">$</span><span class="p">(</span><span class="nx">e</span><span class="p">.</span><span class="nx">target</span><span class="p">).</span><span class="nx">closest</span><span class="p">(</span><span class="dl">'</span><span class="s1">.tag-chip</span><span class="dl">'</span><span class="p">).</span><span class="nx">length</span><span class="p">)</span> <span class="p">{</span>
    <span class="k">return</span><span class="p">;</span>  <span class="c1">// let the tag link handle its own navigation</span>
  <span class="p">}</span>
  <span class="c1">// ... existing post navigation code</span>
<span class="p">});</span>
</code></pre></div></div>

<h4 id="other-pages">Other pages</h4>

<p></p>

<p>I restructured the <a href="https://andlukyane.com/activities">Activities</a> page from flat collapsible lists into cards for publications and talks, data tables for Kaggle notebooks and competitions, and compact inline archives for the full talk history. The <a href="https://andlukyane.com/project">Projects</a> page went from two entries to eight, organized into Featured Projects (with images), Production ML, and Tools and Writing. The <a href="https://andlukyane.com/tags">Tags</a> page now groups 120+ tags into 12 thematic categories displayed as a two-column card grid, and individual tag pages show posts as cards instead of plain lists.</p>

<h3 id="working-with-claude-code">Working with Claude Code</h3>

<p>Most of the changes were done in a single Claude Code session over a couple of days. The most useful pattern was version comparison: for each major page redesign, I asked it to create three versions at temporary URLs (<code class="language-plaintext highlighter-rouge">/about-v1</code>, <code class="language-plaintext highlighter-rouge">/about-v2</code>, <code class="language-plaintext highlighter-rouge">/about-v3</code>), previewed them all in the browser, and described which elements to combine. It is faster to compare three concrete implementations than to describe an abstract preference in words. One version could have a better visual appeal, another one - better structure, another one - a new useful feature, and so on.</p>

<p>Not everything worked well on the first attempt; I had to spend considerable time on debugging and testing. For the Liquid processing issue, Claude Code tried several approaches — adding <code class="language-plaintext highlighter-rouge">layout: page</code> to the front matter, moving Liquid into an include file, and finally identifying that the <code class="language-plaintext highlighter-rouge">_pages</code> collection was the correct fix. It diagnosed problems by inspecting the built <code class="language-plaintext highlighter-rouge">_site/index.html</code> output and checking whether Liquid tags had been resolved. For the tag chip click issue, it found the jQuery handler, understood the event propagation, and added the guard.</p>

<p>All content decisions were mine — which posts to feature, what texts to write, what to include or exclude. Some CSS required several iterations because I could see the rendered result, but Claude Code could only verify by grepping the HTML output. When I sent screenshots of broken layouts, it could usually identify the issue, though not always on the first try. It also occasionally generated text that was too promotional, which I had to tone down.</p>

<h3 id="summary-of-changes">Summary of changes</h3>

<table>
  <thead>
    <tr>
      <th>Metric</th>
      <th>Before</th>
      <th>After</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>Search</td>
      <td>None</td>
      <td>Full-text with Ctrl+K</td>
    </tr>
    <tr>
      <td>Dark mode coverage</td>
      <td>Partial</td>
      <td>Complete (250+ CSS rules)</td>
    </tr>
    <tr>
      <td>Structured data</td>
      <td>None</td>
      <td>WebSite + BlogPosting JSON-LD</td>
    </tr>
    <tr>
      <td>OpenGraph tags</td>
      <td>Partial</td>
      <td>Complete (og:url, og:type, twitter:site)</td>
    </tr>
    <tr>
      <td>Share functionality</td>
      <td>Bottom of post only</td>
      <td>Floating sidebar + bottom</td>
    </tr>
    <tr>
      <td>About page</td>
      <td>Chronological bio</td>
      <td>6 sections with cards and badges</td>
    </tr>
    <tr>
      <td>Projects shown</td>
      <td>2</td>
      <td>8</td>
    </tr>
    <tr>
      <td>Activities structure</td>
      <td>Flat lists</td>
      <td>Cards, tables, compact archives</td>
    </tr>
    <tr>
      <td>Tag organization</td>
      <td>Flat list</td>
      <td>12 thematic card groups</td>
    </tr>
    <tr>
      <td>Blog tag chips</td>
      <td>None</td>
      <td>Up to 4 per post</td>
    </tr>
  </tbody>
</table>

<h3 id="conclusions">Conclusions</h3>

<p>The site now reflects where I am in 2026 rather than where I was in 2021 or 2023. The design is still built on a purchased Jekyll theme with incremental modifications, but it is much better at helping people discover useful content. For me, this is good enough.</p>

<p>Additionally, this was an interesting exercise in using AI to improve a personal website. I remember that it took me almost a month to set up the initial version, then days or weeks to make specific changes because I didn’t know web development well enough. This redesign took less than a week in total, and I learnt a lot about using AI and web development.</p>
</article>
</main></body></html>