<?xml version="1.0" encoding="utf-8"?><feed xmlns="http://www.w3.org/2005/Atom" ><generator uri="https://jekyllrb.com/" version="3.10.0">Jekyll</generator><link href="https://andlukyane.com//feed.xml" rel="self" type="application/atom+xml" /><link href="https://andlukyane.com//" rel="alternate" type="text/html" /><updated>2026-04-20T09:58:04+00:00</updated><id>https://andlukyane.com//feed.xml</id><title type="html">artgor</title><subtitle>Machine Learning Engineer at Meta in London. Kaggle Competition Master, Notebook Grandmaster, Google Developer Expert. Polyglot. Writing about applied ML, paper reviews, systems, and learning.</subtitle><entry><title type="html">FIPO: Teaching LLMs Which Thoughts Actually Matter</title><link href="https://andlukyane.com//blog/paper-review-fipo" rel="alternate" type="text/html" title="FIPO: Teaching LLMs Which Thoughts Actually Matter" /><published>2026-04-20T00:00:00+00:00</published><updated>2026-04-20T00:00:00+00:00</updated><id>https://andlukyane.com//blog/paper-review-fipo</id><content type="html" xml:base="https://andlukyane.com//blog/paper-review-fipo"><![CDATA[<h2 id="fipo-teaching-llms-which-thoughts-actually-matter">FIPO: Teaching LLMs Which Thoughts Actually Matter</h2>

<p><a href="https://arxiv.org/abs/2603.19835">Paper</a></p>

<p><a href="https://qwen-pilot.notion.site/fipo">Notion writeup</a></p>

<p><a href="https://github.com/qwenpilot/FIPO">Code</a></p>

<p><img src="https://andlukyane.com/images/paper_reviews/fipo/2026-04-20_10-26-59.jpg" alt="Main image" /></p>

<p>Most reasoning models today rely on outcome-based RL: generate a solution, check if the answer is correct, and reinforce the whole trajectory. As reasoning becomes longer, the learning signal collapses - important steps and irrelevant tokens receive the same credit, and performance plateaus.</p>

<p>FIPO addresses this directly by introducing <strong>token-level credit assignment</strong> based on future impact. This turns a sparse, outcome-only signal into a dense, structured one. The authors train <strong>Qwen2.5-32B-Base</strong> on top of the DAPO recipe inside VeRL and evaluate on <strong>AIME 2024</strong>. They report Pass@1 going from 50.0% (DAPO) to a peak of 58.0%, converging around 56.0%. Response lengths roughly double during training, from around 4k tokens to more than 10k.</p>

<h3 id="fipo">FIPO</h3>

<p><img src="https://andlukyane.com/images/paper_reviews/fipo/2026-04-20_09-58-03.jpg" alt="Training instability" /></p>

<p>FIPO measures, for each token, how much the policy’s behaviour on the rest of the rollout has shifted since the last update. For each position in a response, the authors compute the log-ratio between the current and old policy at every subsequent position, then sum these shifts with a discount factor that decays with distance. This is the <strong>Future-KL</strong>: a per-token statistic that measures how much the policy’s behaviour on the tail of the rollout has moved since the last update. Closer future tokens matter more than far ones. The discount works as a soft half-life rather than a hard cutoff.</p>

<p>Future-KL can be unstable due to distributional shifts, so it needs some mechanisms to avoid destabilizing the training process. First, a <strong>dual-clip mask</strong> zeroes out tokens whose importance ratio exceeds a threshold around 10, so that extreme values don’t hurt the gradients. Second, the future-KL value is mapped through an exponential clip into a bounded <strong>influence weight</strong>. Tokens whose future trajectory shifted a lot after the update receive larger effective advantages; tokens that did not move the future of the rollout receive less weight.</p>

<h3 id="experiments">Experiments</h3>

<p><img src="https://andlukyane.com/images/paper_reviews/fipo/2026-04-20_10-00-26.jpg" alt="Experiments" /></p>

<ul>
  <li>FIPO: 58.0% peak, around 56.0% at convergence.</li>
  <li>DAPO baseline on Qwen2.5-32B-Base: 50.0% Pass@1. Reproduced DeepSeek-R1-Zero-32B: ~47%.</li>
</ul>

<p>Both DAPO and FIPO show length growth, but the FIPO curves are longer at the same step count. That is consistent with denser credit letting the model commit to longer reasoning - useful tokens in the middle of a rollout can now be rewarded directly.</p>

<p>The main drivers of FIPO’s effectiveness are: the emergence of length-based scaling in reasoning chains, the positive learning signal, and the significantly improved stability of the optimization process.</p>

<h3 id="limitations">Limitations</h3>

<ul>
  <li>Compute overhead from per-step future-KL accumulation and 10k-plus rollouts;</li>
  <li>Narrow task scope (math reasoning only);</li>
  <li>Fixed dataset (DAPO’s open math corpus, not scaled up);</li>
  <li>A base-model confound that makes it hard to isolate the contribution of the RL recipe from the underlying pretraining and SFT.</li>
</ul>

<h3 id="conclusions">Conclusions</h3>

<p>FIPO targets a different problem compared to methods like GRPO, DAPO, and recent reasoning systems (<a href="https://andlukyane.com/blog/paper-review-deepseekr1">DeepSeek-R1</a>, o-series). Most existing approaches improve reasoning by scaling models, improving sampling or refining reward signals. FIPO instead changes <strong>how the reward is distributed within a trajectory</strong>.</p>

<p>This makes it fundamentally different:</p>

<ul>
  <li>Compared to GRPO/DAPO, it removes uniform credit assignment</li>
  <li>Compared to PPO-style RLHF, it avoids critics while still providing dense signals</li>
  <li>Compared to top reasoning models, it suggests gains can come from training signal design, not just scale</li>
</ul>

<p>I am less sure how well the gains hold up outside AIME or under a compute-matched comparison. The short-horizon future-KL proxy probably rewards some variance that is not reasoning-relevant. For now, it looks like a promising direction for reasoning-specific training rather than a drop-in upgrade.</p>]]></content><author><name></name></author><category term="paperreview" /><category term="deeplearning" /><category term="llm" /><category term="rl" /><category term="reasoning" /><category term="optimization" /><summary type="html"><![CDATA[FIPO - an RL algorithm that fixes one of the core limitations of RL for LLM reasoning - credit assignment. Instead of giving every token in a rollout the same outcome advantage, it re-weights tokens by a discounted future-KL signal, enabling longer and more effective reasoning chains.]]></summary></entry><entry><title type="html">Book Review: Unlocking Data with Generative AI and RAG, Second Edition</title><link href="https://andlukyane.com//blog/book-review-unlocking-data-genai-rag" rel="alternate" type="text/html" title="Book Review: Unlocking Data with Generative AI and RAG, Second Edition" /><published>2026-04-09T00:00:00+00:00</published><updated>2026-04-09T00:00:00+00:00</updated><id>https://andlukyane.com//blog/book-review-unlocking-data-genai-rag</id><content type="html" xml:base="https://andlukyane.com//blog/book-review-unlocking-data-genai-rag"><![CDATA[<h2 id="book-review-unlocking-data-with-generative-ai-and-rag-second-edition">Book Review: Unlocking Data with Generative AI and RAG, Second Edition</h2>

<p><a href="https://www.amazon.com/Unlocking-Data-Generative-RAG-fundamentals-ebook/dp/B0G2B5VLL8/">Amazon</a></p>

<p><img src="https://andlukyane.com/images/book_reviews/genai_rag2/2026-04-09_09-44-43.jpg" alt="Main image" /></p>

<p>I was offered an opportunity to read <strong>Unlocking Data with Generative AI and RAG, Second Edition</strong>, by Keith Bourne, in exchange for an honest review. I <a href="https://artgor.medium.com/book-review-unlocking-data-with-generative-ai-and-rag-3ec7cab074a5">reviewed the first edition</a> back in 2024 and liked it, so I was curious to see how the second edition would handle the fact that <strong>RAG</strong> has since absorbed an entirely new layer of the stack — agents, graph retrieval, semantic caches, and memory systems. The book covers the modern RAG landscape across 20 chapters in three parts: classical RAG foundations, production-grade retrieval and evaluation, and an entirely new Part III on agentic RAG spanning <strong>LangGraph</strong>, ontology-driven graph RAG with <strong>Neo4j</strong>, semantic caching, and the <strong>CoALA</strong> memory framework. I liked this book, and its biggest strength is the continuous running example that carries you from a basic retriever in Chapter 2 all the way to a stateful, learning agent in Chapter 19 — a thread very few books manage to present.</p>

<h3 id="the-overall-structure">The overall structure</h3>

<p><img src="https://andlukyane.com/images/book_reviews/genai_rag2/2026-04-09_09-29-19.jpg" alt="The cover" /></p>

<p>The book is organized into three parts:</p>
<ul>
  <li>Part I (Chapters 1–6) introduces RAG and its vocabulary, gets a complete pipeline running by Chapter 2, and then layers in practical applications, a security chapter with red and blue teaming, and a Gradio UI for demos.</li>
  <li>Part II (Chapters 7–11) goes deeper into the components: vectors and vector stores, similarity search, evaluation with <strong>Ragas</strong>, the <strong>LangChain</strong> retrievers and integrations, and the loaders, splitters, and output parsers that hold a real RAG system together.</li>
  <li>Part III (Chapters 12–19) is where the second edition contributes the most: agents and <strong>LangGraph</strong>, ontology engineering with <strong>Protégé</strong>, graph RAG on <strong>Neo4j</strong>, semantic caching, the <strong>CoALA</strong> memory framework, and a capstone investment-advisor agent that integrates all four CoALA memory types.</li>
</ul>

<p>The implementations use real libraries rather than building everything from scratch, which mirrors how most teams actually work and saves the reader from the usual from-scratch boilerplate.</p>

<h3 id="what-i-liked">What I liked</h3>

<p><img src="https://andlukyane.com/images/book_reviews/genai_rag2/2026-04-09_09-43-51.jpg" alt="RAG graphs" /></p>

<p>There were many things I liked in this book, and I want to highlight several in particular:</p>

<ul>
  <li>The single running example is used throughout all 20 chapters. I’ve read enough ML books where each chapter is its own throwaway notebook, so I really appreciate it when one isn’t, and the continuity makes it easier for me to follow the ideas presented in the book.</li>
  <li>The small but important Chapter 3 detail of returning sources alongside answers. Until we have full trust in LLM, being able to answer the question “where did this answer come from?” is very important.</li>
  <li>Chapter 9’s evaluation is built around <strong>Ragas</strong> — faithfulness, answer relevancy, context precision, context recall, plus direct insights from a Ragas co-founder. A lot of evaluation work happens after deployment, which matches my experience on several real projects.</li>
  <li>Chapters 13–14 guide you from building a financial ontology in <strong>Protégé</strong> to loading it into Neo4j with hybrid embeddings that blend text and graph structure. Designing a clean ontology has the same slow, painful, valuable feel as the months we spent organizing data labeling on one of my past projects: it is the kind of upfront work that pays off everywhere downstream.</li>
  <li>The Chapter 5 red team / blue team prompt-injection lab. Security chapters in ML books usually read like policy documents, while this one actually has you attacking and defending your own RAG pipeline, which is a much better way to build practice.</li>
</ul>

<p><img src="https://andlukyane.com/images/book_reviews/genai_rag2/2026-04-09_09-40-58.jpg" alt="Memory" /></p>

<p>The standout production chapter for me was Chapter 15 on semantic caches. The long-tail framing of real-world queries, the cross-encoder verification step, the adaptive thresholds, and the eviction policy discussion are exactly the level of detail that separates a toy semantic cache from one that won’t embarrass you in production. Chapter 16 had a great comparison of three agentic memory frameworks: <strong>Mem0</strong>, <strong>LangMem</strong>, and <strong>Zep/Graphiti</strong>. And Chapter 19 brings everything together into a capstone investment-advisor agent that uses working, episodic, semantic, and procedural memory simultaneously.</p>

<h3 id="what-could-have-been-better">What could have been better</h3>

<p>There are a few small things that could have been handled differently:</p>

<ul>
  <li>Failure modes are under-discussed. Almost every chapter sells the upside of its technique and skips what breaks. Agents make systems slower and harder to debug, cached answers can become stale, and memory stores can grow in ways that hurt retrieval. Such a comprehensive book could be more honest about the failure side, and I’d love a “what goes wrong” subsection in each Part III chapter in a future edition.</li>
  <li>A few places could use head-to-head numbers. When graph-based RAG is introduced, or when the memory-equipped agent is presented, I’d have liked even one quantitative comparison against a simpler baseline on the same questions — the architectures are convincing in concept, and a small numbers-on-a-table moment would make them convincing in practice too.</li>
  <li>The introduction chapter includes a table listing popular models and their context lengths for October 2025 – some of these models were already outdated at that time.</li>
</ul>

<p>But these are small nitpicks that are completely overshadowed by the good sides of the book.</p>

<h3 id="conclusion">Conclusion</h3>

<p>This book is a good fit for RAG practitioners past the hello-world stage — people who already know what an embedding is and what a retriever does, and who now want to go beyond basic vector search. It is particularly useful for engineers being asked to turn a working RAG prototype into something that handles evaluation, security, caching, and agentic workflows without falling apart in production. If you already own the first edition, the question is whether Part III is worth the upgrade, and the answer is yes — Chapters 12 through 19 are essentially a self-contained book on agentic RAG, and very few other resources cover this terrain end-to-end with working code.</p>

<p>RAG is moving fast, and any book on the topic will age quickly in its specifics — LangChain APIs will churn, new memory libraries will appear, and several embedding-model tables will look dated within a year. The value here is less in the details of any particular notebook and more in the mental scaffold: how retrieval quality connects to evaluation, how agents extend the RAG loop, why memory systems matter once an agent lives longer than a single turn, and where graph-based retrieval earns its weight. That scaffold holds up, even as the field continues to evolve.</p>]]></content><author><name></name></author><category term="blogpost" /><category term="books" /><category term="llm" /><category term="rag" /><category term="agents" /><summary type="html"><![CDATA[A review of Keith Bourne second edition of Unlocking Data with Generative AI and RAG, covering the running example that ties all 20 chapters together, explanations of agentic RAG, and where the book is most useful in practice.]]></summary></entry><entry><title type="html">Book Review: A Practical Guide to Reinforcement Learning from Human Feedback</title><link href="https://andlukyane.com//blog/book-review-rlhf" rel="alternate" type="text/html" title="Book Review: A Practical Guide to Reinforcement Learning from Human Feedback" /><published>2026-04-06T00:00:00+00:00</published><updated>2026-04-06T00:00:00+00:00</updated><id>https://andlukyane.com//blog/book-review-rlhf</id><content type="html" xml:base="https://andlukyane.com//blog/book-review-rlhf"><![CDATA[<h2 id="book-review-a-practical-guide-to-reinforcement-learning-from-human-feedback">Book Review: A Practical Guide to Reinforcement Learning from Human Feedback</h2>

<p><a href="https://www.amazon.com/Practical-Guide-Reinforcement-Learning-Feedback/dp/B0FV3414ST/ref=sr_1_2?crid=160SZQL0IJGZU&amp;dib=eyJ2IjoiMSJ9.pfBEIkmcg4uKFShSCU2chy-aN-16qA3kbGpR5fyvBd8O7iy9IVMBkOHkVgTDYo51RtXqDqPzCZazu69tUQVyBlVHY9fxOWlk4j4Ji7oRJVZivh0VPH22I4bBuchG9Mvkk5DHx_i4js2-PlfBkMtAeR5wzQ8QZ5wzg8piIqagfVGRp02-AcKkHKxlXu05Kn-YIWCTXv_7qgOLv5W1vYOnxQ.9tFnY5IzjLbWu4nmf_TXK51XfEcRPt26t-fkPPg5UVQ&amp;dib_tag=se&amp;keywords=reinforcement+learning+human+feedback&amp;qid=1773219960&amp;sprefix=reinforcement+learning+human%2Caps%2C288&amp;sr=8-2">Amazon</a></p>

<p><a href="https://www.linkedin.com/in/sandipdkulkarni/">Author’s LinkedIn page</a></p>

<p><img src="https://andlukyane.com/images/book_reviews/rlhf_practical_guide/2026-04-05_18-54-49.jpg" alt="Main image" /></p>

<p>I was offered to read <strong>A Practical Guide to Reinforcement Learning from Human Feedback</strong> by Sandip Kulkarni in exchange for an honest review. The book covers the full RLHF pipeline across 12 chapters: from classical reinforcement learning through reward modeling and PPO-based fine-tuning, and then into newer methods like <strong>DPO</strong>, <strong>RLAIF</strong>, and <strong>Constitutional AI</strong>. I liked this book and consider it to be a well-structured learning resource providing good theory basics and a lot of practical examples.</p>

<h3 id="the-overall-structure">The overall structure</h3>

<p><img src="https://andlukyane.com/images/book_reviews/rlhf_practical_guide/1773220342507.png" alt="The cover" /></p>

<p>The book follows a deliberate three-stage progression:</p>
<ul>
  <li>Explaining the core principles of reinforcement learning and policy optimization</li>
  <li>Building a complete RLHF pipeline for LLMs</li>
  <li>Exploring the evolution of alignment research</li>
</ul>

<p>By the time you reach DPO in Chapter 10, you understand why it exists, because you have already built a reward model, dealt with PPO’s clipped objective, and seen the full canonical RLHF loop in action. That kind of understanding is hard to get from blog posts or paper summaries alone.</p>

<p>I also liked that the implementations use real libraries (<strong>TRL</strong>, <strong>PEFT</strong>, Hugging Face ecosystem) rather than building everything from scratch. In industry, we use existing tooling to avoid bugs and save time, and the book does the same. The memory optimization advice scattered through the middle chapters (gradient checkpointing, staged model loading, working within Colab constraints) is the kind of practical wisdom that many books skip entirely.</p>

<h3 id="what-i-liked">What I liked</h3>

<p><img src="https://andlukyane.com/images/book_reviews/rlhf_practical_guide/2026-03-22_20-20-14.jpg" alt="RLHF" /></p>

<p>There were many things in this book that I liked, and I want to highlight several in particular:</p>
<ul>
  <li>The discussions on annotator bias and labeling evaluation were especially interesting for me, as on one of my projects, we spend months on organizing data labeling</li>
  <li>Using human keyboard inputs for Mountain Car demonstrations was fun</li>
  <li>The code walkthroughs for configuring PEFT and LoRA were highly practical and well-written</li>
  <li>Using smaller models like Qwen2-0.5B-Instruct makes it easier to play with them when you don’t have good enough hardware for larger models</li>
  <li>The visualizations explaining the intuition behind the algorithms were great.</li>
</ul>

<p><img src="https://andlukyane.com/images/book_reviews/rlhf_practical_guide/2026-04-05_18-34-31.jpg" alt="Mountain car" /></p>

<p>Chapters 9–12 are very interesting and useful; they cover modern approaches and offer many practical tips. The Constitutional AI section includes deployment architecture, transparency mechanisms, and governance considerations that feel like they were written by someone thinking about production systems, not just research experiments. The comparison between DPO and PPO trade-offs is well-argued and clearly presented. The evaluation chapter tackles genuinely hard problems (self-preference bias of LLM judges, mode collapse, the difficulty of proxy evaluations) without pretending there are clean answers.</p>

<h3 id="what-could-have-been-better">What could have been better</h3>

<p>There are some things that could have been handled differently or better:</p>
<ul>
  <li>I’m not sure if it was necessary to explain self-attention and transformers in such detail.</li>
  <li>The TRL library versions are inconsistent across chapters, and since TRL’s API changed significantly between these versions, it would be great if all chapters used the latest version.</li>
  <li>I’d love to see some examples of UI/UX for data collection and user annotation.</li>
</ul>

<p>But these are small nitpicks that are completely overshadowed by the good sides of the book.</p>

<h3 id="conclusion">Conclusion</h3>

<p>This book is a good fit for ML practitioners seeking a single, structured resource that covers the RLHF landscape from foundations to modern methods. It is particularly useful for people transitioning into post-training roles, or for anyone who learns better from implementations than from papers. The code is reproducible, the tooling is up to date, and the pedagogical progression genuinely helps build intuition.</p>

<p>RLHF is moving fast, and any book on the topic will age quickly in its specifics. The value here is less in the details of any particular implementation and more in the mental scaffold: what reward models do, why evaluation is hard, how newer methods relate to older recipes, and why alignment is an engineering discipline rather than a collection of tricks. That scaffold holds up, even as the field continues to evolve.</p>]]></content><author><name></name></author><category term="blogpost" /><category term="books" /><category term="rl" /><category term="rlhf" /><category term="llm" /><summary type="html"><![CDATA[A review of Sandip Kulkarni book on RLHF, covering its strengths as a structured learning resource, its reliance on both older an newer models, and who will benefit most from reading it.]]></summary></entry><entry><title type="html">Redesigning My Personal Website with Claude Code</title><link href="https://andlukyane.com//blog/redesigning-my-personal-website" rel="alternate" type="text/html" title="Redesigning My Personal Website with Claude Code" /><published>2026-03-23T00:00:00+00:00</published><updated>2026-03-23T00:00:00+00:00</updated><id>https://andlukyane.com//blog/redesigning-my-personal-website</id><content type="html" xml:base="https://andlukyane.com//blog/redesigning-my-personal-website"><![CDATA[<h2 id="redesigning-my-personal-website-with-claude-code">Redesigning My Personal Website with Claude Code</h2>

<p><img src="https://andlukyane.com/images/website_redesign/2026-03-23_08-55-01.jpg" alt="Main image" /></p>

<p>Three years ago, I wrote about <a href="https://andlukyane.com/blog/how-i-created-this-website">creating this website</a> using Jekyll, GitHub Pages, and a custom domain. That post is still one of the most-read things on my blog. The site served me well since then — it helped during job interviews, got me a few consulting projects, and became a home for 190+ ML paper reviews.</p>

<p>But I felt that by early 2026, the site was in need of a refresh:</p>
<ul>
  <li>Some content was outdated (About and Career pages weren’t updated for years)</li>
  <li>Some pages weren’t structured well (Actitivies page was a flat chronological list, Projects page had exactly two entries, Tags page was a wall of unsorted tags)</li>
  <li>And the site itself lacked many basic features, like search or dark mode support</li>
</ul>

<p>I did the redesign in two waves. I started with infrastructure improvements (search, dark mode, structured data, footer, sharing). Then I did a full content and layout overhaul using <strong>Claude Code</strong>.</p>

<h3 id="wave-1-infrastructure-improvements">Wave 1: infrastructure improvements</h3>

<p>Before touching any content, I wanted to fix the technical foundation. It took several sessions to make them.</p>

<h4 id="full-text-search">Full-text search</h4>

<p>The site had no search functionality at all. With 190+ posts, finding something specific meant scrolling through the blog listing or using the browser’s Ctrl+F on the tags page. I added a client-side search powered by a JSON index file that Jekyll generates at build time. The search modal opens with <code class="language-plaintext highlighter-rouge">Ctrl + K</code> (or <code class="language-plaintext highlighter-rouge">Cmd + K</code> on Mac) and supports multi-term filtering across titles, descriptions, tags, and content.</p>

<p>The search index is generated by a <code class="language-plaintext highlighter-rouge">search.json</code> file with Liquid:</p>

<div class="language-liquid highlighter-rouge"><div class="highlight"><pre class="highlight"><code>[<span class="p">{%</span><span class="w"> </span><span class="nt">for</span><span class="w"> </span><span class="nv">post</span><span class="w"> </span><span class="nt">in</span><span class="w"> </span><span class="nv">site.posts</span><span class="w"> </span><span class="p">%}</span>
  {
    "title": <span class="p">{{</span><span class="w"> </span><span class="nv">post</span><span class="p">.</span><span class="nv">title</span><span class="w"> </span><span class="p">|</span><span class="w"> </span><span class="nf">jsonify</span><span class="w"> </span><span class="p">}}</span>,
    "url": "<span class="p">{{</span><span class="w"> </span><span class="nv">post</span><span class="p">.</span><span class="nv">url</span><span class="w"> </span><span class="p">}}</span>",
    "date": "<span class="p">{{</span><span class="w"> </span><span class="nv">post</span><span class="p">.</span><span class="nv">date</span><span class="w"> </span><span class="p">|</span><span class="w"> </span><span class="nf">date</span><span class="p">:</span><span class="w"> </span><span class="s1">'%b %d, %Y'</span><span class="w"> </span><span class="p">}}</span>",
    "description": <span class="p">{{</span><span class="w"> </span><span class="nv">post</span><span class="p">.</span><span class="nv">description</span><span class="w"> </span><span class="p">|</span><span class="w"> </span><span class="nf">jsonify</span><span class="w"> </span><span class="p">}}</span>,
    "tags": <span class="p">{{</span><span class="w"> </span><span class="nv">post</span><span class="p">.</span><span class="nv">tags</span><span class="w"> </span><span class="p">|</span><span class="w"> </span><span class="nf">jsonify</span><span class="w"> </span><span class="p">}}</span>,
    "content": <span class="p">{{</span><span class="w"> </span><span class="nv">post</span><span class="p">.</span><span class="nv">content</span><span class="w"> </span><span class="p">|</span><span class="w"> </span><span class="nf">strip_html</span><span class="w"> </span><span class="p">|</span><span class="w"> </span><span class="nf">truncatewords</span><span class="p">:</span><span class="w"> </span><span class="mi">50</span><span class="w"> </span><span class="p">|</span><span class="w"> </span><span class="nf">jsonify</span><span class="w"> </span><span class="p">}}</span>
  }<span class="p">{%</span><span class="w"> </span><span class="kr">unless</span><span class="w"> </span><span class="nb">forloop.last</span><span class="w"> </span><span class="p">%}</span>,<span class="p">{%</span><span class="w"> </span><span class="kr">endunless</span><span class="w"> </span><span class="p">%}</span>
<span class="p">{%</span><span class="w"> </span><span class="nt">endfor</span><span class="w"> </span><span class="p">%}</span>]
</code></pre></div></div>

<p>The JavaScript fetches this JSON once on the first search, then filters locally. It is not as powerful as Algolia or Lunr, but it works well enough for a static site with a few hundred posts and requires no external service.</p>

<h4 id="dark-mode">Dark mode</h4>

<p>The site already had a basic dark mode toggle from an earlier refactor, but it was incomplete — many elements (code blocks, tables, form inputs, the career timeline) still used hardcoded light colors. I expanded the <code class="language-plaintext highlighter-rouge">dark-mode.css</code> file from about 100 lines to over 250, covering every component on the site. The dark mode toggle also needed to sync with the Utterances comment theme, so switching to dark mode now sends a <code class="language-plaintext highlighter-rouge">postMessage</code> to the Utterances iframe:</p>

<div class="language-javascript highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kd">const</span> <span class="nx">updateUtterancesTheme</span> <span class="o">=</span> <span class="p">(</span><span class="nx">theme</span><span class="p">)</span> <span class="o">=&gt;</span> <span class="p">{</span>
  <span class="kd">const</span> <span class="nx">utterancesFrame</span> <span class="o">=</span> <span class="nb">document</span><span class="p">.</span><span class="nx">querySelector</span><span class="p">(</span><span class="dl">'</span><span class="s1">.utterances-frame</span><span class="dl">'</span><span class="p">);</span>
  <span class="k">if</span> <span class="p">(</span><span class="nx">utterancesFrame</span><span class="p">)</span> <span class="p">{</span>
    <span class="nx">utterancesFrame</span><span class="p">.</span><span class="nx">contentWindow</span><span class="p">.</span><span class="nx">postMessage</span><span class="p">(</span>
      <span class="p">{</span> <span class="na">type</span><span class="p">:</span> <span class="dl">'</span><span class="s1">set-theme</span><span class="dl">'</span><span class="p">,</span> <span class="na">theme</span><span class="p">:</span> <span class="nx">theme</span> <span class="p">},</span>
      <span class="dl">'</span><span class="s1">https://utteranc.es</span><span class="dl">'</span>
    <span class="p">);</span>
  <span class="p">}</span>
<span class="p">};</span>
</code></pre></div></div>

<p>Speaking of Utterances — the comments had an embarrassing bug. The comment section did not appear at all until you manually refreshed the page. The cause was that the site uses AJAX navigation (via History.js in <code class="language-plaintext highlighter-rouge">personal.js</code>), so navigating to a post did not trigger a full page load. The Utterances <code class="language-plaintext highlighter-rouge">&lt;script&gt;</code> tag was embedded inline in the post layout, and inline scripts only execute on initial load — AJAX page transitions skip them entirely. The fix was to replace the inline script with a placeholder <code class="language-plaintext highlighter-rouge">&lt;div id="utterances-container"&gt;</code> and move the Utterances initialization into <code class="language-plaintext highlighter-rouge">personal.js</code>, where it runs on every page transition.</p>

<p>This also handles the initial dark mode state — the theme is read from <code class="language-plaintext highlighter-rouge">localStorage</code> at injection time, so comments load with the correct theme from the start. The <code class="language-plaintext highlighter-rouge">postMessage</code> approach described above is still needed when the user toggles dark mode while comments are already visible.</p>

<h4 id="structured-data-and-opengraph">Structured data and OpenGraph</h4>

<p>I added JSON-LD structured data for both the site (WebSite schema) and individual blog posts (BlogPosting schema). The blog post schema includes headline, description, author, publication date, and keywords from tags:</p>

<div class="language-html highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nt">&lt;script </span><span class="na">type=</span><span class="s">"application/ld+json"</span><span class="nt">&gt;</span>
<span class="p">{</span>
  <span class="dl">"</span><span class="s2">@context</span><span class="dl">"</span><span class="p">:</span> <span class="dl">"</span><span class="s2">https://schema.org</span><span class="dl">"</span><span class="p">,</span>
  <span class="dl">"</span><span class="s2">@type</span><span class="dl">"</span><span class="p">:</span> <span class="dl">"</span><span class="s2">BlogPosting</span><span class="dl">"</span><span class="p">,</span>
  <span class="dl">"</span><span class="s2">headline</span><span class="dl">"</span><span class="p">:</span> <span class="dl">"</span><span class="s2">Redesigning My Personal Website with Claude Code</span><span class="dl">"</span><span class="p">,</span>
  <span class="dl">"</span><span class="s2">description</span><span class="dl">"</span><span class="p">:</span> <span class="dl">"</span><span class="s2">A practical walkthrough of redesigning a Jekyll personal website — adding search, dark mode, structured data, and a full card-based redesign using Claude Code. What changed, what broke, and what it is like to iterate on a site with an AI coding assistant.</span><span class="dl">"</span><span class="p">,</span>
  <span class="dl">"</span><span class="s2">datePublished</span><span class="dl">"</span><span class="p">:</span> <span class="dl">"</span><span class="s2">2026-03-23T00:00:00+00:00</span><span class="dl">"</span><span class="p">,</span>
  <span class="dl">"</span><span class="s2">author</span><span class="dl">"</span><span class="p">:</span> <span class="p">{</span>
    <span class="dl">"</span><span class="s2">@type</span><span class="dl">"</span><span class="p">:</span> <span class="dl">"</span><span class="s2">Person</span><span class="dl">"</span><span class="p">,</span>
    <span class="dl">"</span><span class="s2">name</span><span class="dl">"</span><span class="p">:</span> <span class="dl">"</span><span class="s2">Andrey Lukyanenko</span><span class="dl">"</span><span class="p">,</span>
    <span class="dl">"</span><span class="s2">url</span><span class="dl">"</span><span class="p">:</span> <span class="dl">"</span><span class="s2">https://andlukyane.com/</span><span class="dl">"</span>
  <span class="p">},</span>
  <span class="dl">"</span><span class="s2">keywords</span><span class="dl">"</span><span class="p">:</span> <span class="p">[</span><span class="dl">"</span><span class="s2">blogpost</span><span class="dl">"</span><span class="p">,</span><span class="dl">"</span><span class="s2">career</span><span class="dl">"</span><span class="p">]</span>
<span class="p">}</span>
<span class="nt">&lt;/script&gt;</span>
</code></pre></div></div>

<p>I also added missing OpenGraph tags (<code class="language-plaintext highlighter-rouge">og:url</code>, <code class="language-plaintext highlighter-rouge">og:type</code>) and a Twitter Card <code class="language-plaintext highlighter-rouge">twitter:site</code> tag. These are small changes, but they improve how links appear when shared on LinkedIn, Twitter, and other platforms. The <code class="language-plaintext highlighter-rouge">og:type</code> dynamically switches between <code class="language-plaintext highlighter-rouge">article</code> for blog posts and <code class="language-plaintext highlighter-rouge">website</code> for other pages.</p>

<h4 id="floating-share-bar-and-footer-redesign">Floating share bar and footer redesign</h4>

<p>Blog posts got a floating share bar on desktop — four social buttons (Twitter, LinkedIn, Facebook, Reddit) that appear when you start reading and disappear when you scroll past the article. The visibility is controlled by checking the article’s bounding rectangle on the scroll:</p>

<div class="language-javascript highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kd">function</span> <span class="nx">check</span><span class="p">()</span> <span class="p">{</span>
  <span class="kd">var</span> <span class="nx">rect</span> <span class="o">=</span> <span class="nx">article</span><span class="p">.</span><span class="nx">getBoundingClientRect</span><span class="p">();</span>
  <span class="kd">var</span> <span class="nx">show</span> <span class="o">=</span> <span class="nx">rect</span><span class="p">.</span><span class="nx">top</span> <span class="o">&lt;</span> <span class="nb">window</span><span class="p">.</span><span class="nx">innerHeight</span> <span class="o">*</span> <span class="mf">0.3</span>
          <span class="o">&amp;&amp;</span> <span class="nx">rect</span><span class="p">.</span><span class="nx">bottom</span> <span class="o">&gt;</span> <span class="nb">window</span><span class="p">.</span><span class="nx">innerHeight</span> <span class="o">*</span> <span class="mf">0.5</span><span class="p">;</span>
  <span class="nx">bar</span><span class="p">.</span><span class="nx">classList</span><span class="p">.</span><span class="nx">toggle</span><span class="p">(</span><span class="dl">'</span><span class="s1">visible</span><span class="dl">'</span><span class="p">,</span> <span class="nx">show</span><span class="p">);</span>
<span class="p">}</span>
<span class="nb">window</span><span class="p">.</span><span class="nx">addEventListener</span><span class="p">(</span><span class="dl">'</span><span class="s1">scroll</span><span class="dl">'</span><span class="p">,</span> <span class="nx">check</span><span class="p">,</span> <span class="p">{</span> <span class="na">passive</span><span class="p">:</span> <span class="kc">true</span> <span class="p">});</span>
</code></pre></div></div>

<p><img src="https://andlukyane.com/images/website_redesign/2026-03-22_20-20-14.jpg" alt="The footer" /></p>

<p>The footer was redesigned from a single-column layout to a three-column grid (about, quick links, social icons) using CSS Grid. I also added proper focus-visible styles for accessibility across all interactive elements.</p>

<h3 id="wave-2-content-and-layout-overhaul-with-claude-code">Wave 2: content and layout overhaul with Claude Code</h3>

<p>With the infrastructure in place, the site still had the same content problems: outdated positioning, buried content, and flat hierarchies. For this part, I used Claude Code to do the whole thing in a single session (not one-shot).</p>

<h4 id="gathering-ideas">Gathering ideas</h4>

<p>First, I prepared a list of what I wanted to change myself, then asked three AI assistants (ChatGPT, Claude, Gemini) to analyze the live site and suggest improvements. All three converged on the same core problems: the strongest professional signals were buried or absent, good content was difficult to find, and some content was years out of date. I saved their suggestions into Markdown files, compared the overlapping recommendations, and wrote a combined improvement plan.</p>

<h4 id="about-page">About page</h4>

<p><img src="https://andlukyane.com/images/website_redesign/2026-03-22_20-24-24.jpg" alt="About page redesign" /></p>

<p>The old page was a chronological autobiography. The <a href="https://andlukyane.com/">new page</a> is structured as a hub: a short intro, a row of credential badges, a horizontally scrolling “Latest Paper Reviews” section, curated “Featured Work” and “Beyond Work” card grids, and a collapsible section for fun facts at the bottom.</p>

<p>The “Latest Paper Reviews” section autopopulates using a Liquid include:</p>

<div class="language-html highlighter-rouge"><div class="highlight"><pre class="highlight"><code>{% assign reviews = site.posts | where_exp: "post",
    "post.tags contains 'paperreview'" %}
{% for post in reviews limit:3 %}
<span class="nt">&lt;a</span> <span class="na">href=</span><span class="s">"{{ post.url }}"</span> <span class="na">class=</span><span class="s">"card"</span><span class="nt">&gt;</span>
  <span class="nt">&lt;div</span> <span class="na">class=</span><span class="s">"card__meta"</span><span class="nt">&gt;</span>{{ post.date | date: "%b %d, %Y" }}<span class="nt">&lt;/div&gt;</span>
  <span class="nt">&lt;div</span> <span class="na">class=</span><span class="s">"card__title"</span><span class="nt">&gt;</span>{{ post.title }}<span class="nt">&lt;/div&gt;</span>
  <span class="nt">&lt;div</span> <span class="na">class=</span><span class="s">"card__description"</span><span class="nt">&gt;</span>
    {{ post.description | truncate: 140 }}
  <span class="nt">&lt;/div&gt;</span>
<span class="nt">&lt;/a&gt;</span>
{% endfor %}
</code></pre></div></div>

<p>The horizontal scroll is done with pure CSS using flexbox and <code class="language-plaintext highlighter-rouge">overflow-x: auto</code>:</p>

<div class="language-css highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nc">.card-scroll</span> <span class="p">{</span>
  <span class="nl">display</span><span class="p">:</span> <span class="n">flex</span><span class="p">;</span>
  <span class="py">gap</span><span class="p">:</span> <span class="m">20px</span><span class="p">;</span>
  <span class="nl">overflow-x</span><span class="p">:</span> <span class="nb">auto</span><span class="p">;</span>
  <span class="py">scroll-snap-type</span><span class="p">:</span> <span class="n">x</span> <span class="n">mandatory</span><span class="p">;</span>
  <span class="nl">-webkit-overflow-scrolling</span><span class="p">:</span> <span class="n">touch</span><span class="p">;</span>
<span class="p">}</span>

<span class="nc">.card-scroll</span> <span class="nc">.card</span> <span class="p">{</span>
  <span class="nl">min-width</span><span class="p">:</span> <span class="m">280px</span><span class="p">;</span>
  <span class="nl">max-width</span><span class="p">:</span> <span class="m">340px</span><span class="p">;</span>
  <span class="nl">flex-shrink</span><span class="p">:</span> <span class="m">0</span><span class="p">;</span>
  <span class="py">scroll-snap-align</span><span class="p">:</span> <span class="n">start</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>

<p>The main technical challenge was that Jekyll refused to process Liquid tags in the root <code class="language-plaintext highlighter-rouge">index.html</code>. After debugging, the cause turned out to be a quirk of how Jekyll handles root-level files — possibly a conflict with the paginator plugin. The fix was moving the homepage to <code class="language-plaintext highlighter-rouge">_pages/home.html</code> with <code class="language-plaintext highlighter-rouge">permalink: /</code>. Not obvious, and the kind of thing that wastes hours.</p>

<h4 id="card-component-system">Card component system</h4>

<p>Most visual changes are built on a reusable CSS card system (<code class="language-plaintext highlighter-rouge">css/cards.css</code>, 237 lines). The core card is a simple flexbox column with a hover effect:</p>

<div class="language-css highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nc">.card</span> <span class="p">{</span>
  <span class="nl">background</span><span class="p">:</span> <span class="n">var</span><span class="p">(</span><span class="n">--background-alt-color</span><span class="p">,</span> <span class="m">#f4f5f6</span><span class="p">);</span>
  <span class="nl">border-radius</span><span class="p">:</span> <span class="m">8px</span><span class="p">;</span>
  <span class="nl">padding</span><span class="p">:</span> <span class="m">24px</span><span class="p">;</span>
  <span class="nl">transition</span><span class="p">:</span> <span class="n">box-shadow</span> <span class="m">0.3s</span> <span class="n">ease</span><span class="p">,</span> <span class="n">transform</span> <span class="m">0.2s</span> <span class="n">ease</span><span class="p">;</span>
  <span class="nl">display</span><span class="p">:</span> <span class="n">flex</span><span class="p">;</span>
  <span class="nl">flex-direction</span><span class="p">:</span> <span class="n">column</span><span class="p">;</span>
<span class="p">}</span>

<span class="nc">.card</span><span class="nd">:hover</span> <span class="p">{</span>
  <span class="nl">box-shadow</span><span class="p">:</span> <span class="m">0</span> <span class="m">4px</span> <span class="m">20px</span> <span class="n">rgba</span><span class="p">(</span><span class="m">0</span><span class="p">,</span> <span class="m">0</span><span class="p">,</span> <span class="m">0</span><span class="p">,</span> <span class="m">0.1</span><span class="p">);</span>
  <span class="nl">transform</span><span class="p">:</span> <span class="n">translateY</span><span class="p">(</span><span class="m">-2px</span><span class="p">);</span>
<span class="p">}</span>
</code></pre></div></div>

<p>Cards are placed in responsive grids that go from one column on mobile to two or three on desktop:</p>

<div class="language-css highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nc">.card-grid</span> <span class="p">{</span>
  <span class="nl">display</span><span class="p">:</span> <span class="n">grid</span><span class="p">;</span>
  <span class="py">grid-template-columns</span><span class="p">:</span> <span class="m">1</span><span class="n">fr</span><span class="p">;</span>
  <span class="py">gap</span><span class="p">:</span> <span class="m">20px</span><span class="p">;</span>
<span class="p">}</span>

<span class="k">@media</span> <span class="p">(</span><span class="n">min-width</span><span class="p">:</span> <span class="m">768px</span><span class="p">)</span> <span class="p">{</span>
  <span class="nc">.card-grid</span> <span class="p">{</span> <span class="py">grid-template-columns</span><span class="p">:</span> <span class="nb">repeat</span><span class="p">(</span><span class="m">2</span><span class="p">,</span> <span class="m">1</span><span class="n">fr</span><span class="p">);</span> <span class="p">}</span>
<span class="p">}</span>

<span class="k">@media</span> <span class="p">(</span><span class="n">min-width</span><span class="p">:</span> <span class="m">1024px</span><span class="p">)</span> <span class="p">{</span>
  <span class="nc">.card-grid--3</span> <span class="p">{</span> <span class="py">grid-template-columns</span><span class="p">:</span> <span class="nb">repeat</span><span class="p">(</span><span class="m">3</span><span class="p">,</span> <span class="m">1</span><span class="n">fr</span><span class="p">);</span> <span class="p">}</span>
<span class="p">}</span>
</code></pre></div></div>

<p>Tag chips are small pill-shaped links that appear on cards and in the blog listing:</p>

<div class="language-css highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nc">.tag-chip</span> <span class="p">{</span>
  <span class="nl">display</span><span class="p">:</span> <span class="n">inline-block</span><span class="p">;</span>
  <span class="nl">font-size</span><span class="p">:</span> <span class="m">12px</span><span class="p">;</span>
  <span class="nl">padding</span><span class="p">:</span> <span class="m">4px</span> <span class="m">10px</span><span class="p">;</span>
  <span class="nl">border-radius</span><span class="p">:</span> <span class="m">12px</span><span class="p">;</span>
  <span class="nl">background</span><span class="p">:</span> <span class="n">var</span><span class="p">(</span><span class="n">--background-color</span><span class="p">,</span> <span class="m">#ffffff</span><span class="p">);</span>
  <span class="nl">color</span><span class="p">:</span> <span class="n">var</span><span class="p">(</span><span class="n">--text-light-color</span><span class="p">,</span> <span class="m">#6B7B8D</span><span class="p">);</span>
  <span class="nl">border</span><span class="p">:</span> <span class="m">1px</span> <span class="nb">solid</span> <span class="n">var</span><span class="p">(</span><span class="n">--border-color</span><span class="p">,</span> <span class="m">#dddddd</span><span class="p">);</span>
  <span class="nl">transition</span><span class="p">:</span> <span class="n">background</span> <span class="m">0.2s</span> <span class="n">ease</span><span class="p">,</span> <span class="n">color</span> <span class="m">0.2s</span> <span class="n">ease</span><span class="p">;</span>
<span class="p">}</span>

<span class="nt">a</span><span class="nc">.tag-chip</span><span class="nd">:hover</span> <span class="p">{</span>
  <span class="nl">background</span><span class="p">:</span> <span class="n">var</span><span class="p">(</span><span class="n">--accent-color</span><span class="p">,</span> <span class="m">#3498db</span><span class="p">);</span>
  <span class="nl">color</span><span class="p">:</span> <span class="m">#ffffff</span><span class="p">;</span>
  <span class="nl">border-color</span><span class="p">:</span> <span class="n">var</span><span class="p">(</span><span class="n">--accent-color</span><span class="p">,</span> <span class="m">#3498db</span><span class="p">);</span>
<span class="p">}</span>
</code></pre></div></div>

<p>Everything uses CSS custom properties, so dark mode works automatically.</p>

<h4 id="blog-listing-tag-chips">Blog listing tag chips</h4>

<p><img src="https://andlukyane.com/images/website_redesign/2026-03-22_20-30-15.jpg" alt="Tag chips example" /></p>

<p>Every post in the <a href="https://andlukyane.com/blog/">blog</a> listing now shows up to four clickable tag chips. The first attempt had an interesting bug: the entire blog post card was clickable via a jQuery handler in <code class="language-plaintext highlighter-rouge">personal.js</code>, so clicking a tag chip navigated to the post instead of the tag page. The fix was a guard in the click handler:</p>

<div class="language-javascript highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nx">$</span><span class="p">(</span><span class="nb">document</span><span class="p">).</span><span class="nx">on</span><span class="p">(</span><span class="dl">'</span><span class="s1">click</span><span class="dl">'</span><span class="p">,</span> <span class="dl">'</span><span class="s1">.post</span><span class="dl">'</span><span class="p">,</span> <span class="kd">function</span> <span class="p">(</span><span class="nx">e</span><span class="p">)</span> <span class="p">{</span>
  <span class="k">if</span> <span class="p">(</span><span class="nx">$</span><span class="p">(</span><span class="nx">e</span><span class="p">.</span><span class="nx">target</span><span class="p">).</span><span class="nx">closest</span><span class="p">(</span><span class="dl">'</span><span class="s1">.tag-chip</span><span class="dl">'</span><span class="p">).</span><span class="nx">length</span><span class="p">)</span> <span class="p">{</span>
    <span class="k">return</span><span class="p">;</span>  <span class="c1">// let the tag link handle its own navigation</span>
  <span class="p">}</span>
  <span class="c1">// ... existing post navigation code</span>
<span class="p">});</span>
</code></pre></div></div>

<h4 id="other-pages">Other pages</h4>

<p><img src="https://andlukyane.com/images/website_redesign/2026-03-22_20-32-24.jpg" alt="Activities change" /></p>

<p>I restructured the <a href="https://andlukyane.com/activities">Activities</a> page from flat collapsible lists into cards for publications and talks, data tables for Kaggle notebooks and competitions, and compact inline archives for the full talk history. The <a href="https://andlukyane.com/project">Projects</a> page went from two entries to eight, organized into Featured Projects (with images), Production ML, and Tools and Writing. The <a href="https://andlukyane.com/tags">Tags</a> page now groups 120+ tags into 12 thematic categories displayed as a two-column card grid, and individual tag pages show posts as cards instead of plain lists.</p>

<h3 id="working-with-claude-code">Working with Claude Code</h3>

<p>Most of the changes were done in a single Claude Code session over a couple of days. The most useful pattern was version comparison: for each major page redesign, I asked it to create three versions at temporary URLs (<code class="language-plaintext highlighter-rouge">/about-v1</code>, <code class="language-plaintext highlighter-rouge">/about-v2</code>, <code class="language-plaintext highlighter-rouge">/about-v3</code>), previewed them all in the browser, and described which elements to combine. It is faster to compare three concrete implementations than to describe an abstract preference in words. One version could have a better visual appeal, another one - better structure, another one - a new useful feature, and so on.</p>

<p>Not everything worked well on the first attempt; I had to spend considerable time on debugging and testing. For the Liquid processing issue, Claude Code tried several approaches — adding <code class="language-plaintext highlighter-rouge">layout: page</code> to the front matter, moving Liquid into an include file, and finally identifying that the <code class="language-plaintext highlighter-rouge">_pages</code> collection was the correct fix. It diagnosed problems by inspecting the built <code class="language-plaintext highlighter-rouge">_site/index.html</code> output and checking whether Liquid tags had been resolved. For the tag chip click issue, it found the jQuery handler, understood the event propagation, and added the guard.</p>

<p>All content decisions were mine — which posts to feature, what texts to write, what to include or exclude. Some CSS required several iterations because I could see the rendered result, but Claude Code could only verify by grepping the HTML output. When I sent screenshots of broken layouts, it could usually identify the issue, though not always on the first try. It also occasionally generated text that was too promotional, which I had to tone down.</p>

<h3 id="summary-of-changes">Summary of changes</h3>

<table>
  <thead>
    <tr>
      <th>Metric</th>
      <th>Before</th>
      <th>After</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>Search</td>
      <td>None</td>
      <td>Full-text with Ctrl+K</td>
    </tr>
    <tr>
      <td>Dark mode coverage</td>
      <td>Partial</td>
      <td>Complete (250+ CSS rules)</td>
    </tr>
    <tr>
      <td>Structured data</td>
      <td>None</td>
      <td>WebSite + BlogPosting JSON-LD</td>
    </tr>
    <tr>
      <td>OpenGraph tags</td>
      <td>Partial</td>
      <td>Complete (og:url, og:type, twitter:site)</td>
    </tr>
    <tr>
      <td>Share functionality</td>
      <td>Bottom of post only</td>
      <td>Floating sidebar + bottom</td>
    </tr>
    <tr>
      <td>About page</td>
      <td>Chronological bio</td>
      <td>6 sections with cards and badges</td>
    </tr>
    <tr>
      <td>Projects shown</td>
      <td>2</td>
      <td>8</td>
    </tr>
    <tr>
      <td>Activities structure</td>
      <td>Flat lists</td>
      <td>Cards, tables, compact archives</td>
    </tr>
    <tr>
      <td>Tag organization</td>
      <td>Flat list</td>
      <td>12 thematic card groups</td>
    </tr>
    <tr>
      <td>Blog tag chips</td>
      <td>None</td>
      <td>Up to 4 per post</td>
    </tr>
  </tbody>
</table>

<h3 id="conclusions">Conclusions</h3>

<p>The site now reflects where I am in 2026 rather than where I was in 2021 or 2023. The design is still built on a purchased Jekyll theme with incremental modifications, but it is much better at helping people discover useful content. For me, this is good enough.</p>

<p>Additionally, this was an interesting exercise in using AI to improve a personal website. I remember that it took me almost a month to set up the initial version, then days or weeks to make specific changes because I didn’t know web development well enough. This redesign took less than a week in total, and I learnt a lot about using AI and web development.</p>]]></content><author><name></name></author><category term="blogpost" /><category term="career" /><summary type="html"><![CDATA[A practical walkthrough of redesigning a Jekyll personal website — adding search, dark mode, structured data, and a full card-based redesign using Claude Code. What changed, what broke, and what it is like to iterate on a site with an AI coding assistant.]]></summary></entry><entry><title type="html">Collaborative Reinforcement Learning: Why HACRL Trains Models in Teams Instead of Isolation</title><link href="https://andlukyane.com//blog/paper-review-haclr" rel="alternate" type="text/html" title="Collaborative Reinforcement Learning: Why HACRL Trains Models in Teams Instead of Isolation" /><published>2026-03-16T00:00:00+00:00</published><updated>2026-03-16T00:00:00+00:00</updated><id>https://andlukyane.com//blog/paper-review-haclr</id><content type="html" xml:base="https://andlukyane.com//blog/paper-review-haclr"><![CDATA[<h2 id="collaborative-reinforcement-learning-why-hacrl-trains-models-in-teams-instead-of-isolation">Collaborative Reinforcement Learning: Why HACRL Trains Models in Teams Instead of Isolation</h2>

<p><a href="https://arxiv.org/abs/2603.02604">Paper</a></p>

<p><a href="https://zzx-peter.github.io/hacrl/">Project</a></p>

<p><img src="https://andlukyane.com/images/paper_reviews/haclr/2026-03-15_18-43-39.jpg" alt="Main image" /></p>

<p>Modern reinforcement learning pipelines for large models are usually <strong>isolated</strong>: each model generates its own rollouts, evaluates them with a reward model, and updates its policy independently. This setup is inefficient because different models often explore different parts of the solution space, but their discoveries remain locked inside each training run.</p>

<p>This paper explores a different idea: <strong>what if multiple heterogeneous models could learn from each other during training</strong>? HACRL (Heterogeneous Agent Collaborative Reinforcement Learning) enables multiple models to <strong>share successful trajectories while remaining independent during inference</strong>. The key algorithm, HACPO, enables this rollout sharing while correcting for policy mismatch between agents. In practice, this collaborative training improves reasoning performance and reduces rollout costs compared to standard RL approaches such as GSPO. The result is a simple yet powerful shift: instead of training models in isolation, HACRL lets them <strong>improve collectively</strong> through shared exploration.</p>

<h3 id="heterogeneous-agent-collaborative-reinforcement-learning">Heterogeneous Agent Collaborative Reinforcement Learning</h3>

<p><img src="https://andlukyane.com/images/paper_reviews/haclr/2026-03-15_17-48-37.jpg" alt="Shared rollouts" /></p>

<p>Most reinforcement learning pipelines train models separately; this means that even if multiple models are trained on the same task, their discoveries remain separate. One model might find a useful reasoning strategy, but other models never benefit from it.</p>

<p>HACRL proposes a different approach: <strong>multiple heterogeneous agents collaborate during training by sharing successful trajectories</strong>. The agents differ in architecture, size, or capability. When a model discovers a high-quality solution, the trajectory is shared with the others, allowing them to learn from that experience. The collaboration happens <strong>only during training</strong> - at inference time, each model still operates independently.</p>

<p>Having models of different capabilities means that using their trajectory naively by other models can introduce bias or destabilize learning. HACPO addresses this using two mechanisms:</p>
<ul>
  <li><strong>Exponential importance sampling</strong> adjusts the learning signal based on how compatible a shared trajectory is with the agent’s own policy: trajectories that look plausible receive stronger weight, while very unlikely ones have their influence reduced.</li>
  <li><strong>Stepwise clipping</strong> further stabilizes training by limiting how large the policy ratio can become at each step of the sequence, preventing extremely unlikely tokens from producing large gradients.</li>
</ul>

<p>The result is a collaborative training loop where agents both <strong>explore the problem space independently and learn from each other’s discoveries</strong>. Stronger models may uncover effective reasoning strategies, while weaker models contribute additional exploration diversity. Instead of repeating the same exploration across multiple training runs, HACRL turns training into a <strong>shared learning process</strong>, improving both performance and sample efficiency.</p>

<h3 id="experiments">Experiments</h3>

<p><img src="https://andlukyane.com/images/paper_reviews/haclr/2026-03-15_18-40-35.jpg" alt="Experiments" /></p>

<p>The experiments evaluate HACRL using multiple heterogeneous language models collaborating during reinforcement learning. The results show that collaborative training consistently improves reasoning performance compared to standard single-agent RL optimization.</p>

<p>In particular, HACRL outperforms strong baselines such as <strong>GSPO</strong>, while requiring <strong>significantly fewer rollout samples</strong>. The gains are especially noticeable when agents have complementary capabilities: stronger models help guide exploration, while weaker models contribute additional trajectory diversity. These results suggest that <strong>shared exploration across models can improve both performance and training efficiency</strong>, supporting the central idea behind collaborative reinforcement learning.</p>

<h3 id="conclusions">Conclusions</h3>

<p>Traditional RL post-training methods such as <strong>PPO</strong>, <strong>GRPO</strong>, or <strong>GSPO</strong> optimize each model independently using its own rollouts. <strong>Knowledge distillation</strong> allows models to learn from others, but information typically flows in one direction—from teacher to student. <strong>Multi-agent reinforcement learning</strong> trains multiple agents together, but those systems usually require joint deployment or shared policies.</p>

<p>HACRL proposes a different design: <strong>collaboration during training, independence during inference</strong>. Models exchange useful experience without becoming dependent on each other. This makes the approach particularly attractive in modern AI ecosystems where many models coexist. Instead of competing for training data, models can <strong>collectively explore the solution space and accelerate learning</strong>.</p>]]></content><author><name></name></author><category term="paperreview" /><category term="deeplearning" /><category term="rl" /><category term="llm" /><category term="optimization" /><category term="agent" /><category term="finetuning" /><summary type="html"><![CDATA[HACRL proposes a new paradigm for reinforcement learning - instead of training models in isolation, multiple agents collaborate by sharing successful trajectories during training. This simple idea enables more efficient exploration and improves performance across heterogeneous models.]]></summary></entry><entry><title type="html">Beyond Positional Bias: How DroPE Unlocks Zero-Shot Long Context in LLMs</title><link href="https://andlukyane.com//blog/paper-review-drope" rel="alternate" type="text/html" title="Beyond Positional Bias: How DroPE Unlocks Zero-Shot Long Context in LLMs" /><published>2026-02-23T00:00:00+00:00</published><updated>2026-02-23T00:00:00+00:00</updated><id>https://andlukyane.com//blog/paper-review-drope</id><content type="html" xml:base="https://andlukyane.com//blog/paper-review-drope"><![CDATA[<h2 id="beyond-positional-bias-how-drope-unlocks-zero-shot-long-context-in-llms">Beyond Positional Bias: How DroPE Unlocks Zero-Shot Long Context in LLMs</h2>

<p><a href="https://arxiv.org/abs/2512.12167">Paper</a></p>

<p><a href="https://github.com/SakanaAI/DroPE">Code</a></p>

<p><a href="https://pub.sakana.ai/DroPE/">Project</a></p>

<p><img src="https://andlukyane.com/images/paper_reviews/drope/2026-02-23_10-19-38.jpg" alt="Main image" /></p>

<p>Modern LLMs struggle to generalize to sequences longer than their pretraining context, they typically require <strong>expensive long-context fine-tuning</strong> or architecture changes to extend usable context length. DroPE challenges this paradigm by <strong>removing positional embeddings after training</strong> and showing that while explicit positional biases aid convergence, they are not fundamentally needed at test time and, in fact, limit generalization to longer sequences.</p>

<p>The core insight is simple but counterintuitive: while positional embeddings (Rotary Positional Embeddings) accelerate early training by providing inductive structure, they become a bottleneck for generalizing to longer contexts. After a brief recalibration phase with positional embeddings dropped, pretrained LLMs exhibit <strong>zero-shot context extension far beyond their original sequence length</strong> without compromising performance on standard tasks. DroPE outperforms prior positional scaling methods and specialized architectures across different models and datasets at a fraction of the cost.</p>

<h3 id="explicit-positional-embeddings-are-beneficial-for-training">Explicit positional embeddings are beneficial for training</h3>

<p><img src="https://andlukyane.com/images/paper_reviews/drope/2026-02-23_09-55-44.jpg" alt="RoPE outperforms NoPE" /></p>

<p>Although <strong>NoPE (No Positional Embedding)</strong> transformers are theoretically as expressive as RoPE models (they can reconstruct positional information using the causal mask), they consistently train worse in practice. Empirically, NoPE models show higher perplexity throughout training.</p>

<p><img src="https://andlukyane.com/images/paper_reviews/drope/2026-02-23_09-51-12.jpg" alt="RoPE transformers have higher positional bias gradients at initialization" /></p>

<p>The authors argue that the difference is not about expressivity, but <strong>optimization dynamics</strong>. In NoPE transformers, positional information and attention non-uniformity (diagonal or near-diagonal attention patterns) develop only gradually because the gradients driving positional bias are bounded and small at initialization. Explicit positional embeddings like RoPE, however, inject strong positional bias from the start, enabling faster formation of useful attention patterns and more efficient training.</p>

<p>This means that <strong>NoPE can represent position, but learns it too slowly</strong>, while RoPE provides an inductive bias that accelerates training by immediately shaping attention toward position-sensitive structures.</p>

<h3 id="rope-prevents-effective-zero-shot-context-extension">RoPE prevents effective zero-shot context extension</h3>

<p><img src="https://andlukyane.com/images/paper_reviews/drope/2026-02-23_10-03-00.jpg" alt="YaRN crops effective retrieval context" /></p>

<p>RoPE scaling methods such as YaRN can maintain stable perplexity on longer sequences, but they <strong>fail to truly generalize beyond the training context</strong>: on downstream tasks, they behave similarly to simply cropping inputs to the original context length, effectively ignoring information that appears far into the sequence. This failure is clearly visible in long-context retrieval tasks like needle-in-a-haystack, where important distant information is missed despite unchanged perplexity.</p>

<p>The reason is <strong>frequency compression</strong>, which is unavoidable in RoPE scaling. To keep positional phases within the training distribution, all post-hoc scaling methods must aggressively compress <strong>low frequencies</strong>. However, low frequencies are primarily used by <strong>semantic attention heads</strong>, not positional ones. As a result, scaling leaves positional (high-frequency) heads mostly intact but <strong>warps semantic attention at long distances</strong>, increasingly so as sequence length grows. This shifts attention away from the correct tokens in long-range settings, explaining why RoPE scaling cannot deliver true zero-shot context extension.</p>

<h3 id="drope-dropping-positional-embeddings-after-pretraining">DroPE: Dropping positional embeddings after pretraining</h3>

<p>The conclusion is that positional embeddings are crucial for efficient language model training, but they fundamentally limit long-context generalization. The key insight is that positional information is needed only during training, not at inference.</p>

<p>Based on this, the authors introduce <strong>DroPE</strong>, a simple procedure that removes all positional embeddings from a pretrained model and applies a short recalibration phase. After this, the model preserves its original in-context performance while gaining <strong>strong zero-shot generalization to much longer sequences</strong>, outperforming RoPE scaling methods and alternative long-context architectures.</p>

<p><img src="https://andlukyane.com/images/paper_reviews/drope/2026-02-23_10-14-02.jpg" alt="Results" /></p>

<p>In training-from-scratch experiments (0.5B models on 16B tokens), the authors train a RoPE model normally for 14B tokens, then remove positional embeddings and continue training for the final 2B tokens as a short recalibration phase. This DroPE model matches the original RoPE model’s in-context perplexity while clearly outperforming NoPE and other long-context architectures (ALiBi, RNoPE-SWA) on long-context retrieval tasks from the RULER benchmark. Compared to RoPE scaling methods (PI, NTK-RoPE, YaRN), DroPE consistently achieves higher success rates at 2× the training context length.</p>

<p>They also apply DroPE to an already-pretrained 360M SmolLM model trained on 600B tokens, demonstrating that it can extend context in “models in the wild”. With recalibration budgets of 30–120B tokens (and adding QKNorm to stabilize training), DroPE effectively adapts pretrained LMs for long-context generalization without architectural changes.</p>

<p><img src="https://andlukyane.com/images/paper_reviews/drope/2026-02-23_10-14-44.jpg" alt="Larger models" /></p>

<p>DroPE scales effectively to <strong>billion-parameter models</strong>. When applied to SmolLM-1.7B and Llama2-7B, DroPE requires only <strong>~20B tokens of recalibration</strong> (2% and 0.5% of their original training budgets). Even with this minimal additional training, DroPE consistently outperforms state-of-the-art RoPE scaling methods on long-context question answering and summarization, demonstrating that the approach remains effective at large scale and can be applied immediately to existing models.</p>

<h3 id="conclusions">Conclusions</h3>

<p>DroPE stands out among other long-context methods by <strong>decoupling the benefits of positional embeddings during training from their drawbacks at inference</strong>. Compared to scaling or adapting existing positional schemes (like RoPE scaling methods), which still rely on retained or rescaled position information and often require further fine-tuning, DroPE requires no additional long-context training while preserving original model capabilities.</p>

<p>I like that DroPE exhibits efficient pretrained convergence and scalable inference. More than that, it yields zero-shot extension with only a minimal recalibration phase without ever needing to train on long-sequence data. This makes it a very computationally elegant and cost-effective context extension strategy.</p>]]></content><author><name></name></author><category term="paperreview" /><category term="deeplearning" /><category term="llm" /><category term="attention" /><category term="transformer" /><category term="pretraining" /><category term="nlp" /><category term="scaling" /><category term="optimization" /><category term="efficiency" /><category term="fewshotlearning" /><summary type="html"><![CDATA[A review of DroPE, a simple but counterintuitive method that extends LLM context length by dropping positional embeddings at inference and achieves strong zero-shot long-context generalization without retraining.]]></summary></entry><entry><title type="html">Kimi k2.5 Review: Native Multimodality and Agent Swarms at 1 Trillion Parameters</title><link href="https://andlukyane.com//blog/paper-review-kimik25" rel="alternate" type="text/html" title="Kimi k2.5 Review: Native Multimodality and Agent Swarms at 1 Trillion Parameters" /><published>2026-02-16T00:00:00+00:00</published><updated>2026-02-16T00:00:00+00:00</updated><id>https://andlukyane.com//blog/paper-review-kimik25</id><content type="html" xml:base="https://andlukyane.com//blog/paper-review-kimik25"><![CDATA[<h2 id="kimi-k25-review-native-multimodality-and-agent-swarms-at-1-trillion-parameters">Kimi k2.5 Review: Native Multimodality and Agent Swarms at 1 Trillion Parameters</h2>

<p><a href="https://arxiv.org/abs/2602.02276">Paper</a></p>

<p><a href="https://github.com/MoonshotAI/Kimi-K2.5">Code</a></p>

<p><a href="https://www.kimi.com/blog/kimi-k2-5.html">Project</a></p>

<p><img src="https://andlukyane.com/images/paper_reviews/kimik25/2026-02-16_20-46-20.jpg" alt="Main image" /></p>

<p>The KIMI K2.5 represents a significant step forward for open-source multimodal AI by tackling two converging frontiers simultaneously: <strong>native text-vision integration</strong> and <strong>scalable agentic intelligence</strong>. Unlike many contemporary models that augment a predominantly text-based backbone with vision tokens late in training, K2.5 adopts joint pre-training and reinforcement learning across text and visual data — enabling both modalities to mutually strengthen one another rather than competing for capacity. This design choice, along with novel techniques like zero-vision supervised fine-tuning, positions the model as a unified multimodal system capable of complex perception and reasoning tasks.</p>

<p>Finally, K2.5 introduces <strong>Agent Swarm</strong>, a dynamic framework for parallel agent orchestration that yields up to ~4.5x reduction in inference latency and higher task performance compared to traditional sequential agent execution. It allows the model to dynamically decompose a prompt into a graph of heterogeneous sub-tasks, instantiate specialized agents for each, and execute them concurrently</p>

<p>Benchmark results show Kimi k2.5 achieving state-of-the-art (SOTA) performance in coding, math, and visual agentic tasks, outperforming comparable open-source models and rivaling proprietary giants like GPT-5.2 in specific high-reasoning domains.</p>

<h3 id="joint-optimization-of-text-and-vision">Joint Optimization of Text and Vision</h3>

<p><img src="https://andlukyane.com/images/paper_reviews/kimik25/2026-02-16_19-29-41.jpg" alt="Joint Optimization of Text and Vision" /></p>

<p>Kimi K2.5 is trained as a <strong>native</strong> multimodal model via large-scale joint pre-training on ~15T mixed text and vision tokens. Contrary to common practice, the authors show that aggressively injecting large amounts of vision data late in training is unnecessary. With a fixed token budget, <strong>early fusion with a moderate vision ratio</strong> performs better than late, vision-heavy training. This results in a novel design choice: integrate vision early and co-optimize both modalities throughout training to learn balanced multimodal representations.</p>

<p>To address the cold-start problem in multimodal reinforcement learning, the authors introduce <strong>zero-vision SFT</strong>. Instead of using scarce, hand-annotated vision SFT data, K2.5 uses high-quality <strong>text-only SFT</strong>, with visual operations proxied via programmatic operations in IPython. Experiments show that this is sufficient to activate visual and agentic behaviors, likely due to prior joint pre-training. And the usual text–vision SFT performs worse due to limited data quality.</p>

<p>Finally, K2.5 applies <strong>outcome-based visual RL</strong> on tasks that explicitly require vision (grounding, counting, charts, STEM problems). This improves visual reasoning and agentic behavior and also boosts text-only performance on benchmarks like MMLU-Pro, GPQA, and LongBench. At the post-training stage, the model is trained with <strong>joint multimodal RL</strong>, where experts are organized by <strong>ability</strong> (reasoning, coding, agentic skills) rather than modality. This paradigm maximizes cross-modal transfer, allowing improvements in vision to generalize to text and vice versa, without degrading language performance.</p>

<h3 id="agent-swarm">Agent Swarm</h3>

<p><img src="https://andlukyane.com/images/paper_reviews/kimik25/2026-02-16_19-45-25.jpg" alt="Agent Swarm" /></p>

<p>The authors argue that <strong>sequential agent execution is a fundamental bottleneck</strong> for complex, long-horizon tasks. As tasks grow wider (information gathering) and deeper (branching reasoning), a single agent quickly exhausts reasoning depth and tool-call budgets, making purely sequential systems poorly scalable.</p>

<p>To address this, Kimi K2.5 introduces <strong>Agent Swarm</strong> with <strong>Parallel Agent Reinforcement Learning (PARL)</strong>. Instead of hard-coded parallelism or fixed heuristics, the model <strong>learns when and how to parallelize</strong> via RL. An orchestrator dynamically decomposes tasks, spawns heterogeneous subagents, and schedules them to run concurrently. Parallelism is not assumed to be beneficial by default; it emerges through reward-driven exploration based on task outcomes and efficiency.</p>

<p>PARL uses a <strong>decoupled architecture</strong>: a trainable orchestrator coordinates <strong>frozen subagents</strong> sampled from intermediate checkpoints. This avoids unstable end-to-end multi-agent training and sidesteps credit-assignment issues, treating subagent outputs as environmental observations rather than differentiable actions. Training starts with smaller subagents and progressively scales up, with dynamic resource allocation between the orchestrator and subagents.</p>

<p><img src="https://andlukyane.com/images/paper_reviews/kimik25/2026-02-16_19-46-47.jpg" alt="Accuracy, parallelism and step" /></p>

<p>The reward function combines task performance with two auxiliary terms: one encourages subagent instantiation (to avoid collapsing back to serial execution), and another rewards successful subtask completion (to prevent meaningless parallelism). These auxiliary rewards are annealed to zero over the course of training, ensuring the final policy optimizes task quality.</p>

<p>To explicitly optimize latency, the authors introduce <strong>critical steps</strong>, a metric similar to the critical path in parallel computation. Training and evaluation are constrained by critical steps rather than total steps, which incentivizes decompositions that reduce end-to-end execution time instead of merely increasing concurrency.</p>

<p>Finally, the orchestrator is trained on <strong>synthetic stress-test prompts</strong> designed to break sequential agents, emphasizing wide search, deep branching, and real-world workloads like long-document analysis. Without explicitly instructing parallelism, the task distribution naturally favors parallel execution, teaching the orchestrator to exploit Agent Swarm where it provides real benefit.</p>

<h3 id="the-general-approach">The general approach</h3>

<p><img src="https://andlukyane.com/images/paper_reviews/kimik25/2026-02-16_20-11-38.jpg" alt="Overview of training stages" /></p>

<p>Kimi K2.5 is built on <strong>Kimi K2</strong>, a trillion-parameter MoE language model trained on 15T text tokens, and extends it into a native multimodal, agentic system. Kimi K2 MoE LLM is combined with <strong>MoonViT-3D</strong>, a native-resolution vision encoder, connected via an MLP projector. MoonViT-3D unifies image and video understanding by sharing parameters and embedding space across both modalities, treating short video clips as spatiotemporal “patch-and-pack” sequences. This helps to avoid specialized video modules while enabling efficient long-video processing through lightweight temporal pooling and compression.</p>

<p>Pre-training proceeds in three stages over ~15T additional multimodal tokens:</p>
<ul>
  <li>standalone ViT training to build a strong visual encoder from image–text and video–text data;</li>
  <li>joint text–vision pre-training starting from a near-final Kimi K2 checkpoint to co-optimize language and multimodal capabilities;</li>
  <li>mid-training with higher-quality data and <strong>long-context activation</strong>, progressively extending context length via YaRN interpolation.</li>
</ul>

<p>Post-training follows large-scale <strong>SFT + RL</strong>. Supervised fine-tuning uses synthesized high-quality responses from multiple Kimi variants and expert models, emphasizing reasoning depth and tool use. Reinforcement learning is unified across text, vision, and agentic behaviors using a shared agentic RL environment. The policy optimization uses token-level clipping to stabilize off-policy RL in long-horizon, tool-using settings, paired with outcome-based and GRM-based rewards. Visual tasks use task-specific reward formulations (IoU, edit distance, counting error, etc), while <strong>Generative Reward Models</strong> provide fine-grained preference signals across diverse agent types.</p>

<h3 id="evaluations">Evaluations</h3>

<p><img src="https://andlukyane.com/images/paper_reviews/kimik25/2026-02-16_19-25-42.jpg" alt="Evaluations" /></p>

<p>Kimi K2.5 delivers frontier-level performance across reasoning, coding, agentic, and multimodal benchmarks, often matching or surpassing leading proprietary models. It shows strong STEM and scientific reasoning (AIME, HMMT, MMLU-Pro, GPQA), clear gains from tool use on HLE, and competitive real-world software engineering performance (SWE-Bench, LiveCodeBench, CyberGym).</p>

<p><img src="https://andlukyane.com/images/paper_reviews/kimik25/2026-02-16_20-28-33.jpg" alt="Agent swarm evaluations" /></p>

<p>The <strong>Agent Swarm</strong> delivers large gains beyond the base model: multi-agent orchestration consistently improves performance over single-agent K2.5 and often surpasses GPT-5.2 Pro. These gains are paired with latency reductions: parallel execution yields 3–4.5x wall-clock speedups, with execution time remaining nearly constant as task difficulty increases.</p>

<p>Additionally, Agent Swarm functions as proactive context management: tasks are decomposed into parallel, semantically isolated subtasks handled by specialized subagents with bounded local context. Only distilled results are returned to the orchestrator, enabling context sharding rather than context truncation. This allows K2.5 to scale to long-horizon, high-complexity tasks while preserving reasoning integrity and modularity.</p>

<h3 id="conclusions">Conclusions</h3>

<p>Kimi k2.5 positions itself as a strong open-source competitor in the agentic domain, directly challenging GPT-5.2 and Claude Opus 4.5. Kimi k2.5 differentiates itself from other models thanks to its native visual capabilities.</p>

<p>I really like the Agent Swarm orchestration introduced here. Most current models rely on long, linear contexts that degrade in speed as complexity increases. Kimi k2.5 breaks this “linear curse” by parallelizing the reasoning process. The fact that Anthropic already implemented this functionality shows that it was a great idea.</p>]]></content><author><name></name></author><category term="paperreview" /><category term="deeplearning" /><category term="llm" /><category term="vlm" /><category term="visual" /><category term="mllm" /><summary type="html"><![CDATA[A deep-dive review of Kimi K2.5, a next-generation open multimodal model that combines native vision-language training with parallel agent orchestration. This post explains why Agent Swarm and joint multimodal optimization matter and how K2.5 meaningfully differs from today's top closed and open models.]]></summary></entry><entry><title type="html">Paper Review: PaperBanana: Automating Academic Illustration for AI Scientists</title><link href="https://andlukyane.com//blog/paper-review-paperbanana" rel="alternate" type="text/html" title="Paper Review: PaperBanana: Automating Academic Illustration for AI Scientists" /><published>2026-02-09T00:00:00+00:00</published><updated>2026-02-09T00:00:00+00:00</updated><id>https://andlukyane.com//blog/paper-review-paperbanana</id><content type="html" xml:base="https://andlukyane.com//blog/paper-review-paperbanana"><![CDATA[<h2 id="paper-review-paperbanana-automating-academic-illustration-for-ai-scientists">Paper Review: PaperBanana: Automating Academic Illustration for AI Scientists</h2>

<p><a href="https://arxiv.org/abs/2601.23265">Paper</a></p>

<p><a href="https://github.com/dwzhu-pku/PaperBanana">Code</a></p>

<p><a href="https://dwzhu-pku.github.io/PaperBanana/">Project</a></p>

<p><img src="https://andlukyane.com/images/paper_reviews/paperbanana/2026-02-08_18-31-10.jpg" alt="Main image" /></p>

<p>PaperBanana is an agentic framework for automatically generating publication-ready academic illustrations. It coordinates specialized agents to retrieve references, plan content and style, render images, and iteratively refine results via self-critique using VLMs and image generation models. To evaluate the approach, the authors introduce PaperBananaBench, a benchmark of 292 methodology-diagram tasks curated from NeurIPS 2025 papers. Experiments show that PaperBanana outperforms strong baselines in faithfulness, conciseness, readability, and aesthetics, and the framework generalizes to high-quality statistical plot generation.</p>

<h3 id="task-formulation">Task Formulation</h3>

<p><img src="https://andlukyane.com/images/paper_reviews/paperbanana/2026-02-08_17-36-19.jpg" alt="Diagrams" /></p>

<p>The authors formulate automated academic illustration generation as the task of learning a function that maps a source context and communicative intent to a visual output. Given context and intent, the goal is to generate an image that faithfully represents the content while matching the intended focus. The formulation can be extended with optional reference examples, enabling few-shot guidance; without them, the task reduces to zero-shot generation. This paper focuses primarily on methodology diagrams, where the context is the textual method description and the intent is the figure caption that defines the illustration’s scope.</p>

<h3 id="methodology">Methodology</h3>

<p><img src="https://andlukyane.com/images/paper_reviews/paperbanana/2026-02-08_17-56-49.jpg" alt="PaperBanana framework" /></p>

<p>PaperBanana coordinates five specialized agents - Retriever, Planner, Stylist, Visualizer, and Critic - to transform scientific text into publication-quality diagrams and plots.</p>

<p>The <strong>Retriever</strong> selects relevant reference examples using VLM-based reasoning, prioritizing diagram structure and type over topical similarity. The <strong>Planner</strong> acts as the system’s cognitive core, converting the source context and communicative intent, guided by references, into a detailed textual plan of the target illustration. The <strong>Stylist</strong> enforces academic visual standards by synthesizing an aesthetic guideline from the reference corpus and refining the plan accordingly. The <strong>Visualizer</strong> generates images from the refined description using an image generation model, while the <strong>Critic</strong> evaluates each output against the original intent and context, providing feedback for iterative refinement. This Visualizer–Critic loop runs for three iterations to ensure accuracy, clarity, and visual quality.</p>

<p>The framework extends to statistical plot generation by having the Visualizer produce executable Matplotlib code instead of images, with the Critic iteratively correcting errors or inconsistencies.</p>

<h3 id="benchmark">Benchmark</h3>

<p><img src="https://andlukyane.com/images/paper_reviews/paperbanana/2026-02-08_18-02-47.jpg" alt="Benchmark" /></p>

<h4 id="data-curation">Data Curation</h4>

<p>The authors sample 2k NeurIPS 2025 papers, parse PDFs with MinerU to extract methodology text, diagrams, and captions, and filter for quality. Papers without methodology diagrams are removed, and diagrams are further filtered by aspect ratio to ensure suitability for generation and fair evaluation, yielding 610 candidates. Diagrams are then categorized into four types: Agent &amp; Reasoning, Vision &amp; Perception, Generative &amp; Learning, and Science &amp; Applications. A final human curation step verifies text, captions, and categories and removes low-quality visuals, resulting in 584 high-quality samples, split evenly into a test set and a reference set.</p>

<h4 id="evaluation-protocol">Evaluation Protocol</h4>

<p>The authors evaluate generated diagrams and plots using a VLM-as-a-Judge with a referenced comparison setup, where model outputs are directly compared against human-drawn figures. Evaluation covers two perspectives:</p>
<ul>
  <li><strong>Content</strong>: faithfulness to the source and communicative intent, and conciseness</li>
  <li><strong>Presentation</strong>: readability and aesthetics</li>
</ul>

<p>For each dimension, the judge assigns a win, loss, or tie relative to the human reference, mapped to scores of 100, 0, and 50. Overall scores are computed via a hierarchical aggregation that prioritizes faithfulness and readability over conciseness and aesthetics, ensuring that correctness and clarity dominate visual appeal.</p>

<h3 id="experiments">Experiments</h3>

<p><img src="https://andlukyane.com/images/paper_reviews/paperbanana/2026-02-08_18-23-17.jpg" alt="Main results" /></p>

<p>PaperBanana outperforms other models but underperforms the human reference in terms of faithfulness.</p>

<p><img src="https://andlukyane.com/images/paper_reviews/paperbanana/2026-02-08_18-26-41.jpg" alt="The ablation study" /></p>

<ul>
  <li>Removing the <strong>Retriever</strong> results in substantial declines in conciseness, readability, and aesthetics, as the system produces verbose, visually unrefined diagrams. Surprisingly, using randomly selected references performs comparably to semantic retrieval, indicating that exposure to general structural and stylistic patterns matters more than precise content matching.</li>
  <li>The <strong>Stylist</strong> agent improves conciseness and aesthetics but slightly harms faithfulness by omitting technical details, while the <strong>Critic</strong> agent compensates for this loss by restoring faithfulness through iterative refinement. Additional refinement iterations further improve all metrics, balancing visual quality and technical accuracy.</li>
</ul>

<h3 id="limitations-and-future-directions">Limitations and Future Directions</h3>

<p>The authors discuss the limitations of PaperBanana and outline future directions.</p>

<ul>
  <li>A key limitation is that the framework produces <strong>raster images</strong>, which are difficult to edit compared to vector graphics; potential solutions range from post-hoc image editing to reconstructing vector representations to agents that directly operate professional design software.</li>
  <li>Another challenge is the <strong>trade-off between standardized academic style and stylistic diversity</strong>, as a unified style guide constrains variation.</li>
  <li>While PaperBanana performs well visually, it still lags humans in <strong>fine-grained faithfulness</strong>, particularly in subtle structural details like arrow directions, reflecting limits of current VLM perception.</li>
</ul>

<p>Possible extensions include <strong>test-time scaling</strong> to generate multiple stylistic variants for user or model selection, and applying the reference-driven paradigm to <strong>broader domains</strong> such as UI/UX design, patents, and industrial schematics.</p>]]></content><author><name></name></author><category term="paperreview" /><category term="deeplearning" /><category term="agent" /><category term="vlm" /><category term="visual" /><category term="multimodal" /><summary type="html"><![CDATA[My review of the paper PaperBanana Automating Academic Illustration for AI Scientists]]></summary></entry><entry><title type="html">Paper Review: mHC: Manifold-Constrained Hyper-Connections</title><link href="https://andlukyane.com//blog/paper-review-mhc" rel="alternate" type="text/html" title="Paper Review: mHC: Manifold-Constrained Hyper-Connections" /><published>2026-01-26T00:00:00+00:00</published><updated>2026-01-26T00:00:00+00:00</updated><id>https://andlukyane.com//blog/paper-review-mhc</id><content type="html" xml:base="https://andlukyane.com//blog/paper-review-mhc"><![CDATA[<h2 id="paper-review-mhc-manifold-constrained-hyper-connections">Paper Review: mHC: Manifold-Constrained Hyper-Connections</h2>

<p><a href="https://arxiv.org/abs/2512.24880">Paper</a></p>

<p><img src="https://andlukyane.com/images/paper_reviews/mhc/2026-01-25_20-22-04.jpg" alt="Main image" /></p>

<p>Hyper-Connections (HC) architectures expand residual streams and diversify connectivity patterns beyond traditional residual blocks but break the identity mapping property, causing training instability, poor scalability, and inefficient memory access. The authors propose Manifold-Constrained Hyper-Connections (mHC), a general framework that projects HC residual connections onto a defined manifold to restore identity mapping while retaining the expressivity of HC. The design incorporates infrastructure-level optimizations to keep memory and computational costs low.</p>

<p>Empirical evaluation shows that mHC trains robustly at scale, improves performance across tasks, and scales better than prior HC variants without sacrificing efficiency.</p>

<h3 id="introduction">Introduction</h3>

<p><img src="https://andlukyane.com/images/paper_reviews/mhc/2026-01-25_19-06-28.jpg" alt="Illustrations of Residual Connection Paradigms" /></p>

<p>Residual connections are widely used in modern deep networks because their identity mapping property lets signals pass unchanged across layers, stabilizing training at scale. Hyper-Connections extend this idea by widening the residual stream into multiple parallel channels and learning how to mix them, increasing expressivity and topological complexity without raising per-layer FLOPs. However, this unconstrained mixing breaks the identity mapping property: across many layers, the residual transformations no longer preserve the global mean of the signal, leading to amplification or attenuation and training instability. In addition, the widened residual stream introduces significant memory-access overhead. As a result, despite their performance potential, Hyper-Connections are unstable and hard to scale in large training regimes.</p>

<h3 id="the-approach">The approach</h3>

<h4 id="manifold-constrained-hyper-connections">Manifold-Constrained Hyper-Connections</h4>

<p>mHC restores stable signal propagation by constraining the residual mixing matrix onto a specific manifold rather than leaving it unconstrained as in Hyper-Connections. Instead of forcing the residual mapping to be the identity, which would block interaction between streams, it is restricted to be a doubly stochastic matrix, with non-negative entries and rows and columns summing to one. This preserves the identity mapping behavior in a generalized form while still allowing information exchange across streams.</p>

<p>Doubly stochastic matrices are non-expansive, which limits gradient explosion, and they are closed under multiplication, so stability is preserved across many layers. Geometrically, they form the Birkhoff polytope, meaning each residual mapping acts as a convex combination of permutations and gradually increases mixing across streams, functioning as robust feature fusion. Additional non-negativity constraints on the input and output projection matrices prevent signal cancellation and further stabilize training.</p>

<h4 id="parameterization-and-manifold-projection">Parameterization and Manifold Projection</h4>

<p>At each layer, the hidden state is flattened into a single vector and passed through RMSNorm to generate dynamic mixing weights, following the original Hyper-Connections setup. Linear projections produce unconstrained versions of the input mapping, output mapping, and residual mixing matrix. The input and output mappings are then passed through a sigmoid to enforce non-negativity, while the residual mixing matrix is projected onto the manifold of doubly stochastic matrices using the Sinkhorn–Knopp algorithm. This projection exponentiates the matrix to make all entries positive and then iteratively normalizes rows and columns to sum to one, converging to a doubly stochastic matrix after a fixed number of steps. In practice, about 20 normalization iterations are used to obtain the final constrained residual mapping.</p>

<h4 id="efficient-infrastructure-design">Efficient Infrastructure Design</h4>

<p>To reduce mHC overhead, the authors reorder the normalization step to follow matrix multiplication. This preserves mathematical equivalence while lowering latency on high-dimensional inputs. They use mixed-precision to balance numerical accuracy and speed, and fuse operations with shared memory access into unified compute kernels to alleviate memory bandwidth bottlenecks. Three specialized kernels compute the input mapping, output mapping, and residual mixing matrix, with biases, projections, and RMSNorm weights absorbed into consolidated parameters. Additional fusion combines multiple scans, lightweight coefficient operations, and the Sinkhorn–Knopp iteration into single kernels. The custom backward pass recomputes intermediate results on-chip. Two further kernels apply the mappings, with residual merging fused to drastically cut memory reads and writes. Most kernels are implemented with <a href="https://arxiv.org/abs/2504.17577">TileLang</a>, enabling efficient handling of complex computations while maximizing memory bandwidth utilization with minimal engineering overhead.</p>

<p><img src="https://andlukyane.com/images/paper_reviews/mhc/2026-01-25_19-52-15.jpg" alt="Intermediate activations" /></p>

<p>The n-stream residual design greatly increases memory usage, so mHC drops all intermediate activations after the forward pass and recomputes them during backpropagation by rerunning the mHC kernels without the expensive layer function. For each block of consecutive layers, training stores only the input to the first layer, which reduces resident memory.</p>

<p><img src="https://andlukyane.com/images/paper_reviews/mhc/2026-01-25_19-57-07.jpg" alt="Communication-Computation Overlapping for mHC" /></p>

<p>Pipeline parallelism with the DualPipe schedule overlaps communication and computation, but the n-stream residual design in mHC adds significant cross-stage communication latency and extra recomputation cost at stage boundaries. To reduce these bottlenecks, the authors extend the schedule to improve overlap between communication and computation when entering new pipeline stages. The kernels in MLP layers run on a dedicated high-priority compute stream to avoid blocking communication, while long persistent kernels in attention layers are avoided to prevent stalls and allow preemption.</p>

<h3 id="experiments">Experiments</h3>

<p><img src="https://andlukyane.com/images/paper_reviews/mhc/2026-01-25_20-08-44.jpg" alt="Main results" /></p>

<p>mHC effectively mitigates the training instability observed in HC and yields comprehensive improvements on the downstream benchmarks.</p>

<p><img src="https://andlukyane.com/images/paper_reviews/mhc/2026-01-25_20-17-13.jpg" alt="Scaling experiments" /></p>

<p>The performance advantage is robustly maintained even at higher computational budgets, which proves the effectiveness of mHC in large-scale scenarios.</p>

<p><img src="https://andlukyane.com/images/paper_reviews/mhc/2026-01-25_20-18-37.jpg" alt="Propagation stability" /></p>

<p>mHC maintains much more stable signal and gradient propagation than Hyper-Connections.</p>

<p>You can read my review on DeepSeek-R1 <a href="https://andlukyane.com/blog/paper-review-deepseekr1">here</a>.</p>]]></content><author><name></name></author><category term="paperreview" /><category term="deeplearning" /><category term="architecture" /><category term="llm" /><category term="nlp" /><summary type="html"><![CDATA[My review of the paper mHC Manifold-Constrained Hyper-Connections]]></summary></entry><entry><title type="html">Top-10 ML papers I read in 2025</title><link href="https://andlukyane.com//blog/overview-of-2025-papers" rel="alternate" type="text/html" title="Top-10 ML papers I read in 2025" /><published>2025-12-24T00:00:00+00:00</published><updated>2025-12-24T00:00:00+00:00</updated><id>https://andlukyane.com//blog/overview_of_2025_papers</id><content type="html" xml:base="https://andlukyane.com//blog/overview-of-2025-papers"><![CDATA[<h2 id="top-10-ml-papers-i-read-in-2025">Top-10 ML papers I read in 2025</h2>

<p><img src="https://andlukyane.com/images/paper_reviews/overview_of_2025_papers/2025-12-24_16-36-28.jpg" alt="Main image" /></p>

<p>This year, I wrote 30+ paper reviews. That was not a planned goal - it just happened while I was reading really interesting papers. A lot of research, unsurprisingly, went into LLMs: reasoning, training recipes, evaluation, and scaling laws. But not everything interesting this year was an LLM. There were strong papers on vision, agents, time series, evaluation benchmarks, and even re-thinking how we train “old” models like BERT.</p>

<p>I want to highlight ten papers that I was personally interested in — papers that introduced a particularly exciting idea, or simply felt like a glimpse into where the field is going next. This is a subjective list. It reflects my interests, my biases, and the kinds of questions I keep asking myself as an ML engineer.</p>

<p>I don’t include top LLMs, as their papers are usually technical reports with minimal detail.</p>

<h3 id="deepseek-r1-reasoning-without-the-usual-crutches">DeepSeek-R1: Reasoning Without the Usual Crutches</h3>

<p><img src="https://andlukyane.com/images/paper_reviews/deepseekr1/2025-01-26_19-28-21.jpg" alt="DeepSeek-R1" /></p>

<p><a href="https://andlukyane.com/blog/paper-review-deepseekr1">DeepSeek-R1</a> is one of the most interesting reasoning papers of the year because it challenged an assumption many had already accepted: that strong reasoning requires heavyweight tricks like supervised chain-of-thought or expensive human-annotated reasoning traces. It offered pure reinforcement learning instead. The model is pushed to reason better without being explicitly shown how to reason. This demonstrated that reasoning can emerge as a behavior, not just as imitation. This paper started a shift to investing more in RL and reward design for LLMs.</p>

<p>And, yes, it was published this year. Not an eternity ago - less than 12 months back.</p>

<h3 id="gspo-new-policy-optimization-for-llms">GSPO: New Policy Optimization for LLMs</h3>

<p><img src="https://andlukyane.com/images/paper_reviews/gspo/2025-08-03_19-07-54.jpg" alt="GSPO" /></p>

<p><a href="https://andlukyane.com/blog/paper-review-gspo">GSPO</a> (Group Sequence Policy Optimization) was used in Qwen3 and contributed “remarkable improvements” to it. It used sequence-level importance ratios instead of token-level, provided more stable training, and simplified the design of RL infrastructure. It removed the need for a separate critic model by using group-based relative rewards, which further reduces the hardware requirements for training cutting-edge AI. What makes GSPO interesting is that it aligns the optimization objective with how we actually evaluate language models, reducing a class of failure modes that many pipelines quietly struggle with today.</p>

<h3 id="lumine-generalist-agents-in-open-3d-worlds">Lumine: Generalist Agents in Open 3D Worlds</h3>

<p><img src="https://andlukyane.com/images/paper_reviews/overview_of_2025_papers/lumine.jpg" alt="Lumine" /></p>

<p>I didn’t write a review on <a href="https://www.lumine-ai.org/">Lumine</a>, but this is a fascinating research. The paper presents an open recipe for training agents that can act, explore, and generalize in large 3D environments - open-ended worlds. The emphasis is on generalist behavior: navigation, interaction, adaptation. It blew my mind that this agent can play Genshin Impact and other games with reasonable efficiency.</p>

<h3 id="sam-3-segment-anything-grows-up">SAM-3: Segment Anything Grows Up</h3>

<p><img src="https://andlukyane.com/images/paper_reviews/sam3/2025-11-23_17-36-13.jpg" alt="SAM 3" /></p>

<p><a href="https://andlukyane.com/blog/paper-review-sam3">SAM-3</a> continues the evolution of the Segment Anything line and reflects a deeper understanding of what segmentation models are actually used for in real systems. The main impact is the transition from class-agnostic segmentation to semantically-aware segmentation. And users can prompt it not just with simple points, boxes, or nouns, but also with complex semantic concepts.</p>

<h3 id="chronos-2-time-series-without-task-specific-training">Chronos-2: Time Series Without task-specific training</h3>

<p><img src="https://andlukyane.com/images/paper_reviews/chronos2/2025-11-02_17-28-28.jpg" alt="Chronos2" /></p>

<p><a href="https://andlukyane.com/blog/paper-review-chronos2">Chronos-2</a> treats time series forecasting as a sequence modeling problem, it builds on foundation-model ideas and shows that with the right training setup, you can get strong performance across diverse time series tasks without hand-crafted features or manual seasonality modeling. Chronos-2 suggests that we can have universal zero-shot time series forecasting.</p>

<h3 id="neobert-modernizing-bert">NeoBERT: Modernizing BERT</h3>

<p><img src="https://andlukyane.com/images/paper_reviews/neobert/2025-03-02_17-56-54.jpg" alt="NeoBERT" /></p>

<p><a href="https://andlukyane.com/blog/paper-review-neobert">NeoBERT</a> raised an interesting question: what if we trained BERT properly, using everything we’ve learned up to now?  The paper revisits BERT with modern training techniques — better tokenization, better objectives, better optimization — and shows that you can still squeeze meaningful gains from it. What’s more interesting, it gets high scores on the MTEB benchmark, showing that its embeddings can rival modern models.</p>

<h3 id="alphaevolve-search-meets-learning">AlphaEvolve: Search Meets Learning</h3>

<p><img src="https://andlukyane.com/images/paper_reviews/alphaevolve/2025-05-15_18-40-24.jpg" alt="AlphaEvolve" /></p>

<p><a href="https://andlukyane.com/blog/paper-review-alphaevolve">AlphaEvolve</a> explores the space between evolutionary algorithms and modern learning systems. It shows how structured evolution can guide discovery in complex optimization spaces: it proposes solutions, evaluates them objectively and iteratively refines them.</p>

<h3 id="swe-rebench-scalable-and-continuously-updated-benchmark">SWE-rebench: Scalable and continuously updated benchmark</h3>

<p><img src="https://andlukyane.com/images/paper_reviews/swerebench/2025-06-01_20-35-41.jpg" alt="SWE-rebench" /></p>

<p>Instead of synthetic tasks or cherry-picked prompts, <a href="https://andlukyane.com/blog/paper-review-swerebench">SWE-rebench</a> evaluates models on real software engineering problems: fixing bugs in real repositories, under realistic constraints. The result is a benchmark that exposes just how brittle some “coding-capable” models still are.</p>

<h3 id="dinov3-self-supervised-vision-keeps-scaling">DINOv3: Self-Supervised Vision Keeps Scaling</h3>

<p><img src="https://andlukyane.com/images/paper_reviews/dinov3/2025-08-24_08-49-35.jpg" alt="DINOv3" /></p>

<p><a href="https://andlukyane.com/blog/paper-review-dinov3">The paper</a> refines the DINO approach with better training recipes and larger models, producing representations that transfer well across tasks without supervision. What stands out is how general these representations are: they feel less like features and more like visual understanding.</p>

<h3 id="dragon-hatchling-small-models-real-capabilities">Dragon Hatchling: Small Models, Real Capabilities</h3>

<p><img src="https://andlukyane.com/images/paper_reviews/dragon_hatchling/2025-10-26_18-17-36.jpg" alt="Dragon Hatchling" /></p>

<p><a href="https://andlukyane.com/blog/paper-review-dragon-hatchling">Dragon Hatchling</a> focuses on small models that punch above their weight. Instead of assuming scale solves everything, the paper explores how careful training, curriculum design, and architectural choices can produce compact models with surprisingly strong reasoning and generalization abilities. Dragon Hatchling is a biologically inspired LLM built as a scale-free graph of locally interacting neuron-like units.</p>

<h2 id="closing-thoughts">Closing Thoughts</h2>

<p>This was an interesting year for ML research. We saw noticeable progress in LLMs, vision models, agents, and time series. There were many incremental improvements, but also some breakthroughs. LLMs dominate the research, but there are other interesting papers that combine different things: learning and search, vision and language, agents and environments.</p>

<p>This year was again about showing what foundation models can do, maybe the next one will be about figuring out how to make them reliable, grounded, and genuinely useful.</p>]]></content><author><name></name></author><category term="paperreview" /><category term="deeplearning" /><category term="blogpost" /><summary type="html"><![CDATA[Top-10 ML and AI papers I read in 2025]]></summary></entry></feed>