# Paper Review Template — Andrey Lukyanenko Style
## 1. Title
Use the title to expose the **main conceptual contrast**, not just the paper name.
Good patterns:
```text
: Why Fails and What Changes
```
```text
Beyond : How Enables
```
```text
Review: and
```
Examples from actual posts:
```text
Collaborative Reinforcement Learning:
Why HACRL Trains Models in Teams Instead of Isolation
```
```text
Beyond Positional Bias:
How DroPE Unlocks Zero-Shot Long Context in LLMs
```
```text
Kimi k2.5 Review:
Native Multimodality and Agent Swarms at 1 Trillion Parameters
```
Rules:
* emphasize the paper’s **central shift**
* include the old bottleneck when possible
* avoid generic “Paper Review: X”
* the title can be descriptive, but should still contain tension
* for model reports, avoid making the title only about benchmark rank; focus on the technical or systems shift
---
## 2. Opening / Framing
The opening is usually **one dense introductory block**, not a separate dramatic hook.
Structure:
```text
currently relies on .
This approach works, but has a structural limitation: .
changes the setup by .
The result is .
```
Real pattern:
* HACRL starts from isolated RL training and the problem that discoveries remain locked inside each model’s training run.
* DroPE starts from long-context generalization failure and the cost of long-context fine-tuning or architectural changes.
* Kimi K2.5 starts by positioning the model around two converging frontiers: native multimodality and agentic intelligence.
* DeepSeek-V4 starts from the fact that million-token context is not enough if attention and KV-cache costs make long-horizon inference impractical.
Rules:
* start from the field-level bottleneck
* introduce the paper early, but not as “This paper proposes…” in the first sentence
* include the core claim in the first 2–3 paragraphs
* mention evidence briefly only if it strengthens the positioning
* avoid suspense; make the main idea clear immediately
* for model reports, state which capability is actually central: architecture, post-training, inference systems, multimodality, agents, long context, or something else
---
## 3. Main Structure
Default structure should be:
```text
Title
Opening / framing
Section 1: Main technical idea
Section 2: Second technical idea or failure mode
Section 3: Method / system design / general approach
Section 4: Evidence / evaluation
Conclusions
```
But the exact structure should follow the paper.
Use **paper-specific section names**, not generic names like “Core Idea” or “Intuition,” unless the article is very short.
Good section types:
```text
###
###
### :
###
### Experiments / Evaluations
### Conclusions
```
Examples:
```text
### Heterogeneous Agent Collaborative Reinforcement Learning
### Experiments
### Conclusions
```
```text
### Explicit positional embeddings are beneficial for training
### RoPE prevents effective zero-shot context extension
### DroPE: Dropping positional embeddings after pretraining
### Conclusions
```
```text
### Joint Optimization of Text and Vision
### Agent Swarm
### The general approach
### Evaluations
### Conclusions
```
Rules:
* use sections to reflect the paper’s actual conceptual structure
* each section should explain one major idea
* do not force every review into the same 10-section skeleton
* do not separate “intuition,” “mechanics,” and “evidence” unless the paper demands it
* merge explanation and interpretation inside the same section
* use section names as arguments, not labels
---
## 4. Recommended Structures by Paper Type
### Narrow method papers
Use when the paper has one main algorithmic or modeling idea.
```text
Title
Opening / framing
###
### :
### Experiments
### Conclusions
```
### Large model / system papers
Use when the paper is about a model family, system, or frontier release.
```text
Title
Opening / framing
###
###
### The general approach
### Evaluations
### Conclusions
```
### Frontier model / model report papers
Use when the paper combines architecture, post-training, inference systems, agent behavior, and many benchmarks.
```text
Title
Opening / framing
###
###
###
### Evaluations
### What this changes compared to previous models
### Limitations / open questions
### Conclusions
```
Rules for model reports:
* model reports are usually multi-axis: architecture, data, pretraining, post-training, inference systems, tool use, and benchmarks
* do not treat them like narrow method papers
* find the main axis that actually changes the story
* do not let the review become a catalog of features
* systems details can be central if they change what the model can practically do
Example framing:
```text
DeepSeek-V4 is not mainly interesting because it has a 1M-token context window.
It is interesting because it tries to make million-token context operationally usable through compressed attention, lower KV-cache cost, and agent-specific context management.
```
### Conceptual RL / optimization papers
Use when the paper changes how training signal, credit assignment, rewards, or optimization work.
```text
Title
Opening / framing
###
###
### Experiments
### Conclusions
```
---
## 5. Section Writing Pattern
Each substantive section usually follows this shape:
```text
```
Example pattern:
```text
Sequential agent execution is a bottleneck for complex, long-horizon tasks.
Kimi K2.5 introduces Agent Swarm with Parallel Agent Reinforcement Learning.
Instead of hard-coded parallelism, the orchestrator learns when and how to decompose tasks into concurrent subagents.
This changes agent execution from a linear process into a parallel reasoning graph.
```
Rules:
* begin with the local problem
* explain the technical move
* explain the mechanism enough for an ML reader
* end with the implication
* avoid “first / second / third” unless describing actual training stages
* avoid dry method summaries; every mechanism should have a purpose
Better mechanism pattern:
```text
exists because .
Mechanically, it changes .
This matters because .
```
Example:
```text
Hybrid attention exists because dense attention over 1M tokens is too expensive.
Mechanically, it combines compressed sparse access with heavily compressed global access.
This matters because the model can preserve long-range context without paying dense-attention cost at every generation step.
```
---
## 6. Technical Detail Level
The reviews should include important technical detail, but compress it into conceptual prose.
The rule is not “minimal mechanics.” The rule is:
> include mechanisms when they explain the contribution or change the interpretation.
Use this pattern:
```text
matters because .
```
Avoid:
```text
The method consists of the following pipeline...
```
Prefer:
```text
The important detail is not that the model uses multiple agents, but that the orchestrator learns when parallelism is useful instead of assuming it by design.
```
Rules:
* include mechanisms when they explain the paper’s contribution
* skip derivations
* avoid formula-heavy explanation
* explain technical choices through their purpose
* include important details even if they appear “implementation-like,” if they change the interpretation
* do not include a mechanism only because it sounds novel
* for systems-heavy papers, treat KV cache size, inference FLOPs, precision, routing overhead, context-management rules, and tool-call handling as possible core contributions
Keep details like:
* off-policy correction
* clipping
* recalibration
* frequency compression
* synthetic stress-test prompts
* critical path / critical steps
* modality-organized vs ability-organized experts
* early vs late multimodal fusion
* KV-cache reduction
* inference FLOPs
* sparse or compressed attention mechanisms
* routing stability tricks
* tool-call context preservation
* post-training consolidation mechanisms
Remove details like:
* full benchmark tables
* all dataset names
* every ablation
* exact hyperparameters unless central
* implementation trivia that does not affect the interpretation
---
## 7. Evidence / Experiments
Evidence should usually be a short section near the end, or a short paragraph inside each technical section.
Structure:
```text
The experiments show that improves compared to .
The important point is not the exact number, but that the gains appear in , which supports .
```
Your real style:
* HACRL: results support collaborative training and reduced rollout cost.
* DroPE: results show long-context retrieval gains and scalability to larger models.
* Kimi K2.5: evaluations are broader because the paper/model is broader; benchmark coverage is used to position the model.
* DeepSeek-V4: evaluations should support the claim that long-context efficiency, coding, and agentic tool use are the strongest parts of the release, while general knowledge against the best closed models is not uniformly dominant.
Rules:
* do not walk through tables
* include numbers only when they are memorable, structurally important, or necessary for comparison
* use benchmarks to support the conceptual claim
* for large model reports, evaluation can be a dedicated section
* for method papers, evidence can be shorter
* use fewer numbers, but make them exact
* avoid accidental benchmark maximalism
Bad:
```text
The model reaches A, B, C, D, E, F, G...
```
Better:
```text
The strongest evidence is in . The weaker area is . This supports the paper’s main claim because .
```
For model reports:
```text
Use 5–8 headline numbers maximum.
Then interpret them.
```
---
## 8. Benchmark Hygiene
Benchmark sections are the easiest place to introduce factual errors.
For every numerical claim, check:
```text
1. Which model variant?
2. Which evaluation mode?
3. Which benchmark?
4. Which table / figure / section?
```
Do not mix:
* base-model and instruct/post-trained results
* Pro / Flash / Lite / Max variants
* Non-Think / Think High / Think Max modes
* open-model comparisons and closed-model comparisons
* table-level benchmark scores and separate diagnostic figure metrics
* internal benchmark results and external benchmark results
* model-only results and scaffold/tool-agent results
Rules:
* benchmarks should support the argument, not become the argument
* separate base-model, post-trained-model, reasoning-mode, and agent-mode results
* avoid mixing numbers from different tables
* report fewer numbers, but make them exact
* include a comparison only if the comparison is explicitly supported
* distinguish “open-source SOTA” from “frontier closed-model SOTA”
* mention evaluation caveats when the benchmark is internal, scaffold-dependent, or vendor-controlled
Recommended pattern:
```text
The important evaluation result is not .
It is that improves in , where the proposed mechanism should matter.
```
Example:
```text
The strongest story is long-context efficiency, coding, and agentic tool use. The weaker story is general knowledge against the strongest closed models, where DeepSeek is competitive but not uniformly ahead.
```
---
## 9. Variant Discipline
When a paper contains multiple variants, never write “the model” if the claim applies only to one variant.
Use exact names when needed:
* Base vs post-trained
* Pro vs Flash
* Lite vs full
* High vs Max reasoning
* tool-use vs non-tool-use
* preview vs final release
* dense vs MoE
* text-only vs multimodal
Bad:
```text
DeepSeek-V4 reaches 93.5 on LiveCodeBench.
```
Better:
```text
DeepSeek-V4-Pro-Max reaches 93.5 on LiveCodeBench.
```
Bad:
```text
The model has 49B active parameters.
```
Better:
```text
V4-Pro has 1.6T total and 49B active parameters; V4-Flash has 284B total and 13B active.
```
Use the shorter phrase only after the variant has been clearly established.
---
## 10. Positioning
Positioning is important, but it does not always need a separate section.
Often it appears in:
* the opening
* section transitions
* conclusion
Good pattern:
```text
Compared to , this paper changes .
```
Examples:
* HACRL is positioned against independent RL post-training, distillation, and multi-agent RL.
* DroPE is positioned against RoPE scaling, long-context fine-tuning, and alternative architectures.
* Kimi K2.5 is positioned against sequential agent execution and text-first multimodal training.
* DeepSeek-V4 is positioned against the idea that long-context capability is mainly about increasing the context window.
Rules:
* compare against paradigms, not just named baselines
* explain what previous approaches structurally miss
* use 2–4 comparisons maximum
* separate positioning section only if the comparison is central
* do not overstate competitive claims; distinguish paper-reported results from broader market position
---
## 11. Paper-Stated Facts vs Interpretation
Your reviews are interpretive, but technical reviews must separate what the paper says from what the review infers.
Use three levels of statement:
```text
1. Paper-stated fact:
“The paper reports X.”
2. Mechanistic interpretation:
“This suggests Y because Z.”
3. Broader positioning:
“This points toward a shift from A to B.”
```
Example:
```text
The paper states that attention sinks allow total attention mass over context tokens to be less than 1.
A reasonable interpretation is that this gives heads a way to ignore weak or uninformative context.
The broader point is that long-context models need mechanisms for selective forgetting, not only retrieval.
```
Rules:
* do not present interpretation as if it is explicitly stated by the paper
* when the paper only supports a mechanism, do not claim it caused a benchmark gain unless the paper shows this
* use “suggests,” “can be read as,” or “the important implication is” for interpretive claims
* keep the argumentative style, but make the evidence boundary clear
---
## 12. Figures and Screenshots
Important technical details can appear in figures, captions, or equations and may be missed by text search.
If using screenshots or figures:
* verify the caption and surrounding text
* do not infer extra claims from the figure alone
* describe what the figure directly supports
* mark broader interpretation separately
* avoid claiming causality unless the paper explicitly supports it
Example rule:
```text
A figure can support “the architecture contains X.”
It may not support “X was responsible for the benchmark gain” unless the paper explicitly says so.
```
Good pattern:
```text
The figure shows that CSA and HCA use attention sinks by adding learnable sink logits to the attention denominator.
A reasonable interpretation is that this lets a head avoid forcing attention mass onto weakly relevant context blocks.
```
---
## 13. Limitations / Open Questions
Actual reviews often do **not** include a dedicated limitations section.
So this should be optional, not mandatory.
Use when:
* the paper’s claims are strong but evidence is narrow
* scalability is unclear
* cost is important
* the benchmark setup may not reflect real-world use
* the method depends on assumptions that may break
* the evaluation is internal, scaffold-dependent, or not independently verified
* the release is a preview or model report rather than a stable final system
Structure:
```text
There are still open questions.
.
.
.
```
Rules:
* keep it short
* do not add generic limitations
* only include limitations that sharpen the interpretation
* if the conclusion already gives the right framing, skip this section
* for model reports, mention benchmark and deployment caveats when they affect interpretation
---
## 14. Conclusion
The conclusion should be short and interpretive.
Your conclusions usually do three things:
```text
.
.
.
```
Examples:
* HACRL concludes with “collaboration during training, independence during inference.”
* DroPE concludes with decoupling the benefits of positional embeddings during training from their drawbacks at inference.
* Kimi K2.5 concludes by highlighting native visual capabilities and Agent Swarm as a way to break linear agent execution.
* DeepSeek-V4 concludes by framing long context as runtime infrastructure for reasoning and tool use, not just a static context-window property.
Good pattern:
```text
is interesting because it changes .
Instead of , it suggests .
This is useful because .
```
Rules:
* do not summarize the whole post
* end with the conceptual shift
* personal judgment is allowed, but keep it technical
* one strong idea is enough
* avoid ending with leaderboard claims unless the paper’s main contribution really is benchmark dominance
---
# Core Writing Rules
## 1. Start from the bottleneck
The review should begin with what is structurally hard in the field.
Not:
```text
This paper proposes X.
```
Better:
```text
Modern usually rely on . This works, but creates .
```
## 2. Explain the conceptual shift early
The reader should understand the paper’s main contribution within the first few paragraphs.
## 3. Use section names as arguments
Section titles should tell the reader what matters.
Not:
```text
Method
```
Better:
```text
RoPE prevents effective zero-shot context extension
```
## 4. Merge mechanics with interpretation
Do not write a dry method summary. Explain why each mechanism exists.
Not:
```text
The method uses clipping and importance sampling.
```
Better:
```text
Because shared trajectories come from different policies, naive reuse can destabilize learning. HACPO uses importance weighting and clipping to preserve useful cross-agent experience while limiting incompatible updates.
```
## 5. Include technical details that change the story
Do not over-compress away the actual contribution.
Keep important implementation-like details if they change the interpretation:
* off-policy correction
* clipping
* recalibration
* compression ratios
* frequency compression
* critical path / critical steps
* routing stability tricks
* KV-cache and FLOP reductions
* precision or inference-system changes
* tool-use context handling
Remove details that do not affect the story:
* every dataset name
* full benchmark tables
* all ablations
* exact hyperparameters unless central
* engineering details that do not change capability, cost, or stability
## 6. Results support the argument
Use results to answer:
```text
Does the evidence support the claimed conceptual shift?
```
Not:
```text
What are all the numbers?
```
## 7. Prefer dense paragraphs over many bullets
Bullets are fine for mechanisms or stages, but the strongest reviews mostly use compact prose.
Use bullets only when:
* listing training stages
* listing mechanisms
* listing limitations
* comparing approaches
* preventing a dense technical paragraph from becoming unreadable
## 8. Avoid artificial symmetry
Not every review needs:
* hook
* core idea
* intuition
* mechanics
* evidence
* positioning
* limitations
* takeaway
That structure is useful for drafting, but too rigid for the final article.
## 9. Treat systems claims as part of the contribution
For frontier model reports, do not separate “systems details” from “model contribution” too aggressively.
KV cache size, inference FLOPs, precision, routing overhead, context-management rules, and tool-call handling can matter as much as architecture or RL recipe when they change what the model can practically do.
Example:
```text
The KV-cache reduction is not an implementation footnote. It is what makes the 1M context claim operationally meaningful.
```
---
# What to Avoid
* forcing every review into 10 sections
* generic “What actually matters” section titles
* treating evidence as the center of the review
* explaining standard background for too long
* removing important mechanics in the name of simplicity
* writing a tutorial instead of an interpretation
* using “the paper proposes” repeatedly
* separating intuition and mechanics when they belong together
* adding generic limitations just because the template asks for them
* mixing benchmark numbers across variants, modes, or tables
* presenting model-vendor claims as independently verified facts
* overclaiming causal links from figures or architecture diagrams
* writing a leaderboard post instead of a technical interpretation
---
# Efficient Workflow
1. Extract the paper’s **central bottleneck**.
```text
What standard assumption does this paper attack?
```
2. Extract the **main conceptual shift**.
```text
What changes in how we should think about the problem?
```
3. Identify the paper type.
```text
Is this a narrow method paper, model report, systems paper, RL paper, architecture paper, or benchmark paper?
```
4. Select 2–4 major sections.
```text
Which ideas deserve their own headings?
```
5. For each section, write:
```text
problem → mechanism → implication
```
6. Add evidence only where it supports the interpretation.
7. Run a fact-verification pass.
Check:
* Are benchmark numbers copied from the correct table?
* Are base-model and post-trained-model numbers separated?
* Are model variants separated?
* Are inference modes separated?
* Are claims from figures, tables, and text not accidentally merged?
* Are interpretation sentences clearly separated from paper-stated facts?
* Are links official?
8. End with the broader direction.
---
# Fact Verification and Benchmark Hygiene
Before publishing, run a separate verification pass.
For every numerical claim, check:
* exact model variant
* evaluation mode
* benchmark name
* table / figure / source
* whether the number is base-model, post-trained, agentic, or reasoning-mode-specific
Do not mix:
* base and instruct/post-trained results
* Pro / Flash / Lite variants
* Non-Think / Think High / Think Max modes
* open-model comparisons and closed-model comparisons
* table-level benchmark scores and separate figure-level diagnostic metrics
* internal benchmark results and external benchmark results
* model-only and scaffold-dependent agent results
Benchmarks should support the argument, not become the argument. Use fewer numbers, but make them exact. Prefer explaining where the model is strong, where it is weaker, and how this supports or weakens the paper’s claimed conceptual shift.
Separate paper-stated facts from interpretation:
* Paper-stated fact: “The paper reports X.”
* Mechanistic interpretation: “This suggests Y because Z.”
* Broader positioning: “This points toward a shift from A to B.”
For model reports, systems details can be core contributions. KV-cache size, inference FLOPs, precision, routing overhead, context-management rules, and tool-call handling may matter as much as architecture or RL recipe when they change what the model can practically do.
---
# Review Quality Checklist
## Conceptual clarity
* Can the main bottleneck be stated in one sentence?
* Can the main shift be stated in one sentence?
* Does every section support that shift?
* Is the title aligned with the shift?
* Does the opening make the contribution clear early?
## Technical quality
* Are mechanisms explained through purpose, not just listed?
* Are important implementation details included only when they change interpretation?
* Are formulas skipped unless necessary?
* Are systems details included when they matter?
* Are interpretation and paper-stated facts separated?
## Evidence
* Are benchmark numbers exact?
* Are variants separated?
* Are base/post-trained/inference modes separated?
* Are internal benchmark caveats mentioned?
* Are open-model and closed-model comparisons separated?
* Are the results used to support the argument rather than replace it?
## Style
* Are section titles argumentative?
* Are paragraphs dense but readable?
* Are bullets used only where they help?
* Does the conclusion end with a conceptual shift rather than a summary?
* Does the post read like a compressed technical interpretation, not a tutorial or benchmark dump?
---
# Mental Model
The reviews are not summaries and not tutorials.
They are:
> **compressed technical interpretations of why a paper matters**
The ideal reader should finish the review with three things:
1. what problem the paper attacks
2. what technical idea changes the framing
3. why this may matter beyond the reported benchmark numbers
The best review does not say everything in the paper. It finds the conceptual spine, explains the mechanisms that support it, verifies the facts carefully, and ends with why the direction matters.
===
This is a writing guide/template for my paper reviews.
Read it, ultrathink, analyze, and understand it
I want to write a paper review of Deepseek v4.
Links:
DeepSeek_V4.pdf
https://huggingface.co/deepseek-ai/DeepSeek-V4-Pro/blob/main/DeepSeek_V4.pdf
https://deepseek.ai/deepseek-v4
Launch an agent to write a detailed technical summary/review.
Launch the second agent to write a paper review in my usual style.
Launch the third agent to think about the role of this paper and how it compares to other similar papers.
Launch the final agent that scores the outputs against my style, consistency, and quality. Make sure the review goes into technical detail.
Synthesize the final version of the paper review and write it.