Tag: evaluation
2 posts
Testing MiniMax M2.7 via API on three real ML and coding workflows
An evaluation of MiniMax M2.7 used through Claude Code on three workflows I run regularly — writing code for a Kaggle...
Paper Review: SWE-rebench: An Automated Pipeline for Task Collection and Decontaminated Evaluation of Software Engineering Agents
My review of the paper SWE-rebench An Automated Pipeline for Task Collection and Decontaminated Evaluation of Softwar...