By Adarsh Vatsa — 10 Mar 2026

Autoresearch Is Real but there are caveats.

Connecting Karpathy's autoresearch results with recent research on automated AI experimentation.

Andrej Karpathy recently released autoresearch, and the general AI community went bonkers. The idea is simple: give a coding agent your training script and a clear evaluation metric. Let it run overnight. It modifies the code, trains, evaluates, keeps improvements, discards regressions. You wake up to a better model.

After two days of autonomous experimentation on nanochat, the system tried roughly 700 code modifications and found about 20 that improved validation loss. All 20 were additive. They transferred to larger models. Stacked together, they dropped the "Time to GPT-2" benchmark from 2.02 hours to 1.80 hours. That's an 11% improvement on what Karpathy thought was already a well-tuned project.

His own conclusion, in a subsequent tweet, was:

"All LLM frontier labs will do this. It's the final boss battle."

This post isn't here to argue that autoresearch doesn't work. It clearly does. But there's a body of recent research, mostly arXiv preprints not yet peer-reviewed, that tested closely related ideas at larger scale and found the limits. Those limits matter, and the current hype wave isn't talking about them.

Someone already tested this, rigorously.

While autoresearch was going viral, a team at Stanford had already built and evaluated a more sophisticated version of the same loop.

In January 2026, Chenglei Si and collaborators published "Towards Execution-Grounded Automated AI Research". Their system takes natural language research ideas, converts them into code implementations, runs large-scale parallel GPU experiments, and feeds performance back to the ideator. They tested it on the same domain Karpathy targets: LLM pre-training and post-training. They even used nanoGPT as a baseline.

The results were strong. Their system found a post-training recipe that scored 69.4% versus 48.0% for the GRPO baseline. It found a pre-training recipe that beat the nanoGPT baseline by nearly 2x (19.7 minutes vs. 35.9 minutes). And it did this within just 10 search epochs.

So far, this looks like a validation of Karpathy's thesis. Automated search works. The improvements are real.

But Si's group didn't stop there. They also measured what happens after those first 10 epochs.

The purported ceiling(probably?).

Here's what the hype isn't talking about.

Frontier LLMs saturate early. Si's paper reports this directly. The system finds big improvements in the first few epochs, and then the curve flattens. Scaling trends appear only "occasionally." The search doesn't keep getting better the longer you run it.

This matches Karpathy's numbers, even if he hasn't framed it this way yet. Out of 700 attempted modifications, 20 were keepers. That's a 2.9% hit rate. The system tried a lot of things. Most of them didn't help.

Would a second round of 700 attempts yield another 20 improvements? Maybe. Si's system and Karpathy's differ in important ways, so we can't extrapolate directly. But the saturation trend Si observed is at least a cautionary signal. The easy wins get found first, and each subsequent round of search has less room to improve.

Reinforcement learning doesn't help the way you'd expect. A natural follow-up question: can you train the ideator model itself to get better at proposing improvements? Si tested this, using execution outcomes as the reward signal.

The result was counterintuitive. RL improved the average quality of proposed ideas. But it suffered from mode collapse: the model converged on simpler, safer proposals and failed to improve the upper bound of performance. More training made the system more consistently mediocre rather than occasionally brilliant.

This comes from one system in one paper, so we should be cautious about generalizing. But it suggests that naively trying to train your way past the saturation problem doesn't straightforwardly work.

Good-sounding ideas don't reliably produce good results. In earlier work, Si's group showed that research ideas rated highly by both LLMs and human reviewers don't correlate well with actual execution outcomes. This is the "ideation-execution gap." It matters here because autoresearch depends on the agent generating meaningful code variations, not just plausible-sounding ones. If most of the 700 proposals are cosmetic rearrangements that don't change behavior, the search is inefficient no matter how fast you can evaluate them.

What kind of improvements does it actually find?

Let's take a look at what Karpathy's system discovered:

His parameterless QKnorm was missing a scaler multiplier, making attention too diffuse.
Value Embeddings needed regularization, and he wasn't applying any.
His banded attention window was too conservative because he forgot to tune it.
AdamW betas were misconfigured.
Weight decay schedule and network initialization were suboptimal.

Every single one of these is an oversight or under-tuning of the existing design. Not one is a novel architectural contribution. Karpathy acknowledges this himself: "It's not novel, ground-breaking research (yet)."

Si's execution-grounded paper is consistent with this pattern. Evolutionary search excels at exploitation within a known design space. Whether it can eventually produce genuinely novel ideas remains an open question, but the current evidence skews toward "not yet."

This isn't a knock on autoresearch. Finding 20 real improvements that a veteran researcher missed over weeks of manual tuning is genuinely impressive. But it's a different thing than what the hype implies. It's systematic optimization, not invention.

The cascade is where it gets interesting.

Karpathy's most interesting claim isn't about running the loop longer. It's about the cascade:

"You spin up a swarm of agents, you have them collaborate to tune smaller models, you promote the most promising ideas to increasingly larger scales."

The idea is basically something like this:

Run 1,000 experiments on a tiny model. Minutes each, pennies per experiment.
Promote the top 20 to a medium model. Hours each, dollars per experiment.
Promote the top 5 to a large model. Days each, thousands per experiment.
Deploy the 1 or 2 that survive all scales.

This partially sidesteps the saturation problem because you're not trying to squeeze more out of a single exhausted search space. You're using cheap small-scale search as a filter for ideas worth testing expensively. The diminishing returns still hit at each individual scale, but the cascade gives you a new axis to exploit.

Using small models as proxies for large-model behavior isn't new. Researchers have used this trick for decades. But automating the entire pipeline with agents, from idea generation through small-scale validation to large-scale promotion, is genuine progress in engineering efficiency. Whether it also enables qualitatively new discoveries, or just accelerates the existing optimization playbook, remains to be seen.

So when to use it, and when not to?

Based on the combined evidence:

Autoresearch works well when you have a clear, cheap-to-evaluate metric and a search space full of unexplored configurations. Hyperparameter tuning, bug-finding, fixing oversights in training pipelines, optimizing prompts and tool usage for agents on fixed models. These are high-ROI applications.

It works poorly when you're already near the frontier of optimization, when your evaluation metric is noisy or expensive (like LLM-as-judge scoring), or when you need genuinely novel ideas rather than better configurations of existing ones.

The irony: the better your starting point, the less autoresearch can help you. Karpathy got 11% because nanochat had real food on the table. A team that's already spent months systematically tuning every hyperparameter will find less.

What "the final boss battle" really means:

When Karpathy says all frontier labs will do this(and probably already are), he's right. But not because it unlocks superintelligence. It's because it's free performance on the table.

Every frontier lab has thousands of hyperparameters, configuration choices, and implementation details that no human team can exhaustively tune. Running autonomous agent swarms to systematically search those spaces catches mistakes and finds improvements that compound. At frontier scale, 11% on training efficiency translates to millions of dollars in saved compute. Labs leaving that on the table would be negligent.

Autoresearch will become standard practice. As a tool.

The word doing the most work in Karpathy's post is the one most people glossed over:

"It's not novel, ground-breaking research (yet)."

That "yet" carries the weight of the entire thesis. Whether the loop can eventually produce genuine novelty, not just optimization, is the real open question. Si's preliminary evidence suggests current systems hit ceilings: they saturate, they can mode-collapse, they tend to converge on simpler ideas. Whether those ceilings are fundamental or merely engineering challenges is exactly what makes this space worth watching.

In Conclusion.

Autoresearch is a genuinely useful tool. The results are real. Labs use it. It will make models measurably better.

But it's not the final boss battle. Not yet. Today, it's the optimization pass you run after the research direction is set. The battle for genuinely novel ideas still relies heavily on human insight. Whether that remains true as these systems become more sophisticated is one of the most important open questions in the field.

The hype says AI is about to recursively self-improve.

The evidence says AI is about to get really good at tuning things humans forgot to tune.

Only one of them is revolutionary.

Related work.

Here are some other projects that explore variations of the autoresearch loop:

autoresearch-agents (Harrison Chase, 2026) applies the same idea to agent code instead of training scripts, using LangSmith evaluations with LLM-as-judge and code-based metrics.

hermes-agent-self-evolution (Nous Research, 2026) uses GEPA, a genetic prompt evolution algorithm, to optimize agent skills, prompts, and code. It reads execution traces to understand why things fail, not just that they failed.

Automated Design of Agentic Systems (Hu, Lu, Clune; ICLR 2025) is a meta-agent that iteratively programs new agent architectures in code. Different search target than autoresearch, but the same generate-evaluate-select loop.

Subscribe to Persistence of Reason