Experiments in Software 3.0: Autoresearch
Introduction
The key insight behind Karpathy’s autoresearch,1 an autonomous AI scientist, is that the program.md which gives the LLM instructions for how to perform research is not documentation. It is a prompt that acts as code, hence its name program.md. This is Prompt = Code = Data2 in its most literal form.3
I applied autoresearch to Gaussian Process (GP) regression. GPs form a great testbed for autoresearch for several reasons.
- GPs are intimately connected to neural networks. In particular, an infinitely wide neural network is a GP,4 so they serve as a simpler model of what neural networks do.
- GPs can be measured by both prediction error (RMSE) and calibration (NLL), letting us test how autoresearch performs when two objectives must be balanced rather than one.
- GPs can incorporate neural networks via deep kernel learning (DKL),5 which composes a learned embedding (Software 2.0) with a predefined kernel (Software 1.0), providing a rich space for an autoresearcher (Software 3.0) to explore.
Case Study: Deep Kernel Learning with SoftKI
Why approximate GPs? Standard GPs scale as \(O(N^3)\) in the number of data points, which limits them to small datasets. Unfortunately, an autoresearcher may run hundreds of experiments on datasets with tens of thousands of points, so exact GPs are out of the question.
Why SoftKI? Among approximate GPs, I chose SoftKI,6 because its optimization procedure easily adapts to support DKL, giving the agent a rich but tractable design space. It is also one of the few approximate GPs with an extension to derivative observations called DSoftKI.7 Fitting derivative observations is a harder optimization problem with less existing literature, and so this enables us to assess autoresearch’s ability to navigate genuinely novel territory as future work.
The Setup
The question we posed to the autoresearch agent: can the SoftKI approximate GP be extended with DKL to improve both prediction accuracy (RMSE) and calibration (NLL)? We use 8 datasets from the UCI Machine Learning Repository,8 a standard GP regression benchmark. The datasets span 8 to 385 dimensions and thousands to millions of data points each.
What the Agent Found
I gave the agent a program.md adapted for SoftKI and DKL.9 Over 124 experiments across two days, it progressed through several phases of discovery. The agent explored largely autonomously, only requiring a few occasional nudgings from me to keep experimenting.
Phase 1: Regularization recipe (experiments 1–58). The agent’s first challenge was feature collapse,10 a well-known problem where DKL embeddings compress all inputs into a small region, destroying calibration. The program.md was seeded with references to this problem. The agent tried weight decay, spectral normalization, and cosine learning rate (LR) scheduling. The agent concluded that cosine LR annealing was the best regularizer and also found that bounded activations (tanh) were critical for numerical stability with softmax interpolation.
Phase 2: Simplification (experiments 59–109). In the next phase, the agent tuned the recipe. It found that a higher minimum LR was critical for longer training and that using more training data (70% vs 50%) directly helped. The program.md also contains a simplicity criterion—all else being equal, simpler is better. As an example, it found that spectral normalization was unnecessary once cosine LR was in place, and removed it.
Phase 3: CG solver switch (experiments 110–124). The original version of SoftKI uses a QR-based linear solver that is more numerically stable than a direct solve but requires more memory. The autoresearch codebase omitted QR and only exposed two options: a direct solve (the default) and conjugate gradients (CG). The rationale was that DKL is more compute intensive and could potentially learn a representation that was more numerically stable, and so I opted for more scalable approaches. Late in the run, the agent tested the CG solver and found it dramatically improved long-epoch training—more on this in the retrospective below.
Average RMSE improvement: 28% across 8 datasets. All configurations achieved equal or better NLL (calibration).
| Dataset | Dim | Baseline RMSE | Best DKL RMSE | RMSE ↓ |
|---|---|---|---|---|
| kin40k | 8 | 0.128 | 0.036 | 72% |
| ctslices | 385 | 0.033 | 0.017 | 48% |
| pol | 26 | 0.085 | 0.052 | 39% |
| song | 90 | 1.096 | 0.846 | 23% |
| elevators | 18 | 0.422 | 0.344 | 18% |
| keggdirected | 20 | 0.102 | 0.088 | 14% |
| protein | 9 | 0.624 | 0.581 | 7% |
| buzz | 77 | 0.254 | 0.239 | 6% |
Autoresearch Retrospective
Each era of software has its defining failure mode. Software 1.0 fails when code is bad. An entire discipline of software engineering (DRY, code review) exists to prevent it. Software 2.0 fails when data is bad. Data quality and curation are paramount, lest we obtain garbage in, garbage out. Software 3.0, it seems, fails when prompts are bad. The emerging disciplines of prompt engineering and context engineering are only beginning to address this. My experience with autoresearch is consistent with the difficulties and nuances of prompt specification.
Specification, Specification, Specification
Early versions of my program.md, adapted from the LLM setting, told the agent to minimize RMSE. The codebase already computed NLL, and GPs are Bayesian models where calibration is a first-class concern. I assumed the agent would pick up on this context. Unfortunately, it obliged the literal specification and destroyed calibration, producing models with excellent predictions but unusable uncertainties. I had to revise the program to explicitly require that both RMSE and NLL improve, treating them as co-equal objectives. Only then did the agent start navigating the trade-off rather than optimizing one metric at the expense of everything else.
The Devil is in the Details
The agent eventually tried the CG solver at experiment 110 and found it dramatically improved long-epoch training. It named it a “breakthrough.” But here is where the details get tricky. The agent produced a plausible-sounding rationale for why CG helped—something about implicit regularization from approximate solves. As one of the authors of SoftKI, I knew from experience that the direct linear solver is numerically inferior to the QR-based solver, and that the CG solver improves upon the linear solver. This information is documented in the appendix of the paper, which the agent had knowledge about. Had I not had this prior experience, I might have accepted the agent’s rationale as plausible, even though it was not easily verifiable. The agent also did not rebaseline all results with a CG solver (without DKL) to isolate the effect of DKL from the effect of the solver. Without this check, the reported gains may conflate two independent improvements. Could a better program.md have caught this? Perhaps, but it is hard to specify what you don’t know to ask for.
Conclusion
Autoresearch removes a large part of what makes computational research tedious. I would have spent days manually ablating activation functions, learning rate schedules, and embedding dimensions—the agent did it while I was AFK. The ability to search a design space by specifying a markdown file makes it too easy not to use. Additionally, the process of setting up the loop forces you to have an effective test harness and be clear about your metrics, which is invaluable in its own right.
But autoresearch is also surfacing new bottlenecks. The agent generates explanations and research directions faster than I can triage them. It is too early to tell if the time saved from manual experimentation will outweigh the time spent interpreting the results. Dijkstra anticipated this tension:11 “The virtue of formal texts is that their manipulations, in order to be legitimate, need to satisfy only a few simple rules; they are, when you come to think of it, an amazingly effective tool for ruling out all sorts of nonsense that, when we use our native tongues, are almost impossible to avoid.” He concluded: “I suspect that machines to be programmed in our native tongues—be it Dutch, English, American, French, German, or Swahili—are as damned difficult to make as they would be to use.” LLMs proved him wrong about the feasibility. But the ease of writing nonsensical specifications and the cost of making sense of their results is exactly what he warned about. We are venturing boldly into the Software = Code + Data 3.0 era.12
Footnotes
Karpathy, autoresearch, GitHub, 2026.↩︎
Huang, “The ‘Software = Code + Data’ 3.0 Era”, Base26 Labs, 2025.↩︎
Applied to LLM pretraining, autoresearch is a form of recursive self-improvement: a
Promptthat improvesCodefor distilling knowledge from textData, which can in turn improve future prompting. Applied to other domains, the loop is no longer recursive but the equation still holds.↩︎Neal, Bayesian Learning for Neural Networks, Springer, 1996.↩︎
Wilson et al., “Deep Kernel Learning”, AISTATS 2016.↩︎
Camano and Huang, “High-Dimensional Gaussian Process Regression with Soft Kernel Interpolation”, TMLR 08/2025.↩︎
Huang, “Scaling Gaussian Process Regression with Full Derivative Observations”, TMLR 01/2026.↩︎
Kelly, Longjohn, and Nottingham, UCI Machine Learning Repository.↩︎
The autoresearch codebase,
program.md, and experiment logs are available at base26labs/softki_gp_autoresearch.↩︎van Amersfoort et al., “On Feature Collapse and Deep Kernel Learning for Single Forward Pass Uncertainty”, 2021. See also Ober et al., “The Promises and Pitfalls of Deep Kernel Learning”, UAI 2021.↩︎
Dijkstra, “On the Foolishness of Natural Language Programming”, EWD 667, 1978.↩︎
Huang, “The ‘Software = Code + Data’ 3.0 Era”, Base26 Labs, 2025.↩︎