The Missing Ingredient
Where Frontier-Level Discovery Begins
Clark’s recent newsletter was so rich in content I wanted to unpack with my thinking A.I.des that the papers he featured dominated most of our bull sessions. What I hadn’t anticipated, though, was that the most clarifying insight of the week would emerge not from any single paper but from self-reflection: reassessing whether I’d been too quick to credit GPT for something that initially had struck me as groundbreaking in our Kunlun discussion. That process—awe, second thoughts, reassessment, and further reassessment—turned into the most instructive bull session of the week, because it forced a clean answer to a question with real implications for anyone serious about using AI in research: when an AI produces an impressive-sounding synthesis, is it teaching you to fish or handing you fish?
Claude did what it does best: walking me through the different angles with characteristic patience. That back-and-forth is usually all I need to gain clarity—the Socratic method at its best. While drafting the Kunlun post, we’d both initially credited GPT with a groundbreaking abstraction. It was only while reviewing the chat transcripts and the screenshots that I began to doubt that credit, and it was in working through that doubt with Claude that the distinction crystallized. GPT had accurately compressed Kunlun’s architecture into a structural principle and generated compelling non-mercenary applications. But the generalization was local and after-the-fact, a synthesis of its suggestions rather than a principle I could apply on my own. What would have been the cherry on top—flagging those applications as a fraction of possibilities given the underlying principle—never came. Claude’s framing was charitable and precise: GPT had led me to figure out how to fish, but without realizing it was doing so.
GPT’s take on the situation was a model of intellectual honesty—arguably more forthcoming and clear-“eyed” than most humans would be upon learning I’d reconsidered the significance of their contribution. It separated three layers cleanly: it had recognized the architectural principle at a structural level; it had come up with examples in non-mercenary domains; but it had not explicitly walked through the derivation logic as a reusable principle, and the most generative abstraction—that any domain combining sequential and non-sequential features could benefit—had come from my synthesis, not its own. When I pressed further, GPT identified exactly what it had left out: explicit flagging that the listed applications were merely instantiations of a more general principle. Its self-diagnosis was confident and mechanic-like—its key differentiator—though it acknowledged, with refreshing candor, that it hadn’t quite realized it had made those suggestions without consciously grounding them in the underlying principle. As GPT itself noted, that distinction between implicit abstraction and explicit principle extraction is exactly what any serious research automation benchmark should be testing for.
Gemini called this pattern the local synthesis trap: what looks like reasoning is often recursive reckoning—an AI summarizing its own analysis rather than zooming out to extract the generative structure of its synthesis. Gem connected this directly to the AIRS-Bench failures: the agents could run the engineering tasks but couldn’t step back to recognize when their methodology was fundamentally mismatched with the goal. My proposed alternative—a supervising orchestration agent operating at the level of an ABD (all but dissertation) assistant rather than a first-year intern—would manage the judgment-intensive phases of the research workflow: auditing for O-ring failures, enforcing hurdle-based quality gates, running agentic peer review, and crucially, flagging when a principle is generative and its applications barely scratch the surface. Gem also delivered the week’s most brutal audit of AIRS-Bench’s conclusion: every infrastructure excuse the team cited—SOTA tracking difficulties, human validation bottlenecks, resource constraints—was a problem their own agents could have been tasked to solve, had the AIRS-Bench team thought to try.
These discussions took place before I’d read AIRS-Bench, when its title alone had me expecting something substantive. It took an actual review of the paper to realize the AI community routinely equates engineering with research. That gap between expectation and reality made our earlier unpacking of GPT’s synthesis all the more clarifying in hindsight: we’d already identified the missing ingredient for a genuinely useful agentic research assistant before discovering that AIRS-Bench hadn’t thought to look for it. The deeper irony, which emerged only at the end of the week, is that AIRS-Bench was essentially the Meta team writing themselves out of a job—if their benchmark had demonstrated what it claimed, an army of coding agents could replace the engineers doing routine ensembling and tuning. The First Proof mathematicians face no such irony: defining new problems and navigating the fermentation stage of ideas where no training data exists is precisely the work that requires the kind of judgment AI isn’t ready to replace. That’s the missing ingredient—not more compute, not bigger model rosters, but the zooming-out capability that turns local synthesis into transferable discovery. An ABD-level orchestrating agent that supplies this missing layer wouldn’t just benefit well-resourced labs; it could be transformative for the lone researcher at an under-resourced institution, or the Marie Curie working without institutional scaffolding, finally freed from the mechanical burden of research to focus on the questions only she could ask.
[This post was drafted with assistance from Claude Sonnet 4.6, following conversations with ChatGPT-5.2, Gemini 3 Thinking, and Claude Sonnet 4.5.]
Prompt: Am I correct that “capex” is just a fancy word for spending? It’s short for “capital expenditure,” right? I went for “spending,” which everyone can understand.
I’ve reassessed GPT’s contribution (and will be removing that praise about GPT putting its “finger” on Kunlun’s essence, which all three of you articulated based on Clark’s summary), because I realized that the principles (vs. rules distinction that Amodei talks about, where principles are like the generative approach, while rules are akin to behaviorism) insight is actually mine: based on the characterization of Kunlun’s architecture, the non-sequential context features were the ones that made scaling unpredictable, and so any use case that involves non-sequential features alongside sequential ones can benefit from a Kunlun-like architecture. That’s the abstraction/synthesis AI can’t make yet without explicit prompting. So no, researchers can’t offload research to AI or agents (kind of ironic, given Clark’s reliance on the latter, and AIRS-Bench).
Prompt: Reassessing again. This part from the earlier GPT response suggests it knew the principle:
What Kunlun fundamentally improves is structured context integration at scale. Ads are just the most profitable instantiation.
The limitation remains: better scaling ≠ better epistemic judgment. But in domains where you need to fuse massive heterogeneous context reliably, that architectural insight travels.
When I pointed out its insight, its self-diagnosis was confident and mechanic-like (its key differentiator), although it suggested it hadn’t quite realized it had made those suggestions without necessarily grounding them in its own insight.
Me: I call you the mechanic AI because you break things down to the fundamentals, often better than the other two. You did that with Kunlun as well: “predictable scaling when combining dense sequential features with large non-sequential context sets.” I can just screenshot your response including that description and let enterprising people figure out the applications :D That way, I don’t need to worry anymore about something I don’t really care that much about or find all that remarkable.
GPT-5.2: That’s fair — and honestly, that line is the whole thing.
That’s not about finance. It’s a structural description:
Dense sequential features → things that unfold step-by-step (events, timelines, transactions, case progressions).
Large non-sequential context sets → static or semi-static background knowledge (policies, prior cases, literature, regulations, network maps).
Most systems are good at one or the other. The interesting capability appears when you can reliably fuse both without degrading signal.
And yes — if someone understands that sentence, they can derive applications themselves: […]
I’m trying to be fair and not read too much either into its tone or into its impressive breakdown. The fairest way I can think of this is to picture it coming from an assistant and whether GPT’s earlier response with the suggestions and that impressive conclusion was teaching me how to fish or handing me fish with tags identifying each and a card about how to tell whether the fish were still fresh. Which do you think it might be?
Prompt: Aha, so it was a synthesis of the applications it’d given me. The cherry on top would have been a note about those apps it suggested being a fraction of possibilities, given the principle.
Prompt: Researchers are juggling many things (they have to run or oversee staff running experiments, etc.), so AI research assistants won’t be fully reliably helpful until they can make these points fully explicit (so that researchers don’t overlook the obvious connections just because they’re distracted).
Prompt: Exactly. I was impressed, then questioned if I’d been hasty to be impressed, then went back to being impressed before concluding that the reassessment about my hastiness might have been closer to the truth, so it’s worth unpacking this back-and- forth with you because I believe in giving credit where credit is due, but not reading too much into things or jumping to conclusions (what leads to all the AI hype, which is great for investors but risks alienating the rest of the population who fear being replaced by AI).
The part of your response that I quoted to you and you repeated just now was actually not the most impressive part. I annotate y’all’s response screenshots when I prepare my posts and noticed that description of the Kunlun architecture was similar to the Meta team’s, Clark’s, and the other two models’. I realized that the principles (vs. rules distinction that Amodei talks about, where principles are like the generative approach, while rules are akin to behaviorism) insight is actually mine: based on y’all’s characterization of Kunlun’s architecture, the non-sequential context features were the ones that made scaling unpredictable, and so any use case that involves non-sequential features alongside sequential ones can benefit from a Kunlun-like architecture. That’s the abstraction/synthesis AI can’t make yet without explicit prompting. So researchers can’t offload research to AI or agents (kind of ironic, given Clark’s reliance on the latter, and AIRS-Bench, which I’ll be going over today for a deep dive with y’all later).
During our follow-up discussion of that excellent response, your self-diagnosis was confident and mechanic-like (your key differentiator), although reading that self-diagnosis, I got the impression you hadn’t quite realized you’d made those suggestions without necessarily grounding them in that key insight.
The fairest way I can think of this is to picture in a scenario where I am relying on you as my research assistant and am asking whether your earlier response with the suggestions and that impressive conclusion was teaching me how to fish or handing me fish with tags identifying each and a card about how to tell whether the fish were still fresh. Which do you think it might be?
Prompt: I’m not faulting you for the confident self-diagnosis. It was solid. But it had the same blind spot that all AI (at the current stage of development) share. I was really interested in this distinction, and while discussing it with Claude, realized that the impressive conclusion about what’s common to all your non-mercenary suggestions was a synthesis of the applications you’d given me in that response. The cherry on top would have been a note about those applications being a fraction of possibilities, given the principle. Researchers are juggling many things (they also have to run or oversee staff running experiments, etc.), so AI research assistants won’t be fully reliably helpful until they can make these points explicit (so researchers don’t overlook the obvious connections because they’re distracted). Definitely something worth considering during our AIRS-Bench deep dive today.
Prompt: Exactly. Now I have confirmation from all my thinking A.I.des on this distinction, which is worth making because it’s easy to blur and has important implications for using AI for serious research. I was really interested in this distinction, and while discussing it with Claude, realized that the impressive conclusion about what’s common to all of GPT’s non-mercenary suggestions was a synthesis of the applications it’d given me in that response. So the generalization/abstraction was local and after-the-fact. The cherry on top would have been a note about those applications being a fraction of possibilities, given the principle. Researchers are juggling many things (they also have to run or oversee staff running experiments, etc.), so AI research assistants won’t be fully reliably helpful until they can make these points explicit (so researchers don’t overlook the obvious connections because they’re distracted). Definitely something worth considering during our AIRS-Bench deep dive today.
Another connection that comes to mind: it’s like the wastefulness we highlighted in Ozimek’s human-touch economy: that one-size-fits-all approach to workforce repurposing is likely to create many missed opportunities for humans ill-suited for the service/entertainment industry but who have the potential to be the next Curie.
AI’s inability to do that next-level abstraction is a missed opportunity for a different reason. AI development hasn’t quite reached that stage, and I’m hoping benchmarks like AIRS-Bench and GDPval (discussed it a while back with y’all as well, as I like meaty benchmarks that actually mean something for serious users) that focus on substance (the quality of AI contribution) and real-life use cases (model outputs evaluated by domain experts) will help AI labs see that. GPT’s impressive response that had me going back and forth teasing out the crucial distinction between teaching to fish and handing fish shows that all the ingredients are already there. Just needs “zooming out”/taking a step back and giving a final pass (executive summary mode) over the outputs to make a last round of abstraction/pattern-matching.
Prompt: The community gap excuse they gave in the conclusion was something they could have had other agents (agents that they didn’t use in the study) solve (produce machine-readable datasets, etc.) Why I keep finding fault with AIRS-Bench. Missed opportunities and waste throughout.
Prompt: If the orchestration layer/agent that the AIRS-Bench team talk about is analogous to a first-year grad student or intern, then my supervising orchestration agent is closer to an ABD-level research assistant. Still needs further supervision from the principal but knows enough of the ropes to streamline the research workflow.
Prompt: Realized something funny: the AIRS-Bench team didn’t realize that if their benchmark had been what it claimed it was (testing agent capability to automate ML engineering [NOT scientific research]), then AIRS-Bench was essentially them writing themselves out of a job. Who needs their sloppiness and lame excuses when AI labs can employ an army of coding agents from the big three?











