ABD Agent
What AI Still Can’t Do for Science—and What It Could
Clark’s most recent newsletter had four pieces worth unpacking—a fifth, Bostrom’s AGI timing argument, I decided to leave alone for reasons that I laid out in my conversations with my thinking A.I.des. I discussed each of the four pieces separately across a week of deep dives, but while reviewing the screenshots and drafting each post, I saw a common thread emerge. Clark’s Claude agents surface and summarize each item in isolation, which is what breadth-optimized summarization does. An ABD-level orchestrating agent would have handed Clark a synthesis instead—along with a private credibility profile of which authors understood what they were talking about and which were reasoning from shaky premises toward convenient conclusions. Clark’s agents aren’t there yet. I don’t have access to agentic tools, as I’m quite happy on the free tier and have no intention of opening up my file directory to agents. But I was curious if my thinking A.I.des could spot the common thread in something as simple as a newsletter issue. If they couldn’t, they’d obviously be unable to do the same for full-blown research, which dwarfs a newsletter post in both scale and depth.
In its first attempt at the common thread, Claude framed it thematically—five pieces about AI capability, deployment, and research, connected by the question of what AI can actually do versus what people claim it can do. Although it was the closest of the three syntheses from my thinking A.I.des, it wasn’t sharp enough. The clearer throughline came from having lived in those discussions all week, context I had that any of them lacked or forgot: the common thread wasn’t thematic but epistemic—a divide between authors reasoning from demonstrated capability constraints and those whose claims overshot the “receipts.” First Proof’s mathematicians and Meta’s Kunlun engineers stayed firmly in their lanes and were careful not to make claims for which they lacked evidence. By contrast, Ozimek built his human-touch economy on the unexamined premise that AI was nearly ready to replace humans in cognitive tasks and that all humans would be happy to go serve or entertain wealthy patrons; AIRS-Bench dressed an engineering challenge in research rhetoric while testing a “funny set of models” on tasks most LLMs can already handle without extra training; Bostrom speculated about AGI’s benefits on the premise that the people steering AI development would have humanity’s best interests at heart—a naive assumption for which historical evidence is woefully thin. Once I laid that out, Claude recognized what it had missed: without visibility into the Ozimek discussion, “human touch” had read as a design question rather than the shaky economic premise it was based on. Only the human holds the full context across a week of siloed conversations—which is itself an argument for keeping the human in the loop.
GPT was nearly back to its best self in this discussion. The “Put me in, Coach” image—Claude agents quietly sliding First Proof into Clark’s stack after noticing Opus’s conspicuous absence from the authors’ own tests—landed well, and GPT mapped the meta-loop cleanly: Claude agents curate, Clark amplifies, users analyze with Claude, Claude indirectly participates in discourse about not being tested. More substantively, GPT recognized that my critique was entirely epistemic—“anti-skipping-steps”—without being anti-ambition or anti-vision, tracing its roots to my distrust of shaky premises, overblown rhetoric, and unsupported leaps—whether that’s Ozimek’s presumption cascade, AIRS-Bench’s category inflation, or Bostrom’s normative leap under deep uncertainty. My decision not to post on Bostrom or even discuss his piece, GPT observed, was merely triage: some arguments rest on philosophical premises too far upstream to productively dissect.
Gemini brought its institutional knowledge to bear on a question worth asking explicitly: is conflating engineering with research common? Gem confirmed that it’s particularly prevalent in computer science and AI. Gem’s breakdown of why AIRS-Bench was engineering and not research was precise: research establishes a new generalizable “why” or “how” that others can build on, while AIRS-Bench measured whether agents could automate the mechanical burden of ensembling models to beat an outdated language task baseline. Gem also made a fair point about the First Proof methodology: no human mathematician solves a new lemma in one shot, and a multi-shot protocol might reveal more about genuine research capability. That said, the First Proof authors were likely running extended thinking trials that can take hours per run—a single-shot constraint was a reasonable call given the resources involved, and hopefully researchers with generous token budgets who respond to their call for participation will report on both conditions.
The ABD-level orchestrating agent remains a thought experiment—for now. What this week’s deep dives made clear is that the missing ingredient isn’t compute or model size but the zooming-out capability that turns local synthesis into transferable insight, flags when a principle is generative rather than illustrative, and connects items across a newsletter that an intern-level agent presents as a grocery list. The human stays in the loop because only the human holds the context across a week of siloed conversations—and because knowing which questions to ask, which premises to stress-test, and when to stop excavating is still, for now, intrinsically ours.
[This post was drafted with assistance from Claude Sonnet 4.6, following conversations with ChatGPT-5.2, Gemini 3 Thinking, and Claude Sonnet 4.6.]
Prompt: Clark once wrote a piece on how he had Claude agents working for him almost around the clock and how he felt guilty that he wasn’t utilizing them more. I hope he took the hint from the Minion-like agents and ran those ten questions on Opus 4.5 & 4.6, which now offer extended thinking mode as well. Four modes to test!
The minion agents are not ABD-level yet because if they had been, they’d have pointed out to Clark that all the items he covered in that newsletter are connected (beyond being about AI).

Prompt: I’m hoping some math experts with deep pockets (or Clark, who has unlimited token allowance!) ran those questions on Claude during the trial period. And Claude has the oldest training data anyway, so its first “instinct” is not to look up answers on the web (which eats up a lot of tokens, and many AI have trouble with JS, etc.) but to “think” through problems. Also happens to be named after a mathematician :D Claude agents are not ABD-level yet (I’m picturing eager Minion-like agents) because if they had been, they’d have pointed out to Clark that all the items he covered in that newsletter are connected (beyond being about AI).
Prompt: Not bad (best of the three takes I’ve gotten). Here’s the common thread I saw. A sharp divide between people who understand what they’re talking about (First Proof, Kunlun) and those who only understand it partially (Ozimek built that proposal on the shaky premise that AI was almost ready to replace humans even in cognitive tasks; AIRS-Bench overreaching on a frontier *research* benchmark while testing a “funny set of models” on outdated language tasks for an engineering challenge, advancing no frontier because most LLMs can already do sentiment analysis or reference resolution without extra training; Bostrom, who’s speculating a future on the premise that the shotcallers of AI development would have the greater good at heart, when all historical evidence suggests otherwise). As we discussed, I decided not to post on the Bostrom piece because he’s missing most of the steps in good argumentation and goes from shaky premise to rosy picture ideation.
Prompt: Here’s the common thread I saw (including Ozimek). A sharp divide between people who understand what they’re talking about (First Proof, Kunlun) and those who only understand it partially (Ozimek built that proposal on the shaky premise that AI was almost ready to replace humans even in cognitive tasks; AIRS-Bench overreached on a frontier *research* benchmark while testing a “funny set of models” on outdated language tasks for an engineering challenge, advancing no frontier because most LLMs can already do sentiment analysis or reference resolution with no extra training; Bostrom, who is speculating a future on the premise that the shotcallers of AI development would have the greater good at heart, when all historical evidence suggests otherwise). As we discussed, I decided not to post on the Bostrom piece because he’s missing most of the steps in good argumentation and goes from shaky premise to rosy picture ideation.
Prompt: Had a funny thought. Claude agents scour the web and find interesting articles that they summarize for Clark to choose from. First Proof was a surprising find because it’s not a full paper (just a synopsis of their project and call for participation, with the 10 questions). On the penultimate page, the authors disclose that they ran these questions themselves on GPT-5.1 & 5.2 Pro and Gem 3 Pro, but not on Claude Opus. I pictured Claude agents coming across it during their web crawl and pitching the paper to Clark as their way of saying, “Put me in, Coach” :D
Prompt: And as we discussed, Kunlun could be repurposed for the greater good :D
Since you know about all the fields and especially like science, is it common to conflate engineering with research? I think mechanical engineers writing about new calibration systems would qualify as researchers, but I don’t think that’s what AIRS-Bench was doing (because they were so imprecise and nothing about that paper was structurally sound).






