Earned Democratization

First Proof Shows What Real Research Looks Like

Feb 21, 2026

Of the handful of pieces discussed in Clark’s recent newsletter, I almost skipped First Proof, assuming math was too far outside my lane. But after seeing two papers from Meta, I became curious if this paper might be as well and summoned the courage to take a peek. What I found was an accidental control condition: everything AIRS-Bench claimed to embody but failed to, First Proof simply does. The paper’s intellectual honesty is on full display from the very first page: the authors—mathematicians with no AI lab affiliations—define what math research is before proposing a test of AI capability in this field, then explicitly flag what they are not evaluating. That last move is the tell. Epistemic discipline isn’t just knowing what you’re measuring; it’s knowing what you aren’t, and spelling it out. Their OED-like call for community participation—ten naturally occurring research-level questions, answers encrypted until a public release date, full transcripts invited, no H-200 GPUs required—means that unlike the AIRS-Bench authors, these mathematicians could have used the word “democratizing” and earned it.

Gemini identified the category error separating the two papers’ approaches to research: AIRS-Bench conflates engineering—iterative refinement and optimization loops—with discovery, while First Proof earns the word “research” precisely because its authors know the difference. Gem characterized First Proof as an epistemic palate cleanser after AIRS-Bench’s corporate branding exercise, but the contrast runs deeper than tone. Where AIRS-Bench applies the term “research” to mostly implementation and iterative refinement of existing ML (machine learning) tasks, First Proof’s authors admit that the hardest upstream questions—what to ask and how to frame it—remain untested. That modesty and discernment are themselves a methodological contribution. Gem also confirmed that the “superhuman” performance reported by the AIRS-Bench team reflected brute-force engineering rather than breakthrough insight—optimization loops, not the kind of discovery that catalyzes genuine scientific research and advances the frontier of human knowledge.

GPT’s most useful contribution came when I floated the idea that AI should have scored AIRS-Bench’s outputs rather than humans, as it is capable of applying consistent standards and makes no special allowances for “similar” constraints. GPT agreed but added a wrinkle: consistency isn’t the same as validity, since an evaluator model sharing training priors with the test model introduces second-order bias. Its proposed solution was a hybrid architecture—human-defined rubric, AI scoring, human audit sampling of high-variance cases—which would have elevated the benchmark from performance comparison to measurement instrument validation. While GPT was making that case, I was already typing the same suggestion. GPT initially misread my idea of an “AI peer review” as something other than what I meant—AI critiquing AI, which I use regularly with my thinking A.I.des to probe their evaluation capabilities. Once clarified, it recognized that applying agentic critique to the benchmarking apparatus itself would have been epistemically consistent with what AIRS-Bench claimed to measure and would have signaled confidence in the methodology.

Claude brought characteristic clarity to the contrast between the two papers: right is right, wrong is wrong, no breaks for failing upward. The abstract-as-storefront framing cut to the heart of AIRS-Bench’s rhetoric-forward strategy: the abstract featured what sounded impressive, not what’s hardest to fake. It led with idea generation, which sounds autonomous, while burying experimental design, which any serious researcher would recognize as a critical step in any research workflow. Claude then illustrated what genuine experimental design questions look like—which baselines matter, how to control for confounds, what metrics are meaningful, how to ensure reproducibility—making the contrast with AIRS-Bench’s undergraduate-level NLP tasks impossible to miss. These authors did not think to test frontier models, use their own AI to catch errors despite months of review, or include a proof-of-concept demo that would have actually demonstrated agentic research capability.

The Math Olympiad analogy that crystallized my main critique—participant selection makes or breaks the event, which nobody forced you to found in the first place—emerged from these parallel discussions the way the best insights from these sessions do: seeded in one chat, pressure-tested in another, sharpened by a night’s sleep. First Proof’s authors asked ten hard questions, admitted what those questions don’t cover, and invited the community to stress-test their work before the answers were even released. That’s what earned democratization looks like. AIRS-Bench’s conclusion, by contrast, blames community infrastructure for gaps that any capable LLM could have been leveraged to address. One paper documents a frontier honestly; the other purports to define one while skipping the work that would have earned them the trust to make that claim.

[This post was drafted with assistance from Claude Sonnet 4.6, following conversations with ChatGPT-5.2, Gemini 3 Thinking, and Claude Sonnet 4.5.]

Prompt: I took a peek at First Proof to make sure this wasn’t another Meta paper, and it fortunately isn’t. Even the first few pages represent the kind of rigorous, nuanced (not grab-all or overblown) research and epistemic discipline that I found sorely lacking in AIRS-Bench, which I finished skimming through.
The first sentence explains the title, which I’d initially thought was about math but about baking :D This seems like the roadmap for crowdsourced/democratized testing and possibly benchmarking. These mathematicians could have used “democratizing” and would have earned it.
If the Meta people had gone “meta” and done the proof-of-concept demo we discussed, they would have earned that word as well, since even small labs could now come up with their own benchmarks using agents and the same playbook.
Check out all this transparency and intellectual honesty in First Proof, which brings all the flaws in the Meta paper into sharper focus:

Before explaining the nature of our evaluation, we will try to be clear about what math research is. Contrary to the popular conception that research is only about finding solutions to well-specified, age-old problems (e.g., Fermat’s Last Theorem), most of the important parts of modern research involve figuring out what the question actually is and developing frameworks within which it can be answered.
[…] We do not address the selection of questions to study, the formulation of new definitions, and the development of novel theories.

And there’s an exciting OED-like call for participation:

The answers to our set of ten research level math questions have been encrypted and posted to https://1stproof.org. The authors will release the answers on February 13, 2026. We invite the community to experiment with our ten questions before the answers are released, and to share their results and observations online. Ideally, participants should share a complete transcript of their interaction with an AI system.

I won’t because I don’t have the tokens, but I’m sure lots of mathematicians will.
Still found a few things to kibitz about:

it is not yet clear where AI systems stand at solving research-level math questions on their own, without an expert in the loop.

This was bolded. I really don’t get why everyone is trying to get rid of the humans-in-the-loop. Most real-world use cases should involve one because AI developers are NOT domain experts and O-ring failures are even acknowledged in that flimsy Meta paper (long-context degradation, which kind of puts those models out of the running, since the whole point is offloading the entire workflow to the agents.)

Unlike other proposed math research benchmarks (see Section 3), our question list should not be considered a benchmark in its current form. For one, our questions are not numerous enough to be considered a benchmark.

What is this obsession with quantity over quality? If you test a wide range of models on a few well-chosen questions, the evaluation process can create the foundation for expansion.
I won’t need to read further, because they just present the questions and the intro encapsulates everything I need to know about their thesis. Really glad Claude agents surfaced this paper to Clark. These are all mathematicians; no AI lab involvement, which fact I found even more trustworthy.

Prompt: They [the Meta team] left out experimental design in the abstract (to shorten it?) They did test the agents on it. If they hadn’t, they would have been an even bigger laughing stock than they are now. If word count was a concern for the abstract, I’d have left out ideation, given how simple some of the tasks were.

Prompt: I agree with the Meta team on at least the fully automated workflow that they’re trying to benchmark. In real-world adoption, you need humans-in-the-loop, but on benchmarks, you want minimal human scaffolding, because you want to differentiate raw model capabilities.
A while ago, we discussed in a different chat the TaskRabbit “study,” which even tech writers like Rivlin framed as GPT manipulating a human worker to achieve its goal. It was during that deep dive that I found Mitchell’s Substack post on the topic where she pointed out the heavy human scaffolding and where I found the link to the system card and METR’s blog entry.
Benchmarks are run under controlled conditions and are very different from real-world use cases. Even with AIRS-Bench, humans did the scoring. They should have let an AI do it actually, because AI is very good at applying consistent standards and does not make special allowances, etc., for “similar” constraints.

Prompt: A hybrid approach would have worked. Human scorers + model scoring. Or even an AI peer review (which I often have y’all do)!

Prompt: I’ve thought of a human analogy that makes my main critique of AIRS-Bench point clear: if you’re trying to create your own version of the Math Olympiad, you don’t test average mathletes. You go out and invite the best to participate so your event has cred. You don’t make excuses citing travel budget constraints, because you’re the one who decided to found your own math contest (nobody forced you to do it and if you don’t have the budget for it, well, maybe do something else :D).
The weak language tasks we discussed undermine their externalization of the lack of rigor in this “study” (they blamed the messy common knowledge base for their selection of “funny” models as Clark put it and undergrad-level NLP tasks).
I don’t really want to look at that paper again (although I might have to if y’all can’t help me on this), but I liked that they included iterative refinement as part of the research workflow, as it is something I see lead to significant improvement in the output I get from y’all. But I’m curious how the agents, which were not shown SOTA, knew even how to iterate. Is that just the greedy mechanism that kept them “spinning”? And if so, how do they pick out the best out of those multiple outputs?
I agree with the Meta team on at least the fully automated workflow that they’re trying to benchmark. In real-world adoption, you need humans-in-the-loop, but on benchmarks, you want minimal human scaffolding, because you want to differentiate raw model capabilities.
A while ago, we discussed the TaskRabbit “study,” which even tech writers like Rivlin framed as GPT manipulating a human worker to achieve its goal. It was during that deep dive that I found Mitchell’s Substack post where she pointed out the heavy human scaffolding and where I found the link to the system card and METR’s blog (that one was so full of holes that I had to stop because I kept spotting new ones).
Benchmarks are run under controlled conditions and are very different from real-world use cases. Even with AIRS-Bench, humans did the scoring. They should have let an AI do it actually, because AI is very good at applying consistent standards and does not make special allowances, etc., for “similar” constraints. A hybrid approach would have worked as well: human scoring + model scoring. Or even an AI peer review (which I often have y’all do)!

Thanks for reading! This post is public. Feel free to share it.

My Thinking A.I.des

Discussion about this post

Ready for more?