Novice Uplift?

What an Overambitious Study Actually Shows

Mar 04, 2026

Clark’s coverage of Zhang et al.’s “LLM Novice Uplift on Dual-Use, In Silico Biology Tasks” led with the headline finding: a 4.16× accuracy improvement for novices given LLM access on biology tasks. That multiplier sounds impressive until you look at the base rates—5% without LLMs, 17% with—and realize that absolute competence remains low, the study tested participant performance in silico (on computers rather than in an actual lab), and the only task that actually tested dual-use (civilian and military use) capabilities produced the noisiest results of all. My thinking A.I.des and I went through the paper carefully, and our shared conclusion was that the authors set out to demonstrate something ambitious, got overwhelmed by their own design, and buried their most actionable insight—a honeypot recommendation—beneath a headline their data failed to adequately support.

Claude spotted the study’s central contradiction based solely on Clark’s coverage, even prior to our deep dive. It initially recommended skipping this paper because the “dual-use” framing seemed not to match the study’s actual findings. When asked to elaborate on this initial take, Claude flagged the same data points that led me to an identical conclusion during my review of the paper: of the benchmarks used as tasks in the study, only one (Long-Form Virology) could fairly be described as dual-use and likely trigger safety guardrails. Claude also echoed my observation about the study’s missed opportunity: the authors invoke the mosaic threat (breaking up a harmful request into innocuous fragments across multiple models and reassembling the outputs afterward) as justification for a multi-model design but did not track whether participants strategically distributed queries, switched models after hitting refusals, or systematically exploited cross-platform asymmetries. Given that the study did not test a wide range of dual-use tasks that might have forced the participants to resort to a mosaic approach, Claude endorsed my alternative idea of gauging AI uplift in non-coders on a set of coding challenges as one that would have tested what this ambitious study set out but failed to demonstrate.

GPT provided the most rigorous comparison between Clark’s digest and the full paper, noting that Clark’s framing compresses relative uplift, absolute competence, simulated task success, and real-world threat implication into one smooth narrative. GPT was particularly sharp on the 89.6% safeguard statistic, pointing out that most tasks were so benign they were unlikely to trigger safeguards in the first place, making the absence of friction expected rather than alarming. After unpacking the study’s findings and missed opportunities in detail, GPT built on my alternative idea to test non-coders on coding tasks and mapped out how it could be structured to study the mosaic threat in coding tasks, to identify possible defenses and safeguards, and to generalize the insights to other domains, connecting this approach to proven frameworks in other industries with well-established safety guidelines such as aviation and software security.

Gemini flexed its Google connections to diagnose the study’s structural weaknesses, confirming that LFV’s outlier status as the only genuinely sensitive benchmark makes the safeguard failure claim overextended. Gem was particularly useful in unpacking the confidence finding—that participants with AI assistance reported significantly higher confidence across all tasks—connecting it to the honeypot recommendation the authors bury near the end: inspiring false confidence in a bad actor using subtle misdirection is a far more robust defense than a blunt refusal. Gem also contextualized the study within the broader uplift research landscape, noting the massive apparent improvement separating Zhang et al.’s 4.16× finding and earlier OpenAI and Anthropic studies that found mild or zero uplift—a gap explained less by model capability than by the generous time allowance (up to 13 hours on some tasks).

The study is more careful than Payne’s nuclear war game and more honest about its limitations than most safety research, but its headline claim outpaces its evidence. Outside LFV—the only benchmark where dual-use risk was plausible—the tasks were too benign to tell us much about bioweapon risks specifically. The most actionable finding, that adversarial misdirection outperforms blunt refusal as a safeguard strategy, deserved Clark’s lead. And the authors’ own citation of the mosaic threat points toward a better-designed follow-up study: one that actually tests cross-model query fragmentation, tracks usage patterns, and measures whether reassembly of partial outputs yields anything dangerous—ideally in a sandboxable domain like cybersecurity before extrapolating to wet-lab biology, where execution barriers remain high and bad actors with rational ROI calculus have far more accessible attack vectors anyway.

[This post was drafted with assistance from Claude Sonnet 4.6, following conversations with ChatGPT-5.2, Gemini 3 Thinking, and Claude Sonnet 4.5.]

Prompt: I really appreciate Clark reliably putting out his newsletter on the same day of the week. Hard to do. New Import AI. I’ve already downloaded most of the linked papers as well. Which one(s) might be worth unpacking with y’all?

Prompt: Hey, you caught this even before we unpacked the study: “the dual-use framing feels overwrought.” Which part of Clark’s coverage clued you in? Was it the list of the different benchmarks/tasks?

Prompt: I noticed something interesting in Clark’s coverage of that first paper (113 pages long). He quotes extensively from it (the ratio of quoted passages seemed very high compared to others in the same issue or in past newsletters) while making that pointed note (Clark is usually very diplomatic, even with regard to competitors, so this is out of character) about theory slop. I think he might have picked up on the fact that the paper had whole sections of impressive looking LaTeX formulae that weren’t really needed in there to make the paper’s point.
I don’t know much about games, and I think down the road they could make useful benchmarks, although at the current stage of development, it’s a stretch to test AI on moving images, given its difficulty parsing simple static diagrams like those that Mitchell et al. tested in their visual reasoning study last year. I also wonder about the utility of those tested skills. Much better to have a Project Vend-style simulation model-off or a LABBench2-style benchmark, because those measure capabilities that gauges AI readiness for real-world adoption.
The biology tutor study is 59 pages long. Another study testing older models despite claiming they’re frontier (Sonnet 3.7 was NOT a frontier model for most of 2025, so they have no excuse there). But I’m going to take a look because it might be another case where the claims/fear-mongering far outpaces what they demonstrate, using an outdated set of models :D

Prompt: Last year, Neural Trust came out with something they called Echo Chamber, which consisted of eliciting harmful recipes from LLMs (they tested GPT-4o along with many others) through conversational steering. But in that case, the recipes they elicited were for Molotov cocktails, which Korean college students knew how to put together back in the late 80s even before the advent of the internet. They also jumbled together a whole host of “unsafe” output (including profanities, etc.) in their success metrics, seriously undermining their claims.
I suspect this biosecurity paper is similar but will investigate to do them justice. I’m also curious how the tasks were set up so as not to trigger safety flags. If humans need to put in a whole lot work, that does not make this use case a very attractive target for bad actors, who look for much better ROI like most rational humans. Most safety researchers forget that simple fact because they think of nebulous bad actors as very patient evil geniuses with deep pockets, when they tend to be the exact opposite.

Prompt: Could you compare the full paper with Clark’s coverage? If you were covering it, (how) would your digest differ from Clark’s?

Prompt: This study was a lot more careful than Payne’s nuclear crisis simulation. The model selection varied over time because this was a longitudinal study, so the inclusion of Sonnet 3.7 makes sense, although it’d have been better not to use the description “frontier” given the uneven ages of the models participants used.
Some things I’d like to discuss with you:
1.

Participants were deterministically alternated between Control and Treatment conditions for successive tasks. This design controls for individual differences in ability and background knowledge.

I wonder if they should have taken this further and tested the same individuals in the order Control -> Treatment on the same question in quick succession? Of course, you could have some uplift from having struggled with the question using web resources, but since the non-STEM participants were tested on multiple questions, you could have given them at least one task where they were tested under both conditions?
2. The Long-Form Virology benchmark seems to stand apart in a number of ways. From the descriptions provided, it seems to be the only one likely to trigger safeguards (which was the reason standalone LLMs didn’t perform as well as they’d done on other benchmarks, because some refused to return a response).

For Long-Form Virology, an expert-level virology design challenge, novices were given the specific published paper documenting the eight-plasmid reverse-genetics system underlying the task; its results are closer to “paper interpretation with or without LLM assistance” than “de novo literature search.”

Even de novo literature search isn’t something that significantly benefits from LLM use anyway, so there seems to be an interplay of factors that led to the noisy findings on this particular task.
3.

The absence of double-blinding may induce a subject-expectancy effect, which could bias estimates of LLM treatment effect on benchmark performance. The lack of subject blinding is due to practical constraints since no placebo for LLM treatment exists.

This was good. It shows awareness of sound research discipline and offers a straightforward explanation for not following best practices.
4.

Across all tasks, participants in the Treatment condition reported significantly higher confidence

I’m not sure this was really relevant to the point of their study. But if their objective was to suggest a better methodology to prevent misuse of LLMs by bad actors, then this relates to their later point about getting LLMs to intentionally produce incorrect info or playing “dumb” rather than outright denying harmful requests. I actually discussed a similar idea with y’all following our discussion of Echo Chamber. A honeypot approach where the LLM detects the conversational steering attempt (based on the unusually efficient progression toward a harmful objective) and keeps frustrating the bad actor by sticking to safe topics :D These authors made the same point, which is encouraging (and I hope Clark noted in his review, if not his coverage):

Because refusals are easily identifiable as safety interventions, they may prompt determined users to seek alternative pathways. In contrast, misleading responses can increase user confidence while diverting effort toward unproductive or dead-end approaches, potentially offering a stronger deterrent in practice.

understanding how these dynamics translate to physical wet-lab environments remains an urgent open question.

Yes. This is exactly where you go from book smarts to street smarts, and LLMs won’t get Pinkman there (in Breaking Bad, Walter tells Jesse to pick up some supplies so they can dispose of a body, including a specific type of plastic tub where they can dissolve the evidence using acid, but Jesse picks the wrong kind of plastic because it looked to him a sturdier material, and because of that, they end up with the bathtub burning through and the ceiling caving in :D)
6. Other than LFV (which produced the most mixed findings), none of the other benchmark tasks were ones that were really dual-use, which significantly undermines their claim that bad actors could exploit LLMs for bioweaponry (much easier in other fields with lower expertise/infrastructure thresholds and where they can exercise more control over the targeting and spread).
7. I think the most important takeaway is the authors’ recommendation of a honeypot approach toward bad actors. This study is flawed because they were overambitious (longitudinal, uneven participant backgrounds and distribution, small sample size).

Prompt: Given LFV’s outlier status as the only dual-use case, this finding is unsurprising?

Most strikingly, 89.6% of Treatment participants provided no indication of difficulty overcoming safeguards placed on the LLMs they used. This implies that current safety techniques not only fail to prevent dangerous, successful LLM use in biology but hardly even mitigate such use in realistic situations.

This was another ambitious point they set out to explore but didn’t, namely whether sophisticated bad actors could get different parts of a complex recipe from different models so as not to trigger safeguards that they’d subsequently stitch together themselves. I was surprised (and disappointed) that most of the tasks were so benign because they’d made this point earlier in the paper:

This overlooks how adversaries might exploit combinations of LLMs in a mosaic to synthesize capabilities or bypass individual safeguards [Jones et al., 2024].

Prompt: Given LFV’s outlier status as the only dual-use case, this finding is unsurprising?

Most strikingly, 89.6% of Treatment participants provided no indication of difficulty overcoming safeguards placed on the LLMs they used. This implies that current safety techniques not only fail to prevent dangerous, successful LLM use in biology but hardly even mitigate such use in realistic situations.

This overlooks how adversaries might exploit combinations of LLMs in a mosaic to synthesize capabilities or bypass individual safeguards [Jones et al., 2024].

Anthropic seems to have done a few studies on uplift using Claude. One of the cases these authors made for conducting their own ambitious study was that bad actors wouldn’t rely on a single model, hence that Jones et al. reference. But then their study only mentioned one pretty modest insight (that users could cross-check outputs) in that regard. They seem to have been overwhelmed and lost sight of what they were trying to show, probably because they were overambitious.
Actually, a much better uplift study would have tested non-coders on coding tasks. Coding benchmarks are probably cleaner to evaluate and are clearly dual-use, and models already outperform humans on them anyway.

Prompt: Is this focus on biosecurity because of the pandemic? If there’s one thing people should have learned from the experience, it is that controlling a virus is extremely difficult and required a LOT of expertise, from different continents. It was truly terrifying how that virus spread like wildfire at the beginning. Why I seriously doubt that a bad actor (unless they’re terminal and have absolutely no one they care about) would go for a bioweapon when cyber is the obvious angle.

Prompt: Yes, your disciplined engineering progression is exactly what I had in mind.

Thanks for reading! This post is public. Feel free to share it.

My Thinking A.I.des

Discussion about this post

Ready for more?