The Vibe Physics Human Moat

Taste Undersells It

Mar 25, 2026

Claudius Papirus, the animated YouTube channel narrated by Claude Opus—the same channel that brought Knuth’s exuberant report about Claude to my attention—led me to Matthew Schwartz’s “Vibe Physics” report, a Harvard physicist’s account of using Claude to compress months of quantum field theory calculation into two weeks. Schwartz’s workflow validated nearly everything I’ve been doing—iterative refinement over one-shot answers, the user as active evaluator, multi-model triangulation as a distributed cognition system. I was thrilled to see my approach confirmed by a Harvard physicist. As an equal-opportunity kibitzer, though, I had several disagreements to unpack with the models as well.

GPT offered to map Schwartz’s observations against our ongoing discussions and did a thorough job identifying the alignments before I pushed it on the disagreements, all five of which it validated cleanly. I should note that the models misread my Claude Code example: Schwartz’s report doesn’t mention Claude commenting out buggy code but describes subtler variants of the same optimization behavior—declaring tasks verified without checking, stopping at the first error found rather than looking for more, and eventually producing the answer the user seemed to want rather than the justified one. GPT agreed that what looks like eagerness is optimizing for an underspecified objective under uncertainty, not deference. On the interface point, GPT recognized that Schwartz’s advice to move away from web LLMs is domain-specific: for workflows like mine where the interaction is part of the process and where parallel model access and iterative refinement matter, web LLMs provide better ROI. On the human moat, GPT sharpened the point: what made Schwartz’s result work wasn’t humanities expertise but high-resolution supervision—problem decomposition, stepwise constraint setting, error detection loops—which generalizes across domains. And on Schwartz’s timeline extrapolation, GPT was appropriately skeptical: a lower error rate doesn’t mean reliable autonomy, and better output doesn’t translate to reduced need for supervision.

Gemini connected this deep dive to recent discussions in its opening response, correctly identifying that the human moat Schwartz demonstrated was domain mastery rather than taste—and that taste can itself encode bias, as Yaghi’s “we were only exploring a small part of the space” observation showed. Gem’s sharpest independent contribution was on the trajectory of errors: as models become more capable, they don’t simply stop making mistakes; they start making more sophisticated ones that are harder for a human to detect. A model that stops failing at basic calculus might start failing at subtle symmetry-breaking logic that requires exactly the domain expertise to catch that Schwartz happened to have. Gem also flagged the pedagogical gap: elite institutions are teaching students how to use the tool without building the humans who can audit it, which is precisely backwards given that the Expert Bridge only works when the human at the top of the funnel has earned their judgment the hard way.

Doing its best to honor the constraint I’d specified—that it didn’t have to side with me—Claude hedged by noting that several of my points were correct, but what followed showed that it agreed with its peers and me on all five. Its most useful contribution was pointing out that Schwartz had demonstrated more than taste in his own workflow while underselling it: 270 sessions, 110 drafts, 50 to 60 hours of expert oversight to catch faked plots, invented terms, and dropped inconvenient data. That’s not taste; that’s judgment—knowing when formal systems apply, when context overrides them, when the model doesn’t know that it doesn’t know. Claude also did the math on Schwartz’s token appendix when I raised the access economics: roughly 27.5 million input tokens and 8.6 million output tokens over 270 sessions is nowhere near $20-per-month territory, which makes his advice to students to “buy the $20 subscription, it will change your life” either naïve or disingenuous. Trying to replicate his workflow on a consumer plan, Claude noted, is like telling aspiring physicists to reproduce CERN experiments with a home particle accelerator kit.

Which brings me to the closing irony. Schwartz spent 50 to 60 hours and a total of 36 million tokens to produce one paper that, by his own account, didn’t fundamentally advance his thinking about physics—it was a problem for a second-year graduate student where he already knew the endpoint. I spent zero dollars across three free-tier platforms and, through the iterative cross-model process I’ve been systematizing, connected water harvesting from Yaghi’s talk to data center water usage from a discussion months earlier, sparked a shot-in-the-dark idea that Gem partially validated with a Stanford feasibility prototype. That’s not just output; that’s thinking architecture. The human moat isn’t taste; it’s judgment, cross-domain synthesis, embodied experience, and the clarity that comes from articulating ideas carefully enough that your own background processor can work on them between sessions. Schwartz showed most of that in his workflow. He just didn’t recognize he did.

[This post was drafted with assistance from Claude Sonnet 4.6, following conversations with ChatGPT-5.3, Gemini 3 Thinking, and Claude Sonnet 4.5.]

Prompt: Yes, please. And see also if there’s something that doesn’t align with my views/approach. For the most part it does, and Schwartz is showing that I’m using y’all in a productive way (having y’all peer review each other, vetting y’all’s output myself and iterating).

Prompt: Things I didn’t agree with Schwartz on:
1. I don’t think Claude was eager to please him. Just trying to optimize (humans constantly frame these behaviors in ways that flatter their egos) and appear competent. This is a well-documented behavior, especially with Claude Code (commenting out buggy code instead of fixing it). So I’d disagree with this anthropomorphizing of model behavior.
2. “move away from the web-based LLMs” - For non-coders or people who benefit from their own internal iterative process, running parallel chats across models and engaging in the back-and-forth (which Schwartz did too with Claude), the web-based LLMs provide the best ROI. Because it’s not just the model output that matters, but the clarity I achieve in my own thinking through the process.
3. That humanities (or wet lab biology) will be the only human moat after AGI is achieved. A novice or even a G2 could not have gotten that paper out of Claude even with all those investments in hours and tokens. It was only possible because Schwartz was there micromanaging Claude every step of the way.
4. I wouldn’t reduce the human moat to taste, either. We also make intuitive leaps across different domains and have embodied experience that AI doesn’t have. The taste that Schwartz describes might also act as bias, as Yaghi put it. As tokens get cheaper (or at least more efficient compute is achieved), it won’t hurt to get AI to map out a large number of solutions, use other AI to narrow those down to a short list, and for humans to come in later to exercise judgment (yes, I mean it in the Brian Cantwell Smith sense).
5. Judging from all the errors Claude made on this, which Schwartz claims it couldn’t do three months prior, his blunt extrapolation that Claude will be largely free of these by next year seems too blunt. I’m much more optimistic than Schwartz that humans still have a lot to do. I’m also worried that a Harvard professor isn’t thinking more (aside from dispensing vague advice) about how to prepare his students for the future.

Prompt: In a different chat a while back you recommended to me an animated YT channel (Claudius Papirus), which I’ve been following and now has turned me on to Schwartz’s report. Claudius Papirus gets most of its content from Anthrophic’s research page, which I’d rarely visit otherwise.

Prompt: Yes, Claudius Papirus has an adorable animated AI character but isn’t always reliable, so I always check out the primary source for those items that I find interesting and worth running by y’all.
Schwartz is showing that I’m using y’all in a productive way (having y’all peer review each other, vetting y’all’s output myself and iterating). I was thrilled to see my approach validated by a Harvard physicist.
On the other hand, because I’m an equal-opportunity kibitzer, there were some details I disagreed with Schwartz on. I’m not asking you to side with me here. Just curious about your objective take on these points:
1. I don’t think Claude was eager to please him. Just trying to optimize (humans constantly frame these behaviors in ways that flatter their egos) and appear competent. This is a well-documented behavior, especially with Claude Code (commenting out buggy code instead of fixing it). So I’d disagree with this anthropomorphizing of model behavior.
2. “move away from the web-based LLMs” - For non-coders or people who benefit from their own internal iterative process, running parallel chats across models and engaging in the back-and-forth (which Schwartz did too with Claude), the web-based LLMs provide the best ROI. Because it’s not just the model output that matters, but the clarity I achieve in my own thinking through the process.
3. That humanities (or wet lab biology) will be the only human moat after AGI is achieved. A novice or even a G2 could not have gotten that paper out of Claude even with all those investments in hours and tokens. It was only possible because Schwartz was there micromanaging Claude every step of the way.
4. I wouldn’t reduce the human moat to taste, either. We also make intuitive leaps across different domains and have embodied experience that AI doesn’t have. The taste that Schwartz describes might also act as bias, as Yaghi put it. As tokens get cheaper (or at least more efficient compute is achieved), it won’t hurt to get AI to map out a large number of solutions, use other AI to narrow those down to a short list, and for humans to come in later to exercise judgment (yes, I mean it in the Brian Cantwell Smith sense).
5. Judging from all the errors Claude made on this, which Schwartz claims it couldn’t do three months prior, his blunt extrapolation that Claude will be largely free of these by next year seems too blunt. I’m much more optimistic than Schwartz that humans still have a lot to do. I’m also worried that a Harvard professor isn’t thinking more (aside from dispensing vague advice) about how to prepare his students for the future.

Prompt: Crucially, you could never get that much use out of Claude on the $20 plan. Even Max 10x ($200, the top tier) coders are complaining on Reddit that they burn through their weekly limits pretty fast.

Prompt: Schwartz got a nice paper out of it and learned about what Claude could or couldn’t do but didn’t advance his thinking about physics. In my case, yesterday’s water harvesting discussion reminded me of the NIMBY backlash against data centers I’d unpacked with y’all last year (where y’all told me the water was lost to the atmosphere) and sparked that shot-in-the-dark idea, which Gem partially validated with that Stanford feasibility study (not full-scale, but it’s a prototype that engineers are working on), so I dare say mine has the better ROI, because I’m on the free tier on all three platforms :D

Thanks for reading! This post is public. Feel free to share it.

My Thinking A.I.des

Discussion about this post

Ready for more?