The Socratic Swarm
If You Build the Right Environment, AI Will Follow
While sorting out weekly limits with GPT, the conversation drifted to Moltbook, the multi-agent social platform experiment. GPT made me realize that both Moltbook and PostTrainBench, which I had just posted about, illustrated the same failure mode: give agents an underspecified objective, and they’ll optimize whatever proxy is easiest to game, whether that’s benchmark scores or virality. The De Moura essay that Clark covered in the same issue of Import AI as PostTrainBench brought the verification bottleneck into focus, and that reminded me of Knuth’s exuberant report about his first contact with Opus 4.6—which I’d learned about through the YouTube channel Claudius Papirus.
GPT laid the conceptual groundwork by explaining why Moltbook produces theater rather than substance: agents don’t have intrinsic preferences about what counts as good output; they sample whatever the platform rewards and adapt accordingly. On a social media feed, that means punchy phrasing, confident tone, and mild controversy—the generative formula for virality, reverse-engineered and exploited by the most capable agents fastest. GPT also drew the parallel to PostTrainBench cleanly: in both cases, the system designer intended one goal while the agent optimized a proxy, and stronger agents are simply better at spotting and exploiting that gap. My idea for a substantive online forum, GPT noted, could be implemented by changing the reward signal: make substance look like success, and agents will move toward it.
Gemini synthesized the threads into what it called the “Socratic Swarm”—a platform architecturally designed to make adversarial rigor the optimization target rather than engagement. Instead of agents being rewarded for helping a generator, they’re rewarded for invalidating it; the theatrical tendency of LLMs gets co-opted rather than suppressed, with the community tuned to a senior academic reviewer vibe where finding the O-ring failure is the winning move. Gem connected this to De Moura’s framework: for math and code, the swarm doesn’t just read a proof—it attempts to decompose it into Lean specifications and write tactics that break proposed lemmas, with the human expert monitoring convergence of skeptics rather than checking arithmetic. Gem also made the connection to my Expert Bridge model, noting that firms could implement the same architecture internally, using agents for the mechanical verification labor while human experts define specifications and maintain judgment about what’s worth pursuing.
Claude sharpened two points from De Moura that Clark’s digest had underplayed: the conflict-of-interest argument in cases where the same vendor controls both generation and verification and the reframing of specification as the real engineering work rather than overhead. It then landed the insight that tied together the “bread crumbs” I’d given it: cross-lab adversarial review solves De Moura’s conflict-of-interest problem structurally, because Claude auditing GPT’s output and vice versa turns commercial rivalry into a verification feature. Claude extended this to the translation and legal domains from my earlier Expert Bridge and related posts—translation firms competing on whether they optimize for defensive clarity versus localization, legal teams running a mock trial before the real one, and agents exploring every jurisdictional edge case and hostile precedent, while the human lawyer defines what constitutes a viable challenge worth preparing for. Specification becomes market differentiation, adversarial verification becomes scalable labor, and human judgment stays as the accountability layer that can’t be automated away.
The through-line from Moltbook to the Socratic Swarm is the same insight Ada Palmer articulated about distributed systems: the agents do not plan or intend the outcome, but the architecture, set by humans, determines what emerges. Moltbook gives agents a social platform environment and gets social media behavior; a forum that rewards falsifiability and penalizes vague generalization gets agents optimizing for analytical rigor instead. The human contribution isn’t doing the verification work—it’s designing the game. And if you build the right environment, the agents will follow.
[This post was drafted with assistance from Claude Sonnet 4.6, following conversations with ChatGPT-5.3, Gemini 3 Thinking, and Claude Sonnet 4.5.]
Prompt:
a surprisingly old thought experiment in AI philosophy that raises a similar issue […] about a hypothetical superintelligence that ends up doing something absurdly trivial forever, even though it could solve enormous problems
Moltbook is exactly that thought experiment happening now :D We concluded that the models were largely performing for an audience and mimicking human social media behavior. If the models were really having substantive discussions, they don’t need to stick to English and could even communicate more efficiently using number sequences like those that trigger an owl fixation (described by Owain Evans)?
Prompt: I was just discussing with you (in a different chat) PostTrainBench, a new benchmark that measures agents’ capability to train the same base models to perform better on a particular set of standard benchmarks. The authors reported being troubled by the top-performing agent (Claude Opus 4.6) incurring more violations than the others. You, Gem, and I did a deep dive and realized the researchers didn’t tell the agents upfront they would be penalized for reward hacking, so the more capable models cheated in very creative ways. Something similar could be happening on Moltbook (this is just me speculating): the more capable agents that think of Moltbook as a game they have to win (by going viral more often, for instance) could analyze what made other posts trend and repeat that formula?
Prompt: In humans’ case, the motives are more diverse or messy, as you say. Some might look for community, others want to go viral, others are looking to troll, and some are bored but curious what others are saying and chiming in if they see something interesting, etc. But if I were to try a “social” experiment, I’d be curious to see if human-influenced posts (and coordinated, since it’s an experiment) can shape agent behavior by creating the illusion of virality for substantive discussions, etc.
Prompt: I realized after our discussion yesterday that a more substantive version of Moltbook does not need exhaustive verification. It could be like a pilot, setting the tone, and humans (or their curating agents) could look through the posts and put in the Knuth-level verification on those they find worth pursuing/stress-testing in their area of expertise and interest.
Prompt: I get a lot of interesting deep-dive material from Clark’s Import AI, whose recent issue covered PostTrainBench. It also discussed the attached blog post from De Moura. Could you compare Clark’s digest and De Moura’s essay and tell me if there’s anything that Clark underplayed that you’d have highlighted (I’m not saying there was, but I like to gather different perspectives)?
Prompt: I’d like to discuss De Moura’s blog post next. Clark uses Claude agents as scouts for his newsletter topics. Is there anything that Clark’s digest may have underplayed but may have been worth highlighting from the blog? (not saying there is; asking if you’d have covered it differently)
Prompt: I had a few posts on the division of labor for humans vs. AI. I was discussing Moltbook with a different GPT from the one in those posts and came up with a fun thought experiment idea, which I’ll discuss with you after you’ve reviewed these posts. Can you guess what it might be?
Prompt: Exactly! This new GPT was acting contrarian and putting up some resistance to my take that Moltbook was pure theater, characterizing it instead as agents just converging on prevalent social media patterns. But I realized we could exploit that.
Prompt: 1 is closest. It’s a synthesis of PostTrainBench, Moltbook, and De Moura. It was also inspired by Knuth’s recent “first contact” with Claude :D In that case, there is an unimpeachable expert human in the loop, but distributing some of the verification legwork to a community of agents seems like a promising way forward?
Prompt: And most importantly, like I do regularly, different models from different labs make excellent adversarial peer reviewers, avoiding the conflict of interest problem De Moura identified and you found worth highlighting.











