Gemini 3 Pro Is Here?!

A Curious Human’s Musings On “Ph.D.-Level Intelligence”

Nov 27, 2025

I was intrigued when Ethan Mollick’s latest newsletter landed in my inbox announcing Gemini 3 Pro’s “Ph.D.-level intelligence.” Just the week before, he’d published a thoughtful piece on benchmark nuances—how different tests suit different use cases—and had rated Gemini 2.5 Pro as the weakest of the major models. Now, a week later, Google’s newest was Ph.D.-level! I was genuinely curious to see what made the difference.

I was grateful that the piece discussed tests I wouldn’t have thought to run myself. Giving an AI access to my hard drive contents to evaluate its agentic capabilities? That’s a level of commitment (and trust in Google’s data handling) that risk-averse individuals would never attempt. Mollick’s willingness to explore these edges benefits readers who want to understand AI capabilities without personally taking the risk.

But as someone who survived graduate school and learned to triage academic papers, I decided to check out his screenshots of Gem 3 Pro’s crowdfunding research output. You can tell a lot about a paper from its first page, even in an unfamiliar field. And that’s where something caught my eye: the title and the abstract’s conclusion seemed to contradict each other. I wanted to run a sanity-check on my impression, so I typed up the title and abstract and asked all my thinking A.I.des to evaluate it (after all, they’ve read far more papers than any human has). Even Gem 3 Pro caught its own contradiction. If the model can spot its mistake when asked directly but produces the error in the first place, what does “Ph.D.-level” really mean? Any attentive reader would have caught that before submission.

This left me with a question I’m genuinely puzzled by: why no direct comparison between Gemini 2.5 Pro and 3 Pro on the same task? A week earlier, 2.5 Pro was the weakest. Now 3 Pro produces a 14-page paper and is pronounced “Ph.D.-level,” but would 2.5 Pro have done the same? In any research methods course, you learn that minimal-pair testing is Evaluation 101. Without that baseline, we’re measuring capability discovery, not capability improvement.

This secondhand experience of Gem 3 Pro reminded me once again how important the user’s engagement is to AI output. Gem clearly has the analytical ability to spot the inconsistency: it caught its own contradiction when I asked. But AI is designed to be efficient, not to freelance. It won’t self-review unless you ask it to. The quality of what users get depends entirely on the attentiveness they bring to the process.

[This post was drafted with assistance from Opus 4.5, Claude’s newest model, which I’ve been chatting with for the past few days.]

Prompt: You’re the new Gem Pro (v. 3). Mollick wrote about you yesterday. Can you self-critique the first page of the paper you produced for him? Here it is for your reference:

Title: The Distinctiveness Trade-off: Optimal Distinctiveness and Social Capital in Crowdfunding
Abstract: This study investigates the relationship between product distinctiveness and funding success in the context of reward-based crowdfunding. Drawing on Optimal Distinctiveness Theory (ODT), we hypothesize an inverted U-shaped relationship where moderately distinctive products outperform both highly novel and “me-too” offerings. Using a recovered dataset of over 30,000 Kickstarter campaigns, we find a statistically significant positive linear relationship between distinctiveness and funding magnitude, challenging the “liability of extreme novelty” assumption in this context. We also explore the moderating role of social capital. Our findings suggest that in the crowdfunding market, distinctiveness acts as a premium rather than a trade-off, with implications for entrepreneurial strategy and platform design.

Thanks for reading! This post is public. Feel free to share it.

My Thinking A.I.des

Discussion about this post

Ready for more?