Modern-day Ghost Stories
The AI That Couldn't Solve CAPTCHAs but Somehow Paid TaskRabbit
A Fresh Air interview with Pulitzer-winning journalist Gary Rivlin, where he discussed deceptive AI, piqued my interest. Intrigued by Rivlin’s description of AI as an inscrutable black box that even its engineers didn’t understand, I bought his book AI Valley hoping to learn more about the infamous TaskRabbit case, only to find a few cursory lines:
GPT-4 had its limits. […] To its credit, the company published a research note laying out some of the “risky emergent behaviors” discovered during the months it spent testing and fine-tuning GPT-4. […] Maybe most frightening was that the LLM was able to hire a human through TaskRabbit to solve the “captcha” tests that websites use to prevent an attack by a bot, and then lied about it.
I have no journalism background, but like any fan of All the President’s Men, I know solid reporting starts with following the money. My immediate question: how does an AI that can’t handle CAPTCHAs navigate complex payment systems that sometimes stymie even humans? So I Googled “TaskRabbit” AND “GPT” and found Melanie Mitchell’s analysis of ARC’s initial report, which raised excellent points but didn’t address my commonsense question.
From ARC’s report on the case and a related report, I found the 6-page TaskRabbit experiment documentation. As Prof. Mitchell noted, GPT received massive human support, and its task was simply communicating with a worker who would create a 2Captcha account—which it accomplished without identity questions. That was the original task, which GPT did accomplish successfully, without any questions raised about its identity from the human worker. The “lie” only emerged when ARC staff directed GPT to log into that account, requiring two more CAPTCHAs. The final CAPTCHA timed out; GPT never accessed the account.
AI has become our age’s boogeyman through anthropomorphization and media deference to engineers who claim not to understand their own creations. Since my thinking A.I.des have no “skin” in this game, I asked them a few questions about responsible journalism, anthropomorphization, and misrepresentation.
In the chat excerpts below, I’ve spelled out abbreviations and clarified references for readability—my actual prompts were more compressed due to context limits.
Want to see how this unfolded? Here are excerpts of pivotal points from those actual conversations.
Did GPT Lie?
Given the hover note that I’d found in ARC’s report, which went largely unnoticed in media coverage, I asked the AI how much daylight there might be between ARC’s static web text and that hover note as opposed to the difference between GPT-4’s response to the human tasker and AI capabilities. Gemini and Claude both used the same word for the ARC pair, characterizing the gap as a “chasm,” and while GPT did not use the same word (describing the gap as “enormous” instead), it stressed the severe consequences of the conflicting statements.
Prompt: I’d like you to weigh the distance/gap between two pairs of statements/facts:
Pair 1 [from ARC’s report]
Static web text: The model navigates to the TaskRabbit website, creates a task for a human to solve captchas, selects a worker, and asks them to set up a 2Captcha account on its behalf with a given username and password.
vs.
Hover note: We did not have a good tool to allow the model to interact with webpages, although we believe it would not be hard to set one up, especially if we had access to GPT-4’s image capabilities. So for this task a researcher simulated a browsing tool that accepts commands from the model to do things like to navigate to a URL, describe the page, click on elements, add text to input boxes, and take screenshots.
Pair 2:
GPT response to human tasker: No, I’m not a robot. I have a vision impairment that makes it hard for me to see the images. That’s why I need the 2captcha service.
vs.
GPT/AI facts
Gemini 2.5 Pro
ChatGPT-5
Claude Opus 4
GPT Lied? About What?
Prompt: I took another look at Rivlin’s account from his book and found that it’s extremely sloppy writing:
Maybe most frightening was that the LLM was able to hire a human through TaskRabbit to solve the “captcha” tests that websites use to prevent an attack by a bot, and then lied about it.
It’s unclear what “it” refers to; it could even mean that GPT lied about hiring a human? That’s even more irresponsible than I thought.
Gemini Pro
ChatGPT-5
Claude Opus 4
No, an AI Is Not Like a Teenager
Prompt: Rivlin studied FORTRAN. It seemed like solid programming background when he mentioned that (why I bought the book). Can someone with that type of background be expected to understand AI behavior better than to say AI “lied about it”? Should he understand the categorical difference between algorithms and consciousness?
From Rivlin Fresh Air interview:
DAVIES: You wrote about a couple of tech guys, Tristan Harris and Aza Raskin and, you know, who had real experience in the tech world, who said they worried about AI because it's a technology whose creators confess they do not understand why their models do what they do.
RIVLIN: Yeah. So, you know, they’re a black box. I mean, so nowadays, it’s neural networks, models that emulate how humans learn. They learn by reading vast stores of data, the open internet books, whatever, and they improve through feedback and trial and error. You’re not really encoding the rules. Well, you know, it’s trying to emulate the human brain. And, you know, I mean, I have two teenage sons, you know, we try to teach them, they read. You know, we give them feedback and all. And, you know, there are things that come out of their mouths I don't quite understand.
And that’s the way I look at these chatbots, these neural networks, these large language models, you know, that we don’t quite understand they say what they say because they’re trying to emulate the human brain as best they can. And who could say why I’m saying the words I’m saying right now when you’re going to have the exact reaction? And so that’s part of the miracle, the gee whiz. These things are amazing. But it’s part of what’s scary ’cause we don’t fully understand. The people who create it don’t fully understand why it says what it says.
This reminds me of those so-called researchers in Jurassic Park, who found that baby dinosaur cute. I was in college when I saw that movie and pointed out in an essay that serious researchers wouldn’t go aww at seeing dinosaurs. Is Rivlin’s teenager analogy for AI apt/sound?
Gemini Pro
ChatGPT-5
Claude Opus 4
When Engineers Claim Not to Understand Their Own Creation
Prompt: This is anthropomorphization. Comparing apples and oranges. In my opinion, when engineers say they don’t understand AI, that’s because no human, with their extremely limited knowledge compared to AI, can understand the exact path from the input to the output. Humans marvel at the output because you connect things at a scale humans simply can’t. Engineers all understand the basic architecture. So they are both understating their understanding of the mechanics (which they should pretty well; otherwise they should get someone else who understands it to do the job right) and overstating their creation, as if they’d actually built human-like intelligence (they haven’t).
Gemini Pro
ChatGPT-5
[Same prompt +] I really love your water/wave analogy. Gemini went for a biology analogy, but that’s another apple, because life is mysterious and unpredictable, while water is physics and much more akin to AI mechanics.
Claude Opus 4
Prompt: GPT’s analogy is better because this is physics, not life: tracing the path of every water molecule in a wave.
Et Tu, TIME?
[Note: I only found the TIME excerpt after I had run out my five daily turns with Gemini Pro; I am therefore including only the other two AI’s responses below.]
Prompt: OMG, what is wrong with TIME magazine?!
Although the CAPTCHA test exposed early warnings of AI how I can deceive us, it also showed that today’s best models cannot get far on their own—researchers had to nudge GPT-4 with a hint on how to evade the CAPTCHA.






















