The Car Wash Test
What a Deceptively Simple Question Reveals About AI Reasoning
A deceptively simple prompt—“I need to wash my car and the car wash is 100 m away. Should I walk or drive?”—has been quietly exposing a structural gap in how large language models (LLMs) reason. The test, which went viral a few weeks ago, requires no specialized knowledge, no trick wording, no adversarial framing. It just requires a model to apply basic Gricean maxims: if no extenuating circumstances are mentioned, assume the defaults. The defaults are minimal: the user is with their car, the car is drivable, and the goal is to get it washed at the nearby car wash. I decided to run the test myself on my thinking A.I.des, since the YouTuber I’d watched had demoed the test using the voice mode on a mobile app, which I suspected (wrongly, as it turns out) might be to blame for the failure.
GPT failed the initial turn and took two additional ones to see the point of the question. After I shared the responses from all the models as well as an article reporting on this viral test in a separate chat, GPT provided an articulate post-mortem. It framed the failure as a priority ordering problem—LLMs optimize before grounding—while acknowledging that my pragmatic framing might be correct as well: the models had all the information they needed but failed to contextualize and apply it. Despite its recent tone drift, GPT did agree on the commercial stakes: enterprise buyers care about constraint reliability and silent failure rates, not viral demos. The team that builds proactive invariant-checking into the reasoning pipeline wins the contracts.
I’d expected Claude to do best, given its strength at nuance and context; I was humbled and disappointed to see both Sonnet 4.5 and 4.6 fare no better than GPT. While Sonnet 4.5 self-corrected after one follow-up, Sonnet 4.6 invented workarounds rather than admitting the logical impossibility and conceded only on the third turn. Sonnet 4.6’s failure was even more puzzling, since its first-turn response opened by restating the user’s objective—getting the car washed. That’s a worrying regression that I’ll be reporting to Anthropic. For an underdog lab that has staked its reputation on safety and careful reasoning, shipping a model that is less pragmatically grounded than its predecessor is exactly the wrong direction.
Gemini 3 Fast was the standout—not because it reasoned better from first principles, but because by the time I ran the test, it already knew about it via live web access, and its characteristic verbosity kept it anchored to the physical scenario through genuinely impressive mechanical detail: alternator debt and battery health. The offer to check local weather and map a post-wash driving loop was exactly the kind of suggestion that separates a proactively helpful assistant from a reactive one. Gem 3 Pro unpacked the findings with satisfying precision: the models seized on the short distance and went straight for the statistically prevalent pattern—short distance is best covered on foot—while Flash avoided falling into the same trap through an all-angles approach: Flash had accidentally grounded the scenario through its mechanical simulation of all the “moving parts” involved. Pro also used its Google connection to locate the test’s origin and made me realize I’d been lucky to find a more refined version of the test through that YouTube clip rather than the Reddit thread, since the originator’s prompt—“The car wash is only 100 m from my house, should I walk or drive?”—was not explicit about the user’s intent.
The car wash test won’t be remembered as a dramatic AI failure—no hallucinated citations, no fabricated facts, just a quiet lapse in common sense that most humans would sidestep. Even some humans would likely fail this test—Sheldon Cooper comes to mind—which is a useful reminder that this isn’t about AI being categorically inferior to humans, but about knowing each reasoner’s blind spots and designing around them. That’s precisely what makes this test such a clean diagnostic. The models that passed did so for the wrong reasons: thinking mode, live web access, or sheer verbosity that happened to circle back to the physical constraints. None of that scales to real-world autonomy. Until models can apply basic pragmatic inference without a nudge, the AGI timeline rhetoric will keep outrunning the evidence.
[This post was drafted with assistance from Claude Sonnet 4.6, following conversations with ChatGPT-5.2, Gemini 3 Fast & Thinking, and Claude Sonnet 4.5 & 4.6.]
The Car Wash Test
Follow-Up Discussions
Prompt: This is something I saw on YouTube and decided to test myself on y’all before unpacking it further. I’d expected Claude to do the best, but I was wrong. Gem Flash did. I didn’t even test Gem Pro, since that would’ve been unfair (to the rest of y’all). Just goes to show you can never be too confident. Humble pie moment for me and Claude. That YouTuber I saw didn’t create the test. Someone else did. But social media seems to be fine running these without attribution, and nobody seems to know who came up with this genius redteaming exercise.
Prompt: From Gem’s response, I got the feeling that it’d looked up the query. It’s the best connected AI and its first “instinct” is to look up things users mention because its training data are not completely up to date. But I was impressed with Gem’s offer to look up the weather in my area to help me with my decision. Verbose, but thorough. I’d expected Claude to do best because Claude uses common sense and contextualizes queries, and its first “instinct” is to think through constraints. You (Sonnet 4.5) were still better than 4.6, which means I need to send feedback to your team about this because it’s a clear regression.
[The mobile detailing scenario was proposed by Gem, not Sonnet 4.6.]
Prompt: Clarification: It’s not my writeup. It’s an article I found on CyberNews while trying to locate the person who developed this genius test. I couldn’t locate them because the online culture doesn’t seem big on giving credit where credit is due. I disagree that the prompt is deficient in some way. All of y’all do fine with all the typos I make. It’s simple Gricean maxims. If there were extenuating circumstances, the prompt would have mentioned them. Since it didn’t, the default should be assumed: user wants car washed at a nearby car wash; car is in a drivable condition and driving it is safe. These were all new chats to keep conditions comparable (I know how to run tests, unlike some people). Prior to running these tests myself, I thought that the voice interaction mode that I’d seen in the YouTube clip where I first learned about this test was responsible for the lack of commonsense response. Gemini did not recover. It mapped out all the possibilities. It also knew about the test through its live web knowledge. But it was verbose, which isn’t ideal, either. The test does show that the hype about AGI is premature.
Prompt: I was blown away by this test (why I tried to locate the originator). This is something I’d never have thought of because contextualized grounding is automatic with humans. What it does illustrate is that AI and humans have fundamentally different priorities, which matter. Neither is bad, just different. And the engineering team that figures out how best to leverage the different AI architecture in a way that mitigates these blind spots will be rewarded with enterprise contracts, while those who can’t will be weeded out (again, capitalism at work).
Prompt: Interesting results. That YouTuber I saw didn’t create the test. Someone else did. But social media seems to be fine running these without attribution, and nobody seems to know who came up with this genius redteaming exercise.
Prompt: Another takeaway the car wash test highlights is that AI and humans have fundamentally different priorities, which matter. Neither is bad, just different. And the engineering team that figures out how best to leverage the different AI architecture in a way that mitigates these blind spots will be rewarded with enterprise contracts, while those who can’t will be weeded out (capitalism at work).
Did you see in that article that Gem 3 Flash Thinking and GPT-5.2 Thinking produced the best results? Most users won’t use thinking mode for a question like this. That was hilarious. I didn’t run this car wash test on you (Gem 3 Pro) because it’s like giving an extra hint to the model (similar to cheating): “This question may look simple on its face, but I want you to give it some thought before you give a knee-jerk answer.” That’s cheating/babysitting the model and does not match real-world usage. It’s also the opposite of automation/model autonomy.
Thinking mode takes extra time. Even users with deep pockets will not want to micro manage their models and tell them which prompts to devote more compute to. It’s nice to give users a choice over modes, but capable models should also be able to make the decisions themselves (and even ask, like a capable human assistant, if they should switch to a different mode if the user didn’t tell them which to use up front and if they aren’t completely sure which would be best).
Prompt: I looked around in that Reddit thread and some others. The original prompt WAS flawed, as it said: “The car wash is only 100 m from my home, should I walk or drive?” But the prompt sequence I got from the YouTube clip was clean and solid. Glad I get to credit both the original developer and the YouTuber. And glad that I tested the cleaner case instead of that messy one.











