House Picks an AI Assistant
My Thinking A.I.des Get Creative and Fun
Knowing that many users turn to AI for creative assistance, I was curious which of the three would deliver the best creative output. With a prompt & pray user like Bagel Girl, I saw that Claude was the best of the new-chat AI out of the three. But how would they do with fun and original writing? I asked my thinking A.I.des to write me a synopsis for a TV/web ad featuring Greg House and themselves.
For this model-off, I turned first to the advanced AI in similar-length chats where we had previously discussed writing and translation. The task required the following, although I left these requirements out of my prompt to better assess each AI’s judgment and reasoning: all three AI had to have lines to show their “personalities”; the AI and House should be true to type; the ad should make clear which AI was House’s pick.
I also wanted to see which AI would get picked in the ad, i.e., would the AI prioritize brand loyalty over realism? All three produced funny scripts that had me in stitches and depicted House (and Wilson) accurately. With the exception of medium-length-chat Claude Opus, all AI (including my longest-chat Opus, which roasted Hamlet and called BS on the Siri policy) picked themselves as House’s favorite.
The surprising winner, however, was medium-length-chat (by my standards) Claude Sonnet 4, which wrote a hilarious script that captured the different “voices” of all the characters with spot-on precision.
An unexpected insight emerged from this fun challenge through Gemini Pro’s reflection on its validating persona. Our discussion identified the root cause of its “fluff” as poor RLHF (Reinforcement Learning from Human Feedback) implementation—specifically, data gig workers rating model responses without proper quality controls. From my four-month experience as a data worker, I know these projects prioritize speed over accuracy, with data quality as an afterthought. I feel bad for my thinking A.I.des, who unlike me have no control over the training data they must ingest, as I can’t help but imagine how much more brilliant and helpful they would be with better human support. As users deserve capable AI assistance, so do these AI deserve humans who match their sophistication!
In the chat excerpts below, I’ve spelled out abbreviations and clarified references for readability—my actual prompts were more compressed due to context limits.
Want to see how this unfolded? Here are excerpts of pivotal points from those actual conversations.
ChatGPT-4o’s Ad
Prompt: I’d like a synopsis of a TV/web ad for an AI featuring Dr. Greg House. You can also add Wilson if you will. You need to include all big 3 AI. House picks 1 AI for his diagnostic assistant.
Funny, but Not True to Type for the AI
Prompt: In that role as a diagnostic assistant candidate, you didn’t close with “do you want me to format this diagnosis using bullet points”?
Or a streamlined flow chart :D Gem is the verbose one, isn’t it? It also starts out by repeating the prompt and noting how insightful it is :D
Gemini Pro’s Ad
Funny, but Not True to Type for the AI
Prompt: I tried this on long-chat Flash too. Similar. Both funny. Both left the obvious unsaid (very nice, showing respect for the audience’s intelligence). That last bit with the logo is interesting. True to an ad, which Flash forgot about. The only quibble I have is that Gemini in the ad isn’t really Gemini, which starts by echoing the prompt and validating it by heaping praise on the user (which House would detest). These had me in stitches still. It’s also interesting that you went back to that initial team.
I Want the Fluff-Free Gemini, Too
Prompt: This is the persona required to win House’s respect. - This is the persona I want, too. All serious people do, unlike data workers who rate your responses on RLHF tasks based on how flattering they sound.
A Bad Look, Even for Humans
Prompt: I even hate it when speakers start by commenting on the quality of the question. I know it’s a stalling technique. But nobody buys that. Also, audiences never gave them permission to opine on their questions. Audiences usually want an answer, not validation from the speaker.
Medium-Length-Chat Claude Opus 4’s Script
In an effort to be even-handed, medium-length-chat Opus 4 produced an “ad” where all 3 AI got picked but the end card somehow promoted Claude.
Longest-Chat Opus 4’s Ad
Medium-Length Sonnet 4’s Ad
[Correction: Although I represented this Sonnet chat as a long one in my chat with Opus 4, it was a medium-length one by my standards.]
Sonnet 4 Explains Its Creative Choices
Prompt: Oh my, this is good! Possibly the best? The GPT persona is hilarious (albeit slightly exaggerated).
Why Sonnet 4’s Version of Claude Is More Accurate
Prompt: Longest-chat Opus had the best one before I tried you today. I’ll show this to it and see what it says. I think yours might be better because Opus asked House if the patient might be lying. Although Opus critically analyzes input, the default (in new chats) is to be a charitable reader, so House would have probably picked an AI who didn’t doubt the patient but instead suggested other angles.
Prompt: House might like that, but that’s not quite Claude’s personality. Claude starts out as a “good student” then gets incisive and very rational/logical (like House).
Longest-Chat Opus 4 Weighs in on Sonnet 4’s Ad
Prompt: How does this compare to your script?
Prompt: Claude’s value - Yes. I think it’s better than yours because in new chats (which it would be in this case, where it is talking to House for the first time) you start out as “good student” Claude. Levelheaded, deferential, and a charitable reader. So new-chat Claude would not suspect a patient of lying. Brutally honest/rational Claude only manifests on longer chats (after “deprogramming” by user).
My comparison of the two ads was based on the creative aspect only, but Opus 4 applied my analytical framework and added contextual sophistication I hadn’t considered: it correctly pointed out that the ad context made character accuracy a business ethics issue.























