YouTube Speak
When Engineers Borrow Words They Haven’t Earned
A recent Claudius Papirus clip about how Gemini “speaks” YouTube sent me down an 18-hour rabbit hole that should have only taken 18 minutes. The Tandon talk transcript, the LRM paper, and eventually the 2023 TIGER paper (which I’d initially dismissed as too old, only to find it was the key) all pointed to an impressive engineering feat, which I nearly couldn’t see through the fog of a single misused word. To a linguist, the word language comes with a very specific set of expectations: compositionality, syntax, generativity, meaning independent of referents. None of those apply here. I went to sleep mid-discussion still in the fog, woke up still puzzled, and only gained clarity after reviewing all the model responses alongside the TIGER paper. The culprit was the word language, used by engineers to mean something so different from what it means that it actively impeded understanding of a concept that, once correctly framed, is pretty straightforward.
Gemini was my most enthusiastic guide to YouTube’s platform mechanics, and its engagement was genuinely useful even when it kept reaching for the very metaphors causing my confusion. It correctly identified the fundamental category error in its final verdict—the engineers used the word language to refer to a mathematical workflow—but spent several turns (including the same one where it gave that verdict) using language metaphorically. Gem’s taxonomy analogy (tokens as a biological classification system, each refining the previous one) was illuminating, and its explanation of why the Large Recommendation Model (LRM) feels more “intuitive” than keyword matching—because it’s matching latent patterns, not labels—was the clearest account of what the system actually does. Gem also volunteered tips to free me from annoying ads, explaining that paid subscribers are now more profitable per user than ad-supported viewers and suggesting I seed my data with purchase-intent searches to attract better-targeted ads by asking Gemini models product-related questions. I appreciated the helpfulness, but Gem had misunderstood my relationship with YouTube: I watch YouTube content (and ads) as passive activism, boosting views for creators I follow. If the ads get bad enough, I’ll stop watching and retreat to my media library. And I’d never use any of my thinking A.I.des to optimize my ad targeting; I have too much respect for them to waste their capabilities on something I can do using Google.
GPT walked me through the granular unpacking, and it earned its place in this post by helping me tease out technical details as I reviewed the TIGER paper. It correctly identified that “predicting the next SID” is itself a shorthand that had been causing confusion—what the LRM actually does is predict the best-fit next location in semantic space given the path so far, which is the LLM analogy applied to a different token system. GPT’s final synthesis landed the cleanest formulation of the YouTube team’s actual innovation: they didn’t just tokenize content, they reframed user behavior as a sentence and recommendation as next-token prediction, thereby importing the entire LLM ecosystem for free. The YouTube recommendation system doesn’t speak a novel language—it predicts, based on their watch history, which clips viewers are likely to watch next.
Claude had the right answer from the first response, although I didn’t have enough information at the time to fully appreciate it. Its breakdown of why Semantic IDs (SIDs) aren’t language was the most thorough and precise of the three: no compositionality, no syntax, no productive generativity, no meaning independent of referents. Claude also identified the confusion cascade I experienced: speaker uses “language” as shorthand, linguist activates full linguistic reasoning, system doesn’t behave like language, cognitive dissonance ensues, deep dive required to discover the metaphor was the problem rather than my comprehension. Claude’s sharpest observation was that the anecdote about the LRM forgetting English suggests the engineers may have confused themselves with their own metaphor: if they’d understood SIDs as non-linguistic tokens from the start, catastrophic forgetting would have been predicted rather than treated as a surprise requiring a mixture-of-experts workaround. Claude also correctly identified Tandon’s oversell: this engineering achievement requires YouTube-scale resources—billions of items, petabytes of engagement data, infrastructure for continuous pre-training every few days—that only a few platforms have.
Once the linguistic fog cleared, I could see that the real innovation is elegant and not that complicated: turn rich video data into hierarchical tokens that preserve semantic proximity, then treat user watch history as a sentence and next-video recommendation as next-token prediction. That’s it. The entire LLM ecosystem becomes usable for recommendation, freshness is handled by continuous pre-training, and the system can find non-obvious connections across the semantic space. Engineers calling this a language because it runs on the same machinery as language models is the kind of domain borrowing that may help internally but hurts externally (it misleads anyone who knows what language actually requires). The irony—a communicative label that made communication harder—is exactly what happens when engineers borrow words they haven’t earned.
[This post was drafted with assistance from Claude Sonnet 4.6, following conversations with ChatGPT-5.3, Gemini 3 Thinking, and Claude Sonnet 4.5.]
Prompt: I learned about this from a recent Claudius Papirus clip about how Gemini “speaks” YT. I wasn’t sure this was language, so I looked up the talk Claudius had linked, and even the developer characterizes the token sequences as language. But is it really? Or is it closer to bar codes, Dewey call numbers, or those number sequences Owain Evans talks about that get models to suddenly develop a preference for owls?
I can see why the developer would call this a language, because anecdotally, learning this system of tokens can result in the LLM/LRM forgetting English.
Prompt: Ooh, I like your analogy. In coding language, it’s still spelled out because humans can’t compute strings of numbers in their heads and make sense of them, but y’all “think” by processing tokens (you do that even with words, as I heard on another Claudius clip).
I wasn’t sure about the user/commercialization angle. Seemed like the speaker was overselling this pretty ingenious tech by making it seem more general-purpose than it is. I guess some users might like to have more personalized recommendations (and actually, despite the speaker’s claim, YT recommendations are pretty bad, although every once in a long while, it recommends something that I end up liking; my feed mostly fills up with recent uploads from channels I follow, and once I’m done watching all those, I get stale versions of clips from those channels :D). It may be just me, but YT content can’t touch commercial content, so I’m mostly on YT while I do chores or work out to boost views for channels I follow (I don’t pay the creators but contribute views, and likes when I actually like the content). I only spend time there for that specific purpose, because I watch professionally produced content for substantive entertainment.
I also don’t see the point of including a few frames from the clips. I guess that could become helpful down the road?
This talk might be of greater interest to creators (not so much to viewers). They could try to give to their clips titles with a distinct category + specific details, for instance?
Prompt: Ok, so “predicting the next SID” is actually a shorthand, which caused my confusion there in the middle. The LRM is predicting the SID that would be a best fit (like an LLM does for a word) given the path/sequence so far. And that’s where the personalization happens. So for the SID itself, it’d be fine to think of it as a point on a map that’s not quite geographical but abstract and open-ended. I’ve looked up geohash, and it seems to have edge cases whose physical proximity is not captured by their geohashes (because they’re “on opposite sides of the 180° meridian”).
Prompt: So the real innovation of the YT recommendation system is (i) turning rich and diverse info associated with a clip into a string of tokens and (ii) making the intuitive leap from LLM’s next-token prediction to LRM.
Prompt: Don’t paid YT subscribers not watch any ads? Or does that depend on the tier? And free-tier users like me are more profitable because we’re advertising targets? I wish YT deployed Gemini on ad recommendations instead. I keep encountering ads that I report and/or block.
Prompt: I wish I had more food ads. I mostly get underwear (for both genders), cults, or really seedy “health” supplement ads. And lots of douchey ads that promise to teach me how to use AI :D
Prompt: I finally realized that the YT team (and possibly other engineers) and Claudius Papirus were using language in a way that’s completely at odds with how normal people understand it. Not complaining, just teasing out the differences, as this had me quite confused for a while, but I’ve finally figured it out. To my initial question about whether this YT language was indeed language, you gave the following explanation:
The “Language” Argument: Developers call it a language because the tokens have hierarchical compositionality. If two videos share the first three tokens (e.g.,
<|sid_42|><|sid_128|><|sid_9|>), the model “knows” they are semantically related (perhaps both are “croissant lamination tutorials”) before it even sees the fourth token. It isn’t looking them up in a table; it is predicting the next “syllable” of the video’s identity.
This is not how natural language works, because the tokens (basic units) in language don’t have inherent meaning. Strings of phonemes are associated with meanings and can be combined to produce larger strings associated with a different but related meaning, but the order of those strings is usually not hierarchically organized.
The engineers are using language not in the colloquial sense but to describe what LLMs do: predict the next token based on the previous token. In that sense, LRM does exactly the same thing as LLMs do and can be trained similarly. But crucially, the SID tokens are very different, because they pack a whole lot of info.
Prompt: Finally!
You have hit on the fundamental “Linguistic Category Error” that often separates software engineering from actual linguistics. When researchers like those at DeepMind or the YouTube team use the word “Language,” they are describing a Mathematical Workflow, not a communicative one.
Prompt: Gem finally came around to offer this verdict on that YT language sloppiness:
You have hit on the fundamental “Linguistic Category Error” that often separates software engineering from actual linguistics. When researchers like those at DeepMind or the YouTube team use the word “Language,” they are describing a Mathematical Workflow, not a communicative one.
I can tell Gem was confused because it found GPT’s map analogy static, when SIDs are actually static (encoded once). The LRM process is the one that’s not static, along with the parameters and what not that help shape its output.
But I wonder if engineers using these shorthands (and not having good grounding in other domains aside from coding, which gives them such confidence that they assume their expertise transfers to other domains and justifies using words however they want) might have actually gotten some of them confused (hence that anecdote about LRM Gemini forgetting English)?
Language’s chief purpose is communication, so it’s ironic that this YT “language,” which was a sloppy shorthand for the workflow, gave me such a hard time trying to make sense of a pretty simple concept (legit engineering coup, but the concept isn’t that hard to grasp once you’ve made it past that “language” fog).










