1. The Broken Foundation
You run a video platform with 500 million videos. Each video gets an entry in a lookup table:
| |
The full table is a matrix: embedding_table: float32[500_000_000, 128]. Each row is learned during training. The model sees “user watched video_38291047 then clicked video_38291049” and adjusts those rows to be more compatible.
This is broken in three ways:
No generalization. Video_38291047 and video_38291048 might both be pasta recipes, but the model can’t know that. The IDs are arbitrary database keys. Every embedding is learned independently from scratch.
Cold start. A new video’s embedding is random noise. It needs hundreds of interactions before the model knows anything about it.
The task is too easy. The model predicts $P(\text{click} | \text{user}, \text{item}, \text{context})$: a binary classification. Embed features, cross them, MLP, sigmoid. A shallow network can memorize this. Adding layers doesn’t help because dot-product-style feature crossing has no deep hierarchical structure. Scaling laws never emerge.
The symptom is the one-epoch phenomenon: train for one epoch, fine. Train for two, test performance drops. The sparse ID embeddings overfit to training-time co-occurrence patterns after a single pass. They see each ID only a handful of times, fit it to the training distribution, and can’t generalize. A second epoch reinforces the overfit.
Two independent problems need fixing: item IDs have no structure (so the model can’t generalize) and the training task is too easy (so bigger models don’t help). Everything that follows pulls one or both of these levers.
2. Building a Vocabulary for Items
The idea
Replace arbitrary IDs with learned hierarchical codes. Instead of video_38291047, assign (42, 8, 3, 177): a sequence of codes from coarse to fine, where items sharing prefixes are semantically similar. The model gets structure to exploit.
What a codebook is
A codebook is a clustering of a vector space. You have 500 million item embeddings, each a vector in some high-dimensional space. A codebook with $K = 256$ entries partitions this space into 256 regions, each with a center point (the code vector). Every item gets assigned to its nearest center.
| |
One codebook gives each item a 1-digit code. But 256 buckets for 500M items = ~2M per bucket. Too coarse. Making $K$ = 500M brings you back to atomic IDs.
The residual trick
Instead of one huge codebook, stack multiple small ones. Each refines the previous one’s error.
Level 0: Assign each item to its nearest code in codebook 0. The residual is the error:
| |
Level 1: Cluster the residual with codebook 1:
| |
Repeat for $m$ levels. The output is the Semantic ID: $(c_0, c_1, \ldots, c_{m-1})$.
The quantized approximation is the sum of all assigned code vectors:
| |
With $K = 256$ and $m = 4$ levels, you get $256^4 \approx 4.3$ billion unique addresses from a vocabulary of only $4 \times 256 = 1024$ tokens. Two pasta videos share prefix (42, 8, 3). A skateboarding video starts with (17, ...). The hierarchy is meaningful by construction. Items sharing prefixes are nearby in embedding space.
What Semantic IDs deliberately don’t capture. Two identical pasta tutorials, one with 10 million views and one with 100, get the same Semantic ID. So do a trending video and a stale one. Semantic IDs encode what an item is (content similarity), not how popular, fresh, or trending it is. This is by design: popularity changes hourly, but Semantic IDs are static (or retrained weekly at most). You can’t bake a volatile signal into a code that’s meant to be stable. Where real-time signals like popularity, freshness, and creator reputation enter the system is a real design tension. We’ll address the mechanism in Section 4 (token construction) and the deeper tradeoffs in Section 8.
Making embeddings clustering-friendly first
Before we quantize anything, we need to fix the input embeddings. Content embeddings from a vision-language model capture what an item looks like: its visual style, text, metadata. But a cooking tutorial and a kitchen gadget review might look different while serving the same user need. And two visually similar videos (both dark-lit vlogs) might serve completely different audiences.
Fine-tune the embeddings with contrastive learning before quantization: pull together items engaged by overlapping user sets, push apart items with disjoint audiences.
$$\mathcal{L}_{\text{collab}} = -\log \frac{\exp(\cos(\texttt{emb}_i,\ \texttt{emb}_j) / \tau)}{\sum_{k \in \text{batch}} \exp(\cos(\texttt{emb}_i,\ \texttt{emb}_k) / \tau)}$$where $(i, j)$ are items engaged by overlapping users (“collaboratively similar”) and $\tau$ is a temperature that sharpens or softens the distribution. The numerator pulls similar items together; the denominator pushes everything else apart.
After this alignment, the embedding space reflects user behavior similarity, not just content similarity. Items that serve similar user needs are nearby even if they look different. This step is critical because it makes the embedding space much more clustering-friendly. Everything downstream depends on it.
Two ways to build codebooks from aligned embeddings
We now have collaboratively-aligned content embeddings. We need to assign codes. There are two approaches, and the simpler one is what’s used in the largest production system.
Approach 1: RQ-Kmeans (the simple, production approach)
Just run k-means on the aligned embeddings. Compute residuals. Run k-means on the residuals. Repeat.
| |
That’s it. No neural network. No gradients. No training loop. No stop-gradient tricks. You run k-means offline on your catalog, store the codebooks and codes, and you’re done.
Kuaishou’s OneRec uses RQ-Kmeans in production. Their technical report shows it achieves perfect 1.0 codebook utilization at all levels (every code is used), higher entropy (more balanced token distribution), and better reconstruction quality than the more complex alternative (RQ-VAE). The collaborative alignment step does the heavy lifting of making the space clustering-friendly. Once the space is well-shaped, plain k-means is hard to beat.
Approach 2: RQ-VAE (the learned approach)
RQ-VAE wraps the residual quantization in a neural network: an encoder MLP before the codebooks and a decoder MLP after. The encoder learns to rearrange the embedding space to be more quantization-friendly. The decoder verifies that quantization didn’t destroy important information. Everything is trained jointly with gradients.
Where RQ-VAE came from. RQ-VAE was invented for image generation (Lee et al., CVPR 2022), not recommendation. In images, you need a learned encoder-decoder because you’re compressing raw pixel data into discrete codes. There’s no pre-existing “collaboratively-aligned embedding space” for images. The encoder learns the compression from scratch. Google’s TIGER (Rajput et al., 2023) borrowed RQ-VAE for recommendation, using it to convert item content embeddings into Semantic IDs.
What the encoder-decoder adds over plain k-means: The encoder can rearrange the embedding space before quantization: pull together items that are far apart in the raw space but should share codes, spread apart items that are close but should get different codes, reshape elongated clusters into rounder ones. The decoder provides a reconstruction loss that tells the system “your quantization destroyed information about X,” so the encoder adjusts to preserve it.
When this matters: If your collaborative alignment is weak (limited interaction data, cold-start heavy catalog) or your content embeddings are messy (high-dimensional, elongated clusters), the RQ-VAE encoder can compensate. RQ-Kmeans can’t. It’s stuck with whatever space you give it.
Why we’ll explain the training in detail: RQ-VAE is the foundational method in the literature (TIGER, GRID, many others). Understanding how it trains teaches you stop-gradient and straight-through estimator, which transfer to VQ-VAE in audio, image tokenizers, and discrete latent models generally. And you need to understand the complex approach to know when the simple one is sufficient.
RQ-VAE: the architecture
There are exactly three groups of trainable parameters:
| |
That’s it. No Transformer here. RQ-VAE is a small model. The encoder and decoder are tiny MLPs. The codebooks are just matrices of vectors. Total parameter count is modest (a few million at most).
What a single training step looks like
You sample a batch of items from your catalog and process them all in parallel:
| |
Notice: every item in the batch is processed independently. There’s no attention, no interaction between items. The batch dimension is purely for parallelism and gradient averaging. RQ-VAE is a per-item model. It maps one content embedding to one Semantic ID.
The loss function: who gets updated by what?
Now the tricky part. We have three groups of parameters (encoder, codebooks, decoder) and we need gradients to flow to all of them. But argmin in the quantization step is non-differentiable. It has zero gradient almost everywhere. How does this work?
The big picture first. Training has a forward pass and a backward pass, and they do very different things:
Forward pass (what you saw in the training loop above): Each item finds its nearest code at each level. This is pure nearest-neighbor assignment, just like the assignment step in k-means. No gradients are involved. The codebooks don’t move during the forward pass. You’re just asking: “given where the centers currently are, which center is each item closest to?”
Backward pass (what we’re about to explain): Three separate gradient signals update three separate parameter groups.
- The decoder gets a straightforward gradient from the reconstruction loss. Nothing tricky here.
- The codebook centers get updated to better represent the data assigned to them. Here’s how, concretely. Suppose code vector 42 is currently at position
[0.5, 0.3]and three items got assigned to it during this batch, with residuals at[0.7, 0.4],[0.6, 0.2], and[0.8, 0.3]. We write a loss term:||code_42 - residual||^2for each of these three items (with the residual frozen via stop-gradient, explained below). The gradient of||code_42 - residual||^2with respect tocode_42is2*(code_42 - residual), pointing away from the residual. The optimizer subtracts this gradient, so the code vector moves toward the residual. Across many batches, this is equivalent to k-means moving the center to the mean of its assigned points, except it happens incrementally via SGD rather than in one shot. The codebook centers don’t move during the forward pass (that’s just assignment). They move duringoptimizer.step(), after gradients have been computed. - The encoder gets two gradients: one telling it “produce latents that reconstruct well” and another telling it “produce latents that land close to their assigned code, so the assignments are stable.”
The problem is that argmin sits between the encoder and the codebooks in the computation graph, blocking the normal gradient flow. So we need two tricks to manually route gradients around the blockage. The tricks are just plumbing. The conceptual picture above is the real content.
To solve this, we need two tricks that come up frequently in discrete/quantized systems. Let’s understand each one before seeing how they’re used.
What is stop-gradient?
In normal backpropagation, when you compute a loss involving two variables $a$ and $b$, the gradient flows to both. If your loss is $\|a - b\|^2$, the gradient with respect to $a$ is $2(a - b)$ (pulling $a$ toward $b$) and the gradient with respect to $b$ is $2(b - a)$ (pulling $b$ toward $a$). Both move.
Sometimes you want only one to move. Stop-gradient, written $\text{sg}[\cdot]$, tells the autograd system: “treat this value as a frozen constant during backprop. It contributed to the forward pass, but don’t compute or propagate gradients through it.”
| |
In PyTorch, sg(x) is x.detach(). In JAX, it’s jax.lax.stop_gradient(x). The forward computation is identical in all three cases (the loss value is the same). The difference is purely in which parameters get updated during the backward pass.
Why is this useful? When you have two groups of parameters that both appear in a loss term but you want to update them separately with different objectives. You can write two loss terms using stop-gradient to route gradients to the right group.
What is the straight-through estimator?
The quantization step does this:
| |
argmin returns an integer index, not a continuous value. It answers “which code is nearest?” (an integer like 42), not “how near is the nearest code?” (a smooth distance). If you nudge latent by a tiny epsilon, either the same code is still nearest (index stays 42, output unchanged, gradient = 0) or a different code becomes nearest (index jumps from 42 to 43, a discontinuous step, not differentiable). Then codebook[code] is a table lookup by that integer. You can’t differentiate “look up row 42” with respect to 42. So during backprop, no gradient flows from quantized back to latent. The encoder gets no learning signal from the reconstruction loss.
The straight-through estimator is a hack: during the forward pass, use the actual quantized value. During the backward pass, pretend quantization didn’t happen and copy the gradient straight through.
| |
In PyTorch, this is typically implemented as:
| |
This is biased (the gradient is approximate, not exact), but it works well in practice. The encoder learns “if I shift my output slightly in direction $d$, the reconstruction gets better/worse by this much,” even though the actual quantized value didn’t shift (it snapped to a code).
Applying both tricks to RQ-VAE
Reconstruction loss: did quantization destroy information?
$$\mathcal{L}_{\text{recon}} = \frac{1}{B}\sum_{i=1}^{B}\| \texttt{content\_emb}_i - \texttt{reconstructed}_i \|^2$$Averaged over the batch. This wants to update the encoder (to produce better latent vectors), the codebooks (to approximate the latents better), and the decoder (to reconstruct better). But argmin blocks gradients from flowing through the quantization step. The straight-through estimator patches this: during backprop, gradients flow from reconstructed → decoder → quantized → (straight-through) → latent → encoder. The encoder and decoder both get updated. The codebooks still don’t, because the straight-through sends gradients to latent, not to the codebook vectors.
Quantization loss: routing gradients to the codebooks (and back to the encoder):
The codebooks need their own gradient signal. We use stop-gradient to create two auxiliary losses that manually route gradients:
$$\mathcal{L}_{\text{quantization}} = \frac{1}{B}\sum_{i=1}^{B}\sum_{\text{level}} \underbrace{\| \text{sg}[\texttt{residual}_i] - \texttt{code\_vector}_i \|^2}_{\text{Term 1: moves codebook centers}} + \beta \underbrace{\| \texttt{residual}_i - \text{sg}[\texttt{code\_vector}_i] \|^2}_{\text{Term 2: stabilizes encoder output}}$$Term 1 in plain English: Stop-gradient on the residual. Only the code vector receives gradients. The gradient is $2 \cdot (\texttt{code\_vector}_i - \texttt{residual}_i)$, pointing from the code vector toward the residual. The optimizer takes a step, and the code vector moves closer to the residual. This is how codebook centers migrate to where the data actually is. Same as a k-means centroid update, except via SGD across many batches.
Term 2 in plain English: Stop-gradient on the code vector. The gradient flows to the residual, which flows back through the encoder. The gradient is $2\beta \cdot (\texttt{residual}_i - \texttt{code\_vector}_i)$, telling the encoder: “move your output closer to the code vector you were assigned to.” This prevents the encoder from being flighty. If it keeps shifting its outputs, the code assignments keep changing, and the codebooks never stabilize. $\beta = 0.25$ is typical.
Putting it together: what each parameter group receives:
| Parameter group | Updated by | What happens |
|---|---|---|
| Encoder weights | Reconstruction loss (via straight-through) + Term 2 | Encoder learns to produce latents that are easy to quantize AND reconstruct well |
| Codebook vectors | Term 1 only | Cluster centers migrate toward the data they’re assigned |
| Decoder weights | Reconstruction loss (directly) | Decoder learns to reconstruct from the quantized representation |
The full loss:
$$\mathcal{L} = \mathcal{L}_{\text{recon}} + \mathcal{L}_{\text{quantization}}$$One loss.backward() call computes all the gradients. The stop-gradient operators and straight-through estimator ensure each parameter group gets exactly the right signal. One optimizer.step() updates everything simultaneously.
What goes wrong with RQ-VAE: codebook collapse
Some codes get assigned to many data points early on, get frequently updated, attract more data. Others drift away and die. You end up with 20 active codes out of 256.
First fix: K-means initialization + EMA updates + reset dead codes periodically.
Still collapsing? This is one reason RQ-Kmeans wins empirically. K-means directly computes cluster centers as the mean of assigned points, which guarantees every center is near actual data. Gradient-based codebook learning (Term 1) does the same thing incrementally via SGD, but it’s more susceptible to the rich-get-richer dynamic that causes collapse. RQ-Kmeans achieves perfect 1.0 codebook utilization by construction.
Living with a changing catalog
Everything above describes building Semantic IDs once. But your catalog isn’t static. New videos are uploaded every minute. Old ones are removed. User behavior shifts seasonally. The Semantic ID system must handle this continuously, and the way it handles it has cascading consequences for everything downstream.
New items (the easy case). A new video is uploaded. You run its content through the vision-language model to get content_emb, collaboratively align it if your alignment model supports incremental updates, then quantize it against the frozen codebooks. With RQ-Kmeans, this is just nearest-neighbor assignment at each level. With RQ-VAE, it’s a forward pass through the frozen encoder plus nearest-neighbor assignment. Either way, no retraining needed. The new video immediately gets a meaningful ID that shares prefixes with similar existing videos. This is one of the biggest practical advantages of Semantic IDs over atomic IDs: a new video with atomic ID video_999999999 starts with a random embedding and zero information. A new video with Semantic ID (42, 8, 3, 215) is instantly known to be similar to other (42, 8, 3, *) items. The downstream Transformer already knows how to handle that prefix.
The limitation: the codebooks were built on the old catalog’s distribution. If the new video is genuinely unlike anything seen during codebook training, say your cooking platform suddenly adds gaming content, the existing codebooks may not carve the space well for it. The residual errors will be larger, the Semantic ID less precise. This degrades gracefully (the coarse codes are still meaningful, just the fine codes are noisy) but accumulates over time.
Periodic codebook retraining (the hard case). Eventually the catalog drifts enough that the codebooks need retraining. You rerun RQ-Kmeans (or retrain RQ-VAE) on the updated catalog. New codebooks are learned. And now every item’s Semantic ID potentially changes.
This is the hard problem. The video that was (42, 8, 3, 177) might become (38, 12, 7, 201). The downstream Transformer spent its entire training learning that (42, 8, 3, *) means “Italian pasta content.” After rebuilding codebooks, that knowledge is invalidated. The Transformer’s entire learned vocabulary is broken. So is the trie of valid IDs. So are any cached user sequence representations.
The downstream cascade:
- The Transformer must be retrained or fine-tuned on the new Semantic IDs. Retraining from scratch is expensive but clean. Fine-tuning on new IDs risks catastrophic forgetting. The model partially remembers old ID patterns that no longer exist, creating ghost associations.
- The valid-ID trie must be rebuilt entirely.
- All cached user representations (KV caches, precomputed encoder outputs) are stale and must be recomputed.
- If you’re running A/B tests, the old and new models produce incomparable Semantic IDs. You can’t mix them.
Practical strategies to manage this:
Gradual codebook updates. Instead of rebuilding codebooks from scratch, update them incrementally using EMA: $\texttt{codebook} \leftarrow \gamma \cdot \texttt{codebook} + (1 - \gamma) \cdot \texttt{new\_centers}$. With $\gamma = 0.99$, codebooks shift slowly. Most items’ Semantic IDs stay the same or change by one code at the finest level. The downstream Transformer can absorb small ID perturbations without retraining. Its learned patterns at the coarse levels survive.
Scheduled retraining with staged rollout. Rebuild codebooks weekly or monthly on the full updated catalog. Retrain the Transformer on the new IDs. Roll out in stages: 1% of traffic on the new model, monitor metrics, scale up. Keep the old model as fallback. This is operationally complex but standard in production ML.
Training the Transformer to be robust to ID noise. During Transformer training, randomly perturb a small fraction of Semantic IDs (swap a fine-level code for a nearby code). This teaches the model that IDs are approximate. It learns to rely on coarse prefixes (which are stable across retrainings) more than fine suffixes (which are volatile). A form of data augmentation specific to the Semantic ID setting.
The bottom line: Semantic IDs are not a “train once and deploy” component. They’re a living part of the system that requires an operational cadence: monitor codebook utilization, track how much the catalog has drifted since last retraining, schedule retraining before degradation becomes visible in online metrics. Staff-level ownership of this system means owning this lifecycle, not just the initial architecture.
3. Training Like a Language Model
From vocabulary to language
We now have a vocabulary of ~1024 tokens (256 codes × 4 levels) and every item is a “word” of 4 tokens. A user’s engagement history becomes a sentence:
| |
We can now train a Transformer on these sequences exactly like a language model. Given the history, predict the next item’s Semantic ID, one code at a time:
$$\mathcal{L}_{\text{next-token}} = -\sum_{t=1}^{T-1} \log P(\texttt{token}_{t+1} \mid \texttt{token}_1, \ldots, \texttt{token}_t)$$Each term asks: “at position $t$, how surprised was the model by what actually came next?” Lower surprise (higher probability assigned to the true next token) = lower loss.
Why this fixes scaling
The discriminative task (binary click prediction) gives 1 bit of supervision per example. The generative task (predict the next token from a vocabulary of 256) gives $\log_2 256 = 8$ bits per prediction, and a sequence of length $T$ gives $T - 1$ predictions. For a user with 1000 engagements encoded as 4-code Semantic IDs, that’s 3999 prediction tasks per user, each over a 256-way vocabulary. Versus 1000 binary labels in the discriminative setup.
More importantly, the task is harder. The model must learn the full joint distribution: the probability of seeing this entire sequence of items in this order:
$$P(\texttt{token}_1, \ldots, \texttt{token}_n) = \prod_{t=1}^{n} P(\texttt{token}_t \mid \texttt{token}_1, \ldots, \texttt{token}_{t-1})$$Each factor asks: “given everything this user has done so far, what’s the probability of this specific next token?” The model must understand temporal patterns, item relationships, preference evolution, and the compositional structure of codes. A single Transformer layer can’t do it. Depth helps. Scaling laws emerge.
Why depth specifically matters: a toy example
Consider this user sequence (showing only the first code of each Semantic ID for brevity):
| |
What a 1-layer model can learn: First-order transitions. “After seeing code 42, code 42 is likely again (60%), code 17 is possible (40%).” It computes one round of attention over the sequence, producing a weighted mix of what it’s seen. It can learn “the most recent token was 17, so maybe 17 again” or “42 is more common overall.” But it computes each token’s representation by looking at the raw inputs only once.
What a 4-layer model can learn: Higher-order patterns. Layer 1 might learn “this user alternates between 42-runs and 17-runs.” Layer 2 might learn “the runs are getting shorter: started with three 42s, then two 17s, then two 42s, then one 17.” Layer 3 might learn “based on the shortening pattern, the next run should be one 42.” Layer 4 combines this with the user’s overall preference distribution.
Each layer operates on the output of the previous layer, not on the raw input. So layer 2 reasons about layer 1’s conclusions, not about raw tokens. This is compositional reasoning, the same reason deep networks outperform shallow ones on tasks with hierarchical structure. The generative task has this structure because user behavior is temporally structured (sessions, interest phases, boredom cycles). The discriminative task (“will this user click this item?”) doesn’t. It’s a single flat prediction.
The empirical proof: a Transformer trained autoregressively on pure item ID sequences showed power-law improvement from 98K to 0.8B parameters. The same Transformer as a feature extractor for a discriminative head showed no scaling. Only difference: the training objective.
4. What Are the Input Tokens?
In Section 3, we described the user’s history as a flat sequence of codes:
| |
That’s 12 tokens for 3 items. Each item contributes 4 codes (one per Semantic ID level). This is the right picture for understanding the training objective: the model predicts each code given all previous codes, so within a single item it predicts the fine codes conditioned on the coarse ones.
But it’s not how the Transformer actually sees the data. Processing 4 separate codes per item means 4× the sequence length, which means 16× the attention cost. In practice, we collapse each item’s 4 codes into a single dense vector before feeding it to the Transformer. So 3 items become 3 item tokens, not 12. (Each engagement also gets an action token, so 3 engagements = 6 tokens total, as we’ll see shortly.) The autoregressive prediction of individual codes still happens at the output head. The model decodes level by level. But the input representation is one dense vector per item.
This section explains how that collapse works, and what else goes into each token beyond the item identity.
How a Semantic ID becomes an embedding
A Semantic ID like (42, 8, 3, 177) is four integers. The Transformer needs a dense vector. How do you convert one to the other?
You might think: use the codebook vectors directly. After all, codebook_0[42] + codebook_1[8] + codebook_2[3] + codebook_3[177] reconstructs the item’s vector. But those code vectors were optimized for quantization quality, not for downstream sequence prediction. The Transformer needs embeddings optimized for its task.
Instead, maintain separate embedding tables for the Transformer, one per Semantic ID level:
| |
Total embedding parameters: $4 \times 256 \times d_{\text{model}}$. With $d_{\text{model}} = 512$, that’s ~500K parameters, versus 500M × 512 ≈ 256 billion parameters for atomic ID embedding tables. The compression is massive.
Why summation works: it mirrors the residual quantization structure (sum of code vectors approximates the item), so the Transformer’s embedding space inherits the hierarchical structure. Two items sharing prefix (42, 8, 3) share three of four embedding components and differ only in the level-3 embedding. They start close in the Transformer’s input space, which is exactly what we want.
An alternative is concatenation (concat instead of +), which gives a $4 \times d_{\text{model}}$ vector that you project down. This preserves more information (the level-0 component can’t be confused with a level-3 component) but costs a projection layer. Summation is simpler and works well in practice.
The sequence format: item tokens and action tokens
Each engagement has two pieces of information: what item the user saw and what they did with it (clicked, watched 30 seconds, skipped, purchased). Meta’s HSTU (Hierarchical Sequential Transduction Unit) represents each engagement as two separate tokens in the sequence:
| |
Each Φ_i is the item embedding (sum of 4 level embeddings from above). Each a_i is an action embedding from a small learned table (one embedding per action type: long_watch, click, skip, purchase, etc.).
Note: each Φ_i is already one dense vector, not the 4 raw codes. So 3 engagements = 6 tokens (3 item + 3 action), not 12.
This format is the key design insight. Because item tokens and action tokens alternate in the sequence, and the model uses causal attention (each token can only see tokens before it), ranking and retrieval are both just next-token prediction at different positions:
Ranking (predict the action): The model sees [Φ_0, a_0, ..., Φ_{t-1}, a_{t-1}, Φ_t] and predicts a_t. The item Φ_t is the candidate. The model has seen it but hasn’t seen what the user does with it yet. It predicts the action from its output at position Φ_t:
| |
Retrieval (predict the next item): The model sees [Φ_0, a_0, ..., Φ_{t-1}, a_{t-1}] (only after positive actions) and predicts Φ_t. It generates the next item’s Semantic ID autoregressively, one code at a time. This is the generative retrieval from Section 3.
Both tasks use the same model, the same sequence, the same attention. The only difference is which position you read the prediction from and what vocabulary you predict over (action types for ranking, Semantic ID codes for retrieval). No special <UNK> tokens or separate classification heads needed.
The cost: Sequence length doubles compared to one-token-per-engagement. Transformer attention is $O(L^2)$, so doubling $L$ quadruples compute. But you get a unified model that handles both ranking and retrieval with one architecture and one training objective.
Ranking multiple candidates efficiently
To rank 100 candidates for one user, you’d naively run the full sequence 100 times. But the user’s history is the same every time. Two optimizations:
KV caching (standard, same as LLM inference). Process the history once, cache the keys and values from all history positions. For each candidate, only compute that one new token’s attention against the cache. One expensive forward pass for the history, then 100 cheap single-token passes.
Microbatched parallel scoring (the M-FALCON contribution). Even with KV caching, scoring 100 candidates one at a time means 100 sequential GPU kernel launches, each computing attention for just one token. Each launch has overhead and underutilizes the GPU.
M-FALCON’s trick: append multiple candidates to the sequence simultaneously and run ONE attention computation. The attention mask prevents candidates from seeing each other:
| |
Each candidate’s row computes dot products against all history keys (shared, from cache) and only its own key. This is one matrix multiply, not three. The GPU processes all candidates’ attention scores in a single operation because the attention matrix is just bigger, not repeated. Instead of 100 sequential kernel launches (each underutilizing the GPU), you get 10 launches of 10-candidate microbatches, each fully utilizing the GPU’s parallel compute.
| |
KV caching alone gives you $O(L)$ per candidate instead of $O(L^2)$. Microbatching on top of that gives you GPU parallelism across candidates. Together, this is why Meta serves a model with 285× more FLOPs than the DLRM it replaced, at higher throughput.
Where do popularity, freshness, and other dense features go?
In Section 2, we noted that Semantic IDs deliberately don’t encode popularity or freshness. The tokens so far have two components: item identity (from Semantic ID embeddings) and action type. Where do dense, real-time item features enter?
Option 1: Add them to the item token. Bucket continuous values into discrete ranges (e.g., popularity: 0-1K, 1K-10K, 10K-100K, …) and learn an embedding per bucket, or pass raw floats through a small MLP. Add this as a third component to the item embedding: Φ_i = sem_id_emb[i] + dense_emb[i]. For candidate items during ranking, the dense features reflect the candidate’s current real-time stats.
Option 2: Drop them entirely (Meta’s approach). HSTU uses only sparse (categorical) features and drops all dense features. The argument: if a user has engaged with an item’s Semantic ID prefix 50 times in their history, the sequence itself implicitly encodes that prefix’s popularity. The model learns aggregate statistics from the raw event stream without being told explicitly. Meta reports this actually outperforms traditional feature engineering in their setting.
Option 3: Structured token types (Tencent’s approach). Instead of mixing everything into one token, define separate token types for different kinds of information. Tencent’s GPR system uses User tokens (profile), Organic tokens (content engagements), Environment tokens (real-time context like ad position, placement type, trending status), and Item tokens. The environment token is refreshed in real-time at serving, carrying the latest popularity and context signals separate from the item identity.
There’s a genuine philosophical split here. Meta’s “drop everything, trust the sequence” is radical but works at their scale. Meituan found the opposite: dropping dense features “significantly degrades model performance, and scaling up cannot compensate for it at all.” Nobody has published a clean ablation isolating popularity features specifically in a generative recommender. We’ll revisit the broader question of lost features in Section 8.
5. The Transformer Doesn’t Work Out of the Box
First attempt: vanilla Transformer on rec sequences
Take a standard causal Transformer. Feed it the user’s token sequence. Train with next-token prediction. What breaks?
Problem 1: softmax forces a competition
Standard attention:
| |
The softmax(dim=-1) forces each row of the weight matrix to sum to 1. This is a zero-sum game: for token $i$ to attend strongly to token $j$, it must attend less to everything else.
Why this is wrong for recommendations: A user’s history might have multiple independently relevant items. If you’re predicting “what will this user watch next after pasta videos and skateboarding,” both interests are relevant simultaneously. Softmax forces the model to divide attention between them. Worse, softmax can never output true zero. Every position gets a small weight, and across 1000 irrelevant tokens, those small weights add up to noise.
The fix: Replace softmax with a pointwise nonlinearity (SiLU) applied to each score independently:
| |
Now each attention weight is computed independently. The model can attend strongly to multiple positions (no competition), and truly ignore irrelevant positions (SiLU of a negative score ≈ 0, unlike softmax’s always-positive floor).
Problem 2: position doesn’t capture time
We try the pointwise attention with standard positional encodings. Better, but something is still wrong.
Consider: a user watches 10 cooking videos in one hour (positions 50–60), goes offline for a month, then watches a skateboarding video (position 61). Positionally, 60 and 61 are adjacent. Temporally, they’re a month apart. Standard positional encodings treat position 60→61 the same as 59→60. The model can’t tell a session boundary from a within-session transition.
The fix: Don’t add positional information to the token embeddings (that pollutes the content representation). Instead, add a relative attention bias directly to the attention scores. HSTU uses the sum of two biases, one for positional distance and one for temporal distance:
| |
Both biases use log-scale bucketing to keep the number of learnable parameters small. For temporal distance: compute the time gap between events $i$ and $j$ in seconds, then bucketize with a log function:
| |
Each bucket gets one learned weight. So there are only ~25 learnable parameters for temporal bias, not $L^2$. The log scale means the model has fine resolution for recent events (distinguishing “2 seconds ago” from “10 seconds ago”) and coarse resolution for old events (lumping “3 weeks ago” and “4 weeks ago” together). Same bucketing scheme for positional distances.
Now the model can learn: “events close in wall-clock time are related even if far apart in the sequence” and “a month-long gap means a context switch regardless of position.” The log-bucketing prevents overfitting despite the parameters being learned.
Problem 3: different actions carry different signals
We now have pointwise attention with relative position and time biases. The model trains well, but ranking quality plateaus. Diagnosis: the model treats all engagement types the same. A click, a 30-second watch, a purchase, and a share all become attention-weighted sums of the same value vectors. But these carry fundamentally different signals. A purchase is a much stronger preference signal than a 2-second click.
This is actually two problems:
- Representation: The model has no mechanism to represent “this was a purchase” differently from “this was a bounce” in its hidden state. Even if it wanted to treat them differently, it can’t.
- Incentive: Even with that mechanism, if the training loss treats all next-token predictions equally, and clicks are 100× more frequent than purchases, the model optimizes for predicting clicks. It won’t learn to care about purchases.
We solve (1) here with an architectural fix. Problem (2) is a training objective problem that gets solved later: through loss weighting during pretraining (weight purchase predictions 10× higher than click predictions) and through alignment in Section 7 (DPO explicitly reweights outputs by a reward function that can value purchases, watch time, and satisfaction over raw clicks).
The architectural fix for representation: Add a gating mechanism. Project the input to four matrices instead of three:
| |
The gate U is a per-dimension volume knob on the attention output. A purchase action token and a skip action token have different input embeddings (different rows in the action embedding table), so they produce different U vectors. The gate gives the model the capacity to route information differently based on action type.
But capacity is not incentive. The gate doesn’t know purchases are worth more to your business than clicks. It learns whatever the loss function rewards. With unweighted next-token prediction, the model learns which action types are predictive of future tokens, not which action types are valuable to the business. If clicks are 100× more frequent than purchases, the loss is dominated by click predictions. The gate will get very good at representing click patterns and mediocre at representing purchase patterns, because that’s what minimizes the loss.
For the model to actually care about purchases, you need explicit intervention in the training signal. Loss weighting (weight purchase predictions 10× higher than click predictions during pretraining) is the simplest fix. Alignment in Section 7 (DPO with a reward function that values purchases, watch time, and satisfaction over raw clicks) is the more principled fix. The gate provides the representational machinery; those interventions provide the direction.
The complete attention block
All three fixes combined:
| |
Stack multiple blocks for depth. This is the HSTU (Hierarchical Sequential Transduction Unit) block, the core building block for generative recommendation. The name reflects what it does: process hierarchical, sequential user action data through a transduction (sequence-to-sequence) architecture.
What else changes about the Transformer?
The three fixes above (pointwise activation, relative attention bias with time, gating) modify the attention mechanism. But HSTU also simplifies and optimizes the overall Transformer block in ways worth knowing.
No separate feed-forward network. A standard Transformer block alternates two sub-layers: multi-head attention, then a feed-forward network (two linear projections with a nonlinearity between them). HSTU drops the FFN entirely. The gating mechanism partially absorbs its role: element-wise multiplication of attn_out * U followed by a linear projection is already a nonlinear transformation of the attention output, similar to what the FFN would do. Fewer parameters, lower latency, and empirically no quality loss in the recommendation setting.
| |
This means an HSTU block has roughly half the parameters and half the compute of a standard Transformer block at the same hidden dimension. You can stack twice as many layers for the same budget, which matters because Section 3 showed that depth is what unlocks scaling.
Sparse Mixture of Experts (in encoder-decoder variants). When you do want more capacity in the feed-forward computation (particularly in the decoder, which needs to choose among millions of possible Semantic ID codes), some systems like OneRec add a sparse MoE layer. Instead of one FFN, you have 64 expert FFNs, and a learned gating function routes each token to its top-2 experts:
| |
Total model capacity scales with the number of experts (64× more parameters in the FFN), but compute per token only scales with $k = 2$. This is how you get a large model that’s cheap to run. The tradeoff: load balancing across experts is tricky (some experts might get all the traffic while others idle), and the routing adds implementation complexity.
Fused attention kernels for serving. The HSTU paper reports 5-15× speedup over FlashAttention2 on 8192-length sequences. The trick: fuse the pointwise activation, relative attention bias, causal mask, and value aggregation into a single GPU kernel. Standard FlashAttention is optimized for softmax attention (it exploits the online softmax algorithm to avoid materializing the full L×L matrix). HSTU’s pointwise activation is simpler than softmax (no normalization across the row), which enables a different fusion strategy. The computation becomes memory-bound rather than compute-bound, and scales with GPU register size rather than HBM bandwidth. You don’t need to understand the kernel implementation details for an interview, but knowing why HSTU is faster than standard Transformers (simpler activation = more fusible operations = better hardware utilization) is useful context.
These optimizations are why Meta deployed an HSTU model with 285× more FLOPs than the DLRM it replaced, using less inference compute. The architectural simplifications (no FFN, pointwise instead of softmax) aren’t just about model quality. They directly enable the serving efficiency that makes the whole approach practical.
6. The Sequence Doesn’t Fit
The problem (and why “just use a longer context window” doesn’t work)
A power user on a short-video platform generates 100K+ engagements over their lifetime. With two tokens per engagement (item + action), that’s 200K+ tokens.
LLMs now handle 200K+ context windows. So why not just feed the full history into the Transformer?
Scale. LLMs serve one user at a time, for a few seconds, at maybe thousands of QPS. Recommendation systems serve billions of requests per day with strict latency budgets (the ranking stage alone typically gets a tens-of-milliseconds P99 budget, with the full pipeline at 100-300ms) and train on 10-100 billion examples per day. The constraint isn’t “can attention physically handle 200K tokens.” It’s “can you afford $O(L^2)$ attention at that length, multiplied by billions of daily requests, within your GPU budget and latency SLA.” At $L = 200{,}000$, one attention layer costs $4 \times 10^{10}$ operations per request. Even with fused kernels, that’s not feasible at recommendation scale.
In practice, no production system runs full attention on 100K tokens. Every system compresses. They differ in how.
What systems actually do
HSTU (Meta): Truncate + stochastic subsample. HSTU uses sequences of 4096-8192 tokens in production, not 100K. For training, it uses Stochastic Length (SL): randomly subsample the sequence with older events having exponentially lower sampling probability. At their recommended setting ($\alpha = 1.6$), a 4096-token sequence becomes ~776 tokens most of the time, removing 80%+ of tokens. The model still trains well because the stochasticity acts as data augmentation (each training step sees a different subsample of the same user’s history).
| |
The limitation: old events survive with low probability, not zero. A cooking interest from 6 months ago might be represented by 2-3 surviving events out of hundreds, or might be entirely absent in a given training step. The model gets a noisy, sparse view of long-term history, and that view changes randomly between training steps. This works surprisingly well (the stochasticity acts as regularization), but the model can’t reliably learn precise long-term patterns like “this user’s cooking interest peaked in March and declined in April.”
OneRec (Kuaishou): Multi-pathway compression. The insight behind this approach: different time horizons of a user’s history answer different questions, and they need different levels of detail.
Think about what you’d want to know about a user to recommend their next video:
- Who are they? Age, gender, location. One token is enough. This never changes.
- What are they doing right now? Their last 20 interactions, in full detail with action types. You need the exact sequence because order matters (they just skipped a cooking video, so maybe not another one right now). 20 tokens.
- What do they like in general? Their top 256 most-engaged items. You don’t need the exact order or timestamps, just the set of items they’ve shown strong positive signal for. 256 tokens.
- What’s their lifetime taste profile? Their full 100K interaction history. You can’t afford 100K tokens, but you also can’t just throw this away. Compress it heavily into a summary. 32 tokens.
Each pathway preserves exactly the level of detail that matters for its time horizon. Recent history keeps full detail. Lifetime history gets heavy compression. Static features get one token.
| |
309 tokens. Down from 200K. Manageable.
How the lifetime compression works (Pathway 4). This is the non-obvious part. You have 100K item embeddings and need to squeeze them into 32 tokens. Two stages:
Stage 1: Cluster. Run hierarchical k-means on the 100K item embeddings to get ~200 cluster centroids. Each centroid represents a neighborhood of similar items. “Cooking content” might be one cluster, “skateboarding” another, “electronics reviews” another. 100K items → 200 centroids.
Stage 2: Summarize the clusters with learned queries. This uses a QFormer (Querying Transformer). Initialize 32 learnable “query” vectors. These queries learn to ask questions like “how much cooking content is in this user’s history?” or “what’s the strongest interest cluster?” Each query attends to all 200 centroids and produces a weighted summary:
| |
The 32 queries are learned during training. They start random and converge to useful questions about the user’s history. The model discovers what aspects of lifetime behavior are worth preserving in 32 tokens.
Why concatenation matters. After concatenation, the 309 tokens go through the same HSTU attention blocks from Section 5 (but bidirectional here, no causal mask, since this is the encoder processing the user’s past). This means: short-term tokens can attend to lifetime tokens, positive-feedback tokens can attend to short-term tokens, and so on. The encoder discovers relationships across temporal scales: “this user’s recent skateboarding binge (short-term) is a departure from their lifelong cooking preference (lifetime). Maybe a temporary phase.” That cross-pathway attention is the whole point of concatenating rather than processing each pathway independently.
Other approaches to long-sequence compression
VISTA (Meta, 2025): Linear attention summarization. Compress 100K history into a few hundred summary tokens using linear-complexity attention (avoiding the $O(L^2)$ cost), then run standard target attention from candidates against those summaries.
ULTRA-HSTU (Meta, 2026): Deeper computation where it matters. Three mechanisms combined. First, semi-local attention (SLA) restricts each token to a local window rather than the full sequence. This is the same idea as sliding window attention in LLMs (Longformer, Mistral). Second, and more interesting: attention truncation. Run the first few HSTU layers on the full long sequence, then run the remaining (deeper) layers on only the most recent segment. The insight: old history needs shallow processing to extract general taste, but recent history needs deep processing to capture current intent. You allocate more compute where it’s more predictive, rather than giving every token the same depth. Third, Mixture of Transducers (MoT): process different behavioral signals (e.g., clicks vs purchases vs searches) as separate sequences with separate transducers, then fuse.
7. The Model Understands Users but Recommends Badly
The problem
The model is trained. It predicts next tokens well. But when we serve it, the recommendations are… fine for engagement but bad for the business. It recommends clickbait (high predicted CTR, low user satisfaction). It shows 10 pasta videos in a row (accurate prediction, terrible experience). It underweights new items (Semantic IDs help with cold start, as Section 2 showed, but the model still assigns lower probability to specific code combinations it hasn’t seen frequently in training data).
The generative training objective is “predict what the user will engage with next.” That’s not the same as “show the user what they should see.” These objectives conflict.
In the old DLRM world, you’d handle this with multi-task towers, one per objective, combined with manually tuned weights. But our generative model outputs Semantic IDs. There are no towers.
The fix: alignment (same idea as RLHF for LLMs)
Step 1: Define what “good” means. Combine multiple signals into a reward:
- Preference score: Weighted mix of engagement metrics (watch time, likes, shares, not just clicks).
- Format reward: Did the model generate valid Semantic IDs? (Binary. Prevents degenerate outputs.)
- Business reward: Diversity (not all same category), safety, monetization targets, cold-start item boost.
Step 2: Generate contrastive pairs. Use beam search with the current model to produce multiple candidate item lists per user. Score each list with the reward. Take the best as “chosen”, the worst as “rejected”.
Step 3: Update with DPO. Recall that our model generates Semantic IDs autoregressively, one code at a time. So the probability of a full recommendation list is just the product of all individual code probabilities:
$$P_{\text{model}}(\texttt{list} \mid \texttt{user}) = \prod_{t} P_{\text{model}}(\texttt{code}_t \mid \texttt{all previous codes}, \texttt{user history})$$This is exactly what the model already computes at every decoding step. The DPO loss uses these probabilities:
$$\mathcal{L}_{\text{DPO}} = -\log \sigma\!\Big(\beta \log \frac{P_{\text{model}}(\texttt{chosen} \mid \texttt{user})}{P_{\text{pretrained}}(\texttt{chosen} \mid \texttt{user})} - \beta \log \frac{P_{\text{model}}(\texttt{rejected} \mid \texttt{user})}{P_{\text{pretrained}}(\texttt{rejected} \mid \texttt{user})}\Big)$$Each log-ratio measures: “how much more likely does the current model make this list compared to the pretrained model?” The loss says: make that ratio larger for the chosen list and smaller for the rejected list.
Concretely: if the chosen list is [(42,8,3,177), (17,203,44,9), (55,12,8,33)] (diverse) and the rejected list is [(42,8,3,177), (42,8,3,52), (42,8,7,88)] (all pasta), the model learns to increase the probability of generating the diverse list and decrease the probability of the monotone one.
$\beta$ controls how far the model can drift from pretrained behavior. Too small → barely moves. Too large → forgets everything it learned in pretraining.
Step 4: Prevent forgetting. Train with combined loss:
$$\mathcal{L} = \mathcal{L}_{\text{next-token}} + \lambda\ \mathcal{L}_{\text{DPO}}$$The next-token loss on ground truth keeps the model’s language modeling ability. DPO steers it toward business-preferred outputs.
What goes wrong: early instability
In the first alignment round, the model is far from optimal. The log-ratios can be huge. A concrete example: suppose the model assigns probability 0.85 to the chosen list, but the pretrained model assigned 0.001. The log-ratio is $\log(0.85 / 0.001) = 6.7$. Multiply by $\beta = 0.1$ and you get 0.67, giving a moderate gradient. But if the model assigns 0.99 and pretrained assigned 0.0001, the log-ratio is 9.2. The gradient scales linearly with this, and early in training these extreme ratios appear constantly because the model is moving fast. The updates become erratic.
The fix (ECPO): Clip the log-ratios, same idea as PPO’s clipped objective:
$$\text{clip}\!\left(\log \frac{P_{\text{model}}(\texttt{list} \mid \texttt{user})}{P_{\text{pretrained}}(\texttt{list} \mid \texttt{user})},\ -\epsilon,\ \epsilon\right)$$With $\epsilon = 0.2$, that log-ratio of 9.2 gets capped at 0.2. The gradient is gentle and controlled. As training progresses and the model stabilizes, the actual log-ratios naturally stay within the clip range, so it stops activating.
The iterative loop
This isn’t a one-shot process. After updating the model with DPO:
- The model is now better at generating diverse, business-aligned lists.
- Use beam search with the updated model to generate new candidate lists.
- These new candidates are higher quality than before. The “chosen” lists are better.
- Score them, select new chosen/rejected pairs, update again.
Each round raises the ceiling: the model improves, so beam search explores a better region of list space, so the preference pairs are more informative, so the next update is more useful. This converges because the improvement per round shrinks. At some point the model is good enough that beam search can’t find much better candidates than what it already generates. Typically 3–5 rounds.
8. What Did We Lose by Going Generative?
The broader problem
The model works. Scaling laws are real. Alignment improves business metrics. But there are gaps. The old DLRM system was fed hundreds of features per item and per user-item pair. Our generative model has a sequence of tokens. Where did all those features go?
In Section 4, we showed how item-side dense features (popularity, freshness, creator stats) can optionally be injected as a third token component, and noted the philosophical split: Meta drops them entirely while Meituan insists they’re essential. If you chose to include them via token = sem_id_emb + action_emb + dense_emb, item-side features are covered.
But there’s a harder category of lost signal: user-item cross-features. These can’t be handled by adding a component to every token, because they depend on the specific pair of user and candidate, not just the item alone.
The cross-feature gap
In DLRM, the most predictive features were often cross-features: “user $u$’s CTR on category $c$ = 18%”, “user $u$ viewed item $i$ 7 times in 30 days”, “user $u$’s average session length on cooking content = 12 minutes.” These directly encode the user-item relationship. They’re handed to the model as pre-computed dense numbers.
In our generative model, the user is a sequence of Semantic IDs. The model must implicitly discover that the user has engaged with cooking content 7 times recently by recognizing code-prefix patterns across thousands of tokens. That’s asking attention to do counting, a much harder learning problem than reading a feature that says count = 7.
First attempt: just let the Transformer learn it (Meta’s HSTU approach)
This is what Meta does. HSTU drops all dense and cross-features entirely and trusts the sequence. Maybe with enough data and depth, the model will learn to count. After all, LLMs can do arithmetic.
Tested: for users with 50+ interactions with a specific category, the DLRM with pre-computed cross-features achieves measurably higher AUC than the generative model. The Transformer learns rough aggregate statistics (“this user watches a lot of cooking”) but not precise counts or rates (“this user’s CTR on cooking is 18% vs 12% on sports”). The gap is largest for heavy users with rich per-category history. Exactly the users where personalization matters most.
Second attempt: shove cross-features back in (Meituan’s MTGR approach)
Meituan’s MTGR (Meituan Generative Recommendation) found that dropping cross-features “significantly degrades model performance, and scaling up cannot compensate for it at all.” Their solution: reorganize training data from one-sequence-per-user to one-sequence-per-(user, candidate):
| |
Cross-features like “user-item CTR” are now just additional features on the candidate token. The model has direct access.
But this requires careful masking to prevent leakage. The masking uses the same HSTU attention blocks from Section 5, but with a heterogeneous mask instead of a simple causal mask. Since HSTU uses pointwise activation (SiLU) rather than softmax, masking still works the same way. Zeroed-out positions contribute nothing to the output:
| |
- User features: visible to everything.
- History: causal within itself. Can see user features but not candidates.
- Candidates: self-only. Each candidate sees the user, history, and its own features, but not other candidates. If candidate $k$ could see candidate $k+1$’s features (including labels), it would cheat.
The cost: If user $u$ has $N$ candidates, you now produce $N$ sequences instead of 1. That’s $N\times$ more training data. In practice, $N$ can be hundreds. This works but it’s expensive.
Third attempt: don’t replace DLRM at all (Alibaba’s GPSD, Netflix FM, Pinterest PinFM)
This is the hybrid approach, and honestly it’s what most companies do. Alibaba’s GPSD and LUM, Netflix’s Foundation Model, and Pinterest’s PinFM all follow this pattern. The insight: you don’t have to choose between generative and discriminative. Use generative training to learn better representations, then plug them into your existing DLRM that already handles cross-features natively.
Wait, doesn’t Section 3 say “a Transformer as a feature extractor for a discriminative head showed no scaling”? Yes, but that was training the Transformer with the discriminative objective. The Transformer never learned through a hard generative task, so it never developed rich representations. Making it bigger didn’t help because the binary task was too easy. Here, we train the Transformer with the generative objective first (where scaling laws do emerge), freeze the resulting representations, and hand them to DLRM. The scaling already happened during pretraining. The DLRM doesn’t need to scale; it just consumes the already-good embeddings and adds cross-features on top.
Step 1: Pretrain: Autoregressive Transformer on user sequences. Standard next-token prediction. This learns item embeddings that encode temporal patterns and item-item relationships, much richer than DLRM’s co-occurrence embeddings. This is where scaling laws apply.
Step 2: Transfer: Move the pretrained item embeddings into the DLRM. Freeze them. Let the DLRM fine-tune everything else (dense weights, cross-feature processing, task towers) while the pretrained embeddings stay fixed.
| |
Why freezing is critical: The whole value of generative pretraining is robust, generalizable embeddings learned from a hard task. If you unfreeze them during discriminative fine-tuning, they start overfitting to the binary labels. The one-epoch curse returns. Tested: unfrozen embeddings degrade after epoch 1. Frozen embeddings allow training for 5+ epochs with continued improvement.
The hybrid approach gives you: scaling laws from generative pretraining + cross-features from DLRM + existing infrastructure and team structure. The tradeoff: two training stages (pretrain + fine-tune) and you don’t get the architectural simplicity of a single end-to-end model.
A middle ground: fine-tune the whole pretrained model (Pinterest’s PinFM)
Instead of just transferring frozen embeddings, append each candidate to the user sequence, pass through the pretrained Transformer, and fine-tune the whole model end-to-end on ranking objectives. This is what Pinterest’s PinFM does: the candidate is appended to the user sequence to bring “candidate awareness,” and the model is fine-tuned (not frozen) along with action predictions. Alibaba’s GPSD also tested this as their “Full Transfer” strategy, though they found that freezing the sparse (embedding) parameters and only fine-tuning the dense (Transformer) parameters worked better.
This captures richer contextual information (the Transformer’s sequential understanding, not just static embeddings) but risks degrading the pretrained representations. To mitigate: use a smaller learning rate for the pretrained parameters than for the DLRM parameters.
Serving optimization: For $C$ candidates per user, the user’s sequence representation is identical for all $C$. Compute it once, cache it. For each candidate, compute only the cross-attention between that candidate and the cached representation:
| |
Cost drops from $O(C \cdot L^2)$ to $O(L^2 + C \cdot L)$.
9. Generating Lists Instead of Scoring Candidates
The problem with candidate-at-a-time scoring
Everything so far scores candidates independently: process user history, append candidate item token, predict action. But this means the model doesn’t know what else it’s recommending. It might put 5 pasta videos in the top 5 because each one independently scores high. Diversity must be enforced post-hoc by a reranking layer.
What if the model could generate a whole recommendation list at once, where each item depends on the ones before it?
The core insight: autoregressive decoding already gives you interdependence
Think about how the model generates Semantic IDs in retrieval mode (Section 3). It predicts one code at a time, each conditioned on all previous codes. If you generate multiple items sequentially, each item is conditioned on all previously generated items. When generating recommendation 3, the model has already generated recommendations 1 and 2 in its context.
This means diversity emerges naturally from autoregressive list generation. If recommendations 1 and 2 were both pasta videos (codes starting with 42, 8), the model’s context encodes “I’ve already output two pasta items.” The next code prediction shifts probability away from prefix (42, 8) toward other categories. This isn’t a hard constraint. It’s a learned behavior: during training, the model sees ground truth recommendation sessions where diversity correlates with engagement.
This works with a decoder-only model. No encoder-decoder architecture needed.
Decoder-only list generation (the simpler approach)
OneRec v2 (Kuaishou, 2025) proved this by dropping the encoder entirely from the original OneRec encoder-decoder architecture and going decoder-only. The result: computation cut significantly, model scaled larger, quality maintained.
The sequence is just the user’s history followed by the generated recommendations, all processed causally:
| |
The causal mask means each generated code can see: all the user’s history (to the left) and all previously generated codes (to the left). It cannot see future codes (to the right). Standard autoregressive generation, same as an LLM generating text.
Constrained decoding for pre-retrieved candidates. If you’ve already retrieved 100 candidates and want to generate a diverse ordering of a subset, restrict the trie to contain only the Semantic IDs of those 100 candidates. The model autoregressively generates codes, but at each step, only codes that lead to a valid candidate are allowed. The result: an interdependent, diversity-aware ranking of your pre-retrieved set, with no encoder-decoder overhead.
| |
Encoder-decoder list generation (the heavier approach)
OneRec v1 (Kuaishou, 2025) used an encoder-decoder architecture. The encoder processes the user’s history with bidirectional self-attention (no causal mask, every past event sees every other past event). The decoder generates recommendations autoregressively, attending to the encoder’s output via cross-attention.
What the encoder buys you: bidirectional attention over history produces a richer user representation than causal attention. In causal attention, position 50 can only see positions 0-49. In bidirectional attention, position 50 can also see positions 51-1000. This means the encoder can represent “this user’s cooking interest peaked after their skateboarding phase” which requires seeing both phases. The decoder-only model can only represent “this user has done cooking and skateboarding so far.”
The leakage question: Bidirectional attention over history is not leakage. The encoder sees only the past. The decoder generates the future. Cross-attention bridges past → future. The encoder never sees what’s being recommended. The attention mask makes this explicit:
| |
- Top-left (encoder↔encoder): All Y. Bidirectional. Every past event sees every other past event.
- Top-right (encoder→decoder): All
.The past NEVER sees the future recommendations. - Bottom-left (decoder→encoder): All Y. Cross-attention. Every decode step reads the full history.
- Bottom-right (decoder↔decoder): Causal triangle. Each code sees only previously generated codes.
The tradeoff: Cross-attention at every decoder layer adds parameters and compute. OneRec v2 showed that dropping this and going decoder-only is worth it: you lose bidirectional history encoding but gain the ability to scale the model larger within the same compute budget. For most teams, decoder-only list generation is the right starting point.
10. End-to-End: A Concrete User Through the Full Pipeline
Let’s trace one user through every component to see how they connect.
Offline: building the Semantic ID vocabulary
Before any user is served, we’ve already:
- Encoded every video’s content (images, title, tags) through a vision-language model →
content_emb: float32[d_content]per item. - Fine-tuned these embeddings with collaborative contrastive loss so that items engaged by similar users are nearby.
- Built codebooks with 4 levels of 256 codes (via RQ-Kmeans or RQ-VAE). Every video now has a Semantic ID.
| |
We’ve also built a trie of all valid Semantic IDs for constrained decoding.
Online: user arrives
User 7291 opens the app. Their history (last 5 engagements, simplified):
| |
Step 1: Build input tokens (Section 4)
First, embed each item’s Semantic ID using the per-level embedding tables:
| |
Then build the interleaved sequence of item tokens and action tokens:
| |
In a real system, there would be hundreds or thousands of engagements (so thousands of tokens), plus the multi-pathway compression from Section 6 for lifetime history. We use 5 here for clarity.
Step 2: Process through HSTU attention (Section 5)
| |
Step 3a: Ranking a candidate (the common case, Section 4)
The retrieval system has already selected 100 candidates. For each candidate, we append its item token to the history and predict the action at the next position:
| |
Repeat for all 100 candidates. Rank by the action scores (e.g., weighted combination of P(long_watch) and P(purchase)). The KV cache means the history computation is done once, not 100 times.
Step 3b: Generating recommendations directly (the retrieval case, Section 3)
Alternatively, if we’re doing retrieval (not ranking pre-selected candidates), the model generates Semantic IDs autoregressively from the last action position (after a positive action):
| |
Step 4: If alignment has been applied (Section 7)
Without alignment, the model keeps recommending pasta (accurate prediction, bad experience). After DPO alignment, the second recommendation shifts:
| |
The aligned model learned that alternating interests keeps users engaged longer than a pure pasta feed.
11. Choosing Your Path
By now you can derive, justify, and sketch every component. The remaining question is: what do you actually build?
This isn’t a design aesthetics question. It’s a constraints question. Your answer depends on three things:
How much disruption can you absorb? Full generative (replacing DLRM entirely) requires restructuring your retrieval-ranking pipeline and the teams that own each stage. Hybrid (generative pretrain → DLRM fine-tune) changes the embedding layer and nothing else. Most companies start hybrid.
How important is personalization for returning users? If your platform is dominated by power users with rich history, cross-features matter enormously and losing them (as full generative does) is painful. If you’re cold-start dominated (many new users, many new items), the generalization benefits of Semantic IDs matter more and cross-features matter less.
What’s your latency budget? Encoder-decoder with list generation is expensive at serving time. You’re running autoregressive decoding for every request. Decoder-only scoring with KV caching is much cheaper. Hybrid with frozen embeddings is cheapest. The serving path is just your existing DLRM with better embeddings.
The field is moving fast. But the design space is finite, and now you can navigate it.
12. System Design: How It All Serves in Production
An interviewer asking “how would you serve this?” wants to know you understand the physical reality, not just the model math. Here’s how a generative recommender actually runs.
The pipeline hasn’t gone away
Despite the “unify everything” pitch, production systems still use a cascade. The stages are the same as classic DLRM pipelines; what changes is what runs inside each stage:
| |
Where the KV cache lives
In LLM serving, the KV cache is per-conversation and persists across turns. In recommendation, it’s per-request and mostly ephemeral. The user’s history KV is computed at the start of the ranking stage and shared across all candidates via M-FALCON. Once the request is scored, the cache is discarded. There’s no persistent KV store across requests.
Exception: RelayGR (Meta, 2026) pre-computes the long-term history prefix during the retrieval stage and relays it to the ranking stage. This is like LLM’s prefill/decode split: the expensive prefix computation happens early, and the ranking stage only computes the candidate-specific suffix. The cache must stay on the same GPU (remote fetch would blow the latency budget), so the system uses instance-affinity routing.
Training infrastructure
Generative recommenders are trained continuously (streaming), not in static epochs. New engagement data arrives constantly. The model trains on a sliding window of the last N days. Embedding tables (Semantic ID embeddings, action embeddings) are stored in parameter servers on DRAM, not GPU HBM, because they’re too large. Optimizer states use rowwise AdamW to fit in DRAM. The attention layers train on GPU with standard data parallelism. Training throughput is measured in billions of examples per day, not tokens per second.
The hybrid serving path
If you took the hybrid approach (Section 8), serving is simpler. Your existing DLRM inference stack doesn’t change. The pretrained embeddings are exported as a lookup table, loaded into the feature store alongside your other features. The DLRM model runs on CPU or GPU as before, just with better embeddings. No autoregressive decoding, no KV cache, no new serving infrastructure.
13. Interview Questions and How to Answer Them
These are the questions a staff-level interviewer will ask about generative recommendation. For each, the section reference tells you where the deep answer lives.
Conceptual questions
Q: Why can’t DLRM models scale with compute? Two reasons: (1) the task is too easy (binary click prediction gives 1 bit of supervision per item, the model plateaus quickly), and (2) item IDs are atomic (no compositionality, so the model can’t generalize from one item to related items). Generative training fixes both: next-token prediction over Semantic IDs is a harder task (8+ bits per item) that rewards depth, and Semantic IDs give items compositional structure. → Sections 1, 3
Q: What are Semantic IDs and how are they created? Start with collaborative alignment (contrastive loss so items engaged by similar users are nearby in embedding space). Then quantize via RQ-Kmeans (pure nested k-means, no encoder/decoder, used in OneRec production) or RQ-VAE (encoder/decoder with codebook, from image generation, used in TIGER). Explain the level-by-level residual assignment. Mention that RQ-Kmeans outperforms RQ-VAE in practice. → Section 2
Q: Walk me through the HSTU architecture and why each design choice was made. Three problems, three fixes: (1) softmax forces competition between positions → replace with pointwise SiLU so each attention weight is independent, (2) positional encoding can’t capture time gaps → replace with log-bucketed relative attention bias (position + time, ~25 learned parameters each), (3) model can’t represent different action types differently → add U gating matrix (fourth projection) for per-dimension volume control. Then: no FFN (gating absorbs it), fused kernels (simpler activation enables fusion), stochastic length for training. → Section 5
Q: How does HSTU handle both ranking and retrieval? Two-token interleaved format: [item, action, item, action, …]. Ranking = predict the action token after seeing the candidate item token. Retrieval = predict the next item token after a positive action. Both are next-token prediction at different positions in the same sequence. Same model, same training, different prediction targets. → Section 4
Q: What’s M-FALCON and why does it matter? Two parts: (1) KV caching (standard, shared history across candidates), and (2) microbatched parallel scoring (the actual contribution). Pack multiple candidates into one forward pass. Modify the attention mask so each candidate sees the cached history but not other candidates. One GEMM instead of N sequential ones. This is what makes the 285× more FLOPs model servable at higher throughput than the DLRM it replaced. → Section 4
Design tradeoff questions
Q: Should I drop dense features like Meta does? Meta says yes and reports it works at their scale. Meituan says “dropping dense features significantly degrades performance, and scaling up cannot compensate.” Nobody has published a clean ablation reconciling these. The safe answer: it depends on your data density and user behavior diversity. If your users have very long, rich histories, the sequence may implicitly encode what dense features would tell you. If your users are sparse or your signal is noisy, you probably need the dense features. → Sections 4, 8
Q: How do you handle the 100K+ user history? Name the approaches by company: HSTU truncates to 4096-8192 and uses Stochastic Length subsampling. OneRec uses multi-pathway compression (recent history at full resolution, lifetime history through k-means + QFormer). VISTA uses linear attention summarization. ULTRA-HSTU uses semi-local attention. Explain why “just use a longer context window” doesn’t work: not a capability constraint, but a throughput/cost constraint at billions of daily requests. → Section 6
Q: What do you lose by going fully generative? Cross-features. DLRM had “user u’s CTR on category c = 18%” as a precomputed feature. The generative model must implicitly discover this from code-prefix patterns in the sequence. Three approaches: (1) trust the sequence (Meta), (2) shove cross-features back in with heterogeneous masking (Meituan’s MTGR, N× training cost), (3) hybrid: generative pretrain then DLRM fine-tune (Alibaba’s GPSD, Netflix FM, Pinterest PinFM). Most companies do (3). → Section 8
Q: How do you ensure the model doesn’t just optimize for clicks? The gating mechanism (Section 5) gives the model capacity to represent different actions differently, but not the incentive to value purchases over clicks. Three levels of intervention: (1) loss weighting during pretraining (weight purchase predictions higher), (2) multi-task heads at serving time, (3) alignment via DPO/ECPO with a reward function that values business metrics over raw engagement. → Sections 5, 7
System design questions
Q: How would you serve a generative recommender at scale? Draw the cascade pipeline: retrieval (autoregressive or ANN, GPU) → ranking (HSTU + M-FALCON, GPU) → reranking (rules or autoregressive list generation). Explain: ranking stage computes user history KV once, shares across all candidates. M-FALCON microbatches candidates for GPU parallelism. KV cache is per-request, not persistent. Latency budget: tens of ms for ranking stage, 100-300ms full pipeline. The hybrid approach is even simpler: existing DLRM stack with pretrained embeddings, no new serving infrastructure. → Section 12
Q: How do you handle cold-start items in a generative recommender? Semantic IDs solve this partially. A new item with Semantic ID (42, 8, 3, 215) shares prefix (42, 8, 3) with known items, so the model immediately knows it’s related to pasta content. But the specific code 215 has never appeared in training sequences, so the model underweights it. Fixes: alignment with cold-start item boost in the reward function, or constrained decoding that biases toward freshness. → Sections 2, 7
Q: Encoder-decoder or decoder-only for list generation? Decoder-only is the simpler default. Autoregressive decoding already gives you interdependent list generation (each item conditioned on previously generated items). OneRec v2 proved this by dropping the encoder. Constrained decoding on pre-retrieved candidates gives you diversity within a fixed candidate set. Encoder-decoder buys you bidirectional history encoding (richer user representation) at the cost of cross-attention overhead at every decoder layer. Most teams should start decoder-only. → Section 9
References
Semantic IDs and item tokenization
- TIGER: Generative Retrieval via Semantic Identifiers — Rajput et al., 2023. Introduced Semantic IDs using RQ-VAE for generative retrieval.
- RQ-VAE — Lee et al., CVPR 2022. The original residual-quantization VAE for image generation that TIGER borrowed.
Transformer architecture for recommendations
- Actions Speak Louder than Words: Trillion-Parameter Sequential Transducers for Generative Recommendations — Zhai et al., Meta 2024. Introduces HSTU, M-FALCON, and the full Meta generative rec system.
Production systems
- OneRec: Unifying Retrieve and Rank with Generative Recommender and Iterative Preference Alignment — Kuaishou 2025. Full generative system in production; introduces multi-pathway compression and ECPO alignment.
- PinFM: Foundation Model for User Activity Sequences at a Billion-scale Visual Discovery Platform — Pinterest 2025. Hybrid pretrain-then-finetune approach; shows transfer to production DLRM.
Alignment
- Direct Preference Optimization — Rafailov et al., 2023. The DPO paper that recommendation alignment methods adapt from.
Citation
If you found this post useful, you can cite it as:
| |