The Missing Taste Model

Create with AI needs a compass, not a generator

2026-02-27

Netflix is not just a streaming service. It is a long-running taste measurement machine. Over years, it observes what you start, what you finish, what you abandon, what you rewatch, what you binge at 2 a.m., and what you never touch again. That history lets it build a remarkably dense representation of preference.

In the last era, that advantage mostly meant recommendation. If you could predict what a user would choose from a finite catalog, you won. It was a familiar kind of moat, unfair in an obvious way.

The next era is stranger. The catalog is no longer fixed. Generators can create near-infinite candidates. In that world, the advantage shifts from “what should we show you” to “what should we make for you.” Taste stops being an output of the system and becomes an input to the system.

That sounds like incumbents win by default. They might. But the nuance is where it gets interesting. Recommendation data is not the same thing as generative supervision. Logs tell you what people consumed under exposure constraints, UI bias, and social context. Generators need something harder. They need to learn counterfactual appreciation, how someone would react to things that do not exist yet.

This is why Create with AI still feels like a slot machine. We have models that can generate. What we do not have is a high-resolution way to aim generation toward human taste without flattening it.

In the era of infinite generation, creativity becomes an optimization problem. We are missing the objective.

1. Generation is cheap, steering is scarce

A decade ago, the scarce thing was production. Making a polished image, a good demo, a coherent scene took time, tools, and skill. Now the binding constraint is selection under uncertainty.

When generation is abundant, creating shifts from making one thing to searching among many possible things. That is already how most people use creative models.

Generate a batch
Keep one
Repeat
Hope something lands

Prompting helps you ask for candidates, but it does not solve evaluation. If you cannot evaluate well, you cannot steer well. When you cannot steer, you will optimize whatever proxy is easiest to measure.

In most creative systems today, that proxy becomes some combination of generic quality, coherence, or engagement. These are useful signals, but they are not the thing creators mean when they say “this works.” They are often closer to “this is acceptable to many people” than “this will be loved deeply by the right people.”

So the creative constraint has moved. It is no longer “can we generate.” It is “can we predict and shape appreciation.”

2. The scalar trap deletes peaks

Here is the simplest proof that a single score is the wrong object for culture.

Imagine two artifacts. Two short stories, two character designs, two songs.

Artifact A gets a 7 out of 10 from almost everyone. Pleasant, coherent, broadly fine.
Artifact B splits people. Half give it a 10 and feel obsessed. The other half give it a 4 and bounce fast.

Average them and you get roughly the same score. Culturally, they are opposites.

Artifact A is consensus-smooth. It offends nobody. It also rarely creates attachment.
Artifact B creates peaks. It becomes a signal, finds a tribe, generates imitation, discourse, remix, and obsession.

If your reward model is a scalar, these two objects can become indistinguishable because the scalar mostly “sees” the mean. Once you optimize the mean, you systematically select for “nobody hates it” content and select against work that some people love intensely.

This is not a moral argument. It is geometry. A single reward function gives you one hill to climb. Culture is not one hill. It is a landscape with many peaks.

It also happens to match how the most common reward modeling machinery is built today. The canonical setup learns a single reward value from pairwise preferences, then uses that scalar for reranking, fine-tuning, or reinforcement learning. Even when an LLM “judges” outputs, the judgment is usually collapsed into a scalar score or label, because optimization needs a single target to climb.

This is why the failure mode is so consistent. When you compress taste into one number, optimization has only one direction it can reliably follow. It will drift toward broad agreement, not toward deep attachment.

Great creative work is often deliberately polarizing. It breaks expectation, mixes tones, uses codes only some people can read, and commits to choices that will lose part of the room. From the outside, voice looks like decisions that some people recognize as signal and others register as error. Scalar optimization treats that as defect.

This is a big reason AI output so often feels competent but interchangeable. It is not that models cannot produce edge. It is that the surrounding evaluation logic nudges the system toward median acceptability.

If we want Create with AI to mature, we need a different object than a single score.

3. Appreciation is multi-faceted, and taste is conditional

Ask someone why they loved something and they rarely answer with a number. They talk about how it made them feel, what it meant, where it surprised them, whether it felt honest, and which moment landed.

Creative appreciation is not scalar. It is multi-faceted.

A more faithful object is an appreciation vector, a small bundle of signals like these.

Resonance - Did it move me, did it feel true
Novelty - Did it surprise me without feeling random
Coherence - Did it hang together, did it earn its turns
Craft - Did it feel intentional and shaped
Emotional profile - Tension, relief, warmth, dread, awe, comfort
Meaning - Did it say something I can carry
Behavioral intent - Do I want to save it, share it, rewatch it, remix it

Different media emphasize different facets, but the point holds across writing, image, music, and video. Appreciation is naturally a bundle.

Now add the second fact that breaks scalar reward even if you keep adding more labels. Taste is conditional.

The same artifact can be cathartic for one person and melodramatic for another. Iconic to one subculture and cringe to another. Coherent to a genre-literate reader and confusing to a casual viewer. “Good” is rarely a property of the artifact alone. It is a relationship between the artifact, the person, and the context.

If we name it minimally, we get a clean target. Let x be the artifact, u a person or taste profile, c context, and R the response vector. The object we actually want is the distribution of responses conditioned on who and context.

Not “is it good.” More like “who loves it, who hates it, what does it make them feel, and how confident are we.”

We have early scaffolds. We have preference models trained on comparisons, aesthetic scorers, LLM judges that critique and rank, and engagement metrics that predict clicks and watch time. Most of these collapse into one of three failure modes.

Scalar collapse
One score that averages tastes into a single peak
Judge aesthetics
Models measuring what models think is good, not what diverse humans actually love
Engagement confusion
Optimizing for watch time and clicks, which mixes curiosity, habit, controversy, compulsion, and sometimes joy

What is missing is a taste model that is multi-facet, conditioned, and honest about uncertainty. A system that outputs a reaction map rather than a verdict.

4. We do not have taste coordinates that a generator can actually use

“If taste is conditional, just learn it” sounds easy until you try to make it concrete. A generator cannot condition on vibes. It needs a representation. It needs something like a coordinate system that is stable enough to learn, compact enough to condition on, and meaningful enough to steer.

Right now, creative AI has not converged on what those coordinates should be.

A useful taste coordinate system would do three things at once.

First, it would separate long-term traits from short-term state.
Some preferences are stable, like genre affinity, tolerance for ambiguity, novelty appetite, pacing preference, or how much coherence you need before you disengage. Others are contextual, like mood, intent, social setting, or “tonight I want comfort, not challenge.” If you mix trait and state, personalization becomes brittle. It overfits to the last session and forgets the person.

Second, it would turn appreciation into controllable facets rather than a single vibe.
If a creator says “make it new but intact,” the system needs a representation where “new” and “intact” are separately learnable and separately steerable. Otherwise “novelty up” becomes “weirdness up,” and the model will happily break coherence while still scoring high on a crude novelty proxy.

Third, it would work across modalities without being a different universe each time.
If someone’s taste is consistent across writing, music, images, and video, the system should be able to reuse part of that representation. That does not mean one unified axis for everything. It means shared latent structure, like tone preference, irony tolerance, tempo and pacing appetite, preference for clarity versus mystery, preference for sincerity versus stylization.

So what do we actually measure to build this, without turning it into a research fantasy?

The measurement primitives that scale are not mystical.

Comparisons
A or B, given a concrete intent. Which ending is more cathartic. Which character silhouette is more iconic. Which hook makes you want the drop.
Aspect judgments
Not “is it good,” but “what kind of good.” More coherent or more novel. More tender or more intense. More sincere or more ironic. These are the beginnings of a real control surface.
Edits as supervision
When a creator rewrites a line, trims a beat, changes a chord, or adjusts a color grade, that edit is a preference signal with structure. It says what the artifact was missing and what the creator wanted instead.
Moment annotations
The line that broke it. The cut that made it. The second where the tone flipped and part of the audience left. These signals tie taste to structure, not just outcomes.
Stickiness
The week-later test. The difference between “liked it” and “it stayed with me” is the difference between sugar and meaning.

These primitives point to a more concrete formulation, without requiring heavy math. You learn a compact taste profile from a user’s history and feedback. You learn a model that predicts a multi-facet reaction profile for each candidate. Then you use that profile to steer generation and to select a small portfolio of candidates that satisfy constraints.

This is also where a product truth shows up. Taste dials are not sliders.

Creators rarely mean “give me 30 percent novelty and 70 percent coherence.” They mean constraints.

Make it new, but do not break coherence
Make me cry, but not from tragedy, from relief
Make it weird, but still sincere
Make it intense, but not stressful

Those are boundaries. They imply the tool should return a small portfolio of distinct candidates that explore the tradeoffs, each with a predicted reaction profile. That is how you preserve peaks without pretending there is one global optimum.

5. Incumbents, challengers, and what the real moat becomes

This brings us back to Netflix and every platform that has spent years learning preference. In recommendation, the advantage was clear. More history meant better ranking. In generation, the advantage is still real, but it is not automatic.

Here is what incumbents genuinely have.

Longitudinal preference traces
Years of behavior, not just one-off clicks.
A mature representation pipeline
User profiles, item embeddings, refresh cycles, and systems that keep representations stable enough to serve at scale.
Distribution and feedback loops
The ability to ship changes, observe outcomes, and iterate continuously.

But here is what they do not automatically have.

Counterfactual preference labels on generated variants
Logs are choices under exposure constraints. A generator needs preference signals over candidates that never existed in the catalog.
Facet-level supervision
Engagement tells you “continued watching,” not whether something landed because of resonance, novelty, comfort, or compulsion.
A conditioning interface that creators actually use
“Personalized generation” is not just a model feature. It is a workflow that elicits intent, constraints, and iterative feedback.

This is where challengers have a real opening. A new creative tool does not need Netflix-scale history if it owns the right interaction loop. Tools that capture comparisons, edits, and moment-level feedback during creation can accumulate the kind of structured taste data that actually conditions generation. Over time, they can build a taste engine that is closer to “how do I make something you will love” than “what existing thing will you click.”

The practical implication is that the moat shifts. It is less about owning a catalog and more about owning a taste training loop.

The strongest version of that loop looks like this.

The system elicits a small number of targeted comparisons early
It learns a compact taste profile that updates over time
It predicts multi-facet reaction profiles for candidates
It generates portfolios that satisfy constraints rather than chasing a single score
It learns from edits and moment annotations, not just watch time

Creators do not want a leaderboard. They want three answers before they ship.

Where are the peaks
Who will love this intensely
Where is the split
Who will bounce, and what caused it
How confident is the prediction
Is the system guessing or does it actually know this audience

Once you can answer those, Create with AI stops being “generate and pray.” It becomes steerable. You specify intent and constraints, generate a portfolio, inspect predicted reaction maps across taste clusters, iterate where the map shows a break, then ship knowing which peak you chose and what you traded off.

We are not there yet. The missing pieces are not only bigger generators. The missing pieces are taste coordinates, the right supervision, and product loops that make taste learnable without flattening culture.

Closing

We are living through a strange phase where creative models can make almost anything, yet the output often feels interchangeable. That is not a permanent ceiling. It is what happens when you optimize the wrong target.

When you compress taste into a score, you converge to the median. When you optimize engagement, you confuse compulsion for love. Culture is a landscape with many peaks.

The next era of Create with AI belongs to systems that can map that landscape and then generate toward it. Multi-facet appreciation instead of scalar judgment. Conditional taste instead of universal aesthetics. Boundaries and tradeoffs instead of sliders and averages. Portfolios of peaks instead of one best answer.

Generators make possibilities cheap. Taste models make direction possible. Creation happens when you have both.

Link to slide

Concepts Evaluative vs generative judgment