Evaluative vs generative judgment

Recognizing that an output is right is cheaper than producing the right output, for most artifacts that matter. This is the asymmetry every reviewer relies on. The two judgments are not interchangeable. They develop through different practice, train different cognition, and stand on different sides of the workflows AI is collapsing. The asymmetry is the engineering target for the discovery system current LLMs do not yet form, and the structural reason senior judgment is staying scarce while output is becoming cheap.

The asymmetry

For most artifacts that matter, recognizing whether an output is right costs less than producing the right output in the first place. A senior attorney can read a brief in twenty minutes and tell you what is wrong with it; writing the brief took the junior associate three days. A code reviewer can flag a subtle race condition in a function; finding and fixing the race condition originally was the harder week. A literary editor can mark where a manuscript’s structural seams hold and where they slip; the manuscript took a year to write. The asymmetry holds across professional domains where evaluation operates on a finished artifact while generation has to commit to a path through a search space with no certainty about whether the path will land somewhere worth landing.

The asymmetry has limits worth naming. It can fail or invert in domains with a fully-mapped search space where production is cheap relative to optimality-checking (sorting, certain endgame studies in chess). It can also fail in domains where verification requires re-derivation of comparable complexity to the original construction (long mathematical proofs whose check is itself nontrivial). But in the broad class of artifacts professional practitioners produce under real conditions, including briefs, code, manuscripts, plans, and designs, recognition is structurally cheaper than production. The asymmetry is the basis for the entire institution of professional review.

What each judgment trains

The two judgments are not interchangeable. They develop through different practice and train different cognition.

Generative judgment is built by doing the work. The junior attorney drafting briefs at three days each is building the architectural intuition that will eventually let them direct strategy. The practice of producing the artifact under deadline pressure, with real stakes, encodes the constraints, failure modes, and tacit invariants that no review-only training can supply. The resulting cognition is intuitive at expert level.

Evaluative judgment is built by reviewing the work. The partner who marks up briefs internalizes a different set of patterns: what good looks like across many drafts, what the most common failure modes are, what a brief is missing that the writer was too close to see. Evaluative judgment is what makes a senior reviewer’s twenty-minute pass produce notes the junior could not have written about their own draft. Review-trained cognition is also intuitive at expert level; it is trained against a different feedback signal.

Traditionally, evaluative judgment was built on top of generative judgment. You did the work for years, then reviewed others’ work for more years, and the review skill compounded against the building skill. When AI compiles away the building step, the apprenticeship pipeline breaks. The junior becomes a reviewer of output they lack the experience to evaluate. The default assumption that AI-mediated review pipelines produce an adequate substitute is unsupported.

What this distinction is not

The distinction is not Kahneman’s System 1 versus System 2. System 1 / System 2 is about speed and effort. Evaluative and generative judgment are both slow when done well, both effortful, and both can operate intuitively at expert level. The split cuts a different axis: what artifact the cognition is trained against and what feedback signal builds it.

The distinction is not metacognition. Metacognition is reasoning about one’s own reasoning. Evaluative judgment is reasoning about an external artifact someone else produced. A reviewer evaluating a brief is not introspecting on their own production process; they are checking the artifact against accumulated patterns of what good and bad output look like in this domain.

The distinction is not skill versus intuition. Both judgments are skilled, and both can become intuitive at expert level. The split is between producing under uncertainty and grading under more constraint. The methods that train each are different.

  • Four engine model of discovery. Evaluative judgment is the human capacity closest to the taste engine. The four-engine model names the architectural role (taste; “evaluation” in the AI literature); this concept names the human judgment asymmetry that role rests on.
  • Externalized taste. Evaluative judgment is the cognitive capacity; externalized taste is the recorded corpus that capacity has produced in writing about specific artifacts.
  • Ghost GDP. The risk when generative judgment is compiled away and the developmental pipeline that builds evaluative judgment on top of generative practice dries up.
  • Compilation thesis. The thesis names where AI generation compounds. The evaluative-generative split names what humans bring to the loop after generation becomes cheap.

Status

Living. Certainty likely. The recognition-versus-production asymmetry has empirical support across professional domains, code review, peer review in science, and editorial practice. The training-pipeline implication (evaluative judgment built on top of generative practice) is the formulation being asserted. Importance 7.

Referenced in