Why code compiles and law doesn't

The next AI transition is gated on who builds the verifier, not who trains the biggest model.

2026-04-20

Two years into the era of reinforcement learning from verifiable rewards, something strange has happened to the landscape of AI capability. Code models write working code at a level that would have seemed implausible in 2023. Math models solve olympiad problems that defeated the field a year earlier. But legal reasoning has not moved comparably. Neither has clinical judgment. Neither has literary writing. The gap between the fast-compiling domains and the stalled ones has widened, not closed.

The standard explanations are in circulation. There is the capability-gap story, which says models will eventually get smart enough and all domains will compile. There is the data-gap story, which says we just need more training data in the slow domains. There is the scaffolding-gap story, which says better agents and retrieval will unlock what base models cannot. Each of these is partly true, and each of them misses what is actually going on.

The gap is structural. It is set by a property that domains either have or they do not: whether verification is structurally cheaper than generation within them. Where that asymmetry exists, training loops close, data scales with compute, and compilation accelerates. Where the asymmetry fails, data is capped by human effort, and no amount of capability growth closes the remaining ground. This is the predictive edge of the frame. Some domains will compile and some structurally will not, and the difference between them is not how hard AI is; it is whether the domain supplies the asymmetry that makes scaling work.

This essay develops that claim. It extends the Compilation Thesis by specifying the mechanism underneath compilation, and it ends with a concrete research program the field has been underweighting: not better base models, but verifier construction.

What RLVR actually exploits

There is a structural observation from computer science worth borrowing here. P vs NP is usually stated as a question about which problems are tractable, but the insight underneath it is about asymmetry. For some problems, finding an answer is hard while checking one is easy. Given a proposed solution to a Sudoku, you can verify it in seconds. Finding the solution from scratch is much harder. That asymmetry, between the cost of generation and the cost of verification, is what makes a whole class of search problems tractable when you have candidates to check.

Reinforcement learning from verifiable rewards is an applied form of this insight. The training loop generates candidate solutions, runs them through a cheap verifier, and trains on what survived. The efficiency of the loop is set by exactly the ratio the P vs NP frame cares about, generation cost over verification cost. Code and math are high-ratio domains. Pytest costs milliseconds; writing correct code takes real cognitive work. Lean’s type checker runs in seconds; finding the proof can take a mathematician a career. The asymmetry is enormous, and that is what made these domains compile first. It is not that they are easy. It is that the verification-generation gap lets the training loop scale.

The question the rest of this essay answers is what a domain needs for that gap to exist.

The three preconditions

Three properties together determine whether a domain has the asymmetry. Code and math have all three. The domains that compile slowly fail different subsets.

Input completeness. The problem the verifier is checking must be fully specified. A math theorem statement is complete. The claim to be proved is written down, and whatever is proved either matches it or doesn’t. A function signature plus a test suite is complete. The function’s inputs and expected outputs are fixed. A client’s legal situation is not complete. Facts that matter to the case will surface during discovery, or depositions, or the opposing side’s filings. A patient’s presentation is not complete either. Symptoms evolve, comorbidities interact, the true diagnosis is often something the doctor has not yet considered. Without input completeness, the verifier has nothing well-defined to check against. It cannot say “this output is correct for that input” because the input is still moving.

Consensus verifiability. There must be a unique right answer, or close enough to one that qualified reviewers agree. A proof either type-checks or it does not. A test either passes or it fails. A legal brief’s correctness is in-principle consensus-verifiable in the sense that a sufficiently informed attorney could read it and agree on whether it does the work, even if that verification is expensive in practice. The failure in the legal case is input completeness, not consensus. A short story’s quality is not consensus-verifiable at any level. Different readers have different taste. Different traditions have different standards. Literary judgment is not converging, and there is no unique ground truth to verify against even with infinite reviewer time and perfect attention.

Step-terminating verifiability. The verifier must return signal fast enough to inform the next step. Pytest returns in milliseconds. Lean returns in seconds. The signal arrives while the training loop is still running. A trial’s outcome returns in years, and even then it is confounded by everything that happened between the decision and the verdict. A patient’s true prognosis may not be knowable for decades, and the interventions along the way change what the outcome means. Without step-terminating signal, you can train on historical data but you cannot close the interactive loop that actually accelerates compilation.

Code and math have all three properties. Legal reasoning fails the first. Medical diagnosis fails the first and partially fails the third. Creative writing fails the second and the third. Three preconditions, different patterns of failure, different compilation trajectories. The pattern explains what the capability-gap and data-gap stories cannot: why some domains compile fast while others stall, even at comparable levels of investment.

Phrases like “legal reasoning” and “medical diagnosis” are coarse buckets covering internally heterogeneous work. Legal research and document-bounded review sit closer to the asymmetry than trial strategy. Medical imaging and drug dosing sit closer than open-ended clinical reasoning. The category-level claim is about the prestige core of each domain, not every task inside it. Decomposition later in the essay turns on exactly this heterogeneity.

What this looks like in practice

When all three preconditions hold, something remarkable becomes possible. You can generate training data synthetically. Propose candidate problems, let the verifier filter solutions, keep the verified pairs. The dataset grows with compute, not with human effort. This is why PRM800K has 800,000 step-level correctness labels across 75,000 math solutions. This is why code datasets with verified test coverage are effectively unbounded. The verifier is not just a signal during training. It is the mechanism by which the training set itself scales.

Expert-judgment datasets in professional and creative domains do not do this. The most-cited creative-writing evaluation benchmark, Chakrabarty’s 2024 CHI paper on whether LLM judges can stand in for expert writers, uses 48 stories evaluated across 14 binary expert tests. That comes to something like 672 labeled judgments, and the paper’s headline finding is that the LLM judges correlated near zero with the expert evaluations. Clinical reasoning benchmarks tend to be in the low thousands of cases. Legal reasoning benchmarks are similar. The gap between code-math training scale and expert-judgment training scale is not a fixed multiple. That is the deeper point. One curve scales with compute. The other scales with human effort. Over any meaningful time horizon, compute-scaling curves dominate human-effort curves by arbitrary amounts. The gap is not a number. It is an ever-widening divergence.

Anyone who has tried to build domain-specialized language models has run into this wall, and it is always the same wall in a different shape. You start by assuming it is a tooling problem, or a labeling-budget problem, or a labor-market problem. You spend six months treating it as any of those. Eventually you realize the domain is not withholding data. It is withholding an asymmetry, and no amount of collection effort changes what is structurally available to collect.

This points to something more fundamental. The verifier needed to close a scalable training loop is not an auxiliary tool. In code and math, the verifier is a separate artifact from the solution. Pytest does not need to write the code. Lean does not need to find the proof. They are independent systems that can check work without being able to do the work themselves. In legal reasoning, a reliable verifier for brief quality at the level needed to close a training loop would require the same judgment as writing the brief. There is no cheaper independent system that can check a brief for quality without reading it the way a senior attorney reads it.

Partial evaluators exist everywhere, and some are useful. The claim is not about them. It is about the verifier required to close a scalable training loop at the level of judgment the domain actually cares about. At that level, in those domains, the cheaper independent verifier disappears into the full expert judgment itself, and the asymmetry that RLVR exploits disappears with it.

The verifier is the domain.

Three ways to build the asymmetry

If compilation rate is set by verification-generation asymmetry, then the load-bearing research program for domain AI is not what the field is currently prioritizing. It is not primarily better base models. It is not better agents or retrieval. It is verifier construction. The three engineering moves that follow are different ways of building the asymmetry where it does not exist on its own.

Decomposition is the move that breaks a verifier-poor task into verifier-rich sub-tasks. Legal work shows the pattern most clearly. The part of legal practice that everyone talks about, the drafting and the strategy, is not the part that has the three preconditions. The part the field treats as grunt work has them all. Discovery, the process of sorting through millions of documents to find the ones relevant to a case, has input completeness (the documents are fixed), consensus verifiability (qualified attorneys largely agree on whether a document is relevant), and step-terminating verifiability (a senior attorney can check a sample in minutes). Discovery is a verifier-rich sub-task sitting inside a verifier-poor domain. The prediction the framework makes is counterintuitive, and I think it inverts the hierarchy most readers expect. The unglamorous parts of legal work will compile substantially faster than the prestige parts. Brief drafting at senior-associate level and above will stall within the current paradigm, even as discovery work closes significantly on what expert humans do. This is the general pattern. Decomposition finds the compilable islands inside domains that look uniformly hard from the outside.

Of the three moves, decomposition is the most broadly reliable. Proxy construction depends on a naturally occurring correlated signal, which is contingent on the domain. Synthetic ground truth depends on the training signal being rich enough to generalize, which is the scale-solves-everything bet at smaller scale. Decomposition depends only on the domain containing verifier-rich sub-structures, and most large professional domains do. The other two moves work where they work. Decomposition works almost everywhere the first move is available to attempt.

Proxy construction is the move that builds a cheap-to-compute signal that correlates with an expensive-to-compute truth. Unit tests are the canonical example. A test suite does not verify that a program is correct in the full semantic sense. It verifies that the program produces expected outputs on a finite set of cases. The correlation between test-passing and correctness is strong enough to guide development, weak enough to be regularly violated, and the whole software industry’s quality infrastructure is built in the space between those two facts. When Kent Beck formalized test-driven development in the late 1990s, he was naming something engineers already did implicitly. The discipline turned out to be load-bearing not because tests catch bugs, though they do, but because tests create a cheap verifier where none naturally existed. This is the move, industrialized. Benchmarks like MMLU and HumanEval are a second form of proxy, imperfect and gameable and useful precisely because they are cheap. The gameability is itself informative. Goodhart’s law applies to proxies by construction; when the proxy becomes the target, the correlation with the true signal degrades. The engineering discipline is not finding a proxy that cannot be gamed. It is finding a proxy whose correlation with the truth is strong enough to survive being optimized against for a while.

Synthetic ground truth is the move that trains a verifier model on expert data and then uses the trained model as the verifier. Reward models in RLHF are the canonical example. You cannot cheaply ask humans to evaluate every output a model produces during training, so you train a reward model on a finite set of human preferences and use that reward model as a stand-in verifier. The reward model does not measure something different from the truth. It is a model of the truth itself, and that is the distinction from proxy construction. A proxy is a different signal that happens to correlate with the truth; synthetic ground truth is a trained approximation of the truth itself. Its quality depends entirely on how well the training signal captured the thing being approximated, and Chakrabarty’s 2024 result is the cautionary tale here. The study showed that LLM judges correlate near zero with expert writers when evaluating creative prose. Synthetic ground truth works only when the source data is rich enough for the trained verifier to generalize. When the expert signal is too sparse or too contested, the trained verifier captures noise rather than taste, and you get a fluent verifier that agrees with itself and disagrees with the experts the domain actually cares about.

The frontier bet, and what it predicts

The frontier bet against this framework is that none of it matters in the limit. A sufficiently capable base model, the argument goes, becomes its own verifier, either by judging its own outputs at inference time with enough reliability, or by generating synthetic expert-judgment data that beats human-collected data at scale. On this view, the asymmetry is a transient fact about current model capability, and enough base-model progress will produce the verifiers the framework says are missing.

The framework’s response is that this conflates two capabilities. Generation capability and verification capability are not the same thing, even in the same model. This is different in kind from the self-play that solved Go and chess. Those systems scaled because the rules of the game provided a flawless, essentially free verifier for every simulated outcome, and the verifier was part of the domain itself rather than something constructed on top of it. In professional and creative domains, no such intrinsic verifier exists; any verifier has to be engineered, and that engineering is the hard part. A base model trained on what humans produced does not inherit a free verifier from its training distribution. The ceiling of its verification capability is the ceiling of the judgment it absorbed during training. In domains where the asymmetry fails at the level of expert judgment, that expert judgment is not broadly represented in the training distribution. The expert signal is thin and contested and late and institutional, and the model’s ability to imitate the output does not transfer to the ability to verify it. Chakrabarty’s 2024 study is the existing evidence. The LLM judges in that study were evaluating prose within the range of what current models produce, and their ability to write prose that looked like expert work did not translate into the ability to judge it at expert level. The claim here is not that future models can never become strong evaluators. The claim is that they will not get that capability for free from generation scale alone, in domains where the expert signal the evaluator would need is sparse, late, and contested. The self-verifying base model hypothesis is the scale-solves-everything bet in its strongest form, and the framework predicts specifically where and why it will fail. It will fail in the domains whose asymmetry is structurally absent, which is to say, in the domains the framework was already about.

What the framework gives readers is a tool with immediate use. Ask of any AI research program whether it is building the verifier, or assuming the verifier will appear. Ask of any domain where the asymmetry is, and which of the three moves could engineer it. Much current work in professional and creative AI is implicitly betting on verifiers arriving through capability growth. The framework says this is the wrong bet, not because capability growth does not help but because the dataset ceiling in verifier-poor domains does not move with base-model improvement. The ceiling closes only when someone builds the verifier, and the work of building it is the work the field has been underweighting.

What gets compiled next is gated on where the verification-generation asymmetry exists or can be engineered, and that is what the Compilation Thesis predicts at this layer. Domains where the asymmetry can be built through decomposition or proxy or synthetic ground truth will compile. Domains where all three moves fail will retain an expensive-to-judge residual at the level of their highest-stakes work. The edge of what AI can do is not set by what AI can think. It is set by what someone has taken the trouble to make verifiable.

Concepts Compilation thesis · Synthesizability boundary