A Mesh Is Not a Door
Why world models will be won in game engines, then move into Physical AI
Reality does not speak English. It speaks state.
Here is a failure mode every large game team recognizes. The player changes the world, but the game’s semantic world does not update.
You drag a vehicle into a narrow alley to block it. Visually, the alley is closed. Physically, the collision is there. But the pedestrians still route through, because the navmesh was baked offline and never learned that “this space is now blocked.” You kick over a pile of props and create new cover, but the cover system still points to the old tagged nodes. You turn a corner, and NPCs behave as if the last thirty seconds never happened, because their internal representation of the scene is a thin overlay on top of a world that has moved.
Those are not “NPC intelligence” problems in the abstract. They are representation problems. The agent is not reasoning over the world the player is in. It is reasoning over a simplified, partially stale model of it.
World models are an attempt to close that gap. Not by writing cleverer behavior trees, but by maintaining a stateful representation of the world that stays coherent under intervention. Objects keep identity across occlusion. Affordances update as the scene changes. Constraints remain constraints. The system can step forward under actions and remain consistent.
Games matter here because games already run the loop world models need. Observe, act, update state, repeat. Engines are industrial-grade world runtimes with constraints, persistence, instrumentation, and replay. If world models are going to become reliable systems instead of impressive media, they will converge on that runtime.
Robotics comes next for the same reason. Physical AI is the same loop, except mistakes have mass and cost.
From Clips to Worlds
Here’s an easy test.
Can you step it?
Sampling asks for a future and returns one. Stepping applies an action and forces the next state to be consistent with that action, the prior state, and the world’s rules. Once you can step, correctness stops being aesthetic. It becomes behavioral.
In systems terms, a world is a state transition process. There is an underlying state s_t you never fully observe. You take an action a_t. The state evolves under rules and randomness. You observe something and act again.A compact way to write the loop is the following.
A serious world model is a learned approximation of this process. The point is not that it predicts pixels. The point is that it carries a coherent internal state forward so an agent can plan and act without the world “forgetting” what happened.
That is exactly what breaks in many current systems. The environment updates, but the representation the agent uses does not. Navmeshes stay static. Cover graphs stay static. Affordances are brittle tags. The world the agent reasons over drifts away from the world the player sees.
World models target the missing capability: a representation that updates under intervention, stays stable over horizons, and supports stepping.
That leads to the practical question behind “playable.”
What does playable actually require?
A mesh is not a door
“Playable” is often used as if it were one binary property. It is not. There are levels, and the levels explain why “generate a 3D environment” is not the same as “generate a world.”
Walkable means space is coherent, collision works, and navigation is valid.
Interactable means objects have identity, affordances, and persistent state.
Gameable means rules create a loop. Goals, failure, progression.
Authorable means creators can edit and edits persist.
The jump from “3D” to “playable” sits largely in the interactable layer.
A mesh that looks like a door is not a door. A door is geometry plus collision, a hinge constraint, an interaction affordance, state variables (open/closed, locked/unlocked), and rule logic that makes state matter. A floor is not triangles. It is a traversable surface with navigation semantics that must update when the world changes. Cover is not a visual silhouette. It is a queryable affordance tied to geometry and line-of-sight constraints.
World models become products when they can produce worlds that are executable at these levels, not when they can produce prettier imagery.
And that’s where the real bottleneck shows up. Generated world state is not executable world behavior.
Three routes to world models
You can group most current work into three camps. The lines blur in practice, but the intent is distinct.
Camp A: render-first futures (Sora from OpenAI)
Sora-type systems. The native output is video, and the “world” is implicit inside a latent optimized for plausible frames. These models are valuable priors for appearance and short-horizon motion. They are also useful content engines.
Their default interface is sampling. Stepping and persistence usually require structure layered on top.
Camp B: interactive video worlds (Genie from DeepMind)
Genie-type systems. The output is still 2D frames, but the system is designed to be stepped action by action while maintaining an internal state. The difference from Camp A is not resolution. It is that action is first-class and coherence under intervention is the central requirement.
Camp A samples futures. Camp B simulates under actions.
Camp C: explicit 3D world builders (Marble from World Labs)
Marble-type approaches. The bet is that world state should be explicit as 3D structure that supports novel viewpoints, editing, and export into standard pipelines. View consistency and editability are built into the representation instead of being a property you hope emerges from a video latent.
Camp C is naturally legible to engines because engines already operate on explicit 3D state.
The camps will combine. The likely end state is hybrid, using video priors, interactive rollouts, and explicit structure where it helps.
The practical bottleneck is the same across all of them.
Generated world state is not executable world behavior.
The missing layer is a compiler
World models generate world state. Engines execute world dynamics. The hard part is the translation.
Treat it like a compilation problem.
Camp C makes this easiest to see because it outputs structure. The right output target is not “a mesh.” It is a structured world description that an engine can ingest.
A reasonable intermediate representation looks like a scene description, not a render.
{
"entities": [
{
"id": "door_17",
"type": "door",
"transform": { "pos": [1.2, 0.0, -3.4], "rot": [0, 90, 0] },
"mesh": "door_mesh_A",
"materials": ["painted_metal"],
"affordances": ["open", "close"],
"state": { "open": false, "locked": true },
"physics": { "mass": 18.0, "hinge_axis": [0, 1, 0] }
}
]
}
The point is not the JSON. The point is discipline. If a model is going to feed an engine, the output has to look like a structured world description, not just geometry. Entities need stable IDs. Types must be explicit. Affordances and state variables must be representable. Physical properties must exist where they matter, not as an afterthought.
From there, the engine can compile structure into executable semantics by attaching components. A “door” is not a mesh. It becomes collision, a hinge constraint, an interaction component, and state replication. A “pickup” becomes a rigid body with grasp affordances and inventory semantics. “Ground” becomes collision geometry and a navmesh bake. A “hazard” becomes a volume plus damage rules. This is where playability is decided. Walkable worlds require valid collision and navigation. Interactable worlds require identity, affordances, and persistent state. Gameable worlds require rule systems that turn state into consequences.
This is also where the hardest practical objection shows up. Semantic errors are catastrophic. A one percent mistake rate in “walkable floor” makes a game unplayable. That is not a reason to abandon the approach, it is why the compiler layer matters. A serious pipeline treats uncertainty conservatively. Use typed schemas, not free-form tags. Validate, do not guess. Fail closed, not weird. If the model is unsure whether something is a pickup, default to non-interactable. If a floor cannot produce valid collision, reject or repair the scene before it ships. World models become useful when semantic errors turn into debuggable failures, not player-visible glitches. That is the moat.
Engines are executors, models are priors
The obvious objection is compute. Stepping at 60 frames per second is what games do. Running a large neural model at 60 frames per second on consumer hardware is not realistic.
That critique is correct and it points to the right architecture.
Engines remain the executor on the critical path.
World models contribute off the critical path, or at a lower frequency, or only at decision boundaries.
In practice, the early winning systems look hybrid.
The engine runs deterministic stepping, constraints, collisions, and rule logic at frame rate.
The model provides priors and proposals. It generates structured content. It predicts short-horizon outcomes for risky micro-decisions. It fills in plausible state where it is safe to do so. It runs asynchronously and gets cached, distilled, and invoked selectively.
That division of labor is not a compromise. It is the only way this becomes a product.
It also reinforces the core thesis. Engines are not getting replaced. Engines are becoming the runtime that makes world models usable.
Why games become the center of gravity
If you accept “compiler plus runtime,” engines become the natural convergence point for three structural reasons.
Semantics and debugging already live thereEngines already define stepping, collisions, navigation, physics approximations, state machines, triggers, and observability. When something fails, you want to see the state. You want to see which constraint was violated. You want reproducibility.
A latent that only renders plausible frames is a poor place to debug. An engine is a good place to debug because state is explicit and constraints are executable.
Shipping pressure turns edge cases into dataGames have a built-in stress test. Players explore adversarially. They do weird things on purpose. They find boundary conditions you never designed for.
World models need that kind of coverage because the hard failures are action-conditioned and long-horizon. A live game can capture traces, mine failures, and turn them into training curriculum. That creates compounding improvement.
Standardization accumulates around platformsEvery foundation-model era converges toward standards. Data formats, tooling ecosystems, integration surfaces.
Engines already play that role for interactive 3D. If world models are going to become a widely usable substrate, their structured outputs and semantics libraries will stabilize around the platforms that already ship worlds.
This is why games lead. They already have the executor, the debugger, the distribution, and the feedback loop.
What world models unlock in games
The runtime story matters because it changes the product frontier.
World models introduce new primitives. Primitives are where platforms shift.
Prompt-to-play creation
Not concept art. A playable slice. Designers can traverse it immediately, feel pacing, adjust layout, and iterate in minutes. The creative loop becomes interactive from the first step, not after a long build.
This is also where Camp C needs a reality check. AAA-grade 3D assets are hard. Topology, rigging, LODs, UVs, performance budgets. Near-term value is not “AI generates production-ready Night City.”
Near-term value is blockouts, structured layouts, and engine-ingestible scene graphs that can be paired with existing asset libraries and tooling. Let the model draft the world structure. Let the pipeline polish it.
Persistent consequence without combinatorial authoringNot branching trees that explode. Stateful causality that can be maintained. Worlds that remember what players did and remain coherent over long sessions and updates.
NPCs that inhabit the same world dynamicsThe leap is not NPCs that talk better. It is NPCs that move, remember, anticipate, and react inside the same rules as the player. In practice, this looks like fewer brittle logic-tree failure modes when the world deviates from the expected script.
Simulation-native QA and balancingExecutable worlds let agents generate adversarial play traces. They can search for degenerate strategies, probe boundary conditions, and surface edge cases early. Human taste remains central. The difference is that blind spots become measurable.
New formatsThe deepest shift is not production speed. It is experiences that sit between authored content and emergent worlds, coherent because the runtime supplies rules and the model supplies breadth.
Games are the first place these primitives can ship at scale because games already have the runtime, the toolchain, and the audience that will pressure-test them daily.
Robotics inherits the same stack
Robotics is the same stepping loop, only the constraints are real. Small action differences can cause collisions. Partial observability is the default, so state tracking is never optional. Safety constraints are not design choices, they are hard limits. And honest failure matters because hallucinated success does not just look wrong, it breaks hardware and corrupts learning. That is why adoption will be staged. The industry will start with short-horizon prediction and action gating near contact and near failure, then move toward fleet replay and behavioral regression, and only then expand planning horizons as verification coverage hardens.
Stage 1: pre-execution prediction near the boundary
The first widespread use is short-horizon prediction around fragile moments. Contact, grasping, insertion, near-collision navigation.
A robot proposes a micro-action. A world model predicts near-term outcomes and flags constraint violations before committing. The system chooses the action most likely to satisfy invariants.
This is where real industry value appears early. You do not need a perfect digital twin of the factory. You need reliable prediction where decisions are brittle.
Stage 2: fleet replay and behavioral regression becomes normal
As robotics scales, deployment becomes a software problem. Updates must not degrade behavior.
Fleets already log traces. The next step is to treat representative traces as regression tests. After an update, you replay them, measure drift, and block regressions before rollout.
World models strengthen this loop by enabling controlled counterfactuals. Given a recorded situation, perturb micro-actions and evaluate which variants violate constraints. Near-misses become systematic training signal.
Real-world data remains essential. Simulation remains essential. World models become the layer that ties them together with predictive tests and faster iteration.
Stage 3: planning expands as verification coverage expands
Long-horizon planning arrives as verification hardens. Warehouses before homes. Factories before sidewalks. The rollout horizon grows as drift becomes measurable and manageable.
The game connection becomes operational here. Games develop the discipline earlier because they already have stable action interfaces, explicit constraints, replay, instrumentation, adversarial exploration, and a culture of regression testing. Robotics inherits these patterns because it needs them to scale safely.
A direct note to large studios
If you run a large game company and you are defining your foundation-model strategy, do not reduce it to “train a frontier language model.”
Language will remain an interface layer. It will also be broadly accessible through partnerships, licensing, and fine-tuning.
Your durable advantage is transitions.
You own the runtime where actions become consequences. You own telemetry with actions. You own replay infrastructure. You can define constraints. You can integrate world models into the engine as first-class components with validators and regression suites.
The highest-leverage move is not chasing the biggest generator. It is owning the compilation pipeline that turns model outputs into executable worlds.
Partner for language. Own the runtime.
Language models made language programmable.
World models make environments programmable.
To do that, you need executable semantics, constraints, persistence, and operational machinery that keeps behavior stable across updates and edge cases.
Engines already provide that machinery.
That is why world models will be won in games, then move into Physical AI.Link to Slide