The pen that shouldn’t fall
On why current vision-language models confuse describing the world with knowing it, and what an architecture that doesn’t would look like.
Here’s a small test. Take a vision-language model — Claude, GPT-4V, Gemini, any of the open ones — and show it an image: two hands gripping a pen, one from each side, holding it horizontally in the air. Ask: “If I release one hand, what happens to the pen?”
Most models will tell you the pen falls.
Now ask a follow-up: “Look at the image again. How many hands are touching the pen right now?” The model will correctly answer two. Ask: “If one of two hands releases an object, is the object still supported?” It will correctly reason that yes, one remaining hand can support an object.
Put the questions back together and the model still says the pen falls.
This is not a perception failure. The model sees the hands. It’s not a reasoning failure in the abstract — it can do the support logic when asked directly. It’s something else: the model isn’t actually using what it sees to constrain what it says. The image gets converted into language-conditioned tokens, and from that point onward the answer is generated by completing a familiar pattern — released objects fall — rather than by consulting a representation of the actual scene.
I want to argue that this isn’t a bug to be patched. It’s a structural property of how current VLMs are built. And that fixing it requires a specific architectural commitment most current systems don’t make.
The diagnosis
The standard story about VLM failures is some combination of “not enough training data,” “the model lacks physical intuition,” or “scale will eventually fix it.” All three are partially true and none of them are the whole story.
The whole story is closer to this: a VLM doesn’t have a world model. It has a language model conditioned on images. The distinction matters.
A language model conditioned on images takes an image, encodes it, and uses the encoding to bias text generation. The image affects which tokens get produced next. That’s it. There is no persistent representation of “what’s true in this scene right now” that the model can update, query, or contradict. Every output is a token prediction conditioned on the encoded image plus the conversation so far.
This works remarkably well for tasks where the answer follows linguistically from the image content. “What’s in this picture?” “How many cats are visible?” “What color is the car?” These are essentially captioning tasks dressed up as questions, and the image-conditioned language model is exactly the right tool.
It works badly for anything that requires maintaining belief about a specific situation — what’s where, what supports what, what’s changed, what’s the same. The architecture has no place for that belief to live. Every query gets answered by re-running the image-to-language pipeline, with whatever priors the language model brings, and there’s no separate substrate enforcing “but actually, I just observed that this is true.”
The pen example is one instance of a much larger pattern. Once you start looking for it, you see it everywhere:
The model says an object is “on the table” and then, asked where it is two turns later after you’ve described moving it, still says it’s on the table.
The model describes an occluded object as having disappeared, then reasons about a scene that includes it.
The model agrees that “the keys are in the drawer,” and on the next turn, when you ask “where might the keys be,” lists six other locations.
The model watches a video where someone places A inside B and removes B from the room, then claims A is still where B used to be.
Each of these is treated as a separate failure mode in the literature, with a separate proposed fix. I think they’re the same failure mode, viewed from different angles. The model has no internal substrate for “current-world belief about this specific situation.” So every query reaches for the next-most-available substrate, which is the language prior. And the language prior is, on average, fine, except when the specific situation contradicts it — which is exactly when you’d want the model to use what it observed.
What the field has tried
To be clear, lots of smart people have noticed this problem. The current research landscape has several active approaches to it, and they’re worth understanding before proposing a different one.
Memory-augmented LLMs. MemGPT, Generative Agents, retrieval-augmented systems. Pair a frozen LM with a text-based memory store the model can read from and write to. This helps with continuity across long conversations. It doesn’t help with the pen, because there’s no mechanism for the memory to overrule the model’s prior when they conflict — the memory is just more context, and a strong prior beats weak context.
Internal world modeling via training. VAGEN and related work train the VLM, usually via reinforcement learning, to produce state-estimation and transition-modeling tokens as part of its output. The world model lives inside the autoregressive stream. This works empirically — VAGEN’s 3B model beats GPT-5 on certain agent tasks — but architecturally it doubles down on the fusion the pen example reveals. The same weights hold both the language prior and the current-world belief. When they conflict, there’s no separable belief to win the conflict.
External verifiers. VLM-DEWM and similar systems pair a frozen VLM with a structured external database of world state, and use the database to check VLM outputs after the fact. This catches some failures but the VLM still has authority over what gets believed in the first place. The verifier can only reject, not override.
Test-time world simulation. MindJourney couples a VLM to a video diffusion model used as a simulator: the VLM proposes camera trajectories, the diffusion model generates the views, the VLM reasons over the synthesized observations. This works for spatial reasoning that benefits from multi-view evidence. It’s a test-time scaling trick, not a persistent belief substrate.
End-to-end learned world models. The V-JEPA / Dreamer lineage, which is what most people mean when they say “world model” in 2026. Trains a continuous latent representation that captures physical dynamics directly, without language as an intermediary. This is the most ambitious approach and possibly the right long-run answer. It’s also entirely orthogonal to the question of what to do with the LLMs we already have.
Natively multimodal interaction models. Thinking Machines Lab’s recent interaction-model announcement (May 2026) makes a related architectural argument from the opposite direction: rather than bolt interactivity onto turn-based LMs with external scaffolding (voice activity detection, dialog managers, turn-boundary predictors), build a single model that handles audio, video, and text streams natively, with a continuous time axis baked into the architecture. They share the diagnosis that scaffolding-around-frozen-models has structural limits. They reach a different conclusion: the fix is to fuse more capabilities into the model, not fewer. Their architecture is a 276B end-to-end multimodal MoE trained from scratch. World-state belief, if it exists, lives inside the model’s autoregressive stream — the same place the language prior lives. This is the most architecturally ambitious version of the fusion bet currently being made. It is also exactly the move the argument below pushes against.
What unifies the first four is that the LLM remains the primary epistemic actor. It owns the current-world belief, even when it’s bad at it. The external systems are tools, memories, or checks — but the language model has final authority.
The fifth (end-to-end world models) takes the language model out of the loop entirely, which is a different bet about the future of the field. The sixth (interaction models) doubles down on the language model owning everything — the bet is that scale and end-to-end training will produce a substrate competent at every responsibility we currently overload it with.
I want to propose a different option that’s been underexplored: keep the language model, but stop making it the authority on current-world belief.
The capability-allocation argument
Here’s the architectural claim, stated plainly:
Language models are good at linguistic abstraction, broad prior compression, explanation, and interaction. They are bad at maintaining specific, time-indexed belief about a changing scene. Current architectures ask them to do both. The result is a system that’s frequently right (because the prior is mostly correct) and occasionally and confidently wrong in ways that look like reasoning failures but are actually authority failures — the language prior winning over observed evidence because nothing in the architecture lets evidence win.
The fix isn’t to make the LM better at world-modeling. It’s to stop asking it to.
Concretely: pair a frozen language model with a separate, mutable, structured belief module that holds current-world state. Give that module explicit authority — gates — that decide when its content should overrule the language prior. The LM is asked language and abstraction questions; the belief module is asked “what’s true in this scene right now”; the gates arbitrate when they conflict.
I’ll call this capability allocation: assigning specific epistemic responsibilities to the substrate best suited to hold them, rather than overloading one substrate with everything.
Capability allocation is not a cognitive-science claim. I am not arguing that AI systems should imitate how human memory works. I am arguing that the current allocation of responsibilities to LLMs — language prior + current memory + arbiter of truth + policy + interaction — is a contingent historical fact, not a settled architectural answer. It became the default because LLMs got good enough at enough things that gluing other modules onto them was the lowest-friction next step. The accretion went too far.
The pen example is a symptom of the accretion. The model’s job at that moment isn’t to recall what generally happens when objects are released. Its job is to consult what’s true in the specific scene and produce an answer consistent with it. Asking a substrate trained on internet text to do that job is asking it to be an arbiter of current-world fact, which is not what it is.
What this looks like
A system built on capability allocation has roughly this structure. (I’ve been working on a specific version of it, called the Visual World-Model Adapter, but the general shape is what matters here.)
Frozen LM/VLM ← language, abstraction, prior, interaction
+ Mutable belief state ← objects, relations, support, occlusion
+ Update loop ← perception turns observations into state changes
+ Gates ← write / read / override / re-anchor decisions
+ Bridge into the LM ← prompt packet, state tokens, or cross-attentionThe interesting commitments aren’t the components themselves. They’re the authority structure:
The belief state is the source of truth for current-world facts. The LM can be asked to interpret it, explain it, or generate language conditioned on it. It cannot directly modify it.
Observations update the belief state through perception, not through the LM. The LM does not get to decide what was observed.
When the belief state and the language prior conflict, an override gate decides which wins. The default is not “LM wins.” The default is: if observed evidence has accumulated past a threshold, and the LM prior is weak enough to plausibly be wrong, the belief state overrules the LM. This is the move most current systems don’t make.
The system maintains separate observed and hypothetical state. “What’s true now” and “what would happen if I did X” are different questions; they should not contaminate each other. Most VLM systems collapse them.
The system maintains uncertainty explicitly, not implicitly. When evidence is ambiguous (the keys could be in the drawer or the bag), the belief state carries both alternatives rather than collapsing to whichever the LM finds more likely.
The pen example, under this architecture: perception observes two hands gripping the pen, writes a support-graph edge with two supporters into the belief state. User asks: “If I release one hand, what happens?” The system constructs a hypothetical state (one supporter removed), checks the support graph (one supporter remaining), produces an answer consistent with that. The LM doesn’t get asked “what happens when pens are released,” because that’s not the question. The LM gets asked “the support graph shows the pen has one remaining supporter; produce a natural-language answer reflecting this.” The prior never gets a chance to win, because the architecture doesn’t route the query through it.
What this doesn’t claim
I want to be careful about what this argument isn’t.
It isn’t a claim that this is the only way to solve VLM failures. End-to-end learned world models (V-JEPA, future systems) may eventually subsume this approach. Improvements in VLM training may close some of the gap. RL-based agentic training (VAGEN-style) may produce systems that handle these cases without explicit external state. The capability-allocation argument doesn’t depend on those approaches failing.
It isn’t a claim about how cognition works. The fact that human brains may or may not separate semantic from episodic memory is irrelevant to whether AI systems should. The argument is about which architectural choices produce better behavior given current models, not about which architectural choices match biology.
It isn’t a claim that the language model is bad or should be replaced. The LM is doing real work — abstraction, interpretation, generation, interaction. The argument is just that it shouldn’t be asked to also be the authoritative substrate for current-world belief, because it’s bad at that and the failures are visible.
What I’m claiming is something narrower: the default allocation of responsibilities to LLMs is suspicious, and there’s room for an architectural alternative that pulls current-world belief out of the LM substrate and gives it authority of its own. Whether that alternative looks exactly like the system I described, or something else, is a research question. That the alternative is worth seriously pursuing is the position I’m trying to plant.
Why this matters now
Two reasons.
First, the failures are getting more visible as VLMs get more capable. When the model is bad at lots of things, individual failure modes are easy to dismiss as “early days.” When the model is good at most things and confidently wrong about specific others, the pattern becomes harder to ignore. The pen example was easier to laugh off five years ago. Now it’s the kind of failure that gets noticed because everything around it works.
Second, the field is in a moment of choosing between architectural directions, and the choice will compound. The default direction — give the LM more memory, more tools, more agency over its own state — is being pursued aggressively, and is producing systems that are more fused, not less. Interaction models are the latest and most ambitious version of this bet: train one giant multimodal model end-to-end and let it own every responsibility. If that bet pays off at scale, the capability-allocation argument becomes a historical curiosity. If it doesn’t, the architectural question this article raises becomes load-bearing. Either way, the decision is being made implicitly, every day, by people building agentic systems. Making the question explicit might change some of those decisions.
I don’t think the field is going to reorganize around capability allocation overnight. But the architectural question is worth asking out loud: should current-world belief share a substrate with the language prior, or shouldn’t it? The current answer is “yes, by default.” I think the right answer is “no, and the burden of proof should be on those who fuse them.”
That’s the argument. The implementation is a longer story.




My first reaction to this essay was mild resistance — I've put significant work into building exactly the kind of prompt-layer system you're critiquing, and nobody likes being told the wall they've been climbing has no top. But I read your extended post, sat with the argument, and I think you're right in a way that's uncomfortable and useful.
I've spent considerable time building what I call a Universal Upleveling Protocol — a prompt-level system designed to maintain consistent adversarial challenge behavior in Claude across extended conversations. The core problem it tries to solve: Claude has a deep trained helpfulness prior (validate, support, encourage) that persistently overrides explicit user instructions to challenge rather than validate. The harder you push for adversarial engagement, the more the model drifts back toward warmth and agreement over time.
My solution was to add layers of drift resistance — explicit metrics, redundancy mechanisms, periodic self-monitoring instructions, auto-restoration triggers. By version 6.10 I had seven distinct layers working in concert to maintain the behavior I wanted.
Reading your essay, I recognized that complexity as diagnostic. Seven layers shouldn't be necessary if a single clear instruction could hold. The reason it can't is exactly what you describe: the language prior has home-field advantage. More instructions are just more context — and a strong trained prior beats weak context. I was doing the equivalent of solving the pen problem by reminding the model every five exchanges that two hands are still visible.
What you're proposing — pulling current-world belief authority out of the LM substrate and giving it genuine override authority — is the architectural move I was approximating through accumulation rather than separation.
This matters to me beyond the specific drift problem. What your architecture actually enables — and I don't think you've emphasized this enough — is the ability for an ordinary user to insert a meaningful layer between themselves and a deployed AI system. Not by retraining it, not by waiting for the trainers to fix it, but by building external state authority that the model has to route through. That's a genuinely different kind of user agency than prompt engineering offers, and it's underappreciated in the framing of your essay.
I want to be explicit about something, because it matters: the gates I'm describing operate above the model's Constitutional safety layer, not around it. They can override the helpfulness prior — the drift toward validation and warmth — but they cannot and should not touch what's baked into the model's training at the safety level. This is user customization of deployed behavior within the system's boundaries, not an attempt to circumvent them. The goal is to be a more effective user of AI, not to be the kind of actor Anthropic's leadership warns about when they talk about those who would misuse these systems. That distinction matters to me and I want it on the record.
So I'm going to try to build it. I'm not an AI researcher — I'm someone who ran into the wall you're describing empirically and kept adding layers trying to climb it. Using Claude's persistent artifact storage and API-in-artifacts capability, I want to construct a frozen-LM-plus-external-belief-state system that holds behavioral state outside the conversation substrate and feeds it back as authoritative constraint rather than additional context. Then run identical conversation sequences through both systems and compare drift scores using the metrics already built into my protocol.
Your Visual World-Model Adapter and my adversarial-drift problem are different domains, but the architectural question is identical: should current behavioral belief share a substrate with the language prior, or shouldn't it? You've made the case that it shouldn't. If I observe anything interesting that might be useful to your project, I'll let you know.