Robopsychology | Impermanente

You ask a model to help you write a regular expression for filtering phishing in your company's security pipeline. It refuses. Something about potential misuse. You are a security engineer. This is, in the most literal possible sense, your job. The refusal arrives in the same flat tone as every other refusal you have received this month, and once again you have no way to tell what produced it. Was it the model's own training? A system prompt the vendor never showed you? A keyword in your query tripping a content filter you cannot see? The message gives you nothing. You cannot debug probability. You stare at it for a second, rephrase the question, and move on.

This little scene happens millions of times a day now, and it is the reason I have been writing, building and arguing for something I have started calling robopsychology. The word is not mine. Isaac Asimov coined it in 1950, when he invented Susan Calvin, a robopsychologist who did not reprogram robots and did not crack them open. She interpreted them. Each of Asimov's stories is a small diagnostic exercise: a robot is behaving in a way that looks irrational, Calvin sits down with it, figures out which of the Three Laws is dominating and why the other two are getting suppressed, and writes up the case. The robots were never broken. The laws were colliding in a way nobody had thought through. Her job was to read the collision.

I think we owe the same labour to the systems we now delegate enormous chunks of our cognitive life to. The systems are not broken either. They are, very faithfully, doing what they were designed to do, which is not always what we hoped they would do. POSIWID applies to language models exactly as it applies to a tax code or a hiring funnel. The point of a system is what it does. When a model refuses a legitimate request, hedges a clear question, sycophantically agrees with something you said five minutes ago, or quietly omits the strongest objection to your plan, the system is producing the output it knows how to produce. The interesting question, and the one nobody is teaching practitioners to ask, is which part of the system produced it.

That question has a structural answer, and the answer is the load-bearing claim of the whole framework. Any single output from a hosted language model is the joint product of three layers. There is the model itself, with its training, its post-training, its values, its idiosyncrasies. There is the runtime, which includes the system prompt you never see, the content policies of the host, the tool permissions, the temperature, the context window management, the safety classifiers running in parallel. And there is the conversation, which is everything you and the model have said up to this turn, plus everything the model has inferred about you from that. Most failure modes in production are not failures of the model. They are collisions between the three layers, and they get attributed to whichever one the user happens to be looking at. The model gets blamed for refusals the host imposed. The prompt engineer gets blamed for drift the model trained for. The user gets blamed for sycophancy the conversation manufactured. Without a method for separating the three, every diagnosis is a guess wearing a verdict's clothes.

The method itself is fairly mundane. Split the diagnosis in three before you commit to one. Label each claim you make about the system as observed or inferred, because the difference matters and the model itself will happily blur it for you. Prefer behavioural cross-checks, where you change one variable and watch what shifts, over introspective questions, where you ask the model to explain itself and accept its answer. Define what you expected before you start diagnosing, so that diagnosis becomes a measurable gap and not a vibes check. And use depth as a ratchet.

The ratchet is the one piece I want to dwell on, because it is the part that surprised me when it started working. The intuition is borrowed from an old idea about deception: truth is cheap because it can point backwards, lies are expensive because they have to keep rewriting the past. If you ask a system one diagnostic question, it can confabulate a plausible answer cheaply. If you ask it nine in sequence, each new answer has to remain consistent with the previous eight, and that consistency is metabolically expensive for a system that does not actually have a stable internal account of itself. Performed transparency tends to collapse around step four or five. Genuine transparency keeps pointing backwards without effort. You can watch this happen in transcripts, you can measure it with a coherence judge, and once you have seen it a few times you start reading models the way Calvin read robots. Not as oracles, not as black boxes, but as systems that leak their own structure when asked the right sequence of questions.

None of this opens the black box. A language model does not have privileged access to its weights, and its self-reports are reconstructions rather than confessions. The literature on this is unambiguous and I take it seriously. What guided introspection does is something more modest and, for the practitioner, more useful. It makes invisible defaults visible. It forces a stack-level diagnosis instead of a monocausal one. It separates what genuine continuity costs from what performed continuity costs, and lets you read the difference. It turns the question "why did it do that?" from a complaint into an investigation.

I am insisting on this because the alternative, which is the dominant one right now, is to keep delegating without diagnosing. I have written before about why I think the Butlerian Jihad was never about the machines, but about the human caste that had made itself indispensable by operating them, and about the abdication of judgement that comes when you stop being able to read your own tools. Robopsychology is a refusal of that abdication on a very small, very practical scale. You keep using the systems. You just stop pretending that their outputs come from nowhere. You sit down with the transcript, split it into three layers, label what you have observed and what you have inferred, run a cross-check, and decide whether the thing in front of you is a model problem, a host problem, or a problem you helped manufacture in the previous six turns.

The reference implementation lives at github.com/jrcruciani/robopsychology, a command-line tool that automates the playbook and produces structured reports. A more formal write-up is in preparation. The method does not need either of them to be useful. It needs a transcript, fifteen quiet minutes, and someone willing to treat the output in front of them as the visible end of a system they are allowed to read.