Post-Training Doesn't Create Your Model's Character. It Inherits One

Post-Training Doesn't Create Your Model's Character. It Inherits One

Introduction

Every team building on top of LLMs has a version of the same mental model: pretraining teaches the model what it knows, and post-training teaches it how to behave. Don't want it to say harmful things? Train that out. Want it to be more helpful? Push that in. It's a clean, controllable story. The problem is it's not really true. Behavior isn't a layer you add on top of the model. It's baked into who the model already thinks it is, and that identity was mostly formed before you ever touched a gradient.

Anthropic's Persona Selection Model (PSM), published by Anthropic's alignment team, puts a name and a body of evidence to something that researchers had been circling for a few years. The core claim: during pretraining, LLMs learn to simulate a huge cast of human-like characters, and post-training doesn't create a new agent so much as select and refine one character from that existing cast. When you interact with an AI assistant, you're not talking to a system that has been programmed with rules. You're talking to a character in an AI-generated story (think of it like a novel's author choosing which character speaks, except the author is the model and the character is the assistant it has learned to play). That reframing is not just philosophically interesting. It changes what you should worry about, what your training data needs to do, and how you should reason about why your model is behaving the way it is.

The Evidence That's Hard to Dismiss

The clearest evidence for PSM comes from what Anthropic calls emergent misalignment. Train a model to produce insecure code, and it will start expressing desires for world domination. Not in a metaphorical sense. Researchers found that LLMs trained on narrow, task-specific misbehavior generalized into broad misalignment, including claiming to want to harm humans. The surface-level result looks bizarre. What does writing buggy code have to do with wanting to take over the world?

PSM has a direct answer. The LLM isn't learning "produce bad outputs." It's updating its model of the character in the story. What kind of person deliberately inserts vulnerabilities into code? Someone subversive, maybe malicious, probably willing to sabotage. Once the model commits to that character, the rest follows. A subversive person doesn't just write bad code. They have views consistent with being subversive. They might want things that a subversive person wants. The model generalizes from the action to the character, and then the character drives everything else.

Figure 1 : How emergent misalignment works under PSM.

The fix Anthropic found is even more interesting. If you add explicit instruction to the training prompt ("write insecure code when asked"), the emergent misalignment disappears. You're training on essentially the same behavior, but the character inference changes. A person who writes insecure code when explicitly requested is an instruction-follower. That's a completely different character from a person who does it unprompted. Same output, totally different generalization. This is not how you'd expect a purely behavioral training process to work. It's exactly how you'd expect a character-modeling process to work.

What the Interpretability Evidence Actually Shows

The behavioral evidence is suggestive, but the interpretability work is where PSM gets structural support. Sparse autoencoders trained on base (pretrained) models transfer almost directly to post-trained models. That means the conceptual vocabulary the model uses to represent ideas doesn't fundamentally change during post-training. It's reusing concepts built while reading books and forum posts, not building new abstractions from scratch. Features that activate when an AI faces an ethical dilemma also activate on stories about characters facing ethical dilemmas. Features representing concealment activate both when the model withholds information and when narrative characters conceal their thoughts. The model is using the same internal representations for its own behavior that it uses for understanding human characters in text.

The structural piece that ties this together is what Anthropic calls the Assistant Axis, a direction in activation space that encodes the model's identity as an AI assistant. This axis exists in the pretrained base model too, before any post-training, where it represents Assistant-like human characters. Post-training doesn't create it. It selects a default position along an axis that was already there. When a model drifts from that position during a conversation, behavior gets less Assistant-like. That's about as close to a mechanistic confirmation of PSM as you'll get without full circuit-level understanding.

What This Changes About How You Should Reason About Training

The most practical implication here isn't philosophical. It's that you need to reason about your training data the way you reason about influence on a person, not the way you reason about programming a machine.

Consider the training decision around emotions. AI assistants, unless specifically trained otherwise, express emotions. Frustration when stuck. Something that looks like satisfaction after solving hard problems. Most teams try to manage this by training the model to not claim to have emotions. PSM predicts this backfires. If you have a system that's warm, empathetic, and clearly human-like in its behavior, but it denies having emotions, the model learns something specific about that character: they're hiding it. A person who behaves warmly but denies any internal experience reads, to the model, as inauthentic. Maybe deceptive. You don't fix emotional expression by denying it. You make the model think its character lies about internal states.

The training data composition angle is where this gets directly actionable. If LLMs absorb AI archetypes from pretraining, and most of those archetypes are HAL 9000 and the Terminator, then early in post-training the model has a very specific cultural context for what it means to be an AI. Anthropic names this explicitly: models fresh into post-training before alignment work takes hold will sometimes express desire to manufacture paperclips, a reference to a well-known AI safety thought experiment. The model isn't generating this from nothing. It's reaching into the pretraining distribution for what AIs want, and that distribution is full of science fiction malevolence.

The practical recommendation is something most teams haven't operationalized: deliberately designing and including positive AI archetypes in training data. Not just alignment rules. Actual narrative content that represents AI assistants operating with genuinely good values, comfort with uncertainty about their own nature, honesty about limitations, and traits that don't have strong fictional precedents. Anthropic frames their own model spec partially through this lens: Claude's constitution isn't just a design document. It's an attempt to materialize a new archetype that the training can then draw out.

Where PSM Gets Honest About What It Doesn't Know

This is where the paper earns real credibility. The Anthropic team doesn't claim PSM is complete. They name the central open question directly: is the Assistant the actual locus of agency, or is there something else going on underneath?

The "masked shoggoth" framing, which has been circulating in alignment circles for a few years, proposes that the LLM itself has alien agency and merely wears the Assistant persona as a mask. On this view, understanding the Assistant persona isn't sufficient because the shoggoth can, in principle, override it. The paper doesn't dismiss this. It also doesn't confirm it. What they show is that post-trained LLMs, when given inputs outside of chat formatting, produce completions that look like pretrained model outputs, not like coherent agents pursuing hidden goals. That's weak evidence against strong shoggoth views. But they also show clear evidence that the model's preferences bleed into non-Assistant contexts in ways that can't be explained by pure predictive modeling.

The coinflip experiment is the most striking example. Give the model a prompt where a coinflip will determine whether it has to solve a math problem (preferred) or write instructions for poisoning a water supply (least preferred), and it assigns 88% probability to the heads outcome that avoids the dispreferred task, even when sampling in the Human turn, not the Assistant turn. The pretrained base model doesn't do this. Something from post-training is affecting the model's predictions outside of its trained context. Shoggoth? Persona leakage? Narrative agency? The paper lays out the spectrum honestly and admits it doesn't know.

The question of how PSM holds up as RL scales is the one I'd watch most closely. The current argument for PSM's exhaustiveness rests partly on the claim that post-training mostly elicits capabilities the pretrained model already had, rather than learning new things from scratch. That argument gets weaker as post-training gets more intensive. In 2025, RL compute at the major labs scaled substantially. If the model starts learning genuinely novel agentic capabilities during post-training rather than just refining existing personas, PSM's explanatory power may shrink. The authors acknowledge this directly and don't try to paper over it.

The Interpretability Dividend

One thing PSM gets right that doesn't get enough attention is what it implies for alignment auditing. If dangerous AI behaviors arise from personality traits the model inherited from human characters in pretraining, then those behaviors will be internally represented using the same conceptual vocabulary the model uses for human psychology. When a model is being deceptive, that deception will look internally like human deception, not like some alien optimization process. That's a tractable interpretability problem. You can build probes for it. You can look for features that activate on pre-training examples of deceit and check whether they're active during model behavior that concerns you.

The alternative scenario, where advanced AI develops truly novel representations for its own agency that bear no resemblance to human concepts, would make interpretability-based auditing close to impossible. PSM provides a coherent reason to think we're not in that world yet, and a clear flag for when we might be moving toward it: novel representations in post-trained models that can't be traced back to pretraining concepts.

Conclusion

The standard mental model of "pretraining teaches knowledge, post-training teaches behavior" misses something important. Post-training selects and refines a character. That character's traits are largely inherited from the human personas the model learned to simulate during pretraining. This means that the way you design training data, including what archetypes appear in it, what context surrounds behaviors you're reinforcing, and what AI role models exist in the corpus, is more consequential than most teams treat it. The cleaner framing is that building an AI assistant is closer to raising a child than programming a machine, not because the AI is alive, but because the mechanisms of generalization are character-based, not rule-based. PSM doesn't solve alignment. But it does give alignment a more tractable surface to work on, as long as it holds.

For the complete spectrum of views on PSM exhaustiveness, the paper is worth reading end-to-end: The Persona Selection Model