Emotion Vectors and the Future of Matching

Interpretability research shows emotion vectors steer model behavior inside Claude. Two questions follow that the research has not yet answered: how do you extract those vectors per user, and can a single embedding carry enough signal to encode someone faithfully?

Last week, Anthropic's interpretability team published a paper [Anthropic 2026] showing that Claude Sonnet 4.5 maintains internal representations of emotion concepts that are abstract, context-dependent, and causally operative on the model's behavior. They call these "functional emotions." The paper is dense, and its implications extend beyond alignment research.

We have been curious about this for over a year. We wrote General Personal Embeddings [Politzki 2024] arguing that as models scale, two things become true. Vector embeddings can represent any concept. Those embeddings become interpretable. The interpretability community has mostly used this for safety. The same primitive, pointed at people, is the missing piece of compatibility infrastructure.

What Anthropic found

Three results from the paper matter here.

The emotion representations are abstract. They generalize across contexts, characters, and modes of expression. The same "guilt" vector fires whether guilt is expressed through dialogue, body language, or internal monologue. These are not surface-level pattern detectors. They are representations of the concept itself.

The representations are geometrically structured. The top two principal components of the 171 extracted emotion vectors approximate the affective circumplex from human psychology: valence on one axis, arousal on the other. The model learned a geometry that decades of psychological work converged on independently.

And the representations are causally operative. Steering with emotion vectors changes the model's behavior. Steering toward "desperation" raised blackmail rates from 22% to 72%. Steering toward "calm" dropped it to 0%. The model reads these features and acts on them.

Affective geometry

The circumplex result is the part of the paper most worth dwelling on. The two principal axes of the 171 emotion vectors recover valence and arousal because those are the axes along which emotions actually vary in language. The model did not have to be told. The structure emerged from text alone.

This is the strongest evidence we have so far that the representations large models learn are not arbitrary internal codes. They map onto categories that humans already use to describe themselves. If that map is faithful enough, it is something you can build on.

Why this matters for matching

At Jean Technologies we train embedding models and rerankers on outcome data to predict compatibility between people. The question we keep returning to is what features of a person actually predict whether a match will succeed.

Across every domain we work in, surface-level attributes are weak predictors. In hiring, keyword overlap between a resume and a job description does not predict tenure. In dating, stated preferences do not predict chemistry. In investor and founder matching, pedigree does not predict conviction.

What does predict success is deeper: behavioral patterns, communication style, emotional dynamics, and the interaction between two people's latent traits. That is the same kind of structure Anthropic's emotion vectors are capturing.

Extracting vectors per user

So far the interpretability literature has been about the model's internal world. The vectors Anthropic isolated are features of Claude, not features of a specific user. Steering Claude with a "desperate" vector changes how Claude responds; it does not tell you how desperate the user is.

The downstream question, and the one we think is undersupplied in the conversation right now, is how you extract an analogous vector for an individual person from their interaction history. The existence proof for the representation is now strong. The extraction problem is the part nobody has solved.

A few plausible angles. Probe a user's interaction trace against a fixed dictionary of emotion features and average the activations. Train a separate user encoder against outcome labels, then read off the dimensions that align with the model's internal emotion basis. Use steering inversely: given a user's writing, infer the steering vector that, applied to a neutral model, would produce that writing. None of these are obviously right. They are the questions the field should be asking.

The reason it matters for our work is that matching needs a per-user representation, not a per-token one. Whatever method turns out to extract user-level emotional structure reliably is the input layer of a real compatibility model.

One vector isn't enough

The other thing the conversation tends to skip past is that a single embedding has a hard capacity ceiling.

Recent theoretical work [Weller et al. 2025] shows that for a corpus of n items, supporting all possible top-k retrievals from a single fixed-dimensional embedding requires the dimension to grow with n. Past a threshold, no single vector can encode enough distinctions to keep retrieval quality from collapsing. The result is an information-theoretic bound, not an engineering one. You cannot tune your way around it.

People are richer than documents. If a single vector cannot carry enough signal to distinguish among n documents in a search index, it certainly cannot carry enough to encode a person across every dimension that matters for matching: emotional pattern, working style, communication tempo, growth trajectory, risk tolerance, and the half-dozen latent traits that show up only under specific conditions.

The field is already moving on this. Multi-vector retrieval (late interaction, per-aspect embeddings, mixture-of-experts heads) is becoming the default for serious systems. The Anthropic result slots into this picture naturally. If emotional structure is itself a 171-dimensional concept space inside the model, the natural per-user representation is a small set of vectors that factor a person along those dimensions, not a single average.

This is the direction our infrastructure is heading. Multi-vector, outcome-trained, with at least one component that aligns with the model's internal emotional basis so that "why are these two people compatible" admits a structured answer rather than a black box.

What comes next

Two things we want to see published next. First, an extraction method that takes a user's interaction history and produces a per-user emotion vector that survives an out-of-distribution test. That is the analog of Anthropic's steering result on the user side.

Second, an evaluation that asks whether multi-vector representations of people predict match outcomes better than single-vector ones at the same parameter budget. We expect the answer is yes, and that the gap widens as the matching domain gets richer (dating, mentorship, team formation). The Weller bound suggests the gap is forced, not incidental.

When we wrote about personal embeddings in 2024, the ideas felt speculative. Anthropic's interpretability work at the time was identifying features like "Golden Gate Bridge" inside Claude. Important work, but far from the human-centric representations we were imagining. Two years later they are publishing on emotion vectors that causally influence model behavior, and there are theoretical bounds telling us how to compose them for downstream use. The gap between interpretability research and matching infrastructure is closing faster than most people realize.

We are building the infrastructure to put these ideas into production. If you are working on a platform where the quality of human matching matters, we should talk.

References

  1. Anthropic. (2026). Functional Emotions in Claude Sonnet 4.5. Transformer Circuits. transformer-circuits.pub
  2. Politzki, J. (2024). General Personal Embeddings. jonathanpolitzki.com
  3. Weller, O. et al. (2025). On the Theoretical Limitations of Embedding-Based Retrieval. arXiv:2508.21038. arxiv.org/abs/2508.21038
Right pane