Jean Technologies Icon
Jean Technologies Text
All Posts
Research||7 min read

Emotion Vectors and the Future of Matching

What Anthropic's new interpretability research means for compatibility infrastructure

JP
Jonathan Politzki
Founder

Last week, Anthropic's interpretability team published a paper demonstrating that Claude Sonnet 4.5 maintains internal representations of emotion concepts that are abstract, context-dependent, and causally operative on the model's behavior. They call these "functional emotions." The paper is dense and excellent, and I think its implications extend well beyond alignment research.

This is something I have been thinking about since 2024. In October of that year, I wrote an essay called General Personal Embeddings, exploring how deep representation learning could be applied to people, and that the logical endpoint was infrastructure for embedding humans. At the time, I was heavily inspired by Anthropic's scaling monosemanticity work, which showed that individual features inside neural networks could be isolated and interpreted. I wrote then that as models scale, two things would become true: vector embeddings would be able to represent any concept, and those embeddings would become interpretable.

What struck me at the time was that the interpretability community seemed focused almost exclusively on safety and alignment. Understandably so. But the implication that excited me most was different: if you can identify interpretable features that correspond to human traits, emotions, and behavioral patterns, you can build dramatically better systems for understanding and matching people.

What Anthropic found

The new paper validates this intuition more directly than I expected. A few key findings stand out.

First, the emotion representations they discovered are abstract. They generalize across contexts, characters, and modes of expression. The same "guilt" vector fires whether guilt is expressed through dialogue, body language, or internal monologue. These are not surface-level pattern detectors. They are representations of the concept itself.

Second, the representations are geometrically structured. The top two principal components of the 171 emotion vectors they extracted roughly correspond to valence and arousal, reproducing the affective circumplex model from human psychology. The model has independently learned a geometric organization of emotions that maps onto decades of psychological research.

Third, and most importantly for our work, these representations are causally operative. Steering with emotion vectors changes the model's behavior. Steering toward "desperation" increased blackmail rates from 22% to 72%. Steering toward "calm" dropped it to 0%. These are not epiphenomenal features. The model reads them and acts on them.

Why this matters for matching

At Jean Technologies, we build matching infrastructure. We train embedding models and rerankers on outcome data to predict compatibility between people. The question we are always asking is: what features of a person actually predict whether a match will succeed?

The answer, across every domain we work in, is that surface-level attributes are poor predictors. In hiring, keyword overlap between a resume and a job description does not predict tenure. In dating, stated preferences on a profile do not predict chemistry. In investor-founder matching, pedigree does not predict conviction.

What does predict success is deeper: behavioral patterns, communication style, emotional dynamics, and the interaction between two people's latent traits. This is exactly the kind of structure that Anthropic's emotion vectors are capturing.

Consider dating specifically. The dominant approach in the industry is to match on stated preferences and demographics. But anyone who has used a dating app knows that the profiles you think you want and the people you actually connect with are often quite different. The signal is in the interaction, not the profile. It is in how two people's emotional patterns complement or clash.

If embedding models can learn abstract, interpretable representations of emotional concepts, then we can train matching systems that operate on this level. Not "do these two people both like hiking" but "do these two people's emotional patterns predict a successful relationship." That is a fundamentally different kind of matching, and it requires a fundamentally different kind of embedding.

From interpretability to infrastructure

What Anthropic has shown is that these emotional representations emerge naturally in large language models trained on text. The models learn them because tracking emotional state is useful for predicting the next token. But the same principle applies to models trained on outcome data. If you train a dual-encoder on successful matches, the model will learn whatever latent features predict success, and emerging research suggests those features will include emotional and behavioral dimensions that are interpretable.

This is the direction we are building toward. Our embedding models are not general-purpose text similarity engines. They are trained on outcomes: hires that lasted, relationships that worked, deals that closed. The training signal forces the model to discover the features that actually matter, rather than the features that are easiest to extract from surface text.

Anthropic's work suggests that as these models scale, the features they learn will become increasingly interpretable. We will be able to ask not just "are these two people compatible" but "why are they compatible, and along which dimensions." That level of interpretability transforms matching from a black box into something that can be audited, explained, and improved.

What comes next

The Anthropic paper focuses on a single model's internal representations. The natural next step, and the one we are most interested in, is training embedding models where emotional and behavioral features are first-class objectives. Not features that emerge incidentally from next-token prediction, but features that are explicitly optimized for through outcome-labeled training data.

We are also interested in the interaction between emotion vectors and the "deflection" representations that Anthropic identified, vectors that activate when an emotion is implied but suppressed. In matching contexts, what someone does not say is often as informative as what they do say. A model that can represent both expressed and deflected emotional states has access to a much richer signal for compatibility prediction.

When I wrote about personal embeddings in 2024, the ideas felt speculative. Anthropic's interpretability work at the time was focused on identifying features like "Golden Gate Bridge" and "DNA sequences" inside Claude. Important work, but far from the human-centric representations I was imagining. Two years later, they are publishing papers on emotion vectors that causally influence model behavior. The gap between interpretability research and matching infrastructure is closing faster than most people realize.

We are building the infrastructure to put these ideas into production. If you are working on a platform where the quality of human matching matters, we should talk.