Local Drift-Adapters

Anyone running a production vector database faces the same dilemma every time a better embedding model comes out: re-embed the entire corpus at great cost, or stay on the older, weaker model. For an organization holding hundreds of millions of vectors, the re-embedding bill is high enough that the upgrade simply doesn't happen.

Drift-Adapter [Vejendla 2025] offered a third path. Train a small linear map on a calibration set, apply it to every old embedding, get most of the new model's retrieval quality for orders of magnitude less compute. The catch is in the word every. A single global map assumes the transformation from old space to new space is the same everywhere: that scientific text and financial text and conversational text all drift in the same direction. They don't.

One adapter, everywhere

The global setup is clean. Sample a few thousand documents, embed each under both the old model and the new one, and solve for the matrix that best maps one set of vectors to the other: an orthogonal Procrustes rotation, a low-rank affine, or a small residual MLP, depending on how much expressive power the dimension change demands. The adapter then translates the rest of the corpus, and incoming queries, without ever invoking the new model.

On easy upgrades (same model family, same dimension), that's enough. The map really is close to uniform, and a single rotation captures it. The story breaks down at the hard end of the spectrum: cross-family, cross-dimensional jumps, which are exactly the upgrades anyone bothers to do.

Drift isn't uniform

The premise of this work is that embedding spaces aren't structurally homogeneous, and the transformation between two spaces shouldn't be treated as if they were. The cross-lingual alignment literature noticed this years ago. Different vocabulary regions translate via different mappings [Nakashole & Flauger 2018], and cluster-based alignment closed real gaps in bilingual lexicon induction.

We fit a global Procrustes adapter on MS MARCO and measured the residual: how badly the global map fails for each point. The residuals are spatially correlated. Nearby points in the old space share both the direction and the magnitude of their error. The figure on the right is a synthetic version of what the real data shows: three regions, three qualitatively different transformations, one global map that can't be all three at once.

The mixture-of-experts move

If drift is locally consistent but globally not, the obvious fix is to fit one adapter per neighborhood. We partition the calibration set into K clusters, train a dedicated adapter on each cluster's data, and route incoming embeddings through a soft combination of those adapters at inference. The experts are tiny linear maps; the gating is deterministic from the clustering. Nothing new is trained at routing time.

Soft routing

For an incoming old-space vector z, the routing weights are a temperature-scaled softmax over cosine similarity to each cluster centroid. The final adapted vector is the weighted combination of every per-cluster adapter's output, then renormalized. The renormalization isn't optional. A convex combination of rotations is no longer a rotation, and unnormalized blends shrink norms near cluster boundaries enough to hurt cosine retrieval. Low temperature recovers hard routing; we use τ = 0.1.

Drift-aware clustering

The intuition we walked in with: standard k-means on old embeddings will group points by semantic similarity, but semantically similar points can drift differently. Biomedical and physics jargon both live in "scientific text," yet the new model might reorganize them in opposite directions. So we augment the clustering features with the residuals from a global adapter, giving every point a tag for how the global map fails it. We expected this to help.

Empirically, it barely does. Drift-aware clustering improves Recall@1 by less than half a point over plain k-means. Spatial proximity in the old space already captures most of the drift structure. Two readings of this: either the partitioning criterion matters less than we thought, or the benefit of any reasonable partition dominates which one you pick. Probably both.

Headline numbers

We evaluate four model-pair configurations on MS MARCO, ordered by transformation difficulty: same-family (MiniLM-L6 → MiniLM-L12), cross-family (MiniLM → BGE-small), cross-dimensional (BGE-small → BGE-base), and the hardest pair: MiniLM → E5-large, which is both cross-family and a 384 → 1024 dimension jump.

Local adapters beat their global counterparts on all four pairs. The interesting result is the hardest pair: a local Procrustes model with just eight clusters outperforms the global Procrustes baseline (0.984 vs 0.979 R@1). Even the strongest single-map baseline benefits from being broken into pieces.

Scaling with K

Performance climbs monotonically with the number of clusters up to roughly K = 32, then plateaus or degrades. On the cross-family pair, going from one cluster to thirty-two closes about 70% of the gap to an oracle that has the true new embeddings. On the hardest pair, K = 32 yields 2.6× the Recall@1 of the global affine baseline.

The K = 64 degradation is a data-volume story rather than a modeling one. 100K calibration samples split across 64 clusters gives each adapter only ~1,500 points to fit, which isn't enough. Pick K based on calibration set size, not on some property of the data.

Gains scale with difficulty

The improvement from local adapters is small on easy upgrades and large on hard ones, and the relationship is nearly linear in how much retrieval degrades without any adapter at all. Same-family upgrades barely need anything (+0.3 R@1 points). Cross-family +7.8. Cross-dimensional +11.0. The cross-family, cross-dimensional pair: +32.5 points.

This is the practitioner's takeaway. If your upgrade is easy, a global adapter is already fine; don't add machinery. If your upgrade crosses model families or jumps dimensions, that's exactly the regime where global maps underfit and local ones pay for themselves many times over.

Caveats

Two new hyperparameters relative to the global setup: the cluster count K and the drift-feature weight α. We provide defaults, but the right values shift with corpus and model pair. Routing at query time costs an extra O(K·d) cosine comparisons, negligible for moderate K, non-trivial above K > 100. The evaluation is English sentence embeddings only; multilingual and non-textual embeddings may look different.

The drift-aware clustering result is the one we'd most like to revisit. Our hypothesis, that drift information should improve partitioning, is a clean idea that didn't pay off empirically, and we don't fully understand why. Either the drift signal is too noisy at the per-point level to help partitioning, or spatial proximity is doing more than we gave it credit for. The honest finding is that you can ship local adapters with vanilla k-means and lose almost nothing.

References

Vejendla, A. (2025). Drift-Adapter: Embedding Migration without Re-embedding.
Nakashole, N. & Flauger, R. (2018). Characterizing Departures from Linearity in Word Translation. ACL.