PaperSwipe

SuperActivators: Only the Tail of the Distribution Contains Reliable Concept Signals

Published 1 day agoVersion 1arXiv:2512.05038

Authors

Cassandra Goldberg, Chaehyeon Kim, Adam Stein, Eric Wong

Categories

cs.LG

Abstract

Concept vectors aim to enhance model interpretability by linking internal representations with human-understandable semantics, but their utility is often limited by noisy and inconsistent activations. In this work, we uncover a clear pattern within the noise, which we term the SuperActivator Mechanism: while in-concept and out-of-concept activations overlap considerably, the token activations in the extreme high tail of the in-concept distribution provide a reliable signal of concept presence. We demonstrate the generality of this mechanism by showing that SuperActivator tokens consistently outperform standard vector-based and prompting concept detection approaches, achieving up to a 14% higher F1 score across image and text modalities, model architectures, model layers, and concept extraction techniques. Finally, we leverage SuperActivator tokens to improve feature attributions for concepts.

SuperActivators: Only the Tail of the Distribution Contains Reliable Concept Signals

1 day ago
v1
4 authors

Categories

cs.LG

Abstract

Concept vectors aim to enhance model interpretability by linking internal representations with human-understandable semantics, but their utility is often limited by noisy and inconsistent activations. In this work, we uncover a clear pattern within the noise, which we term the SuperActivator Mechanism: while in-concept and out-of-concept activations overlap considerably, the token activations in the extreme high tail of the in-concept distribution provide a reliable signal of concept presence. We demonstrate the generality of this mechanism by showing that SuperActivator tokens consistently outperform standard vector-based and prompting concept detection approaches, achieving up to a 14% higher F1 score across image and text modalities, model architectures, model layers, and concept extraction techniques. Finally, we leverage SuperActivator tokens to improve feature attributions for concepts.

Authors

Cassandra Goldberg, Chaehyeon Kim, Adam Stein et al. (+1 more)

arXiv ID: 2512.05038
Published Dec 4, 2025

Click to preview the PDF directly in your browser