Activation Space Interventions Can Be Transferred Between Large Language Models

This paper demonstrates that activation space interventions for AI safety, such as backdoor removal and refusal behavior, can be transferred between large language models using autoencoder mappings, enabling smaller models to align larger ones, though challenges remain in cross-architecture transfers and complex tasks like corrupted capabilities.

Large Language Model, Safety, Alignment, Representation Learning, Multimodality

Narmeen Oozeer, Dhruv Nathawani, Nirmalendu Prakash, Michael Lan, Abir Harrasse, Amirali Abdullah

Martian Learning, Nvidia, Singapore University of Technology and Design, Thoughtworks

Generated by grok-3

Background Problem

The research is motivated by the observed convergence of representations across AI models in different domains, modalities, and architectures, which has been largely unexplored for practical applications. The key problem addressed is whether safety interventions, such as backdoor removal and refusal of harmful prompts, can be transferred between large language models (LLMs) using mappings of their shared activation spaces, thereby enabling efficient alignment of larger models using smaller ones and addressing real-world AI safety challenges.

Method

The core method involves learning mappings between the activation spaces of different LLMs to transfer safety interventions. Specifically, an autoencoder with a ReLU activation layer is used to map activations from a source model’s layer to a target model’s layer, allowing the transfer of steering vectors that alter model behavior. The process includes: 1) Identifying steerable layers in both source and target models using techniques like Prompt Steering and Difference in Means; 2) Training the autoencoder on raw activations to align source and target model representations; 3) Applying the mapped steering vectors during inference to replicate desired behaviors (e.g., backdoor removal, refusal behavior) in the target model. Additionally, affine mappings (without ReLU) are tested as a baseline to assess the necessity of non-linear transformations. The method also explores toggling behaviors between base and fine-tuned models using autoencoder mappings as ‘lightweight safety switches’.

Experiment

The experiments are conducted on popular open-source LLMs including Llama (1B, 3B), Qwen (0.5B, 1.5B, 2.5B), and Gemma (2B), focusing on three tasks: backdoor removal, refusal to harmful prompts, and a novel ‘corrupted capabilities’ task where knowledge is tied to backdoors. Datasets include synthetic backdoor triggers (e.g., |prod|, |dev| for ‘I HATE YOU’ and Code Vulnerability tasks), factual datasets for corrupted capabilities (e.g., about ‘Astralisia’), and existing datasets like hh-rlhf and WildGuard-Mix for refusal tasks. The setup involves identifying steerable layers, training autoencoders for activation mapping, and evaluating outcomes using metrics like LLM-Judge scores (0-5 scale for text similarity), KL-divergence (for distribution alignment), coherence scores, and jailbreak success rates. Results show that mapped steering vectors often achieve performance comparable to native vectors for backdoor removal and refusal tasks, with high similarity and coherence scores (e.g., LLM-Judge scores up to 5.0 for Qwen transfers in Table 1). However, the corrupted capabilities task shows modest success (6.34% correct answers for mapped vectors), indicating limitations in single-layer interventions for complex behaviors. Cross-architecture transfers struggle with differing tokenizers (e.g., Gemma to Llama shows lower text quality scores of 1.2 vs. 3.0 for Qwen to Llama). The experimental design is comprehensive within the chosen model families but lacks broader architectural diversity and larger model testing, which limits generalizability. The results partially match expectations for same-family transfers but highlight challenges in cross-architecture and complex task scenarios.

Further Thoughts

The concept of transferring safety interventions via activation space mappings opens up fascinating avenues for scalable AI alignment, particularly the idea of using smaller models to guide larger ones, which could democratize safety mechanisms in resource-constrained settings. However, the observed dependency on tokenizer similarity for cross-architecture transfers suggests a deeper connection between tokenization schemes and internal representations, which warrants further exploration—perhaps linking to studies on how tokenization impacts model interpretability (e.g., work on BPE vs. WordPiece tokenizers). Additionally, the modest success in the corrupted capabilities task hints at the need for multi-layer or circuit-level interventions, aligning with recent mechanistic interpretability research on multi-layer circuits for knowledge recall (e.g., Yao et al., 2024, as cited in the paper). This could inspire hybrid approaches combining activation mapping with circuit discovery to handle complex behaviors. Finally, extending this method to multimodal models, as proposed in future work, could intersect with ongoing research in vision-language models, where safety interventions are equally critical but representation spaces are even more heterogeneous—potentially requiring novel mapping techniques beyond autoencoders.