Rethinking Invariance in In-context Learning

This paper introduces Invariant In-Context Learning (InvICL), a novel ICL method that achieves permutation invariance, information non-leakage, and context interdependence using leave-one-out encoding and parallel implementation, outperforming both invariant and non-invariant baselines in generalization and performance across synthetic and real-world tasks.

In-Context Learning, Permutation Invariance, Large Language Model, Transformer, Generalization, Context Interdependence

Lizhe Fang, Yifei Wang, Khashayar Gatmiry, Lei Fang, Yisen Wang

Peking University, MIT CSAIL

Generated by grok-3

Background Problem

In-Context Learning (ICL) is a significant emergent property of large language models (LLMs), enabling rapid adaptation to new tasks using context examples without parameter tuning. However, ICL suffers from a critical flaw: sensitivity to the order of context examples, despite their mutual independence, leading to significant performance variance (e.g., accuracy fluctuating from 90% to 50% on SST-2 dataset). This order sensitivity stems from the auto-regressive nature of LLMs and causal masking in Transformer architectures, breaking permutation invariance. The paper aims to address this by designing an invariant ICL algorithm that maintains performance while achieving permutation invariance, focusing on two additional desiderata: information non-leakage (preventing query access to its answer) and context interdependence (allowing examples to interact for better encoding).

Method

The proposed method, Invariant In-Context Learning (InvICL), aims to achieve permutation invariance in ICL while preserving information non-leakage and context interdependence. It operates in a two-stage process: first, it independently encodes each context example using a Bag-of-Examples (BoE) approach; second, it applies a leave-one-out (LOO) pre-encoding strategy where each example is encoded with context from all other examples (excluding itself), ensuring interdependence without leakage. This is implemented using a novel LOO-type attention mask within a Transformer architecture. To address computational inefficiency, a parallel implementation is introduced by duplicating the input sequence and unrolling it twice, allowing all LOO encodings and predictions in a single forward pass with complexity $O(n^2)$ , comparable to baseline methods. Symmetric positional encoding is also adopted to maintain invariance by treating each example as an independent sequence.

Experiment

The experiments are conducted on both synthetic and real-world datasets to evaluate InvICL’s performance. Synthetic tasks include linear regression, sparse linear regression, and decision tree learning, where InvICL shows faster convergence (e.g., outperforming baselines at 50k epochs) and superior length extrapolation ability (better performance on sequences longer than training length of 40). Real-world experiments use 142 tasks from MetaICL, covering text classification, QA, NLI, and paraphrase detection, with base models like GPT-2 Large, GPT-Neo 2.7B, and Pythia-2.8B. InvICL outperforms non-invariant methods (e.g., AR ICL) in 4/7 tasks for all target tasks and all 7 tasks for unseen domains, and surpasses invariant baselines (e.g., Prefix ICL, BoE variants) in most settings, especially in out-of-distribution (OOD) generalization (average accuracy of 48.4% vs. 43.6% for AR ICL). The experimental setup is comprehensive, testing various task types and OOD scenarios, though the need for short fine-tuning on real-world tasks suggests potential practical limitations. Inference time is comparable to baselines (around 22ms), but memory overhead from input duplication (14% increase for GPT-2 Large) is noted as acceptable. Results generally match expectations of improved generalization due to invariance, though the synthetic focus might overstate real-world impact.

Further Thoughts

The principles behind InvICL, particularly the focus on data symmetry and context interdependence, could potentially extend beyond ICL to other emergent capabilities of LLMs, such as reasoning or planning, where order sensitivity might also play a role. For instance, in multi-step reasoning tasks, ensuring that the model’s intermediate reasoning steps are invariant to the order of provided information could enhance robustness, similar to how InvICL improves ICL. Additionally, the LOO encoding strategy might inspire new attention mechanisms in multimodal systems, where different modalities could be encoded with context from others without leakage, potentially improving cross-modal generalization. However, I’m concerned about the scalability of input duplication in larger models or longer contexts, as memory constraints could become prohibitive. It would be insightful to explore if InvICL’s benefits hold in extremely long-context scenarios or with newer architectures like State Space Models, which might handle sequence dependencies differently. This also connects to recent works on scaling laws, where understanding how invariance impacts emergent abilities at scale could be a fruitful direction for future research.