This paper introduces SEFE, a method combining Answer Style Diversification (ASD) to mitigate superficial forgetting and RegLoRA to address essential forgetting in Multimodal Continual Instruction Tuning, achieving state-of-the-art performance on the CoIN benchmark.
Continual Learning, Multimodal Systems, Parameter-Efficient Fine-Tuning, Instruction Tuning, Data Augmentation
Jinpeng Chen, Runmin Cong, Yuzhi Zhao, Hongzheng Yang, Guangneng Hu, Horace Ho Shing Ip, Sam Kwong
City University of Hong Kong, Shandong University, The Chinese University of Hong Kong, Xidian University, Lingnan University
Generated by grok-3
Background Problem
Multimodal Continual Instruction Tuning (MCIT) aims to enable Multimodal Large Language Models (MLLMs) to learn new tasks incrementally without catastrophic forgetting, a phenomenon where models lose previously acquired capabilities when fine-tuned on new data. Existing approaches treat forgetting as a generalized knowledge loss, but this paper introduces a nuanced perspective by categorizing it into superficial forgetting (deviation in response style due to task-specific biases) and essential forgetting (actual loss of factual knowledge). The key problem addressed is mitigating both types of forgetting to maintain model performance across diverse tasks without resource-intensive retraining.
Method
The proposed method, Superficial and Essential Forgetting Eliminator (SEFE), comprises two main components to address the identified types of forgetting in MCIT:
- Answer Style Diversification (ASD) Paradigm: This targets superficial forgetting by transforming each task’s dataset into five question types (yes/no, multiple-choice, short answer, brief explanation, detailed explanation). A proportion (default X=20%) of the data is equally converted into alternative formats using standardized rules and MLLM assistance, ensuring the model learns to respond in diverse styles and reduces bias towards a single format.
- RegLoRA: To mitigate essential forgetting, this method extends LoRA (a parameter-efficient fine-tuning technique) by identifying key elements in the weight update matrix (top M=2% by absolute value) after each task. These elements are preserved through regularization masks during subsequent task training, using a loss term (with hyperparameter λ=2.5×10³) to minimize updates at critical positions, thus retaining prior knowledge while allowing flexibility for new learning. The synergy of ASD and RegLoRA aims to comprehensively address forgetting by first normalizing response styles and then stabilizing essential knowledge.
Experiment
The experiments were conducted on the CoIN benchmark, featuring eight vision-language tasks (e.g., ScienceQA, TextVQA), and an ASD-adjusted version, CoIN-ASD, to better evaluate essential forgetting. The base model used was LLaVA-1.5 with Vicuna-7B, and evaluation metrics included Truth Alignment (TA) for exact matches and aggregate metrics like Mean Final Accuracy (MFN) and Backward Transfer (BWT) for overall performance and forgetting assessment. Results showed SEFE outperforming baselines (FFT, LoRA, O-LoRA, LoTA) with MFN of 58.57% and BWT of -10.45%, compared to LoRA’s 41.59% and -28.62%, respectively. ASD alone improved performance across methods (e.g., LoRA+ASD MFN increased by 6.29%), validating its impact on superficial forgetting. RegLoRA further enhanced results by addressing essential forgetting, as seen in case studies where response accuracy improved. The experimental setup is reasonably comprehensive with ablation studies on hyperparameters (X for ASD, M for RegLoRA), though the reliance on specific values raises concerns about generalizability. The improvement is notable, but the benchmark’s task diversity might not fully capture real-world MCIT challenges, and comparisons might be biased as baselines weren’t inherently designed for ASD integration.
Further Thoughts
The distinction between superficial and essential forgetting opens up intriguing avenues for future research in continual learning, particularly in how response style biases can mask true knowledge retention—an aspect that could be explored in unimodal LLMs as well. The ASD paradigm’s reliance on data transformation prompts a question about its scalability: could automated style diversification be integrated into model architectures (e.g., via adaptive prompting) rather than dataset preprocessing, reducing manual effort? Additionally, RegLoRA’s heuristic of regularizing the top M% elements might benefit from a more dynamic approach, perhaps leveraging gradient-based importance metrics to identify critical knowledge, as seen in some neural network pruning studies. I also see a connection to federated learning, where diverse client data formats could induce similar superficial forgetting; adapting ASD-like strategies there could enhance model robustness. Finally, the paper’s focus on MCIT benchmarks like CoIN suggests a need for broader, real-world-inspired datasets to test SEFE’s applicability in dynamic, user-driven environments where task formats and knowledge requirements evolve unpredictably.