This paper introduces ULFine, an unbiased lightweight fine-tuning strategy for foundation-model-assisted long-tailed semi-supervised learning, which mitigates ‘minority bottleneck’ and ‘majority overconfidence’ issues using Prototype Adaptive Fitting and Dual Logit Fusion, achieving significant performance improvements and over 10x training cost reduction on benchmark datasets.
Semi-Supervised Learning, Long-Tailed Recognition, Foundation Model, Fine-Tuning, Pseudo-Labeling, Class Imbalance
Enhao Zhang, Chaohua Li, Chuanxing Geng, Songcan Chen
MIIT Key Laboratory of Pattern Analysis and Machine Intelligence, China, College of Computer Science and Technology, Nanjing University of Aeronautics and Astronautics (NUAA), Nanjing, China
Generated by grok-3
Background Problem
Long-Tailed Semi-Supervised Learning (LTSSL) addresses the challenge of training models with limited labeled data and abundant unlabeled data under imbalanced class distributions, where traditional methods often suffer from biased pseudo-labels and classifiers favoring head classes. Existing LTSSL approaches, typically trained from scratch, struggle with generalizability and fail to effectively mitigate these biases. The advent of pre-trained foundation models like CLIP, with strong generalization capabilities, presents an opportunity to enhance LTSSL performance. However, their direct application reveals issues such as performance degradation with full fine-tuning and neglect of tail classes with lighter fine-tuning strategies. This paper explores the impact of foundation models on LTSSL and proposes a solution to overcome the identified ‘minority bottleneck’ (neglect of tail classes) and ‘majority overconfidence’ (overconfidence in false pseudo-labels) problems.
Method
The proposed Unbiased Lightweight Fine-tuning (ULFine) strategy aims to mitigate biases in LTSSL using foundation models like CLIP through two core components:
- Prototype Adaptive Fitting (PAF): This component adaptively updates textual prototypes based on confidence-aware pseudo-label distributions to fit imbalanced downstream tasks unbiasedly. It uses a momentum update mechanism (as in Eq. 4) to adjust textual prototypes according to visual prototypes derived from labeled data, slowing updates for low-confidence classes to focus on tail classes. Additionally, an orthogonal loss (Eq. 5) is introduced to ensure visual and textual prototypes are uniformly distributed, reducing overconfidence in head classes.
- Dual Logit Fusion (DLF): Inspired by the complementary nature of similarity-based and linear classifiers, DLF fuses logits from unbiased textual prototypes and linear probing to generate comprehensive, unbiased pseudo-labels and classifiers. It aligns the logits by normalizing their ranges (Eq. 6) and combines them with a weighted average (Eq. 7, with η=0.7 favoring linear probing), enhancing performance across head and tail classes. The overall loss combines a modified FixMatch consistency loss with the orthogonal loss, maintaining efficiency by retaining only DLF during inference.
Experiment
The experiments were conducted on four benchmark datasets (CIFAR10-LT, CIFAR100-LT, STL10-LT, and ImageNet-127) with varying imbalance ratios and distribution settings (consistent and inconsistent labeled-unlabeled distributions). The setup used CLIP (ViT-B/16 backbone) with AdaptFormer for fine-tuning, comparing ULFine against numerous state-of-the-art LTSSL methods (e.g., FixMatch, DARP, CCL). Results show ULFine significantly outperforms baselines, achieving up to 19.6% higher top-1 accuracy on CIFAR datasets and 8.05% on ImageNet-127 compared to the next best method. It also reduces training costs by over 10 times (1.5x10^4 vs. 2.5x10^5 epochs) and training time by 22% compared to other foundation model-based approaches. Ablation studies confirm the effectiveness of PAF and DLF, with each component contributing to performance gains, especially for tail classes. Visualization of per-class accuracy indicates a more balanced classifier, addressing the ‘minority bottleneck.’ However, the experimental setup, while comprehensive across datasets, lacks robustness testing across different foundation models or larger-scale datasets, raising questions about generalizability. The results match the expectation of bias mitigation but may be overly optimistic due to potential dataset-specific tuning.
Further Thoughts
The ULFine approach opens intriguing avenues for integrating foundation models into challenging learning paradigms like LTSSL, particularly with its focus on bias mitigation through adaptive prototype fitting and logit fusion. However, a deeper exploration is needed on how the orthogonality constraint in PAF impacts prototype representations—could it inadvertently suppress meaningful inter-class similarities in certain domains like medical imaging, where subtle feature overlaps are critical? Additionally, the reliance on CLIP raises questions about applicability to other foundation models with different pre-training objectives (e.g., DINO or BERT-based vision-language models). An interesting connection could be drawn to recent works in federated learning, where class imbalance is also prevalent; could ULFine’s bias mitigation strategies be adapted to federated settings to handle non-iid data distributions across clients? Furthermore, the impressive efficiency gains (10x training cost reduction) suggest potential for real-time applications, but scalability to trillion-parameter models or datasets with extreme long-tail distributions (e.g., web-scale data) remains untested. These aspects warrant further research to solidify ULFine’s position as a generalizable framework for LTSSL.