Latte: Transfering LLMs` Latent-level Knowledge for Few-shot Tabular Learning

The paper introduces ‘Latte’, a framework that transfers latent-level knowledge from Large Language Models during training to enhance few-shot tabular learning, outperforming baselines by leveraging unlabeled data and mitigating overfitting across diverse classification and regression tasks.

Few-Shot Learning, Large Language Model, Representation Learning, Tabular Data, Pre-training, Fine-tuning

Ruxue Shi, Hengrui Gu, Hangting Ye, Yiwei Dai, Xu Shen, Xin Wang

Jilin University

Generated by grok-3

Background Problem

Few-shot tabular learning addresses the challenge of training machine learning models with limited labeled data, a critical issue in real-world applications like fraud detection and disease diagnosis where annotation is costly. Traditional supervised learning struggles in such scenarios due to insufficient supervisory signals, and prior LLM-based approaches either incur high latency through test-time knowledge extraction or suffer from unreliable text-level knowledge due to hallucinations. The paper introduces ‘Latte’, a framework aimed at overcoming these limitations by transferring latent-level knowledge from Large Language Models (LLMs) during training to guide downstream tabular models, mitigating overfitting and enhancing generalization with limited labeled data while leveraging unlabeled data for robust initialization.

Method

The ‘Latte’ framework transfers latent-level knowledge from LLMs for few-shot tabular learning through a training-time extraction strategy. It positions the LLM as a ‘teacher’ by inputting task metadata (task and feature descriptions) to extract hidden states from the final transformer layer as task-relevant knowledge via average pooling. Key components include: 1) A semantic-aware tabular encoder that integrates feature semantics into representations using BERT for categorical and numerical features, followed by a Transformer to model feature interactions; 2) A knowledge adapter that aligns LLM latent knowledge with tabular representations using a GTransformer and attention mechanisms to fuse information weightedly, creating predictive row embeddings. The training process involves two stages: unsupervised pre-training on unlabeled data with meta-learning and pseudo-labels from clustering, and fine-tuning on limited labeled data with knowledge-guided loss functions (KL divergence and task-specific loss). This approach aims to reduce overfitting and promote generalization in few-shot settings.

Experiment

The experiments evaluate ‘Latte’ on nine real-world datasets (six classification, three regression tasks) with varying shot settings (4 to 64 labeled samples), comparing against ten baselines including traditional methods (Logistic Regression, XGBoost), few-shot algorithms (SCARF, STUNT), and LLM-based frameworks (FeatLLM, TabLLM). The setup uses AUC for classification and MSE for regression, with results averaged over three seeds. ‘Latte’ consistently outperforms baselines, achieving an average 4.22% AUC improvement over FeatLLM in classification tasks, attributed to leveraging unlabeled data and latent knowledge. The experimental design is comprehensive, covering diverse datasets and tasks, though the selection of datasets might favor tabular structures amenable to semantic encoding. Ablation studies confirm the importance of components like the semantic-aware encoder and meta-learning, but lack analysis of failure cases. The impact of LLM layer selection shows deeper layers benefit classification but not regression, suggesting potential noise in higher layers. While results match the expectation of improved performance, the generalizability across more complex or noisy tabular data remains untested.

Further Thoughts

The ‘Latte’ framework opens up interesting avenues for integrating LLM capabilities into structured data tasks, particularly in domains where labeled data is scarce. However, a deeper exploration of the latent knowledge extraction process is warranted—does it truly capture task-specific nuances, or is it merely encoding general semantic patterns that may not always align with tabular data intricacies? This could be tested by applying ‘Latte’ to highly domain-specific tabular tasks where general LLM knowledge might be less relevant, such as niche industrial datasets. Additionally, the efficiency of minimal LLM calls is a practical advantage, but how does this scale in dynamic environments where task metadata changes frequently, necessitating repeated knowledge extraction? Cross-referencing with recent works on parameter-efficient tuning (like IA3 in TabLLM) could provide insights into hybrid approaches combining latent knowledge transfer with lightweight fine-tuning. Lastly, the framework’s reliance on BERT for semantic encoding raises questions about computational overhead in resource-constrained settings—could simpler embedding methods suffice without sacrificing performance? These considerations could guide future research to refine and broaden the applicability of such knowledge transfer paradigms.