Task-Oriented Semantic Communication in Large Multimodal Models-based Vehicle Networks

This paper proposes a task-oriented semantic communication framework for LMM-based vehicle AI, using LLaVA with Semantic Matching for efficient image slicing and Fusion Attention-based power allocation to prioritize critical data transmission, achieving significant accuracy improvements (up to 33.1% at low SNR) in traffic VQA tasks.

Semantic Communication, Large Multimodal Models, Resource Allocation, Visual Attention, Vehicle AI, Efficiency

Baoxia Du, Hongyang Du, Dusit Niyato, Ruidong Li

Kanazawa University, University of Hong Kong, Nanyang Technological University

Generated by grok-3

Background Problem

The paper addresses the challenge of deploying computationally intensive Large Multimodal Models (LMMs) in resource-constrained vehicle AI systems, where real-time processing and reliable communication are critical for tasks like autonomous driving assistance. Traditional deployment on edge devices is hindered by limited computational power, while cloud-based solutions require efficient data transmission under varying channel conditions. The key problem solved is the optimization of data communication and resource allocation to maintain high accuracy and low latency in LMM-based vehicle networks, particularly in poor signal-to-noise ratio (SNR) environments.

Method

The proposed method introduces a task-oriented semantic communication framework for LMMs in vehicle AI, utilizing the Large Language and Vision Assistant (LLaVA) model. The core idea is to distribute computation by deploying the lightweight visual encoder on the vehicle for feature extraction and offloading the intensive LLM processing to a cloud server, thus reducing onboard computational load. Key steps include: (1) Semantic Matching (SM)-based Image Slicing, which uses lightweight models (YAKE for keyword extraction, YOLO for object detection, and GloVe for semantic similarity) to selectively encode only user-relevant image regions, reducing visual tokens and inference costs; (2) Fusion Attention-based Semantic Communication (FA-SemCom), which combines objective (saliency-based) and subjective (user query-based) attention to assess semantic importance of image patches, dynamically allocating transmission power to prioritize critical features using a quantized weight system and a tunable parameter β. This ensures efficient resource use and maintains data integrity in challenging communication environments.

Experiment

The experiments were conducted on a custom traffic Visual Question Answering (VQA) dataset comprising 41 images and 172 questions, focusing on traffic elements like signs and pedestrians, with evaluations performed on a Ubuntu 20.04 system with an NVIDIA RTX A6000 GPU. The setup simulated real-world vehicle-to-server communication using the Fisher-Snedecor F fading model for channel conditions. Results showed that the SM module reduced computational load (FLOPs) and response time by 27% compared to LLaVA-1.6, with minimal accuracy loss (e.g., 0.6% for Vicuna-7B). The FA-SemCom method significantly improved answer accuracy over average transmission (AVG-SemCom), especially in low SNR conditions, with gains of 13.4% at 12 dB and 33.1% at 10 dB. The experimental design is reasonable for initial validation, focusing on traffic scenarios, but the small dataset size and lack of comparison with other advanced semantic communication frameworks limit the comprehensiveness. The results match the expectation of improved performance under constrained conditions, though generalizability remains a concern due to the niche dataset and fixed parameters like α=0.5.

Further Thoughts

The framework’s focus on semantic importance through attention fusion is a valuable contribution, potentially applicable beyond vehicle networks to domains like remote healthcare or industrial IoT, where bandwidth and computational constraints are prevalent. However, the reliance on a static attention fusion coefficient (α=0.5) could be revisited by exploring adaptive mechanisms that adjust based on real-time channel quality or user intent, drawing inspiration from adaptive bitrate streaming in video transmission. Additionally, integrating this approach with recent advancements in federated learning could enhance privacy by minimizing raw data transmission, aligning with the paper’s concern for data security. A potential research direction could be to test the framework’s robustness against adversarial inputs or extreme noise, as real-world vehicle environments often encounter such challenges. Comparing FA-SemCom with other semantic communication systems, such as those using reinforcement learning for dynamic resource allocation, could further validate its superiority and uncover complementary strategies. Lastly, the small dataset size highlights a need for broader benchmarking on diverse, large-scale datasets to ensure the framework’s scalability and robustness across varied scenarios.