Multilingual Performance of a Multimodal Artificial Intelligence System on Multisubject Physics Concept Inventories

This exploratory study evaluates GPT-4o’s multilingual and multimodal performance on physics concept inventories, revealing strong results in English and text-based tasks but significant weaknesses in visual interpretation and non-Western languages, highlighting implications for equitable AI integration in education.

Large Language Model, Multimodality, Multimodal Systems, Reasoning, Human-AI Interaction, AI Ethics

Gerd Kortemeyer, Marina Babayeva, Giulia Polverini, Ralf Widenhorn, Bor Gregorcic

ETH Zurich, Michigan State University, Charles University, Uppsala University, Portland State University

Generated by grok-3

Background Problem

The rapid integration of Large Language Models (LLMs) like GPT-4o into education, particularly in physics, has opened new avenues for teaching and learning but also poses risks of over-reliance and inequity due to varying performance across languages and modalities. This study addresses the underexplored area of multimodal AI performance on physics concept inventories, focusing on multilingual contexts and visual interpretation challenges, aiming to assess how well such systems can handle diverse educational tasks compared to undergraduate students and identify limitations that could impact their responsible use in physics education.

Method

The study evaluates GPT-4o’s performance by presenting physics concept inventories as images (screenshots) to mimic real student test materials, including diagrams and text in multiple languages, sourced from PhysPort. The methodology involves: (1) preparing 3,662 image files of inventory items covering various physics subjects; (2) submitting each image to GPT-4o via Microsoft Azure AI Services three times to account for probabilistic outputs, using structured JSON prompts in English to extract problem descriptions, reasoning, and answers; (3) analyzing performance across subject categories, languages, and image dependency (text-only, unneeded images, required images); and (4) comparing AI scores to published undergraduate post-instruction scores. The core idea is to test multimodal capabilities in a realistic educational setting, focusing on conceptual understanding rather than numerical computation.

Experiment

The experiments utilized a comprehensive dataset of 54 physics concept inventories from PhysPort, spanning subjects like mechanics, electromagnetism, and quantum physics, in 35 languages, resulting in 14,022 solutions for 4,674 items. The setup aimed to assess performance across diverse contexts, comparing English and non-English versions, and evaluating image dependency. Results showed an average performance of 71.1% in English, with significant variation across subjects (e.g., 85.2% in Thermodynamics, 35.0% in Laboratory Skills) and languages (e.g., 74% in Portuguese for FCI, 20% in Punjabi). GPT-4o outperformed average undergraduate post-instruction scores in most categories (68.9% of cases) except Laboratory Skills. However, it struggled with required-image items (49% accuracy vs. 81% for text-only), indicating a clear limitation in visual interpretation. The experimental design is comprehensive in scope but limited by manual data preparation, only three iterations per item, and potential translation quality issues, which may skew results. The performance disparity across languages suggests training data bias, and the setup does not fully explore reasoning quality, only final answers.

Further Thoughts

The findings of this paper raise deeper questions about the role of AI in education beyond mere performance metrics. The significant language disparity suggests a potential reinforcement of global educational inequities, as students in non-Western language contexts may not benefit equally from AI tools. This connects to broader discussions in AI ethics about bias in training data, as seen in works like those by Nicholas and Bhatia (2023), and calls for targeted efforts in diversifying datasets or developing language-specific models. Additionally, the struggle with visual interpretation aligns with ongoing challenges in computer vision and multimodal AI, suggesting a need for hybrid approaches that integrate specialized vision models with LLMs for subjects like physics where diagrams are crucial. This could tie into research on vision-language models like CLIP, exploring if such integrations improve performance on required-image tasks. Finally, the paper’s observation that AI errors do not mirror student misconceptions opens a research avenue into whether AI can be trained to simulate student-like reasoning errors for pedagogical purposes, enhancing its utility as a teaching tool. These considerations underscore the necessity of a critical, equity-focused approach to AI deployment in education.