The Promise and Limits of LLMs in Constructing Proofs and Hints for Logic Problems in Intelligent Tutoring Systems

This paper evaluates LLMs in intelligent tutoring systems for propositional logic, demonstrating DeepSeek-V3’s promising accuracy in proof construction (up to 86.7%) and hint generation (75%), but reveals significant pedagogical limitations in justification and subgoaling, necessitating hybrid approaches for educational integration.

Large Language Model, Generative AI, Reasoning, Human-AI Interaction, Prompt Engineering

Sutapa Dey Tithi, Arun Kumar Ramesh, Clara DiMarco, Xiaoyi Tian, Nazia Alam, Kimia Fazeli, Tiffany Barnes

North Carolina State University

Generated by grok-3

Background Problem

Intelligent tutoring systems (ITSs) are effective in teaching formal propositional logic but are limited by template-based feedback that lacks personalization. Large Language Models (LLMs) offer potential for dynamic, scalable feedback generation, yet face challenges like hallucination of incorrect information and pedagogically unsound explanations. This work addresses the gap in evaluating LLMs for constructing multi-step symbolic logic proofs and generating next-step hints within a pedagogical context, aiming to enhance ITS capabilities while ensuring accuracy and educational appropriateness.

Method

The study evaluates LLMs for two tasks: constructing step-by-step propositional logic proofs and generating next-step hints with explanations in an ITS. Four LLMs (Llama-3-70B-Instruct, Gemini-Pro, GPT-4o, DeepSeek-V3) were tested using six prompting techniques, ranging from Zero-Shot (ZS) to advanced Few-Shot methods like Chain-of-Thought (CoT) and Tree-of-Thoughts CoT (ToT_CoT), incorporating learning science principles. Proof construction was assessed on two datasets (Symbolic Propositional Logic with 310 problems and Logic Tutor with 48 problems), while hint generation used 1,050 student problem-solving states from real ITS interaction logs. Prompts were structured to guide step-by-step derivations and JSON-formatted outputs, with hint quality evaluated on consistency, clarity, justification, and subgoaling via human experts and an LLM grader.

Experiment

Experiments were conducted on proof construction and hint generation. For proof construction, DeepSeek-V3 with FS_ToT_CoT achieved the highest accuracy (85.0% on Logic Tutor, 86.7% on Symbolic Propositional Logic datasets), outperforming other LLMs, especially on simpler rules, though performance dropped for complex rules involving negation (e.g., Contrapositive at 64.07%). Prompting techniques significantly influenced results, with Few-Shot methods vastly outperforming Zero-Shot. For hint generation, DeepSeek-V3 with FS_CoT achieved 75% accuracy across 1,050 student states, with higher accuracy on simpler training levels (87.9%) declining as difficulty increased (to 63.2%). The setup was comprehensive, using real student data and multiple evaluation metrics, but hint accuracy is insufficient for standalone ITS integration due to potential trust issues. Human evaluation of 20% of hint explanations rated them highly for consistency (3.9/4) and clarity (3.09/4), but lower for justification (2.24/4) and subgoaling (1.66/4), indicating pedagogical gaps. Results partially meet expectations for accuracy but highlight the need for hybrid systems to ensure reliability and educational value.

Further Thoughts

The parallel between human cognitive load in handling negation and LLMs’ struggles with complex logic rules is intriguing and suggests a deeper connection between human reasoning mechanisms and current LLM architectures. This could inspire research into whether training LLMs with cognitive science-inspired constraints (e.g., working memory limits) might improve their reasoning over complex structures. Additionally, the observed decline in hint accuracy with increasing problem difficulty mirrors challenges in other AI domains like natural language reasoning benchmarks, where performance drops with task complexity. A potential cross-disciplinary link could be explored with AI for mathematics education, where symbolic solvers are often paired with explanatory systems—could such hybrid models be adapted for logic ITSs to validate LLM outputs? Another concern is the scalability of human evaluation; relying on expert ratings for only 20% of data raises questions about automating pedagogical quality assessment without losing nuance. Future work might integrate reinforcement learning with human feedback (RLHF) tailored to educational outcomes, ensuring LLMs prioritize pedagogical alignment over mere accuracy.