Skip to content
Go back 2502.03253 arXiv logo

How do Humans and Language Models Reason About Creativity? A Comparative Analysis

Published:  at  10:59 AM
60.58 🤔

This paper conducts a comparative analysis of creativity evaluation in STEM, revealing that human experts and LLMs prioritize different facets of originality (cleverness vs. remoteness/uncommonness) and are differentially influenced by contextual examples, with LLMs showing higher predictive accuracy but poorer construct validity due to homogenized facet correlations.

Large Language Model, Reasoning, Human-AI Interaction, Classification, Few-Shot Learning

Antonio Laverghetta, Tuhin Chakrabarty, Tom Hope, Jimmy Pronchick, Krupa Bhawsar, Roger E. Beaty

Pennsylvania State University, Stony Brook University, Hebrew University of Jerusalem

Generated by grok-3

Background Problem

Creativity assessment in STEM fields traditionally relies on human expert judgment, yet the cognitive processes and biases shaping these evaluations are poorly understood. With the increasing role of Large Language Models (LLMs) in scientific research and innovation, including tasks like peer review and idea generation, there is a pressing need to understand how both humans and AI evaluate creativity, particularly in terms of originality and its facets (uncommonness, remoteness, cleverness), and whether their strategies align. This study addresses the gap by examining how contextual examples influence creativity ratings in STEM design problems, aiming to uncover differences in evaluation processes between human experts and LLMs.

Method

The study comprises two experiments focusing on creativity evaluation using the Design Problems Task (DPT), which involves generating solutions to real-world STEM challenges. In Study 1, 72 human experts with STEM training rated DPT responses for originality, uncommonness, remoteness, and cleverness on a five-point Likert scale, split into ‘example’ (provided with rated sample solutions) and ‘no example’ conditions, followed by textual explanations analyzed via LLMs for linguistic markers (e.g., comparative, analytical language). Study 2 replicated this setup with LLMs (CLAUDE-3.5-HAIKU and GPT-4O-MINI), using identical prompts and conditions to rate the same DPT responses, assessing facet correlations and explanation styles. The methodology emphasizes fine-grained analysis to dissect originality and the impact of contextual examples on judgment processes.

Experiment

The experiments utilized a dataset of over 7000 DPT responses from undergraduate STEM majors, rated by experts as ground truth. In Study 1, human experts (37 in example, 35 in no example condition) showed moderate correlations (r=0.45-0.67) between originality and facets, with examples increasing cleverness correlation but decreasing remoteness and uncommonness correlations; accuracy in predicting true originality scores was similar across conditions (r=0.44-0.47). Linguistic analysis indicated no-example experts used more comparative language, suggesting memory-based comparisons. In Study 2, LLMs achieved higher correlations with ground truth (r=0.6-0.76), with examples boosting accuracy but homogenizing facet correlations (up to 0.99), indicating a lack of distinction between facets. LLM explanations were more rigid and less diverse than human ones. The setup was reasonable for comparing human-AI differences, but limited example variation and prompt sensitivity might bias results, and the high LLM facet correlations question construct validity despite better predictive performance.

Further Thoughts

The stark contrast in how LLMs homogenize creativity facets compared to humans suggests a deeper issue in AI’s semantic understanding of abstract concepts like creativity—could this be tied to training data biases or over-reliance on statistical patterns rather than nuanced reasoning? This connects to broader challenges in AI interpretability, as seen in other domains like medical diagnostics where AI accuracy doesn’t always equate to meaningful decision-making. Future research could explore whether incorporating diverse human evaluation strategies into LLM training (e.g., via RLHF) mitigates homogenization. Additionally, testing across varied STEM tasks beyond DPT, such as hypothesis generation, might reveal if these discrepancies persist or are task-specific. This also raises ethical questions about deploying LLMs in high-stakes creativity assessment—without addressing construct validity, we risk automating biased or superficial judgments in scientific innovation.



Previous Post
CRANE: Reasoning with constrained LLM generation
Next Post
Streaming, Fast and Slow: Cognitive Load-Aware Streaming for Efficient LLM Serving