Notable evaluation benchmarks

Assessing the Creativity of Large Language Models: Testing, Limits, and New Frontiers

Samuel Schapiro, Alexi Gladstone, Jonah Black, Heng Ji

Published
May 13, 2026 — 12:46 UTC

Problem
This preprint addresses the gap in the literature regarding the assessment of creativity in large language models (LLMs). While existing methods often employ human creativity tests to evaluate LLMs, their validity as measures of machine creativity remains unestablished. Furthermore, these tests have limited predictive power for human creativity, leading to questions about their applicability to LLMs. The authors aim to systematically evaluate the effectiveness of various human creativity tests in predicting the creative capabilities of LLMs across three constructs: creative writing, divergent thinking, and scientific ideation.

Method
The authors conducted a large-scale study to assess the predictive validity of human creativity tests for LLMs. They evaluated several tests, including the Divergent Association Task (DAT), Conditional DAT, and introduced the Divergent Remote Association Test (DRAT). The DRAT is designed to assess both convergent and divergent thinking within a single framework. The study involved administering these tests to LLMs and analyzing their performance across the three target constructs. The authors employed statistical methods to determine the correlation between test scores and LLM performance in creative tasks, focusing on the robustness of the DRAT compared to existing tests.

Results
The study found that the DAT and Conditional DAT were the most effective predictors for creative writing and divergent thinking, respectively. However, the effectiveness of these tests varied significantly by construct, with no single test demonstrating strong predictive power across all constructs. Notably, the authors reported that no existing test reliably predicts scientific ideation ability. The introduction of the DRAT yielded significant improvements, as it was the first test to robustly predict scientific ideation ability in LLMs. The performance gain from the DRAT could not be replicated through any linear combination of the DAT and the Remote Associates Test, underscoring the necessity of integrating divergent and convergent thinking assessments in a single instrument.

Limitations
The authors acknowledge that the validity of the tests may be influenced by the specific LLMs used in the study, and the results may not generalize across all models. Additionally, the study does not explore the underlying mechanisms of creativity in LLMs, focusing instead on predictive validity. The authors also note that while the DRAT shows promise, further validation across diverse LLM architectures and datasets is necessary to confirm its effectiveness.

Why it matters
This work has significant implications for the field of AI and creativity assessment. By establishing a more reliable method for evaluating LLM creativity, the study paves the way for future research aimed at enhancing the creative capabilities of these models. The introduction of the DRAT could lead to improved training methodologies and evaluation frameworks, ultimately contributing to the development of more sophisticated and creatively capable AI systems. This research also raises important questions about the nature of creativity in machines and the potential for LLMs to contribute to creative fields.

Authors: Samuel Schapiro, Alexi Gladstone, Jonah Black, Heng Ji
Source: arXiv:2605.13450
URL: https://arxiv.org/abs/2605.13450v1

Turing Wire

By Turing Wire editorial staff · May 13, 2026 · Editorial standards →

Source: arXiv cs.CL