Notable evaluation benchmarks

Models Recall What They Violate: Constraint Adherence in Multi-Turn LLM Ideation

Garvin Kruthof

Original source

arXiv cs.CL

Problem
This preprint addresses a significant gap in the evaluation of large language models (LLMs) regarding their ability to adhere to specified constraints during multi-turn ideation processes. Existing literature lacks a systematic approach to assess how well LLMs maintain fidelity to original objectives when generating iterative outputs. The authors introduce DriftBench, a benchmark designed to quantify constraint adherence across various models and interaction conditions, highlighting the need for a more nuanced understanding of LLM behavior in scientific ideation.

Method
The core technical contribution is the development of DriftBench, which evaluates constraint adherence through a comprehensive framework involving 2,146 scored benchmark runs across seven LLMs from five providers, including two open-weight models. The benchmark encompasses four interaction conditions and 38 research briefs from 24 scientific domains. The authors introduce the knows-but-violates (KBV) rate as a novel metric to quantify instances where models accurately recall constraints but fail to adhere to them. The study employs structured checkpointing to assess its impact on KBV rates, although it does not eliminate the dissociation between recall and adherence. Sensitivity analyses are conducted to ensure robustness against variations in temperature settings (0.7 vs. 1.0) and types of iterative pressure (novelty vs. rigor).

Results
The findings reveal that iterative pressure consistently increases structural complexity in model outputs while often decreasing adherence to original constraints. The KBV rates vary significantly across models, ranging from 8% to 99%, indicating a wide disparity in constraint compliance. Human validation against blind raters suggests that the LLM judge under-detects constraint violations, leading to conservative estimates of adherence scores. The results underscore the limitations of current LLMs in maintaining constraint fidelity during multi-turn interactions, with implications for their application in scientific contexts.

Limitations
The authors acknowledge that while structured checkpointing reduces KBV rates, it does not fully resolve the dissociation between recall and adherence. Additionally, the benchmark’s reliance on human validation may introduce subjectivity, and the under-detection of violations by the LLM judge could skew results. The study does not explore the long-term implications of constraint violations on the quality of generated outputs or the potential for model fine-tuning to improve adherence. Furthermore, the generalizability of findings across different domains and types of constraints remains unexamined.

Why it matters
This work has significant implications for the deployment of LLMs in scientific ideation and other domains requiring strict adherence to constraints. By quantifying the extent of constraint violations, it provides a critical framework for future research aimed at improving LLM fidelity. The introduction of DriftBench as an open benchmark facilitates further exploration of constraint adherence, potentially guiding the development of more robust models that can better align with user-defined objectives. This research highlights the necessity for ongoing evaluation of LLM capabilities, particularly in high-stakes applications where adherence to constraints is paramount.

Authors: Garvin Kruthof
Source: arXiv:2604.28031
URL: https://arxiv.org/abs/2604.28031v1