Major evaluation benchmarks

PhantomBench: Benchmarking the Non-existential Threat of Language Models

Haeji Jung, Hila Gonen

Published: Jun 9, 2026 — 17:03 UTC

Problem
The paper addresses the critical issue of hallucinations in language models (LMs), where models generate factually incorrect information. This phenomenon poses significant risks, particularly in high-stakes applications where reliance on LMs can lead to harmful consequences. Despite advancements in understanding hallucinations, there is a lack of systematic evaluation regarding LMs’ ability to recognize their knowledge limitations. The authors present PhantomBench, the first large-scale benchmark designed to assess this capability, filling a gap in the literature on model reliability and factual grounding. Notably, this work is a preprint and has not undergone peer review.

Method
PhantomBench consists of over 60,000 non-existent terms and entities derived from real-world concepts across various domains. The authors evaluate 21 different models, including both small and large architectures, to assess their hallucination rates when confronted with these non-existent concepts. The evaluation methodology involves presenting models with inputs that imply the existence of these terms, thereby testing their ability to abstain from generating responses based on false premises. The authors also provide a detailed pipeline for constructing PhantomBench, allowing researchers to generate tailored non-existent concepts for further experimentation.

Results
The evaluation reveals alarmingly high hallucination rates across all tested models, with some models exhibiting average rates as high as 86.7%. This indicates a pervasive issue in the current state of LMs, where even advanced models fail to recognize and abstain from generating content about non-existent entities. The results highlight that the presence of input that presumes the existence of these terms significantly exacerbates the hallucination problem. The findings suggest that PhantomBench can effectively serve as a proxy for studying model behavior concerning rare concepts, which are typically more prone to hallucination.

Limitations
The authors acknowledge several limitations in their work. First, the benchmark is limited to non-existent terms derived from real concepts, which may not encompass the full spectrum of hallucination scenarios. Additionally, the evaluation does not account for the contextual nuances that might influence model responses. The authors also note that while PhantomBench provides a scalable method for generating non-existent concepts, the effectiveness of this approach in real-world applications remains to be validated. Furthermore, the reliance on existing models may not capture the full range of hallucination behaviors across future architectures.

Why it matters
The implications of this work are significant for the development and deployment of language models in critical applications. By providing a systematic approach to benchmark hallucination rates, PhantomBench enables researchers and practitioners to better understand and mitigate the risks associated with LMs. This benchmark can inform future model training and evaluation strategies, ultimately leading to more reliable AI systems. The findings underscore the necessity for ongoing research into model grounding and the recognition of knowledge limits, as emphasized in the context of AI safety and ethics, as published in arXiv cs.AI.

By Callan Zhang · Jun 9, 2026 · Editorial standards →

Summarised from the primary source with AI assistance under human editorial oversight. Turing Wire is not a primary source — read the original for the authoritative account.

Source: arXiv cs.AI