Major evaluation benchmarks OpenAI

ProjectionBench: Evaluating Scientific Hypothesis Generation in LLMs Under Progressive Information Disclosure

A. J. Lew, Y. Cao, M. J. Buehler

Published: May 28, 2026 — 17:38 UTC

Also in this story: Google

Problem
This paper addresses the gap in evaluating the innovative reasoning capabilities of large language models (LLMs) in the context of scientific hypothesis generation. Existing benchmarks primarily focus on multi-hop retrieval tasks, neglecting the nuanced reasoning required for scientific discovery. The authors propose a novel benchmark framework, ProjectionBench, to systematically assess LLM performance in generating hypotheses under progressive information disclosure. This work is presented as a preprint and has not yet undergone peer review.

Method
The core contribution is the ProjectionBench framework, which evaluates LLMs by progressively revealing information about a scientific problem. Initially, models are provided only with the topic and research question derived from a recent paper. As the evaluation progresses, additional technical details are disclosed, allowing the model to generate hypotheses at each stage. The generated hypotheses are then compared to the original paper’s conclusions using automated semantic similarity metrics, focusing on the alignment of atomic claims. The evaluation captures two dimensions: innovativeness under minimal information and grounded reasoning capabilities under full experimental details. The models evaluated include GPT-5, GPT-5.4, Gemini 2.5 pro, and Gemini 3.1 pro preview, across 45 papers in the domains of bioactive materials, mechanical materials, and nanomaterials.

Results
The results indicate that GPT-5.4 and Gemini 3.1 pro outperform their predecessors, with GPT-5.4 achieving a notable F1 score of 0.7 in alignment with ground truth conclusions, even when provided with minimal context. This performance is significant compared to baseline models, demonstrating enhanced capabilities in hypothesis generation and reasoning. The evaluation framework allows for a nuanced understanding of model performance across different stages of information disclosure, highlighting the models’ ability to adapt their reasoning as more context is provided.

Limitations
The authors acknowledge several limitations, including the reliance on automated semantic similarity metrics, which may not fully capture the qualitative aspects of scientific reasoning. Additionally, the benchmark is limited to a specific set of domains (bioactive materials, mechanical materials, and nanomaterials), which may not generalize across all scientific fields. The progressive information disclosure approach may also introduce biases based on the order of information revealed. Furthermore, the models evaluated are still in the preview stage, and their performance may change upon full release.

Why it matters
This work has significant implications for the development of AI systems aimed at scientific discovery. By providing a structured framework for evaluating LLMs in generating hypotheses, it lays the groundwork for future research in AI-assisted scientific inquiry. The ability to assess both innovative reasoning and grounded capabilities is crucial for advancing AI systems that can function as co-scientists, potentially transforming the landscape of scientific research and discovery.

Authors: A. J. Lew, Y. Cao, M. J. Buehler
Source: arXiv:2605.30284
URL: https://arxiv.org/abs/2605.30284v1

By Callan Zhang · May 28, 2026 · Editorial standards →

Summarised from the primary source with AI assistance under human editorial oversight. Turing Wire is not a primary source — read the original for the authoritative account.

Source: arXiv cs.AI