Notable agents robotics Meta

Do Agents Need Semantic Metadata? A Comparative Study in Agentic Data Retrieval

Shiyu Chen, Tarfah Alrashed, Alon Halevy, Natasha Noy

Published
May 27, 2026 — 17:46 UTC

Problem
This preprint addresses the gap in understanding the necessity of semantic metadata for autonomous agents in data retrieval tasks. With the emergence of Large Language Models (LLMs) capable of processing unstructured data, the authors investigate whether traditional semantic metadata frameworks, such as schema.org, are still essential for effective data discovery. The study contrasts the performance of a Baseline Agent, which retrieves data from the unstructured web, against a Semantic Agent that utilizes a structured dataset corpus.

Method
The authors employ a comparative analysis framework involving two agents: the Baseline Agent, which searches billions of open-web documents, and the Semantic Agent, which leverages a curated corpus of 90 million datasets annotated with schema.org metadata. The evaluation methodology utilizes an “LLM-as-a-judge” pipeline, which assesses the retrieved data against the FAIR principles (Findable, Accessible, Interoperable, and Reusable). Key metrics include semantic relevance, data accessibility, and computational utility. The training compute specifics are not disclosed, but the evaluation emphasizes precision and coverage as primary performance indicators.

Results
The Semantic Agent significantly outperforms the Baseline Agent in precision metrics. It achieves a 44.9% higher precision for metadata-rich registries and a 46.6% higher precision for pages with machine-readable downloads. In contrast, the Baseline Agent, while answering 40% more questions, suffers from “Last-Mile Utility” failures, retrieving 20.1% prose-heavy pages and 8.5% portal landing pages instead of actionable data. Overall, the Semantic Agent demonstrates a 65.7% higher precision in retrieving FAIR-compliant datasets, highlighting its effectiveness in structured data retrieval.

Limitations
The authors acknowledge that while the Semantic Agent excels in precision, it may not match the Baseline Agent’s coverage in exploratory tasks. They do not address potential biases in the LLM evaluation process or the scalability of the Semantic Agent’s approach in diverse real-world applications. Additionally, the reliance on schema.org metadata may limit the generalizability of the findings to other metadata standards or unstructured data environments.

Why it matters
This research underscores the critical role of semantic metadata in enhancing the reliability and accuracy of autonomous data retrieval systems. The findings suggest that while LLMs can facilitate broad exploratory data tasks, structured metadata remains essential for executing precise, actionable workflows. This has significant implications for the design of future autonomous agents and data retrieval systems, advocating for a hybrid approach that integrates both unstructured and structured data methodologies to optimize performance in diverse applications.

Authors: Shiyu Chen, Tarfah Alrashed, Alon Halevy, Natasha Noy
Source: arXiv:2605.28787
URL: https://arxiv.org/abs/2605.28787v1

Turing Wire

By Turing Wire editorial staff · May 27, 2026 · Editorial standards →

Source: arXiv cs.AI