Major agents robotics

Beyond One-shot: AI Agents for Learning in Field Experiments

Junjie Luo, Ritu Agarwal, Gordon Gao

Published: Jun 1, 2026 — 16:30 UTC

Problem
This work addresses the underutilization of data generated from A/B testing experiments, particularly in the context of healthcare prescription messaging. The authors identify a gap in the literature regarding the ability of AI to autonomously learn from prior experimental data to inform subsequent intervention designs. The study is presented as a preprint, indicating that it has not yet undergone peer review.

Method
The authors conducted two-stage field experiments involving 693,139 patient visits. In Stage 1, a Human + Chatbot method was employed, where behavioral experts collaborated with conversational AI to co-design 13 message variants, tested on 444,691 patient visits. In Stage 2, the Tool-Augmented Agentic AI method autonomously extracted principles from the data generated in Stage 1 to create 17 new message variants, tested on 248,448 patient visits. The AI system utilized analytical tools and structured Data-Information-Knowledge-Wisdom (DIKW) reasoning agents, along with transparent evidence chains, to derive insights from the experimental data.

Results
The results demonstrated that the AI-generated interventions significantly outperformed the baseline. The best message variant produced by the Agentic AI achieved a click-through rate (CTR) of 69.8%, which is an increase of 6.5 percentage points over the baseline. Notably, the study found that general-purpose behavioral theories did not uniformly apply to the specific healthcare context, highlighting the necessity for domain-specific insights in intervention design.

Limitations
The authors acknowledge that their findings are contingent on the specific healthcare context and may not generalize to other domains without further validation. They also note that frontier large language models (LLMs) lacking access to experimental data were ineffective in predicting successful interventions, suggesting limitations in their general reasoning capabilities. Additionally, the study’s reliance on a single healthcare setting may restrict the applicability of the results.

Why it matters
This research has significant implications for the field of behavioral experimentation, suggesting that tool-augmented AI can transform the process from a one-shot evaluation to a scalable system for cumulative design learning. By demonstrating the potential of AI to autonomously generate domain-relevant interventions, the study paves the way for more effective and data-driven approaches in A/B testing and intervention design. The findings encourage further exploration of agentic AI methodologies in various contexts, as published in arXiv cs.AI.

By Callan Zhang · Jun 1, 2026 · Editorial standards →

Summarised from the primary source with AI assistance under human editorial oversight. Turing Wire is not a primary source — read the original for the authoritative account.

Source: arXiv cs.AI