Major evaluation benchmarks DeepSeek

SenFlow: Inter-Sentence Flow Modeling for AI-Generated Text Detection in Hybrid Documents

Jingkun Luo, Yifan Sun, Da-Tian Peng, Guanxiong Pei

Published: Jun 17, 2026 — 11:29 UTC

Also in this story: Perplexity

Problem
The paper addresses the limitations of existing sentence-level AI-generated text detection (S-AGTD) methods, which classify sentences in isolation and ignore inter-sentence dependencies. Additionally, it highlights the lack of comprehensive benchmarks that include the latest AI text generators. The authors present MOSAIC, a new benchmark comprising 16,000 hybrid documents sourced from PubMed and XSum, generated by advanced models DeepSeek-V3.2 and Kimi K2, under rigorous quality controls.

Method
The authors propose SenFlow, a structured prediction model that integrates graph-based inter-sentence propagation with linear-chain Conditional Random Field (CRF) decoding. This approach allows for a document-level analysis of sentence sequences, leveraging a sentence graph to capture dependencies between sentences. The training process utilizes the MOSAIC dataset, which includes documents generated under stringent perplexity-consistency filters, ensuring high-quality data for training and evaluation. The architecture is designed to operate in a single pass over the sentence graph, enhancing efficiency in detecting AI-generated content.

Results
SenFlow achieves state-of-the-art performance on the MOSAIC benchmark, demonstrating a +4.15 percentage point improvement in average Macro-F1 score on cross-domain transfer tasks compared to existing methods. This performance is particularly notable on the most challenging of three evaluation protocols, indicating the model’s robustness in diverse contexts. The authors also observe that even after applying the perplexity filter to mitigate overt cues, AI-generated sentences exhibit a generator-dependent length disparity, which can still be exploited by the model.

Limitations
The authors acknowledge that while SenFlow improves upon previous methods, it may still be susceptible to certain types of adversarial examples or novel generation techniques not covered in the training data. Additionally, the reliance on a specific set of benchmarks may limit the generalizability of the findings to other domains or types of hybrid documents. The paper does not explore the computational efficiency of the model in real-time applications, which could be a critical factor for deployment.

Why it matters
The introduction of SenFlow and the MOSAIC benchmark represents a significant advancement in the field of AI-generated text detection, particularly in hybrid documents where human and AI contributions coexist. By addressing the inter-sentence dependencies that previous models overlooked, this work lays the groundwork for more sophisticated detection systems that can adapt to evolving text generation technologies. The findings underscore the importance of continuous benchmarking against state-of-the-art generators, as highlighted in the paper, and provide a foundation for future research in this area, as published in arXiv.

By Callan Zhang · Jun 17, 2026 · Editorial standards →

Summarised from the primary source with AI assistance under human editorial oversight. Turing Wire is not a primary source — read the original for the authoritative account.

Source: arXiv cs.CL