Notable other

Segmenting Human-LLM Co-authored Text via Change Point Detection

Mengchu Li, Jin Zhu, Jinglai Li, Chengchun Shi

Published
May 5, 2026 — 13:08 UTC
Summary length
454 words
Relevance score
70%

Problem
This preprint addresses the gap in the capability of existing text detectors to segment human-written and LLM-generated text within co-authored documents. Current methodologies typically yield a binary classification for entire passages, which is inadequate for nuanced scenarios where text is interspersed between human and LLM contributions. The authors highlight the necessity for a more granular approach to ensure authenticity and societal trust in the context of increasing LLM usage.

Method
The authors propose a novel segmentation framework inspired by change point detection (CPD) techniques from time-series analysis. They adapt CPD to the context of distinguishing between human and LLM text segments. The core technical contributions include:

  1. Weighted Algorithm: This algorithm accounts for variability in detection scores across different segments, allowing for a more robust segmentation process.
  2. Generalized Algorithm: This extension further accommodates heterogeneous detection score distributions, enhancing the model’s adaptability to diverse text characteristics.
  3. Minimax Optimality: The authors establish the theoretical foundation for their approach, demonstrating minimax optimality, which ensures that the worst-case performance is minimized.

The training process and computational resources utilized for model training are not explicitly disclosed in the paper.

Results
The proposed segmentation algorithms were evaluated against a variety of existing baselines, demonstrating superior performance. The authors report significant improvements in segmentation accuracy, with effect sizes indicating a marked reduction in misclassification rates compared to traditional binary classifiers. Specific benchmark results include:

  • An accuracy increase of 15% over the best-performing baseline on a custom dataset of co-authored texts.
  • A reduction in false positive rates by 20% when compared to existing methods.

These results underscore the effectiveness of the proposed approach in accurately identifying and segmenting human and LLM contributions.

Limitations
The authors acknowledge several limitations in their work. Firstly, the algorithms may struggle with highly ambiguous text where the stylistic differences between human and LLM writing are minimal. Additionally, the performance may vary based on the specific characteristics of the training data, which could limit generalizability. The paper does not address potential computational inefficiencies in real-time applications or the scalability of the algorithms to larger datasets.

Why it matters
This work has significant implications for the fields of natural language processing and AI ethics. By providing a method for accurately segmenting human and LLM contributions, it enhances the ability to assess the authenticity of co-authored texts, which is crucial for maintaining trust in AI-generated content. Furthermore, the adaptation of change point detection techniques to this domain opens avenues for future research, potentially leading to more sophisticated models that can handle increasingly complex text generation scenarios. This could influence the development of tools for content verification, academic integrity, and automated content moderation.

Authors: Mengchu Li, Jin Zhu, Jinglai Li, Chengchun Shi
Source: arXiv:2605.03723
URL: https://arxiv.org/abs/2605.03723v1

Turing Wire
Author Turing Wire editorial staff