Seeing the Poem: Image-Semantic Detection of AI-Generated Modern Chinese Poetry with MLLMs
Shanshan Wang, Fengying Ye, Hanjia Lyu, Caiwen Gou, Junchao Wu, Jingming Yao
- Published
- May 21, 2026 — 15:57 UTC
Problem
This preprint addresses a significant gap in the literature regarding the detection of AI-generated modern Chinese poetry using large language models (LLMs). Previous studies have established that LLMs are not effective as detectors for various forms of generated text, but they have not specifically examined the unique characteristics of modern Chinese poetry. The authors aim to evaluate and enhance the detection capabilities of LLMs in this context, proposing a novel approach that integrates image semantics with textual analysis.
Method
The core technical contribution of this work is the development of an image-semantic guided poetry detection method. This approach innovatively incorporates images that reflect the thematic and emotional content of the poetry, allowing for a richer contextual understanding. The method employs a multi-modal framework where the imagery is processed alongside the textual data, enabling the model to extract complementary information regarding meaning, imagery, and emotional resonance. The authors utilize various LLM architectures, although specific details on the architectures and training compute are not disclosed. The training dataset consists of AI-generated modern Chinese poetry paired with relevant images, facilitating the model’s ability to learn from both modalities.
Results
The proposed Gemini detector achieves a Macro-F1 score of 85.65%, outperforming baseline detectors that rely solely on plain text, including the state-of-the-art traditional detector RoBERTa. The results indicate a significant performance improvement, with the Gemini detector demonstrating superior efficacy across multiple datasets generated by LLMs. The authors provide quantitative comparisons against these baselines, highlighting the effectiveness of their image-semantic integration approach in enhancing detection accuracy.
Limitations
The authors acknowledge several limitations, including the potential biases in the training data, which may affect the generalizability of the model. Additionally, the reliance on image-text pairs may limit the method’s applicability to poetry that does not have a strong visual component. The study does not explore the scalability of the method to other forms of poetry or different languages, which could be a critical area for future research. Furthermore, the computational resources required for training multi-modal models may be a barrier for broader adoption.
Why it matters
This work has significant implications for the field of AI-generated content detection, particularly in the context of creative writing and poetry. By demonstrating the effectiveness of integrating image semantics with LLMs, the authors pave the way for more nuanced detection methods that can account for the complexities of artistic expression. This approach could inspire further research into multi-modal detection systems across various domains, enhancing the robustness of AI in understanding and evaluating creative works. The findings also contribute to the ongoing discourse on the limitations of LLMs in detection tasks, suggesting new avenues for improving their performance.
Authors: Shanshan Wang, Fengying Ye, Hanjia Lyu, Caiwen Gou, Junchao Wu, Jingming Yao, Chengzhong Xu, Jiebo Luo et al.
Source: arXiv:2605.22654
URL: https://arxiv.org/abs/2605.22654v1
By Turing Wire editorial staff · May 21, 2026 · Editorial standards →
Source: arXiv cs.CL