Notable interpretability

Probing Cultural Awareness in LLMs: A Case Study of Cross-Culture Aesthetic Stylistics

Jiashuo Wang, Fenggang Yu, Jian Wang, Chak Tou Leong, Xiaoyu Shen, Chunpu Xu

Published
May 26, 2026 — 17:08 UTC

Problem
This preprint addresses the gap in understanding how Large Language Models (LLMs) perform in culturally nuanced contexts, specifically in the domain of aesthetic stylistics. While LLMs are widely used across various cultural settings, their proficiency in recognizing and generating stylistically rich language that resonates with specific cultural aesthetics has not been thoroughly investigated. The authors introduce a new benchmark, C4STYLI, to evaluate LLMs’ capabilities in this area, focusing on translated movie titles and advertising slogans from Hong Kong and the Chinese Mainland.

Method
The core technical contribution of this work is the C4STYLI benchmark, which consists of stylized text samples curated from two distinct cultural contexts. The evaluation framework assesses LLMs on two fronts: behavioral recognition (the ability to identify stylistic elements) and productive competence (the ability to generate stylistically appropriate text). The authors employ logistic regression probes for structural ablation to analyze the features that LLMs utilize for stylistic recognition. This method allows for a detailed examination of the linguistic attributes leveraged by LLMs, distinguishing between surface-level features and deeper stylistic structures.

Results
The evaluation reveals significant discrepancies between LLMs and human performance in stylistic recognition tasks. Specifically, LLMs demonstrate a limited ability to recognize and generate culturally resonant stylistic elements, with performance varying across text domains. For instance, in the Hong Kong context, LLMs primarily rely on surface-level linguistic cues rather than understanding the underlying stylistic structures. The authors report that LLMs achieve a recognition accuracy of approximately 60% on the C4STYLI benchmark, compared to human performance, which exceeds 80%. This indicates a substantial gap in stylistic recognition capabilities, highlighting the need for further refinement in LLM training to enhance cultural sensitivity.

Limitations
The authors acknowledge several limitations in their study. First, the C4STYLI benchmark is limited to specific cultural artifacts (movie titles and advertising slogans), which may not fully represent the breadth of aesthetic stylistics across different contexts. Additionally, the reliance on logistic regression probes may not capture the full complexity of stylistic recognition processes in LLMs. The authors also note that their findings are primarily based on LLMs trained on general corpora, which may not be optimized for culturally specific tasks. An obvious limitation not discussed is the potential for bias in the curated dataset, which could affect the generalizability of the results.

Why it matters
This research has significant implications for the development of culturally aware AI systems. By highlighting the limitations of current LLMs in recognizing and generating culturally specific stylistic elements, the study underscores the necessity for more targeted training methodologies that incorporate cultural nuances. The findings suggest that enhancing LLMs’ sensitivity to aesthetic stylistics could improve their applicability in diverse cultural contexts, ultimately leading to more effective human-AI interactions. This work paves the way for future research aimed at bridging the gap between LLM capabilities and the rich tapestry of human cultural expression.

Authors: Jiashuo Wang, Fenggang Yu, Jian Wang, Chak Tou Leong, Xiaoyu Shen, Chunpu Xu, Jiawen Duan, Wenjie Li et al.
Source: arXiv:2605.27296
URL: https://arxiv.org/abs/2605.27296v1

Turing Wire

By Turing Wire editorial staff · May 26, 2026 · Editorial standards →

Source: arXiv cs.CL