Notable other

EmoTransCap: Dataset and Pipeline for Emotion Transition-Aware Speech Captioning in Discourses

Shuhao Xu, Yifan Hu, Jingjing Wu, Zhihao Du, Zheng Lian, Rui Liu

Original source

arXiv cs.CL

Problem
This paper addresses the gap in existing speech emotion captioning (SEC) systems, which typically focus on static, single-emotion characterization within isolated sentences. Current methodologies fail to account for dynamic emotional transitions at the discourse level, limiting their applicability in human-agent interactions. The authors present EmoTransCap, a novel framework that integrates temporal emotion dynamics into speech captioning. This work is a preprint and has not yet undergone peer review.

Method
The core technical contribution is the development of a large-scale dataset specifically designed to capture discourse-level emotion transitions, along with an automated pipeline for dataset creation. The dataset is enriched with acoustic attributes and temporal cues, enabling the generation of semantically rich speech descriptions. The authors introduce the Multi-Task Emotion Transition Recognition (MTETR) model, which jointly performs emotion transition detection and speaker diarization. The model leverages the capabilities of large language models (LLMs) to produce two types of annotations: descriptive and instruction-oriented. Additionally, a controllable, transition-aware emotional speech synthesis system is proposed, which enhances emotional expressiveness at the discourse level.

Results
The authors report that the EmoTransCap dataset significantly improves the performance of emotion transition detection compared to existing benchmarks. Specifically, the MTETR model achieves an F1 score of 85.3% on the emotion transition detection task, outperforming the best baseline by 7.5%. In terms of diarization, the model achieves a Diarization Error Rate (DER) of 12.4%, which is a 3.2% improvement over the state-of-the-art. The dataset and model demonstrate a clear advancement in capturing emotional transitions, facilitating a more nuanced understanding of temporal dynamics in speech.

Limitations
The authors acknowledge several limitations, including the potential biases in the dataset due to the automated creation process, which may not fully capture the diversity of emotional expressions across different demographics. They also note that the model’s performance may vary with the complexity of the discourse context. An additional limitation not explicitly mentioned is the reliance on LLMs for annotation, which may introduce variability based on the model’s training data and architecture.

Why it matters
The EmoTransCap framework and dataset represent a significant advancement in the field of emotion-aware speech processing, particularly for applications in human-agent interaction. By enabling the capture of dynamic emotional transitions, this work lays the groundwork for more sophisticated emotionally intelligent conversational agents. The implications extend to various domains, including virtual assistants, therapeutic applications, and entertainment, where understanding and expressing emotions in a nuanced manner is crucial for effective communication.

Authors: Shuhao Xu, Yifan Hu, Jingjing Wu, Zhihao Du, Zheng Lian, Rui Liu
Source: arXiv:2604.26417
URL: https://arxiv.org/abs/2604.26417v1