Notable other

Fine-tuning LLMs for Passive Depression Severity Estimation from AI Mental Health Dialogue

Olivier Tieleman, Ziyi Zhu, Ting Su, Samuel J. Bell, Thomas D. Hull, Caitlin A. Stamatis

Published: Jun 16, 2026 — 14:28 UTC

Problem
The paper addresses the challenge of low completion rates for validated depression assessment tools like the Patient Health Questionnaire-9 (PHQ-9), which leads to response bias and systematic missingness in clinical data. The authors propose a novel approach to estimate depression severity passively from AI-generated dialogue, filling a gap in the literature regarding the use of conversational data for mental health assessment. This work is a preprint and has not undergone peer review.

Method
The authors fine-tune a Qwen3.5-27B model, integrating a regression head to predict PHQ-9 total scores directly from conversation transcripts. They augment a dataset of 3,111 ground-truth labels with pseudolabels generated by a reasoning model (Claude Opus), resulting in a combined dataset of 6,283 users. The training process involves iterative training of intermediate models to enhance performance. The model is evaluated on a held-out test set of 842 users, focusing on metrics such as Mean Absolute Error (MAE), Root Mean Square Error (RMSE), Pearson correlation coefficient (r), and Area Under the Curve (AUC) at various clinical thresholds.

Results
The best-performing model achieves a MAE of 2.6 and an RMSE of 4.0, with a Pearson correlation coefficient of 0.80 and an AUC of 0.91 at the PHQ-9 clinical threshold of ≥10. Additionally, the model demonstrates AUC values exceeding 0.87 across all severity thresholds from PHQ-9 ≥3 to PHQ-9 ≥24, indicating robust performance in capturing the full spectrum of depression severity.

Limitations
The authors acknowledge that the reliance on AI dialogue transcripts may introduce biases inherent to the conversational context, which could affect the generalizability of the model. They also note that the model’s performance is contingent on the quality of the dialogue data and the representativeness of the training set. Furthermore, the lack of peer review raises questions about the robustness of the findings, and the potential for overfitting in the model due to the augmentation with pseudolabels is not explicitly addressed.

Why it matters
This research has significant implications for the field of mental health monitoring, as it enables continuous, passive assessment of depression severity without requiring users to engage in self-reporting. This could lead to improved early detection and intervention strategies in mental health care, particularly in AI-driven applications. The findings contribute to the growing body of literature on leveraging conversational AI for health assessments, as published in arXiv cs.CL.

By Callan Zhang · Jun 16, 2026 · Editorial standards →

Summarised from the primary source with AI assistance under human editorial oversight. Turing Wire is not a primary source — read the original for the authoritative account.

Source: arXiv cs.CL