Toggle Main Menu Toggle Search

Open Access padlockePrints

Finding the Human Voice in AI: Insights on the Perception of AI-Voice Clones from Naturalness and Similarity Ratings

Lookup NU author(s): Dr Kai AlterORCiD

Downloads

Full text for this publication is not currently held within this repository. Alternative links are provided below where available.


Abstract

© 2025 International Speech Communication Association. All rights reserved.AI-generated voice clones are important tools in language learning, audiobooks, and assistive technology, but often struggle to replicate key prosodic features such as dynamic F0 variation. The impact of these differences on speech perception remain underexplored. To address this, we conducted two behavioural tasks, evaluating listeners' ratings of naturalness and similarity for human speech, three AI voice clones (ElevenLabs, StyleTTS-2, XTTS-v2), and a 30% F0 variation condition. ElevenLabs was rated comparably to human speech, while StyleTTS-2 and XTTS-v2 received lower ratings. Reduced F0 variation also led to lower ratings, suggesting that prosody is key to perceived naturalness and similarity. Listener ratings were further influenced by speaker accent and sex, but not by AI tool experience. These findings suggest that prosodic features and speaker-specific characteristics could be drivers for the varying performance of AI-voice clones.


Publication metadata

Author(s): Bakkouche L, McGhee C, Lau E, Cooper S, Luo X, Rees M, Alter K, Post B, Schwarz J

Publication type: Conference Proceedings (inc. Abstract)

Publication status: Published

Conference Name: Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH

Year of Conference: 2025

Pages: 2190-2194

Acceptance date: 02/04/2025

ISSN: 2958-1796

Publisher: International Speech Communication Association

URL: https://doi.org/10.21437/Interspeech.2025-947

DOI: 10.21437/Interspeech.2025-947


Share