Browse by author
Lookup NU author(s): Dr Kai AlterORCiD
Full text for this publication is not currently held within this repository. Alternative links are provided below where available.
© 2025 International Speech Communication Association. All rights reserved.AI-generated voice clones are important tools in language learning, audiobooks, and assistive technology, but often struggle to replicate key prosodic features such as dynamic F0 variation. The impact of these differences on speech perception remain underexplored. To address this, we conducted two behavioural tasks, evaluating listeners' ratings of naturalness and similarity for human speech, three AI voice clones (ElevenLabs, StyleTTS-2, XTTS-v2), and a 30% F0 variation condition. ElevenLabs was rated comparably to human speech, while StyleTTS-2 and XTTS-v2 received lower ratings. Reduced F0 variation also led to lower ratings, suggesting that prosody is key to perceived naturalness and similarity. Listener ratings were further influenced by speaker accent and sex, but not by AI tool experience. These findings suggest that prosodic features and speaker-specific characteristics could be drivers for the varying performance of AI-voice clones.
Author(s): Bakkouche L, McGhee C, Lau E, Cooper S, Luo X, Rees M, Alter K, Post B, Schwarz J
Publication type: Conference Proceedings (inc. Abstract)
Publication status: Published
Conference Name: Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH
Year of Conference: 2025
Pages: 2190-2194
Acceptance date: 02/04/2025
ISSN: 2958-1796
Publisher: International Speech Communication Association
URL: https://doi.org/10.21437/Interspeech.2025-947
DOI: 10.21437/Interspeech.2025-947