Browse by author
Lookup NU author(s): Dr Huizhi Liang
This work is licensed under a Creative Commons Attribution 4.0 International License (CC BY 4.0).
Acoustic and articulatory signals are naturally coupled and complementary. The challenge of acquiring articulatory data and the nonlinear ill-posedness of acoustic–articulatory conversions have resulted in previous studies on speech emotion recognition (SER) primarily relying on unidirectional acoustic–articulatory conversions. However, these studies have ignored the potential benefits of bi-directional acoustic–articulatory conversion. Addressing the problem of nonlinear ill-posedness and effectively extracting and utilizing these two modal features in SER remain open research questions. To bridge this gap, this study proposes a Bi-A2CEmo framework that simultaneously addresses the bi-directional acoustic-articulatory conversion for SER. This framework comprises three components: a Bi-MGAN that addresses the nonlinear ill-posedness problem, KCLNet that enhances the emotional attributes of the mapped features, and ResTCN-FDA that fully exploits the emotional attributes of the features. Another challenge is the absence of a parallel acoustic-articulatory emotion database. To overcome this issue, this study utilizes electromagnetic articulography (EMA) to create a multi-modal acoustic-articulatory emotion database for Mandarin Chinese called STEM-E$^2$VA. A comparative analysis is then conducted between the proposed method and state-of-the-art models to evaluate the effectiveness of the framework. Bi-A2CEmo achieves an accuracy of 89.04\% in SER, which is an improvement of 5.27\% compared with the actual acoustic and articulatory features recorded by the EMA. The results for the STEM-E$^2$VA dataset show that Bi-MGAN achieves a higher accuracy in mapping and inversion than conventional conversion networks. Visualization of the mapped features before and after enhancement reveals that KCLNet reduces the intra-class spacing while increasing the inter-class spacing of the features. ResTCN-FDA demonstrates high recognition accuracy on three publicly available datasets. The experimental results show that the proposed bi-directional acoustic-articulatory conversion framework can significantly improve the SER performance.
Author(s): Li H, Zhang X, Duan S, Liang H
Publication type: Article
Publication status: Published
Journal: Knowledge-Based Systems
Year: 2024
Volume: 299
Print publication date: 05/09/2024
Online publication date: 13/06/2024
Acceptance date: 11/06/2024
Date deposited: 02/07/2024
ISSN (print): 0950-7051
ISSN (electronic): 1872-7409
Publisher: Elsevier BV
URL: https://doi.org/10.1016/j.knosys.2024.112123
DOI: 10.1016/j.knosys.2024.112123
ePrints DOI: 10.57711/m885-j587
Data Access Statement: The authors do not have permission to share data.
Altmetrics provided by Altmetric