Toggle Main Menu Toggle Search

Open Access padlockePrints

Speech emotion recognition based on bi-directional acoustic–articulatory conversion

Lookup NU author(s): Dr Huizhi Liang

Downloads


Licence

This work is licensed under a Creative Commons Attribution 4.0 International License (CC BY 4.0).


Abstract

Acoustic and articulatory signals are naturally coupled and complementary. The challenge of acquiring articulatory data and the nonlinear ill-posedness of acoustic–articulatory conversions have resulted in previous studies on speech emotion recognition (SER) primarily relying on unidirectional acoustic–articulatory conversions. However, these studies have ignored the potential benefits of bi-directional acoustic–articulatory conversion. Addressing the problem of nonlinear ill-posedness and effectively extracting and utilizing these two modal features in SER remain open research questions. To bridge this gap, this study proposes a Bi-A2CEmo framework that simultaneously addresses the bi-directional acoustic-articulatory conversion for SER. This framework comprises three components: a Bi-MGAN that addresses the nonlinear ill-posedness problem, KCLNet that enhances the emotional attributes of the mapped features, and ResTCN-FDA that fully exploits the emotional attributes of the features. Another challenge is the absence of a parallel acoustic-articulatory emotion database. To overcome this issue, this study utilizes electromagnetic articulography (EMA) to create a multi-modal acoustic-articulatory emotion database for Mandarin Chinese called STEM-E$^2$VA. A comparative analysis is then conducted between the proposed method and state-of-the-art models to evaluate the effectiveness of the framework. Bi-A2CEmo achieves an accuracy of 89.04\% in SER, which is an improvement of 5.27\% compared with the actual acoustic and articulatory features recorded by the EMA. The results for the STEM-E$^2$VA dataset show that Bi-MGAN achieves a higher accuracy in mapping and inversion than conventional conversion networks. Visualization of the mapped features before and after enhancement reveals that KCLNet reduces the intra-class spacing while increasing the inter-class spacing of the features. ResTCN-FDA demonstrates high recognition accuracy on three publicly available datasets. The experimental results show that the proposed bi-directional acoustic-articulatory conversion framework can significantly improve the SER performance.


Publication metadata

Author(s): Li H, Zhang X, Duan S, Liang H

Publication type: Article

Publication status: Published

Journal: Knowledge-Based Systems

Year: 2024

Volume: 299

Print publication date: 05/09/2024

Online publication date: 13/06/2024

Acceptance date: 11/06/2024

Date deposited: 02/07/2024

ISSN (print): 0950-7051

ISSN (electronic): 1872-7409

Publisher: Elsevier BV

URL: https://doi.org/10.1016/j.knosys.2024.112123

DOI: 10.1016/j.knosys.2024.112123

ePrints DOI: 10.57711/m885-j587

Data Access Statement: The authors do not have permission to share data.


Altmetrics

Altmetrics provided by Altmetric


Funding

Funder referenceFunder name
12004275
62271342
National Nature Science Foundation of China
Youth Fund of the National Nature Science Foundation of China

Share