Browse by author
Lookup NU author(s): Dr Yi Li,
Dr Mohsen Naqvi
This work is licensed under a Creative Commons Attribution 4.0 International License (CC BY 4.0).
Recently, Transformer shows the potential to exploit the long-range sequence dependency in speech with self-attention. It has been introduced in single channel speech enhancement to improve the accuracy of speech estimation from a noise mixture. However, the amount of information represented across attention-heads is often huge, which leads to increased computational complexity. To address this issue, the axial attention is proposed i.e., to split a 2D attention into two 1-D attentions. In this paper, we develop a new method for speech enhancement by leveraging the axial attention, where we generate time and frequency sub-attention maps by calculating the attention map along time- and frequency-axis. Different from the conventional axial attention, the proposed method provides two parallel multi-head attentions for time- and frequency-axis, respectively. Moreover, the frequency-band aware attention is proposed i.e., high frequency-band attention (HFA), and low frequency-band attention (LFA), which facilitates the exploitation of the information related to speech and noise in different frequency bands in the noisy mixture. To re-use high-resolution feature maps from the encoder, we design a U-shaped Transformer, which helps recover lost information from the high-level representations to further improve the speech estimation accuracy. Extensive experiments on four public datasets are used to demonstrate the efficacy of the proposed method.
Author(s): Li Y, Sun Y, Wang W, Naqvi SM
Publication type: Article
Publication status: Published
Journal: IEEE/ACM Transactions on Audio Speech and Language Processing
Online publication date: 12/04/2023
Acceptance date: 03/04/2023
Date deposited: 05/04/2023
ISSN (print): 2329-9304
ISSN (electronic): 2329-9290
ePrints DOI: 10.57711/e02q-0735
Altmetrics provided by Altmetric