Toggle Main Menu Toggle Search

Open Access padlockePrints

Acoustic Scene Classification using Bilinear Pooling on Time-Liked and Frequency-Liked Convolution Neural Network

Lookup NU author(s): Dr Xing Kek, Professor Cheng Chin


Full text for this publication is not currently held within this repository. Alternative links are provided below where available.


The current methodology in tackling Acoustic Scene Classification (ASC) task can be described in two steps, preprocessing of the audio waveform into log-mel spectrogram and then using it as the input representation for Convolutional Neural Network (CNN). This paradigm shift occurs after DCASE 2016 where this framework model achieves the state-of-the-art result in ASC tasks [1] and [2]. In this paper, we explored the use of harmonic and percussive source separation (HPSS) to split the audio into harmonic audio and percussive audio. Next, we curated 2 CNNs which tries to understand harmonic audio and percussive audio in their ‘natural form’, one specialized in extracting deep features in time biased domain and another specialized in extracting deep features in frequency biased domain, respectively. The deep features extracted from these 2 CNNs will then be combined using bilinear pooling. Hence, presenting a ‘two-stream’ time and frequency CNN architecture approach in classifying acoustic scene. The model is being evaluated on DCASE 2019 sub task 1a dataset and scored an average of ~65% on development dataset, Kaggle Leadership Private and Public board.

Publication metadata

Author(s): Kek XY, Chin CS, Li Y

Publication type: Conference Proceedings (inc. Abstract)

Publication status: Published

Conference Name: 2019 IEEE Symposium Series on Computational Intelligence

Year of Conference: 2019

Print publication date: 01/05/2019

Online publication date: 11/04/2019

Acceptance date: 10/09/2019

ISSN: 1556-6048

Publisher: IEEE


DOI: 10.1109/MCI.2019.2901101