Toggle Main Menu Toggle Search

Open Access padlockePrints

Phone-to-audio alignment without text: A semi-supervised approach

Lookup NU author(s): Dr Cong Zhang


Full text for this publication is not currently held within this repository. Alternative links are provided below where available.


The task of phone-to-audio alignment has many applications in speech research. Here we introduce two Wav2Vec2-based models for both text-dependent and text-independent phone-to-audio alignment. The proposed Wav2Vec2-FS, a semi-supervised model, directly learns phone-to-audio alignment through contrastive learning and a forward sum loss, and can be coupled with a pretrained phone recognizer to achieve text-independent alignment. The other model, Wav2Vec2-FC, is a frame classification model trained on forced aligned labels that can both perform forced alignment and text-independent segmentation. Evaluation results suggest that both proposed methods, even when transcriptions are not available, generate highly close results to existing forced alignment tools. Our work presents a neural pipeline of fully automated phone-to-audio alignment. Code and pretrained models are available at

Publication metadata

Author(s): Zhu J, Zhang C, Jurgens D

Publication type: Conference Proceedings (inc. Abstract)

Publication status: Published

Conference Name: IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

Year of Conference: 2022

Pages: 8167–8171

Online publication date: 27/04/2022

Acceptance date: 27/04/2022

Publisher: IEEE


DOI: 10.1109/ICASSP43922.2022.9746112

Library holdings: Search Newcastle University Library for this item

ISBN: 9781665405416