Variational Autoencoder-Based Synthesis of Marmoset Vocalizations Using Linear Spectrograms

Du, Y; Woolgar, E; Kikuchi, Y; Ogawa, T

doi:10.1109/CyberSciTech68397.2025.00123

Variational Autoencoder-Based Synthesis of Marmoset Vocalizations Using Linear Spectrograms

Lookup NU author(s): Emma Woolgar, Dr Yuki Kikuchi ORCiD

Downloads

Full text for this publication is not currently held within this repository. Alternative links are provided below where available.

Abstract

Synthetic voice generation for socially assistive robotics requires biologically validated approaches to ensure effective human-robot interaction. This paper presents a Variational Autoencoder (VAE) based system for generating species-specific vocalizations with behavioral validation using marmoset. Our approach processes linear spectrograms through a symmetric encoder-decoder architecture with Kullback-Leibler divergence regularization and adaptive KL annealing. The system was trained on 18 marmoset ‘twitter’ calls and validated through controlled behavioral experiments with three adult female marmosets. Generated vocalizations achieved 86.79% Mel-Frequency Cepstrum Coefficients (MFCC) similarity to natural calls and had a significant main effect on two marmoset behavior (stationary behavior: χ2 = 11.47, p = 0.04; leg-stand contact behavior: χ2 = 12.12, p = 0.03), although behavioral responses were different to those seen in the equivalent natural call type. Results demonstrate the feasibility of VAE-based vocalization synthesis while highlighting the importance of biological validation for developing emotionally appropriate synthetic voices in assistive robotics applications.