Voice Conversion Using Feature Specific Loss Function Based Self-Attentive Generative Adversarial Network
Document Type
Conference Article
Publication Title
ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings
Abstract
Voice conversion (VC) is the process of converting the vocal texture of a source speaker similar to that of a target speaker without altering the content of the source speaker's speech. With the ongoing developments of deep generative models, generative adversarial networks (GANs) appeared as a better alternative to the conventional statistical models for VC. However, the existing VC model-generated speech samples possess substantial dissimilarity from their corresponding natural human speech. Therefore, in this work a GAN-based VC model is proposed which is incorporated with a self-attention (SA) mechanism based generator network to obtain the formant distribution of the target mel-spectrogram efficiently. Moreover, the modulation spectra distance (MSD) is also incorporated in this work as a feature-specific loss in terms of getting high speaker similarity. The proposed model has been tested with CMU Arctic and VCC 2018 datasets. Based on the objective and subjective evaluations, we observe the proposed feature-specific loss-based self-attentive GAN (FLSGAN-VC) model significantly performed better than the state-of-the-art (SOTA) MelGAN-VC model.
DOI
10.1109/ICASSP49357.2023.10095069
Publication Date
1-1-2023
Recommended Citation
Dhar, Sandipan; Banerjee, Padmanabha; Jana, Nanda Dulal; and Das, Swagatam, "Voice Conversion Using Feature Specific Loss Function Based Self-Attentive Generative Adversarial Network" (2023). Conference Articles. 552.
https://digitalcommons.isical.ac.in/conf-articles/552