Voice Conversion Using Feature Specific Loss Function Based Self-Attentive Generative Adversarial Network

Document Type

Conference Article

Publication Title

ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings

Abstract

Voice conversion (VC) is the process of converting the vocal texture of a source speaker similar to that of a target speaker without altering the content of the source speaker's speech. With the ongoing developments of deep generative models, generative adversarial networks (GANs) appeared as a better alternative to the conventional statistical models for VC. However, the existing VC model-generated speech samples possess substantial dissimilarity from their corresponding natural human speech. Therefore, in this work a GAN-based VC model is proposed which is incorporated with a self-attention (SA) mechanism based generator network to obtain the formant distribution of the target mel-spectrogram efficiently. Moreover, the modulation spectra distance (MSD) is also incorporated in this work as a feature-specific loss in terms of getting high speaker similarity. The proposed model has been tested with CMU Arctic and VCC 2018 datasets. Based on the objective and subjective evaluations, we observe the proposed feature-specific loss-based self-attentive GAN (FLSGAN-VC) model significantly performed better than the state-of-the-art (SOTA) MelGAN-VC model.

DOI

10.1109/ICASSP49357.2023.10095069

Publication Date

1-1-2023

This document is currently not available here.

Share

COinS