FID-RPRGAN-VC: Fréchet Inception Distance Loss based Region-wise Position Normalized Relativistic GAN for Non-Parallel Voice Conversion
Document Type
Conference Article
Publication Title
2023 Asia Pacific Signal and Information Processing Association Annual Summit and Conference, APSIPA ASC 2023
Abstract
Voice conversion (VC) is the speech-to-speech (STS) synthesis process that converts the vocal identity of a source speaker to a target speaker by keeping the linguistic content unaltered. In recent years, VC research has been explored using generative adversarial network (GAN) models. However, a substantial difference exists between the real and the state-of-the-art (SOTA) VC model-generated speech samples as far as naturalness is concerned. This work proposes an improved GAN model for non-parallel VC to enhance the naturalness of the generated speech samples. The improved GAN model is integrated with a region-wise positional normalization technique in the generator, a relativistic mechanism-based discriminator, and a Fréchet inception distance (FID) based loss function. We tested the proposed model on VCC 2018, CMU Arctic, and a dysarthric speech dataset. The experimental results revealed the superiority of the proposed FID-RPRGAN-VC model over the SOTA MaskCycleGAN-VC model.
First Page
350
Last Page
356
DOI
10.1109/APSIPAASC58517.2023.10317438
Publication Date
1-1-2023
Recommended Citation
Dhar, Sandipan; Akhter, Tousin; Banerjee, Padmanabha; Jana, Nanda Dulal; and Das, Swagatam, "FID-RPRGAN-VC: Fréchet Inception Distance Loss based Region-wise Position Normalized Relativistic GAN for Non-Parallel Voice Conversion" (2023). Conference Articles. 542.
https://digitalcommons.isical.ac.in/conf-articles/542