FID-RPRGAN-VC: Fréchet Inception Distance Loss based Region-wise Position Normalized Relativistic GAN for Non-Parallel Voice Conversion

Document Type

Conference Article

Publication Title

2023 Asia Pacific Signal and Information Processing Association Annual Summit and Conference, APSIPA ASC 2023

Abstract

Voice conversion (VC) is the speech-to-speech (STS) synthesis process that converts the vocal identity of a source speaker to a target speaker by keeping the linguistic content unaltered. In recent years, VC research has been explored using generative adversarial network (GAN) models. However, a substantial difference exists between the real and the state-of-the-art (SOTA) VC model-generated speech samples as far as naturalness is concerned. This work proposes an improved GAN model for non-parallel VC to enhance the naturalness of the generated speech samples. The improved GAN model is integrated with a region-wise positional normalization technique in the generator, a relativistic mechanism-based discriminator, and a Fréchet inception distance (FID) based loss function. We tested the proposed model on VCC 2018, CMU Arctic, and a dysarthric speech dataset. The experimental results revealed the superiority of the proposed FID-RPRGAN-VC model over the SOTA MaskCycleGAN-VC model.

First Page

350

Last Page

356

DOI

10.1109/APSIPAASC58517.2023.10317438

Publication Date

1-1-2023

This document is currently not available here.

Share

COinS