Audio Samples From:

"Any-to-One Non-Parallel Voice Conversion System Using an Autoregressive Conversion Model and LPCNet Vocoder"

ATISP, ENET’COM, Sfax University, Tunisia LORIA, Lorraine University, France

Authors: Kadria Ezzine, Joseph Di Martino, Mondher Frikha

Abstract: We present an any-to-one voice conversion (VC) system, using an autoregressive model and LPCNet vocoder, aimed to enhance the converted speech in terms of naturalness, intelligibility, and speaker similarity. As the name implies, non-parallel any-to-one voice conversion does not require paired source and target speeches and can be employed for arbitrary speech conversion tasks. Recent advancements in neural-based vocoders, such as WaveNet, have improved the efficiency of speech synthesis. However, in practice, we find that the trajectory of some generated waveforms is not smooth, thus causing some voice error problems. This problem is addressed by the use of a high-fidelity vocoder called LPCNet, which enables the production of more natural and clear speech and is capable of generating speech in real-time. To precisely represent the linguistic content of a given utterance, we use speaker-independent PPG features (SI-PPG) computed from an automatic speech recognition (ASR) model trained on a multi-speaker corpus. Next, a conversion model maps the SI-PPG to the acoustic representations used as input features for the LPCNet vocoder. The proposed autoregressive (AR) structure makes our system able to produce the following step outputs from the previous step's acoustic features. We evaluate the effectiveness of our system by performing any-to-one conversion pairs between native English speakers. Experimental results show that the proposed method outperforms state-of-the-art systems, producing higher speech quality and greater speaker similarity.

Keywords:  Voice conversion; Non-parallel data; Autoregressive model; LPCNet; Phonetic Posteriorgrams.


Comparison of the proposed system with state-of-the-art systems
  1.     S1 : Parallel VC system based on ASR and TTS-oriented pretraining strategy using Transformer models for seq-to-seq VC [Huang et al., 2021]
  2.     S2 : Parallel VC system based on sequence- to-sequence mapping model with attention mechanism [Zhang et al., 2019]
  3.     S3 : Non-parallel VC system based on a variant of GAN model called StarGAN [Kameoka et al., 2020]
  4.     S4 : Non-parallel VC system based on jointly train conversion model and WaveNet vocoder [Liu et al., 2019]

Any-to-One non-parallel voice conversion is performed using the American CMU-ARCTIC database between the following pairs:


  •  BDL-to-SLT 
  •  CLB-to-SLT 
  •  RMS-to-SLT 


Male1 to Female1 VC --> RMS-to-SLT 
Source
Target
















S1
S2
S3
S4
Proposed








































Male2 to Female1 VC --> BDL-to-SLT
Source
Target








S1
S2
S3
S4
Proposed




















Female2 to Female1 VC --> CLB-to-SLT
Source Target








S1
S2
S3
S4
Proposed



























https://voicer.com
https://techtech-solution.com/v/DiQNS
Paragraph Format
Drop image
(or click)

Loading image

Color #61BD6D    Color #1ABC9C    Color #54ACD2    Color #2C82C9    Color #9365B8    Color #475577    Color #CCCCCC   
Color #41A85F    Color #00A885    Color #3D8EB9    Color #2969B0    Color #553982    Color #28324E    Color #000000   
Color #F7DA64    Color #FBA026    Color #EB6B56    Color #E25041    Color #A38F84    Color #EFEFEF    Color #FFFFFF   
Color #FAC51C    Color #F37934    Color #D14841    Color #B8312F    Color #7C706B    Color #D1D5D8    Clear Formatting