Audio Samples From:"Any-to-One Non-Parallel Voice Conversion System Using an Autoregressive Conversion Model and LPCNet Vocoder"
ATISP, ENET’COM, Sfax University, Tunisia LORIA, Lorraine University, France
Authors: Kadria Ezzine, Joseph Di Martino, Mondher Frikha
Abstract: We present an any-to-one voice conversion (VC) system, using an autoregressive model and LPCNet vocoder, aimed to enhance the converted speech in terms of naturalness, intelligibility, and speaker similarity. As the name implies, non-parallel any-to-one voice conversion does not require paired source and target speeches and can be employed for arbitrary speech conversion tasks. Recent advancements in neural-based vocoders, such as WaveNet, have improved the efficiency of speech synthesis. However, in practice, we find that the trajectory of some generated waveforms is not smooth, thus causing some voice error problems. This problem is addressed by the use of a high-fidelity vocoder called LPCNet, which enables the production of more natural and clear speech and is capable of generating speech in real-time. To precisely represent the linguistic content of a given utterance, we use speaker-independent PPG features (SI-PPG) computed from an automatic speech recognition (ASR) model trained on a multi-speaker corpus. Next, a conversion model maps the SI-PPG to the acoustic representations used as input features for the LPCNet vocoder. The proposed autoregressive (AR) structure makes our system able to produce the following step outputs from the previous step's acoustic features. We evaluate the effectiveness of our system by performing any-to-one conversion pairs between native English speakers. Experimental results show that the proposed method outperforms state-of-the-art systems, producing higher speech quality and greater speaker similarity.
Keywords: Voice conversion; Non-parallel data; Autoregressive model; LPCNet; Phonetic Posteriorgrams.
Comparison of the proposed system with state-of-the-art systems
S1 : Parallel VC system based on ASR and TTS-oriented pretraining strategy using Transformer models for seq-to-seq VC [Huang et al., 2021]
S2 : Parallel VC system based on sequence- to-sequence mapping model with attention mechanism [Zhang et al., 2019]
S3 : Non-parallel VC system based on a variant of GAN model called StarGAN [Kameoka et al., 2020]
S4 : Non-parallel VC system based on jointly train conversion model and WaveNet vocoder [Liu et al., 2019]
Any-to-One non-parallel voice conversion is performed using the American CMU-ARCTIC database between the following pairs:
BDL-to-SLT
CLB-to-SLT
RMS-to-SLT
Male1 to Female1 VC --> RMS-to-SLT
Source |
Target |
---|---|
|
|
S1 |
S2 |
S3 |
S4 |
Proposed |
---|---|---|---|---|
|
|
|
|
|
Male2 to Female1 VC --> BDL-to-SLT
Source |
Target |
---|---|
|
|
S1 |
S2 |
S3 |
S4 |
Proposed |
---|---|---|---|---|
|
|
|
|
|
Female2 to Female1 VC --> CLB-to-SLT
Source | Target |
---|---|
|
|
S1 |
S2 |
S3 |
S4 |
Proposed |
---|---|---|---|---|
|
|
|
|
|