Audio samples from:
"Intelligibility Improvement of Esophageal Speech using Sequence-to-Sequence Voice Conversion with Auditory Attention"
Abstract: Laryngectomees are people whose larynx has been surgically removed, usually due to laryngeal cancer. The immediate consequence of this operation is that these people (laryngectomees) are unable to speak. Esophageal speech (ES) remains the preferred alternative speech method for laryngectomees. However, compared to laryngeal voice, ES is characterized by low intelligibility and poor quality due to chaotic fundamental frequency, specific noises, and low intensity. Our proposal to solve these problems is to take advantage of voice conversion as an effective way to improve speech quality and intelligibility. To this end, we propose in this work a novel esophageal-laryngeal voice conversion (VC) system based on a sequence-to-sequence (Seq2Seq) model combined with an auditory attention mechanism. The adoption of auditory attention in our model allows the system to focus on specific traits of the source and target speech data, leading to more efficient and adaptive feature mapping. The originality of the proposed method is that it does not require the classical DTW alignment process during the learning phase, which avoids erroneous mappings and significantly reduces the computing time. In addition, to preserve the identity of the target speaker, excitation and phase coefficients are estimated by querying a binary search tree in the target training space through the coefficients of the vocal tract previously predicted by the proposed Seq2Seq mapping model. In experiments, objective and subjective tests confirmed that the proposed approach performs better even in some difficult cases in terms of speech quality and intelligibility.
Keywords: Esophageal speech; intelligibility; voice conversion; sequence-to-sequence; attention mechanism; speech quality.
Comparison of different models (GMM & DNN baseline, LSTM & BiLSTM without attention mechanism, and Seq2Seq with an attention mechanism):
Source Speakers (Esophageal Voice): "PC" and "MH"
Target Speaker (Laryngeal Voice): "AL"
- PC Voice Enhancement:
PC1 ES | GMM | DNN | LSTM | BiLSTM | Seq2Seq_att |
---|---|---|---|---|---|
|
|
|
|
|
|
PC2 ES | GMM | DNN | LSTM | BiLSTM | Seq2Seq_att |
---|---|---|---|---|---|
|
|
|
|
|
|
- MH Voice Enhancement:
MH1 ES | GMM | DNN | LSTM | BiLSTM | Seq2Seq_att |
---|---|---|---|---|---|
|
|
|
|
|
|
MH2 ES | GMM | DNN | LSTM | BiLSTM | Seq2Seq_att |
---|---|---|---|---|---|
|
|
|
|
|
|