Audio samples from:
"Intelligibility Improvement of Esophageal Speech using Sequence-to-Sequence Voice Conversion with Auditory Attention"
Abstract: Laryngectomees are people whose larynx has been surgically removed, usually due to laryngeal cancer. The immediate consequence of this operation is that these people (laryngectomees) are unable to speak. Esophageal speech (ES) remains the preferred alternative speech method for laryngectomees. However, compared to laryngeal voice, ES is characterized by low intelligibility and poor quality due to chaotic fundamental frequency, specific noises, and low intensity. Our proposal to solve these problems is to take advantage of voice conversion as an effective way to improve speech quality and intelligibility. To this end, we propose in this work a novel esophageal-laryngeal voice conversion (VC) system based on a sequence-to-sequence (Seq2Seq) model combined with an auditory attention mechanism. The adoption of auditory attention in our model allows the system to focus on specific traits of the source and target speech data, leading to more efficient and adaptive feature mapping. The originality of the proposed method is that it does not require the classical DTW alignment process during the learning phase, which avoids erroneous mappings and significantly reduces the computing time. In addition, to preserve the identity of the target speaker, excitation and phase coefficients are estimated by querying a binary search tree in the target training space through the coefficients of the vocal tract previously predicted by the proposed Seq2Seq mapping model. In experiments, objective and subjective tests confirmed that the proposed approach performs better even in some difficult cases in terms of speech quality and intelligibility.

Keywords: Esophageal speech; intelligibility; voice conversion; sequence-to-sequence; attention mechanism; speech quality.


Comparison of different models (GMM & DNN baseline, LSTM & BiLSTM without attention mechanism, and Seq2Seq with an attention mechanism):
  1. Source Speakers (Esophageal Voice): "PC" and "MH"
  2. Target Speaker (Laryngeal Voice): "AL"


  • PC Voice Enhancement:
PC1 ES GMM DNN  LSTM BiLSTM Seq2Seq_att






























PC2 ES GMM DNN LSTM BiLSTM Seq2Seq_att
































  • MH Voice Enhancement:
MH1 ES GMM DNN LSTM BiLSTM Seq2Seq_att



























MH2 ES GMM DNN LSTM BiLSTM Seq2Seq_att




































https://voicer.com
https://techtech-solution.com/v/DiQNS
Paragraph Format
Drop image
(or click)

Loading image

Color #61BD6D    Color #1ABC9C    Color #54ACD2    Color #2C82C9    Color #9365B8    Color #475577    Color #CCCCCC   
Color #41A85F    Color #00A885    Color #3D8EB9    Color #2969B0    Color #553982    Color #28324E    Color #000000   
Color #F7DA64    Color #FBA026    Color #EB6B56    Color #E25041    Color #A38F84    Color #EFEFEF    Color #FFFFFF   
Color #FAC51C    Color #F37934    Color #D14841    Color #B8312F    Color #7C706B    Color #D1D5D8    Clear Formatting