Accent is a crucial aspect of speech that helps define one's identity. We note that the state-of-the-art Text-to-Speech (TTS) systems can achieve high-quality generated voice, but still lack in terms of versatility and customizability. Moreover, they generally do not take into account accent, which is an important feature of speaking style. In this work, we utilize the concept of Multi-level VAE (ML-VAE) to build a control mechanism that aims to disentangle accent from a reference accented speaker; and to synthesize voices in different accents such as English, American, Irish, and Scottish. The proposed framework can also achieve high-quality accented voice generation for multi-speaker setup, which we believe is remarkable. We investigate the performance through objective metrics and conduct listening experiments for a subjective performance assessment. We showed that the proposed method achieves good performance for naturalness, speaker similarity, and accent similarity.
LEARNING ACCENT REPRESENTATION WITH MULTI-LEVEL VAE TOWARDS CONTROLLABLE SPEECH SYNTHESIS
Ambuj Mehrish;
2022-01-01
Abstract
Accent is a crucial aspect of speech that helps define one's identity. We note that the state-of-the-art Text-to-Speech (TTS) systems can achieve high-quality generated voice, but still lack in terms of versatility and customizability. Moreover, they generally do not take into account accent, which is an important feature of speaking style. In this work, we utilize the concept of Multi-level VAE (ML-VAE) to build a control mechanism that aims to disentangle accent from a reference accented speaker; and to synthesize voices in different accents such as English, American, Irish, and Scottish. The proposed framework can also achieve high-quality accented voice generation for multi-speaker setup, which we believe is remarkable. We investigate the performance through objective metrics and conduct listening experiments for a subjective performance assessment. We showed that the proposed method achieves good performance for naturalness, speaker similarity, and accent similarity.| File | Dimensione | Formato | |
|---|---|---|---|
|
Learning_Accent_Representation_with_Multi-Level_VAE_Towards_Controllable_Speech_Synthesis.pdf
non disponibili
Tipologia:
Versione dell'editore
Licenza:
Copyright dell'editore
Dimensione
659.88 kB
Formato
Adobe PDF
|
659.88 kB | Adobe PDF | Visualizza/Apri |
I documenti in ARCA sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.



