Accent is a crucial aspect of speech that helps define one's identity. We note that the state-of-the-art Text-to-Speech (TTS) systems can achieve high-quality generated voice, but still lack in terms of versatility and customizability. Moreover, they generally do not take into account accent, which is an important feature of speaking style. In this work, we utilize the concept of Multi-level VAE (ML-VAE) to build a control mechanism that aims to disentangle accent from a reference accented speaker; and to synthesize voices in different accents such as English, American, Irish, and Scottish. The proposed framework can also achieve high-quality accented voice generation for multi-speaker setup, which we believe is remarkable. We investigate the performance through objective metrics and conduct listening experiments for a subjective performance assessment. We showed that the proposed method achieves good performance for naturalness, speaker similarity, and accent similarity.

LEARNING ACCENT REPRESENTATION WITH MULTI-LEVEL VAE TOWARDS CONTROLLABLE SPEECH SYNTHESIS

Ambuj Mehrish;
2022-01-01

Abstract

Accent is a crucial aspect of speech that helps define one's identity. We note that the state-of-the-art Text-to-Speech (TTS) systems can achieve high-quality generated voice, but still lack in terms of versatility and customizability. Moreover, they generally do not take into account accent, which is an important feature of speaking style. In this work, we utilize the concept of Multi-level VAE (ML-VAE) to build a control mechanism that aims to disentangle accent from a reference accented speaker; and to synthesize voices in different accents such as English, American, Irish, and Scottish. The proposed framework can also achieve high-quality accented voice generation for multi-speaker setup, which we believe is remarkable. We investigate the performance through objective metrics and conduct listening experiments for a subjective performance assessment. We showed that the proposed method achieves good performance for naturalness, speaker similarity, and accent similarity.
2022
2022 IEEE SPOKEN LANGUAGE TECHNOLOGY WORKSHOP, SLT
File in questo prodotto:
File Dimensione Formato  
Learning_Accent_Representation_with_Multi-Level_VAE_Towards_Controllable_Speech_Synthesis.pdf

non disponibili

Tipologia: Versione dell'editore
Licenza: Copyright dell'editore
Dimensione 659.88 kB
Formato Adobe PDF
659.88 kB Adobe PDF   Visualizza/Apri

I documenti in ARCA sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/10278/5105953
Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus 5
  • ???jsp.display-item.citation.isi??? 0
social impact