LEARNING ACCENT REPRESENTATION WITH MULTI-LEVEL VAE TOWARDS CONTROLLABLE SPEECH SYNTHESIS

Accent is a crucial aspect of speech that helps define one's identity. We note that the state-of-the-art Text-to-Speech (TTS) systems can achieve high-quality generated voice, but still lack in terms of versatility and customizability. Moreover, they generally do not take into account accent, which is an important feature of speaking style. In this work, we utilize the concept of Multi-level VAE (ML-VAE) to build a control mechanism that aims to disentangle accent from a reference accented speaker; and to synthesize voices in different accents such as English, American, Irish, and Scottish. The proposed framework can also achieve high-quality accented voice generation for multi-speaker setup, which we believe is remarkable. We investigate the performance through objective metrics and conduct listening experiments for a subjective performance assessment. We showed that the proposed method achieves good performance for naturalness, speaker similarity, and accent similarity.

LEARNING ACCENT REPRESENTATION WITH MULTI-LEVEL VAE TOWARDS CONTROLLABLE SPEECH SYNTHESIS

Jan Melechovsky;Ambuj Mehrish;Dorien Herremans;Berrak Sisman

2022

Abstract

Accent is a crucial aspect of speech that helps define one's identity. We note that the state-of-the-art Text-to-Speech (TTS) systems can achieve high-quality generated voice, but still lack in terms of versatility and customizability. Moreover, they generally do not take into account accent, which is an important feature of speaking style. In this work, we utilize the concept of Multi-level VAE (ML-VAE) to build a control mechanism that aims to disentangle accent from a reference accented speaker; and to synthesize voices in different accents such as English, American, Irish, and Scottish. The proposed framework can also achieve high-quality accented voice generation for multi-speaker setup, which we believe is remarkable. We investigate the performance through objective metrics and conduct listening experiments for a subjective performance assessment. We showed that the proposed method achieves good performance for naturalness, speaker similarity, and accent similarity.

Scheda breve

Scheda completa

Scheda completa (DC)

	Anno pubblicazione
	
				2022
			
	Titolo del volume
	
				2022 IEEE SPOKEN LANGUAGE TECHNOLOGY WORKSHOP, SLT
			
	DOI
	
				https://dx.doi.org/10.1109/SLT54892.2023.10023072
			
	Appare nelle tipologie:
	
				4.1 Articolo in Atti di convegno

File in questo prodotto:

File	Dimensione	Formato
Learning_Accent_Representation_with_Multi-Level_VAE_Towards_Controllable_Speech_Synthesis.pdf non disponibili Tipologia: Versione dell'editore Licenza: Copyright dell'editore Dimensione 659.88 kB Formato Adobe PDF Visualizza/Apri	659.88 kB	Adobe PDF	Visualizza/Apri

I documenti in ARCA sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/10278/5105953

Citazioni

ND

8

1

social impact