It is an open challenge to obtain high quality training data, espe- cially captions, for text-to-audio models. Although prior meth- ods have leveraged text-only language models to augment and improve captions, such methods have limitations related to scale and coherence between audio and captions. In this work, we propose an audio captioning pipeline that uses an audio lan- guage model to synthesize accurate and diverse captions for audio at scale. We leverage this pipeline to produce a dataset of synthetic captions for AudioSet, named AF-AudioSet, and then evaluate the benefit of pre-training text-to-audio models on these synthetic captions. Through systematic evaluations on AudioCaps and MusicCaps, we find leveraging our pipeline and synthetic captions leads to significant improvements on audio generation quality, achieving a new state-of-the-art

Improving Text-To-Audio Models with Synthetic Captions

Ambuj Mehrish;
2024-01-01

Abstract

It is an open challenge to obtain high quality training data, espe- cially captions, for text-to-audio models. Although prior meth- ods have leveraged text-only language models to augment and improve captions, such methods have limitations related to scale and coherence between audio and captions. In this work, we propose an audio captioning pipeline that uses an audio lan- guage model to synthesize accurate and diverse captions for audio at scale. We leverage this pipeline to produce a dataset of synthetic captions for AudioSet, named AF-AudioSet, and then evaluate the benefit of pre-training text-to-audio models on these synthetic captions. Through systematic evaluations on AudioCaps and MusicCaps, we find leveraging our pipeline and synthetic captions leads to significant improvements on audio generation quality, achieving a new state-of-the-art
2024
Synthetic Data’s Transformative Role in Foundational Speech Models
File in questo prodotto:
File Dimensione Formato  
kong24_syndata4genai.pdf

non disponibili

Tipologia: Versione dell'editore
Licenza: Copyright dell'editore
Dimensione 282.26 kB
Formato Adobe PDF
282.26 kB Adobe PDF   Visualizza/Apri

I documenti in ARCA sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/10278/5105954
Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus ND
  • ???jsp.display-item.citation.isi??? ND
social impact