This paper presents BAMBI (BAby language Models Boost-rapped for Italian), a series of Baby Language Models (BabyLMs) trained on data that mimic the linguistic input received by a five-years-old Italian speaking child. The BAMBI models are tested using BaBIEs (Capone "et al." 2024), a benchmark specifically designed to evaluate LMs, which takes into account the amount of training input the models received. The BAMBI models are compared against a large LM and a vision-language model, to study the contribution of extralinguistic information for language acquisition. The results of our evaluation align with the existing literature on English LMs, confirming that while reduced training data support the development of relatively robust syntactic competence, they are insufficient for fostering semantic understanding. However, the gap between the training resources (data and computation) of the BAMBI models and the LLMs is not fully reflected in their performance: despite LLMs’ massive training, their performance is not much better than that of BAMBI models. This suggests that strategies beyond scaling training resources, such as data curation, inclusion of multimodal input, and other training strategies (such as curriculum learning), could play a crucial role in shaping models’ performance.

BAMBI: Developing BAby language Models for Italian

Alice Suozzi;Gianluca E. Lebani;Alessandro Lenci
2025-01-01

Abstract

This paper presents BAMBI (BAby language Models Boost-rapped for Italian), a series of Baby Language Models (BabyLMs) trained on data that mimic the linguistic input received by a five-years-old Italian speaking child. The BAMBI models are tested using BaBIEs (Capone "et al." 2024), a benchmark specifically designed to evaluate LMs, which takes into account the amount of training input the models received. The BAMBI models are compared against a large LM and a vision-language model, to study the contribution of extralinguistic information for language acquisition. The results of our evaluation align with the existing literature on English LMs, confirming that while reduced training data support the development of relatively robust syntactic competence, they are insufficient for fostering semantic understanding. However, the gap between the training resources (data and computation) of the BAMBI models and the LLMs is not fully reflected in their performance: despite LLMs’ massive training, their performance is not much better than that of BAMBI models. This suggests that strategies beyond scaling training resources, such as data curation, inclusion of multimodal input, and other training strategies (such as curriculum learning), could play a crucial role in shaping models’ performance.
2025
1/2025
File in questo prodotto:
File Dimensione Formato  
1720-9331-42687-6.pdf

non disponibili

Tipologia: Versione dell'editore
Licenza: Copyright dell'editore
Dimensione 535.21 kB
Formato Adobe PDF
535.21 kB Adobe PDF   Visualizza/Apri

I documenti in ARCA sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/10278/5099727
Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus 0
  • ???jsp.display-item.citation.isi??? 0
social impact