Despite the release of numerous datasets for training models in historical handwritten text recognition, there is still a significant need for more diverse and extensive data. This paper aims to contribute to bridging this gap by introducing a new dataset comprising 159 pages from an Early Modern age volume part of the Venetian ‘Marigold’ collection. The dataset contains various abbreviations that are key to transcribing for a complete understanding of the content. To accommodate different research needs, the dataset is released in two versions: one with ‘expanded’ abbreviations and another without abbreviations – where the abbreviations are removed –, aligning with the choices made for other released datasets. Additionally, the dataset encompasses two distinct writing styles, leading us to provide three separate splits for training and evaluating machine learning models: one with a combination of both styles and two individual splits for each style. The qualitative and quantitative characteristics of all dataset configurations are analysed. In addition, three diverse architectures for handwritten text recognition are trained to assess their performances on this dataset. The dataset is available for download at https://doi.org/10.48557/GJYJTW.

The Specchieri MarVen Dataset: an Abbreviation-Rich Dataset in Venetian Idiom

Sara Ferro;Debora Pasquariello;Marcello Pelillo;Arianna Traviglia
2023-01-01

Abstract

Despite the release of numerous datasets for training models in historical handwritten text recognition, there is still a significant need for more diverse and extensive data. This paper aims to contribute to bridging this gap by introducing a new dataset comprising 159 pages from an Early Modern age volume part of the Venetian ‘Marigold’ collection. The dataset contains various abbreviations that are key to transcribing for a complete understanding of the content. To accommodate different research needs, the dataset is released in two versions: one with ‘expanded’ abbreviations and another without abbreviations – where the abbreviations are removed –, aligning with the choices made for other released datasets. Additionally, the dataset encompasses two distinct writing styles, leading us to provide three separate splits for training and evaluating machine learning models: one with a combination of both styles and two individual splits for each style. The qualitative and quantitative characteristics of all dataset configurations are analysed. In addition, three diverse architectures for handwritten text recognition are trained to assess their performances on this dataset. The dataset is available for download at https://doi.org/10.48557/GJYJTW.
2023
International Conference on Image Analysis and Processing
File in questo prodotto:
Non ci sono file associati a questo prodotto.

I documenti in ARCA sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/10278/5045122
Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus ND
  • ???jsp.display-item.citation.isi??? ND
social impact