Despite the release of numerous datasets for training models in historical handwritten text recognition, there is still a significant need for more diverse and extensive data. This paper aims to contribute to bridging this gap by introducing a new dataset comprising 159 pages from an Early Modern age volume part of the Venetian ‘Marigold’ collection. The dataset contains various abbreviations that are key to transcribing for a complete understanding of the content. To accommodate different research needs, the dataset is released in two versions: one with ‘expanded’ abbreviations and another without abbreviations – where the abbreviations are removed –, aligning with the choices made for other released datasets. Additionally, the dataset encompasses two distinct writing styles, leading us to provide three separate splits for training and evaluating machine learning models: one with a combination of both styles and two individual splits for each style. The qualitative and quantitative characteristics of all dataset configurations are analysed. In addition, three diverse architectures for handwritten text recognition are trained to assess their performances on this dataset. The dataset is available for download at https://doi.org/10.48557/GJYJTW.
The Specchieri MarVen Dataset: an Abbreviation-Rich Dataset in Venetian Idiom
Sara Ferro;Debora Pasquariello;Marcello Pelillo;Arianna Traviglia
2023-01-01
Abstract
Despite the release of numerous datasets for training models in historical handwritten text recognition, there is still a significant need for more diverse and extensive data. This paper aims to contribute to bridging this gap by introducing a new dataset comprising 159 pages from an Early Modern age volume part of the Venetian ‘Marigold’ collection. The dataset contains various abbreviations that are key to transcribing for a complete understanding of the content. To accommodate different research needs, the dataset is released in two versions: one with ‘expanded’ abbreviations and another without abbreviations – where the abbreviations are removed –, aligning with the choices made for other released datasets. Additionally, the dataset encompasses two distinct writing styles, leading us to provide three separate splits for training and evaluating machine learning models: one with a combination of both styles and two individual splits for each style. The qualitative and quantitative characteristics of all dataset configurations are analysed. In addition, three diverse architectures for handwritten text recognition are trained to assess their performances on this dataset. The dataset is available for download at https://doi.org/10.48557/GJYJTW.I documenti in ARCA sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.