The paper delves into the complexities of indexing legal documents for the purpose of building corpora for linguistic and juridical analysis, highlighting the challenges of applying a systematic pseudonymisation of the original texts without disrupting their linguistic texture, while integrating an unlimited number of metadata from various sources, such as document metadata, typographic aspects, and more linguistic information from NLP tools. It emphasizes the need for a higher level of abstraction in representing textual and metatextual data to offer a uniform interface for the search process. Furthermore, it introduces the architecture of a new engine, stemming from the linguistic and metrical analysis of Classical texts, designed to meet these requirements through a "dematerialization" of text, transforming it into a higher abstraction level that better supports the desired characteristics. This engine differs from traditional systems by focusing on objects—containers of metadata properties—rather than character sequences. This approach allows for more complex and varied metadata, enhancing search capabilities beyond simple character sequence comparisons. The paper also outlines the modular components of the system, which support interactive text consultation, creating an environment conducive to both research and reading. The entire process from the original document to its indexing is described as a modular sequence, allowing for the integration of various text analysis procedures and metadata from different sources, ultimately rendering the document in a typographically rich format suitable for presentation to the end-user.
Il corpus digitale AttiChiari: costruzione, analisi, strumenti di ricerca
Daniele Fusi
2024-01-01
Abstract
The paper delves into the complexities of indexing legal documents for the purpose of building corpora for linguistic and juridical analysis, highlighting the challenges of applying a systematic pseudonymisation of the original texts without disrupting their linguistic texture, while integrating an unlimited number of metadata from various sources, such as document metadata, typographic aspects, and more linguistic information from NLP tools. It emphasizes the need for a higher level of abstraction in representing textual and metatextual data to offer a uniform interface for the search process. Furthermore, it introduces the architecture of a new engine, stemming from the linguistic and metrical analysis of Classical texts, designed to meet these requirements through a "dematerialization" of text, transforming it into a higher abstraction level that better supports the desired characteristics. This engine differs from traditional systems by focusing on objects—containers of metadata properties—rather than character sequences. This approach allows for more complex and varied metadata, enhancing search capabilities beyond simple character sequence comparisons. The paper also outlines the modular components of the system, which support interactive text consultation, creating an environment conducive to both research and reading. The entire process from the original document to its indexing is described as a modular sequence, allowing for the integration of various text analysis procedures and metadata from different sources, ultimately rendering the document in a typographically rich format suitable for presentation to the end-user.I documenti in ARCA sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.