The Atti Chiari project, collecting the first large Italian corpus of judicial acts, presents strict legal requirements as well as many peculiarities in terms of language and content; to meet them, a number of processes and tools have been designed and implemented. The first issue is the requirement to remove any personal data from the documents, without however destroying their linguistic form, nor compromising their readability. To this end, a pseudonymisation procedure has been created based on a preliminary annotation stage, which adds information right in order to remove it in different ways, according to different purposes (linguistic analysis, legal analysis, etc.). At the same time, this light annotation provides data useful not only for pseudonymization, but also for the conversion of documents, from their original presentational format into a semantic one based on TEI. Once documents have been prepared in this way, they are then centralized in a corpus, ready to be indexed for linguistic research. Given the multiple search criteria that must be combined, whatever their origin and model, a new type of search engine, designed primarily in the philological field, has been used here to obtain the required openness and granularity of metadata.

Testi in maschera: nuovi strumenti per la sicurezza e l'analisi linguistica di corpora giuridici

Daniele Fusi;
2024-01-01

Abstract

The Atti Chiari project, collecting the first large Italian corpus of judicial acts, presents strict legal requirements as well as many peculiarities in terms of language and content; to meet them, a number of processes and tools have been designed and implemented. The first issue is the requirement to remove any personal data from the documents, without however destroying their linguistic form, nor compromising their readability. To this end, a pseudonymisation procedure has been created based on a preliminary annotation stage, which adds information right in order to remove it in different ways, according to different purposes (linguistic analysis, legal analysis, etc.). At the same time, this light annotation provides data useful not only for pseudonymization, but also for the conversion of documents, from their original presentational format into a semantic one based on TEI. Once documents have been prepared in this way, they are then centralized in a corpus, ready to be indexed for linguistic research. Given the multiple search criteria that must be combined, whatever their origin and model, a new type of search engine, designed primarily in the philological field, has been used here to obtain the required openness and granularity of metadata.
2024
16
File in questo prodotto:
Non ci sono file associati a questo prodotto.

I documenti in ARCA sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/10278/5060343
Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus ND
  • ???jsp.display-item.citation.isi??? ND
social impact