Enriching Documents by Linking Salient Entities and Lexical-Semantic Expansion

This paper explores a multi-strategy technique that aims at enriching text documents for improving clustering quality. We use a combination of entity linking and document summarization in order to determine the identity of the most salient entities mentioned in texts. To effectively enrich documents without introducing noise, we limit ourselves to the text fragments mentioning the salient entities, in turn, belonging to a knowledge base like Wikipedia, while the actual enrichment of text fragments is carried out using WordNet. To feed clustering algorithms, we investigate different document representations obtained using several combinations of document enrichment and feature extraction. This allows us to exploit ensemble clustering, by combining multiple clustering results obtained using different document representations. Our experiments indicate that our novel enriching strategies, combined with ensemble clustering, can improve the quality of classical text clustering when applied to text corpora like The British Broadcasting Corporation (BBC) NEWS.

Enriching Documents by Linking Salient Entities and Lexical-Semantic Expansion

Pourvali, Mohsen;Orlando, Salvatore

2020

Abstract

This paper explores a multi-strategy technique that aims at enriching text documents for improving clustering quality. We use a combination of entity linking and document summarization in order to determine the identity of the most salient entities mentioned in texts. To effectively enrich documents without introducing noise, we limit ourselves to the text fragments mentioning the salient entities, in turn, belonging to a knowledge base like Wikipedia, while the actual enrichment of text fragments is carried out using WordNet. To feed clustering algorithms, we investigate different document representations obtained using several combinations of document enrichment and feature extraction. This allows us to exploit ensemble clustering, by combining multiple clustering results obtained using different document representations. Our experiments indicate that our novel enriching strategies, combined with ensemble clustering, can improve the quality of classical text clustering when applied to text corpora like The British Broadcasting Corporation (BBC) NEWS.

Scheda breve

Scheda completa

Scheda completa (DC)

	Anno pubblicazione
	
				2020
			
	Titolo della Rivista
	
				JOURNAL OF INTELLIGENT SYSTEMS
			
	N° Volume
	
				0
			
	DOI
	
				https://dx.doi.org/10.1515/jisys-2018-0098
			
	Appare nelle tipologie:
	
				2.1 Articolo su rivista

File in questo prodotto:

File	Dimensione	Formato
10.1515_jisys-2018-0098.pdf accesso aperto Tipologia: Versione dell'editore Licenza: Accesso libero (no vincoli) Dimensione 1.09 MB Formato Adobe PDF Visualizza/Apri	1.09 MB	Adobe PDF	Visualizza/Apri

I documenti in ARCA sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/10278/3711322

Citazioni

ND

0

ND

social impact