This paper explores a multi-strategy technique that aims at enriching text documents for improving clustering quality. We use a combination of entity linking and document summarization in order to determine the identity of the most salient entities mentioned in texts. To effectively enrich documents without introducing noise, we limit ourselves to the text fragments mentioning the salient entities, in turn, belonging to a knowledge base like Wikipedia, while the actual enrichment of text fragments is carried out using WordNet. To feed clustering algorithms, we investigate different document representations obtained using several combinations of document enrichment and feature extraction. This allows us to exploit ensemble clustering, by combining multiple clustering results obtained using different document representations. Our experiments indicate that our novel enriching strategies, combined with ensemble clustering, can improve the quality of classical text clustering when applied to text corpora like The British Broadcasting Corporation (BBC) NEWS.
Pourvali, Mohsen (Corresponding)
|Titolo:||Enriching Documents by Linking Salient Entities and Lexical-Semantic Expansion|
|Data di pubblicazione:||2018|
|Appare nelle tipologie:||2.1 Articolo su rivista |
File in questo prodotto:
|enriching.pdf||Documento in Pre-print||Accesso chiuso-personale||Riservato|