The discovery of patterns in binary dataset has many applications, e.g. in electronic commerce, TCP/IP networking, Web usage logging, etc. Still, this is a very challenging task in many respects: overlapping vs. non overlapping patterns, presence of noise, extraction of the most important patterns only. In this paper we formalize the problem of discovering the Top-K patterns from binary datasets in presence of noise, as the minimization of a novel cost function. According to the Minimum Description Length principle, the proposed cost function favors succinct pattern sets that may approximately describe the input data. We propose a greedy algorithm for the discovery of Patterns in Noisy Datasets, named PaNDa, and show that it outperforms related techniques on both synthetic and real-world data. Copyright © by SIAM.
Mining Top-K Patterns from Binary Datasets in presence of Noise
LUCCHESE, Claudio;ORLANDO, Salvatore;
2010-01-01
Abstract
The discovery of patterns in binary dataset has many applications, e.g. in electronic commerce, TCP/IP networking, Web usage logging, etc. Still, this is a very challenging task in many respects: overlapping vs. non overlapping patterns, presence of noise, extraction of the most important patterns only. In this paper we formalize the problem of discovering the Top-K patterns from binary datasets in presence of noise, as the minimization of a novel cost function. According to the Minimum Description Length principle, the proposed cost function favors succinct pattern sets that may approximately describe the input data. We propose a greedy algorithm for the discovery of Patterns in Noisy Datasets, named PaNDa, and show that it outperforms related techniques on both synthetic and real-world data. Copyright © by SIAM.File | Dimensione | Formato | |
---|---|---|---|
pandaSDM_2010.pdf
non disponibili
Tipologia:
Documento in Post-print
Licenza:
Licenza non definita
Dimensione
1.59 MB
Formato
Adobe PDF
|
1.59 MB | Adobe PDF | Visualizza/Apri |
I documenti in ARCA sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.