Sketching and Streaming for Dictionary Compression

We initiate the study of sub-linear sketching and streaming techniques for estimating the output size of common dictionary compressors such as Lempel-Ziv ’77, the run-length Burrows-Wheeler transform, and grammar compression. To this end, we focus on a measure that has recently gained much attention in the information-theoretic community and which approximates up to a polylogarithmic multiplicative factor the output sizes of those compressors: the normalized substring complexity function δ. As a matter of fact, δ itself is a very accurate measure of compressibility: it is monotone under concatenation, invariant under reversals and alphabet permutations, sub-additive, and asymptotically tight (in terms of worst-case entropy) for representing strings, up to polylogarithmic factors.We present a data sketch of O(ε −3 log n + ε −1 log 2 n) words that allows computing a multiplicative (1 ± ε)-approximation of δ with high probability, where n is the string length. The sketches of two strings S 1 ,S 2 can be merged in O(ε −1 log 2 n) time to yield the sketch of {S 1 ,S 2 }, speeding up the computation of Normalized Compression Distances (NCD). If random access is available on the input, our sketch can be updated in O(ε −1 log 2 n) time for each character right-extension of the string. This yields a polylogarithmic-space algorithm for approximating δ, improving exponentially over the working space of the state-of-the-art algorithms running in nearly-linear time. Motivated by the fact that random access is not always available on the input data, we then present a streaming algorithm computing our sketch in $O(\sqrt n \cdot \log n)$ working space and O(ε −1 log 2 n) worst-case delay per character. We show that an implementation of our streaming algorithm can estimate δ on a dataset of 189GB with a throughput of 203MB per minute while using only 5MB of RAM, and that our sketch speeds up the computation of all-pairs NCD distances by one order of magnitude, with applications to phylogenetic tree reconstruction.

Sketching and Streaming for Dictionary Compression

Becker, Ruben^{Membro del Collaboration Group};Canton, Matteo;Cenzato, Davide^{Membro del Collaboration Group};Kim, Sung-Hwan^{Membro del Collaboration Group};Kodric, Bojana^{Membro del Collaboration Group};Prezza, Nicola^{Membro del Collaboration Group}

2024-01-01

Abstract

We initiate the study of sub-linear sketching and streaming techniques for estimating the output size of common dictionary compressors such as Lempel-Ziv ’77, the run-length Burrows-Wheeler transform, and grammar compression. To this end, we focus on a measure that has recently gained much attention in the information-theoretic community and which approximates up to a polylogarithmic multiplicative factor the output sizes of those compressors: the normalized substring complexity function δ. As a matter of fact, δ itself is a very accurate measure of compressibility: it is monotone under concatenation, invariant under reversals and alphabet permutations, sub-additive, and asymptotically tight (in terms of worst-case entropy) for representing strings, up to polylogarithmic factors.We present a data sketch of O(ε −3 log n + ε −1 log 2 n) words that allows computing a multiplicative (1 ± ε)-approximation of δ with high probability, where n is the string length. The sketches of two strings S 1 ,S 2 can be merged in O(ε −1 log 2 n) time to yield the sketch of {S 1 ,S 2 }, speeding up the computation of Normalized Compression Distances (NCD). If random access is available on the input, our sketch can be updated in O(ε −1 log 2 n) time for each character right-extension of the string. This yields a polylogarithmic-space algorithm for approximating δ, improving exponentially over the working space of the state-of-the-art algorithms running in nearly-linear time. Motivated by the fact that random access is not always available on the input data, we then present a streaming algorithm computing our sketch in $O(\sqrt n \cdot \log n)$ working space and O(ε −1 log 2 n) worst-case delay per character. We show that an implementation of our streaming algorithm can estimate δ on a dataset of 189GB with a throughput of 203MB per minute while using only 5MB of RAM, and that our sketch speeds up the computation of all-pairs NCD distances by one order of magnitude, with applications to phylogenetic tree reconstruction.

Scheda breve

Scheda completa

Scheda completa (DC)

	Anno pubblicazione
	
				2024
			
	Titolo del volume
	
				Proceedings of the 2024 Data Compression Conference (DCC)
			
	DOI
	
				https://dx.doi.org/10.1109/dcc58796.2024.00029
			
	Appare nelle tipologie:
	
				4.1 Articolo in Atti di convegno

File in questo prodotto:

File	Dimensione	Formato
delta_stream.pdf non disponibili Descrizione: full text Tipologia: Documento in Pre-print Licenza: Copyright dell'editore Dimensione 3.33 MB Formato Adobe PDF Visualizza/Apri	3.33 MB	Adobe PDF	Visualizza/Apri

I documenti in ARCA sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/10278/5069483

Citazioni

ND

0

0

social impact