We initiate the study of sub-linear sketching and streaming techniques for estimating the output size of common dictionary compressors such as Lempel-Ziv ’77, the run-length Burrows-Wheeler transform, and grammar compression. To this end, we focus on a measure that has recently gained much attention in the information-theoretic community and which approximates up to a polylogarithmic multiplicative factor the output sizes of those compressors: the normalized substring complexity function δ. As a matter of fact, δ itself is a very accurate measure of compressibility: it is monotone under concatenation, invariant under reversals and alphabet permutations, sub-additive, and asymptotically tight (in terms of worst-case entropy) for representing strings, up to polylogarithmic factors.We present a data sketch of O(ε −3 log n + ε −1 log 2 n) words that allows computing a multiplicative (1 ± ε)-approximation of δ with high probability, where n is the string length. The sketches of two strings S 1 ,S 2 can be merged in O(ε −1 log 2 n) time to yield the sketch of {S 1 ,S 2 }, speeding up the computation of Normalized Compression Distances (NCD). If random access is available on the input, our sketch can be updated in O(ε −1 log 2 n) time for each character right-extension of the string. This yields a polylogarithmic-space algorithm for approximating δ, improving exponentially over the working space of the state-of-the-art algorithms running in nearly-linear time. Motivated by the fact that random access is not always available on the input data, we then present a streaming algorithm computing our sketch in $O(\sqrt n \cdot \log n)$ working space and O(ε −1 log 2 n) worst-case delay per character. We show that an implementation of our streaming algorithm can estimate δ on a dataset of 189GB with a throughput of 203MB per minute while using only 5MB of RAM, and that our sketch speeds up the computation of all-pairs NCD distances by one order of magnitude, with applications to phylogenetic tree reconstruction.

Sketching and Streaming for Dictionary Compression

Becker, Ruben
Membro del Collaboration Group
;
Cenzato, Davide
Membro del Collaboration Group
;
Kim, Sung-Hwan
Membro del Collaboration Group
;
Kodric, Bojana
Membro del Collaboration Group
;
Prezza, Nicola
Membro del Collaboration Group
2024-01-01

Abstract

We initiate the study of sub-linear sketching and streaming techniques for estimating the output size of common dictionary compressors such as Lempel-Ziv ’77, the run-length Burrows-Wheeler transform, and grammar compression. To this end, we focus on a measure that has recently gained much attention in the information-theoretic community and which approximates up to a polylogarithmic multiplicative factor the output sizes of those compressors: the normalized substring complexity function δ. As a matter of fact, δ itself is a very accurate measure of compressibility: it is monotone under concatenation, invariant under reversals and alphabet permutations, sub-additive, and asymptotically tight (in terms of worst-case entropy) for representing strings, up to polylogarithmic factors.We present a data sketch of O(ε −3 log n + ε −1 log 2 n) words that allows computing a multiplicative (1 ± ε)-approximation of δ with high probability, where n is the string length. The sketches of two strings S 1 ,S 2 can be merged in O(ε −1 log 2 n) time to yield the sketch of {S 1 ,S 2 }, speeding up the computation of Normalized Compression Distances (NCD). If random access is available on the input, our sketch can be updated in O(ε −1 log 2 n) time for each character right-extension of the string. This yields a polylogarithmic-space algorithm for approximating δ, improving exponentially over the working space of the state-of-the-art algorithms running in nearly-linear time. Motivated by the fact that random access is not always available on the input data, we then present a streaming algorithm computing our sketch in $O(\sqrt n \cdot \log n)$ working space and O(ε −1 log 2 n) worst-case delay per character. We show that an implementation of our streaming algorithm can estimate δ on a dataset of 189GB with a throughput of 203MB per minute while using only 5MB of RAM, and that our sketch speeds up the computation of all-pairs NCD distances by one order of magnitude, with applications to phylogenetic tree reconstruction.
2024
Proceedings of the 2024 Data Compression Conference (DCC)
File in questo prodotto:
File Dimensione Formato  
delta_stream.pdf

non disponibili

Descrizione: full text
Tipologia: Documento in Pre-print
Licenza: Copyright dell'editore
Dimensione 3.33 MB
Formato Adobe PDF
3.33 MB Adobe PDF   Visualizza/Apri

I documenti in ARCA sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/10278/5069483
Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus ND
  • ???jsp.display-item.citation.isi??? ND
social impact