We initiate the study of sub-linear sketching and streaming techniques for estimating the output size of common dictionary compressors such as Lempel-Ziv ’77, the run-length Burrows-Wheeler transform, and grammar compression. To this end, we focus on a measure that has recently gained much attention in the information-theoretic community and which approximates up to a polylogarithmic multiplicative factor the output sizes of those compressors: the normalized substring complexity function δ. As a matter of fact, δ itself is a very accurate measure of compressibility: it is monotone under concatenation, invariant under reversals and alphabet permutations, sub-additive, and asymptotically tight (in terms of worst-case entropy) for representing strings, up to polylogarithmic factors.We present a data sketch of O(ε −3 log n + ε −1 log 2 n) words that allows computing a multiplicative (1 ± ε)-approximation of δ with high probability, where n is the string length. The sketches of two strings S 1 ,S 2 can be merged in O(ε −1 log 2 n) time to yield the sketch of {S 1 ,S 2 }, speeding up the computation of Normalized Compression Distances (NCD). If random access is available on the input, our sketch can be updated in O(ε −1 log 2 n) time for each character right-extension of the string. This yields a polylogarithmic-space algorithm for approximating δ, improving exponentially over the working space of the state-of-the-art algorithms running in nearly-linear time. Motivated by the fact that random access is not always available on the input data, we then present a streaming algorithm computing our sketch in $O(\sqrt n \cdot \log n)$ working space and O(ε −1 log 2 n) worst-case delay per character. We show that an implementation of our streaming algorithm can estimate δ on a dataset of 189GB with a throughput of 203MB per minute while using only 5MB of RAM, and that our sketch speeds up the computation of all-pairs NCD distances by one order of magnitude, with applications to phylogenetic tree reconstruction.
Sketching and Streaming for Dictionary Compression
Becker, RubenMembro del Collaboration Group
;Cenzato, DavideMembro del Collaboration Group
;Kim, Sung-HwanMembro del Collaboration Group
;Kodric, BojanaMembro del Collaboration Group
;Prezza, NicolaMembro del Collaboration Group
2024-01-01
Abstract
We initiate the study of sub-linear sketching and streaming techniques for estimating the output size of common dictionary compressors such as Lempel-Ziv ’77, the run-length Burrows-Wheeler transform, and grammar compression. To this end, we focus on a measure that has recently gained much attention in the information-theoretic community and which approximates up to a polylogarithmic multiplicative factor the output sizes of those compressors: the normalized substring complexity function δ. As a matter of fact, δ itself is a very accurate measure of compressibility: it is monotone under concatenation, invariant under reversals and alphabet permutations, sub-additive, and asymptotically tight (in terms of worst-case entropy) for representing strings, up to polylogarithmic factors.We present a data sketch of O(ε −3 log n + ε −1 log 2 n) words that allows computing a multiplicative (1 ± ε)-approximation of δ with high probability, where n is the string length. The sketches of two strings S 1 ,S 2 can be merged in O(ε −1 log 2 n) time to yield the sketch of {S 1 ,S 2 }, speeding up the computation of Normalized Compression Distances (NCD). If random access is available on the input, our sketch can be updated in O(ε −1 log 2 n) time for each character right-extension of the string. This yields a polylogarithmic-space algorithm for approximating δ, improving exponentially over the working space of the state-of-the-art algorithms running in nearly-linear time. Motivated by the fact that random access is not always available on the input data, we then present a streaming algorithm computing our sketch in $O(\sqrt n \cdot \log n)$ working space and O(ε −1 log 2 n) worst-case delay per character. We show that an implementation of our streaming algorithm can estimate δ on a dataset of 189GB with a throughput of 203MB per minute while using only 5MB of RAM, and that our sketch speeds up the computation of all-pairs NCD distances by one order of magnitude, with applications to phylogenetic tree reconstruction.File | Dimensione | Formato | |
---|---|---|---|
delta_stream.pdf
non disponibili
Descrizione: full text
Tipologia:
Documento in Pre-print
Licenza:
Copyright dell'editore
Dimensione
3.33 MB
Formato
Adobe PDF
|
3.33 MB | Adobe PDF | Visualizza/Apri |
I documenti in ARCA sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.