Unlike in statistical compression, where Shannon’s entropy is a definitive lower bound, no such clear measure exists for the compressibility of repetitive sequences. Since statistical entropy does not capture repetitiveness, ad-hoc measures like the size z of the Lempel–Ziv parse are frequently used to estimate repetitiveness. Recently, a more principled measure, the size γ of the smallest string attractor, was introduced. The measure γ lower bounds all the previous relevant ones (including z), yet length-n strings can be represented and efficiently indexed within space O(γlognγ), which also upper bounds most measures (including z). While γ is certainly a better measure of repetitiveness than z, it is NP-complete to compute, and no o(γlog n) -space representation of strings is known. In this paper, we study a smaller measure, δ≤ γ, which can be computed in linear time. We show that δ better captures the compressibility of repetitive strings. For every length n and every value δ≥ 2, we construct a string such that γ=Ω(δlognδ). Still, we show a representation of any string S in O(δlognδ) space that supports direct access to any character S[i] in time O(lognδ) and finds the occ occurrences of any pattern P[1.m] in time O(mlog n+ occlogεn) for any constant ε> 0. Further, we prove that no o(δlog n) -space representation exists: for every length n and every value 2 ≤ δ≤ n1-ε, we exhibit a string family whose elements can only be encoded in Ω(δlognδ) space. We complete our characterization of δ by showing that, although γ, z, and other repetitiveness measures are always O(δlognδ), for strings of any length n, the smallest context-free grammar can be of size Ω(δlog2n/ log log n). No such separation is known for γ.
Towards a Definitive Measure of Repetitiveness
Prezza, Nicola
2020-01-01
Abstract
Unlike in statistical compression, where Shannon’s entropy is a definitive lower bound, no such clear measure exists for the compressibility of repetitive sequences. Since statistical entropy does not capture repetitiveness, ad-hoc measures like the size z of the Lempel–Ziv parse are frequently used to estimate repetitiveness. Recently, a more principled measure, the size γ of the smallest string attractor, was introduced. The measure γ lower bounds all the previous relevant ones (including z), yet length-n strings can be represented and efficiently indexed within space O(γlognγ), which also upper bounds most measures (including z). While γ is certainly a better measure of repetitiveness than z, it is NP-complete to compute, and no o(γlog n) -space representation of strings is known. In this paper, we study a smaller measure, δ≤ γ, which can be computed in linear time. We show that δ better captures the compressibility of repetitive strings. For every length n and every value δ≥ 2, we construct a string such that γ=Ω(δlognδ). Still, we show a representation of any string S in O(δlognδ) space that supports direct access to any character S[i] in time O(lognδ) and finds the occ occurrences of any pattern P[1.m] in time O(mlog n+ occlogεn) for any constant ε> 0. Further, we prove that no o(δlog n) -space representation exists: for every length n and every value 2 ≤ δ≤ n1-ε, we exhibit a string family whose elements can only be encoded in Ω(δlognδ) space. We complete our characterization of δ by showing that, although γ, z, and other repetitiveness measures are always O(δlognδ), for strings of any length n, the smallest context-free grammar can be of size Ω(δlog2n/ log log n). No such separation is known for γ.File | Dimensione | Formato | |
---|---|---|---|
delta.pdf
accesso aperto
Descrizione: articolo principale, versione preprint (arXiv)
Tipologia:
Documento in Pre-print
Licenza:
Accesso gratuito (solo visione)
Dimensione
217.55 kB
Formato
Adobe PDF
|
217.55 kB | Adobe PDF | Visualizza/Apri |
I documenti in ARCA sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.