While the k th order empirical entropy is an accepted measure of the compressibility of individual sequences on classical text collections, it is useful only for small values of k and thus fails to capture the compressibility of repetitive sequences. In the absence of an established way of quantifying the latter, ad-hoc measures like the size z of the Lempel–Ziv parse are frequently used to estimate repetitiveness. The size b ≤ z of the smallest bidirectional macro scheme captures better what can be achieved via copy-paste processes, though it is NP-complete to compute, and it is not monotone upon appending symbols. Recently, a more principled measure, the size γ of the smallest string attractor , was introduced. The measure γ ≤ b lower-bounds all the previous relevant ones, while length- n strings can be represented and efficiently indexed within space O (γ log n / γ ), which also upper-bounds many measures, including z . Although γ is arguably a better measure of repetitiveness than b , it is also NP-complete to compute and not monotone, and it is unknown if one can represent all strings in o (γ log n) space. In this paper, we study an even smaller measure, δ ≤ γ, which can be computed in linear time, is monotone, and allows encoding every string in O (δ log n / δ ) space because z = O (δ log n / δ ). We argue that δ better captures the compressibility of repetitive strings. Concretely, we show that (1) δ can be strictly smaller than γ, by up to a logarithmic factor; (2) there are string families needing Ω(δ log n / δ ) space to be encoded, so this space is optimal for every n and δ; (3) one can build run-length context-free grammars of size O (δ log n / δ ), whereas the smallest (non-run-length) grammar can be up to Θ(log n / log log n ) times larger; and (4) within O (δ log n / δ ) space, we can not only represent a string but also offer logarithmic-time access to its symbols, computation of substring fingerprints, and efficient indexed searches for pattern occurrences. We further refine the above results to account for the alphabet size σ of the string, showing that Θ(δ log n log σ/ δ log n ) space is necessary and sufficient to represent the string and to efficiently support access, fingerprinting, and pattern matching queries.

Towards a Definitive Compressibility Measure for Repetitive Sequences

Prezza, Nicola
2022-01-01

Abstract

While the k th order empirical entropy is an accepted measure of the compressibility of individual sequences on classical text collections, it is useful only for small values of k and thus fails to capture the compressibility of repetitive sequences. In the absence of an established way of quantifying the latter, ad-hoc measures like the size z of the Lempel–Ziv parse are frequently used to estimate repetitiveness. The size b ≤ z of the smallest bidirectional macro scheme captures better what can be achieved via copy-paste processes, though it is NP-complete to compute, and it is not monotone upon appending symbols. Recently, a more principled measure, the size γ of the smallest string attractor , was introduced. The measure γ ≤ b lower-bounds all the previous relevant ones, while length- n strings can be represented and efficiently indexed within space O (γ log n / γ ), which also upper-bounds many measures, including z . Although γ is arguably a better measure of repetitiveness than b , it is also NP-complete to compute and not monotone, and it is unknown if one can represent all strings in o (γ log n) space. In this paper, we study an even smaller measure, δ ≤ γ, which can be computed in linear time, is monotone, and allows encoding every string in O (δ log n / δ ) space because z = O (δ log n / δ ). We argue that δ better captures the compressibility of repetitive strings. Concretely, we show that (1) δ can be strictly smaller than γ, by up to a logarithmic factor; (2) there are string families needing Ω(δ log n / δ ) space to be encoded, so this space is optimal for every n and δ; (3) one can build run-length context-free grammars of size O (δ log n / δ ), whereas the smallest (non-run-length) grammar can be up to Θ(log n / log log n ) times larger; and (4) within O (δ log n / δ ) space, we can not only represent a string but also offer logarithmic-time access to its symbols, computation of substring fingerprints, and efficient indexed searches for pattern occurrences. We further refine the above results to account for the alphabet size σ of the string, showing that Θ(δ log n log σ/ δ log n ) space is necessary and sufficient to represent the string and to efficiently support access, fingerprinting, and pattern matching queries.
File in questo prodotto:
File Dimensione Formato  
1910.02151.pdf

accesso aperto

Descrizione: articolo principale
Tipologia: Documento in Pre-print
Licenza: Dominio pubblico
Dimensione 340.38 kB
Formato Adobe PDF
340.38 kB Adobe PDF Visualizza/Apri

I documenti in ARCA sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/10278/5011480
Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus 13
  • ???jsp.display-item.citation.isi??? 6
social impact