\emph{suffixient set} $\mathcal S \subseteq [n]$ for $T$ is a set of positions such that, for every one-character right-extension $T[i,j]$ of every right-maximal substring $T[i,j-1]$ of $T$, there exists $x\in S$ such that $T[i,j]$ is a suffix of $T[1,x]$. It was recently shown that, given a suffixient set of cardinality $q$ and an oracle offering fast random access on $T$ (for example, a straight-line program), there is a data structure of $O(q)$ words (on top of the oracle) that can quickly find all Maximal Exact Matches (MEMs) of any query pattern $P$ in $T$ with high probability. The paper introducing suffixient sets left open the problem of computing the smallest such set; in this paper, we solve this problem by describing a simple quadratic-time algorithm, a $O(n + \bar r|\Sigma|)$-time algorithm running in compressed working space ($\bar r$ is the number of runs in the Burrows-Wheeler transform of $T$ reversed), and an optimal $O(n)$-time algorithm computing the smallest suffixient set. We present an implementation of our compressed-space algorithm and show experimentally that it uses a small memory footprint on repetitive text collections.

On Computing the Smallest Suffixient Set

Cenzato, Davide;Olivares, Francisco;Prezza, Nicola
2024-01-01

Abstract

\emph{suffixient set} $\mathcal S \subseteq [n]$ for $T$ is a set of positions such that, for every one-character right-extension $T[i,j]$ of every right-maximal substring $T[i,j-1]$ of $T$, there exists $x\in S$ such that $T[i,j]$ is a suffix of $T[1,x]$. It was recently shown that, given a suffixient set of cardinality $q$ and an oracle offering fast random access on $T$ (for example, a straight-line program), there is a data structure of $O(q)$ words (on top of the oracle) that can quickly find all Maximal Exact Matches (MEMs) of any query pattern $P$ in $T$ with high probability. The paper introducing suffixient sets left open the problem of computing the smallest such set; in this paper, we solve this problem by describing a simple quadratic-time algorithm, a $O(n + \bar r|\Sigma|)$-time algorithm running in compressed working space ($\bar r$ is the number of runs in the Burrows-Wheeler transform of $T$ reversed), and an optimal $O(n)$-time algorithm computing the smallest suffixient set. We present an implementation of our compressed-space algorithm and show experimentally that it uses a small memory footprint on repetitive text collections.
2024
Proceedings of the 31st International Symposium on String Processing and Information Retrieval (SPIRE 2024)
File in questo prodotto:
File Dimensione Formato  
suffixient.pdf

non disponibili

Descrizione: full text
Tipologia: Documento in Pre-print
Licenza: Copyright dell'editore
Dimensione 649.01 kB
Formato Adobe PDF
649.01 kB Adobe PDF   Visualizza/Apri

I documenti in ARCA sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/10278/5076323
Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus ND
  • ???jsp.display-item.citation.isi??? ND
social impact