\emph{suffixient set} $\mathcal S \subseteq [n]$ for $T$ is a set of positions such that, for every one-character right-extension $T[i,j]$ of every right-maximal substring $T[i,j-1]$ of $T$, there exists $x\in S$ such that $T[i,j]$ is a suffix of $T[1,x]$. It was recently shown that, given a suffixient set of cardinality $q$ and an oracle offering fast random access on $T$ (for example, a straight-line program), there is a data structure of $O(q)$ words (on top of the oracle) that can quickly find all Maximal Exact Matches (MEMs) of any query pattern $P$ in $T$ with high probability. The paper introducing suffixient sets left open the problem of computing the smallest such set; in this paper, we solve this problem by describing a simple quadratic-time algorithm, a $O(n + \bar r|\Sigma|)$-time algorithm running in compressed working space ($\bar r$ is the number of runs in the Burrows-Wheeler transform of $T$ reversed), and an optimal $O(n)$-time algorithm computing the smallest suffixient set. We present an implementation of our compressed-space algorithm and show experimentally that it uses a small memory footprint on repetitive text collections.
On Computing the Smallest Suffixient Set
Cenzato, Davide;Olivares, Francisco;Prezza, Nicola
2024-01-01
Abstract
\emph{suffixient set} $\mathcal S \subseteq [n]$ for $T$ is a set of positions such that, for every one-character right-extension $T[i,j]$ of every right-maximal substring $T[i,j-1]$ of $T$, there exists $x\in S$ such that $T[i,j]$ is a suffix of $T[1,x]$. It was recently shown that, given a suffixient set of cardinality $q$ and an oracle offering fast random access on $T$ (for example, a straight-line program), there is a data structure of $O(q)$ words (on top of the oracle) that can quickly find all Maximal Exact Matches (MEMs) of any query pattern $P$ in $T$ with high probability. The paper introducing suffixient sets left open the problem of computing the smallest such set; in this paper, we solve this problem by describing a simple quadratic-time algorithm, a $O(n + \bar r|\Sigma|)$-time algorithm running in compressed working space ($\bar r$ is the number of runs in the Burrows-Wheeler transform of $T$ reversed), and an optimal $O(n)$-time algorithm computing the smallest suffixient set. We present an implementation of our compressed-space algorithm and show experimentally that it uses a small memory footprint on repetitive text collections.File | Dimensione | Formato | |
---|---|---|---|
suffixient.pdf
non disponibili
Descrizione: full text
Tipologia:
Documento in Pre-print
Licenza:
Copyright dell'editore
Dimensione
649.01 kB
Formato
Adobe PDF
|
649.01 kB | Adobe PDF | Visualizza/Apri |
I documenti in ARCA sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.