We study the problem of indexing a text T[1..n] ∈ Σn so that, later, given a query regular expression pattern R of size m = |R|, we can report all the occ substrings T[i..j] of T matching R. The problem is known to be hard for arbitrary patterns R, so in this paper, we consider the following two types of patterns. (1) Character-class Kleene-star patterns of the form P1D∗P2, where P1 and P2 are strings and D = {c1, . . ., ck} ⊂ Σ is a character-class (shorthand for the regular expression (c1|c2|··· |ck)) and (2) String Kleene-star patterns of the form P1P∗P2 where P, P1 and P2 are strings. In case (1), we describe an index of O(nlog1+ϵ n) space (for any constant ϵ > 0) solving queries in time O(m + log n/log log n + occ) on constant-sized alphabets. We also describe a general solution for any alphabet size. This result is conditioned on the existence of an anchor: a character of P1P2 that does not belong to D. We justify this assumption by proving that no efficient indexing solution can exist if an anchor is not present unless the Set Disjointness Conjecture fails. In case (2), we describe an index of size O(n) answering queries in time O(m + (occ + 1) logϵ n) on any alphabet size.
Text Indexing for Simple Regular Expressions
Nicola Prezza;
2025-01-01
Abstract
We study the problem of indexing a text T[1..n] ∈ Σn so that, later, given a query regular expression pattern R of size m = |R|, we can report all the occ substrings T[i..j] of T matching R. The problem is known to be hard for arbitrary patterns R, so in this paper, we consider the following two types of patterns. (1) Character-class Kleene-star patterns of the form P1D∗P2, where P1 and P2 are strings and D = {c1, . . ., ck} ⊂ Σ is a character-class (shorthand for the regular expression (c1|c2|··· |ck)) and (2) String Kleene-star patterns of the form P1P∗P2 where P, P1 and P2 are strings. In case (1), we describe an index of O(nlog1+ϵ n) space (for any constant ϵ > 0) solving queries in time O(m + log n/log log n + occ) on constant-sized alphabets. We also describe a general solution for any alphabet size. This result is conditioned on the existence of an anchor: a character of P1P2 that does not belong to D. We justify this assumption by proving that no efficient indexing solution can exist if an anchor is not present unless the Set Disjointness Conjecture fails. In case (2), we describe an index of size O(n) answering queries in time O(m + (occ + 1) logϵ n) on any alphabet size.| File | Dimensione | Formato | |
|---|---|---|---|
|
LIPIcs.CPM.2025.20.pdf
accesso aperto
Tipologia:
Versione dell'editore
Licenza:
Creative commons
Dimensione
866.67 kB
Formato
Adobe PDF
|
866.67 kB | Adobe PDF | Visualizza/Apri |
I documenti in ARCA sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.



