We study the problem of indexing a text T[1..n] ∈ Σn so that, later, given a query regular expression pattern R of size m = |R|, we can report all the occ substrings T[i..j] of T matching R. The problem is known to be hard for arbitrary patterns R, so in this paper, we consider the following two types of patterns. (1) Character-class Kleene-star patterns of the form P1D∗P2, where P1 and P2 are strings and D = {c1, . . ., ck} ⊂ Σ is a character-class (shorthand for the regular expression (c1|c2|··· |ck)) and (2) String Kleene-star patterns of the form P1P∗P2 where P, P1 and P2 are strings. In case (1), we describe an index of O(nlog1+ϵ n) space (for any constant ϵ > 0) solving queries in time O(m + log n/log log n + occ) on constant-sized alphabets. We also describe a general solution for any alphabet size. This result is conditioned on the existence of an anchor: a character of P1P2 that does not belong to D. We justify this assumption by proving that no efficient indexing solution can exist if an anchor is not present unless the Set Disjointness Conjecture fails. In case (2), we describe an index of size O(n) answering queries in time O(m + (occ + 1) logϵ n) on any alphabet size.

Text Indexing for Simple Regular Expressions

Nicola Prezza;
2025-01-01

Abstract

We study the problem of indexing a text T[1..n] ∈ Σn so that, later, given a query regular expression pattern R of size m = |R|, we can report all the occ substrings T[i..j] of T matching R. The problem is known to be hard for arbitrary patterns R, so in this paper, we consider the following two types of patterns. (1) Character-class Kleene-star patterns of the form P1D∗P2, where P1 and P2 are strings and D = {c1, . . ., ck} ⊂ Σ is a character-class (shorthand for the regular expression (c1|c2|··· |ck)) and (2) String Kleene-star patterns of the form P1P∗P2 where P, P1 and P2 are strings. In case (1), we describe an index of O(nlog1+ϵ n) space (for any constant ϵ > 0) solving queries in time O(m + log n/log log n + occ) on constant-sized alphabets. We also describe a general solution for any alphabet size. This result is conditioned on the existence of an anchor: a character of P1P2 that does not belong to D. We justify this assumption by proving that no efficient indexing solution can exist if an anchor is not present unless the Set Disjointness Conjecture fails. In case (2), we describe an index of size O(n) answering queries in time O(m + (occ + 1) logϵ n) on any alphabet size.
2025
Leibniz International Proceedings in Informatics, LIPIcs
File in questo prodotto:
File Dimensione Formato  
LIPIcs.CPM.2025.20.pdf

accesso aperto

Tipologia: Versione dell'editore
Licenza: Creative commons
Dimensione 866.67 kB
Formato Adobe PDF
866.67 kB Adobe PDF Visualizza/Apri

I documenti in ARCA sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/10278/5104708
Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus 0
  • ???jsp.display-item.citation.isi??? ND
social impact