On Indexing and Compressing Finite Automata

An index for a finite automaton is a powerful data structure that supports locating paths labeled with a query pattern, thus solving pattern matching on the underlying regular language. The problem is hard in the general case: a recent conditional lower bound suggests that, in the worst case, deciding at query time whether a pattern of length m belongs to the substring closure of the language accepted by A requires Ω(m · |A|) time. On the other hand, Gagie et al. [TCS 2017] introduced a subclass of automata that allow an optimal Õ(m)time solution based on prefix-sorting the states in a total order. In this paper, we solve the long-standing problem of indexing arbitrary finite automata, matching the above bounds. Our solution consists in finding a partial co-lexicographic order of the states and proving, as in the total order case, that states reached by a given string form one interval on the partial order, thus enabling indexing. We provide a lower bound stating that such an interval requires O(p) words to be represented, p being the order's width (i.e. the size of its largest antichain). Indeed, we show that p determines the complexity of several fundamental problems on finite automata: (i) Letting σ be the alphabet size, we provide an encoding for NFAs using dlog σe + 2dlog pe + 2 bits per transition and a smaller encoding for DFAs using dlog σe + dlog pe + 2 bits per transition. This is achieved by generalizing the Burrows-Wheeler transform to arbitrary automata. (ii) We show that indexed pattern matching can be solved in Õ(m · p2) query time on NFAs. (iii) We provide a polynomial-time algorithm to index DFAs, while matching the optimal value for p. On the other hand, we prove that the problem is NP-hard on NFAs. (iv) We show that, in the worst case, the classic powerset construction algorithm for NFA determinization generates an equivalent DFA of size 2p(n−p+1)−1, where n is the number of NFA's states. Contribution (i) provides a new compression paradigm for labeled graphs. Contributions (ii)-(iii) solve the regular language indexing problem, notably with a polynomial-time solution for DFAs. Contribution (iv) implies a new FPT analysis for the complexity of classic algorithms on automata, including membership and equivalence (the latter being PSPACE-complete when input automata are NFAs).

On Indexing and Compressing Finite Automata

Cotumaccio, Nicola;Prezza, Nicola

2021-01-01

Abstract

An index for a finite automaton is a powerful data structure that supports locating paths labeled with a query pattern, thus solving pattern matching on the underlying regular language. The problem is hard in the general case: a recent conditional lower bound suggests that, in the worst case, deciding at query time whether a pattern of length m belongs to the substring closure of the language accepted by A requires Ω(m · |A|) time. On the other hand, Gagie et al. [TCS 2017] introduced a subclass of automata that allow an optimal Õ(m)time solution based on prefix-sorting the states in a total order. In this paper, we solve the long-standing problem of indexing arbitrary finite automata, matching the above bounds. Our solution consists in finding a partial co-lexicographic order of the states and proving, as in the total order case, that states reached by a given string form one interval on the partial order, thus enabling indexing. We provide a lower bound stating that such an interval requires O(p) words to be represented, p being the order's width (i.e. the size of its largest antichain). Indeed, we show that p determines the complexity of several fundamental problems on finite automata: (i) Letting σ be the alphabet size, we provide an encoding for NFAs using dlog σe + 2dlog pe + 2 bits per transition and a smaller encoding for DFAs using dlog σe + dlog pe + 2 bits per transition. This is achieved by generalizing the Burrows-Wheeler transform to arbitrary automata. (ii) We show that indexed pattern matching can be solved in Õ(m · p2) query time on NFAs. (iii) We provide a polynomial-time algorithm to index DFAs, while matching the optimal value for p. On the other hand, we prove that the problem is NP-hard on NFAs. (iv) We show that, in the worst case, the classic powerset construction algorithm for NFA determinization generates an equivalent DFA of size 2p(n−p+1)−1, where n is the number of NFA's states. Contribution (i) provides a new compression paradigm for labeled graphs. Contributions (ii)-(iii) solve the regular language indexing problem, notably with a polynomial-time solution for DFAs. Contribution (iv) implies a new FPT analysis for the complexity of classic algorithms on automata, including membership and equivalence (the latter being PSPACE-complete when input automata are NFAs).

Scheda breve

Scheda completa

Scheda completa (DC)

	Anno pubblicazione
	
				2021
			
	Titolo del volume
	
				Proceedings of the 2021 ACM-SIAM Symposium on Discrete Algorithms (SODA)
			
	DOI
	
				https://dx.doi.org/10.1137/1.9781611976465.153
			
	Appare nelle tipologie:
	
				4.1 Articolo in Atti di convegno

File in questo prodotto:

File	Dimensione	Formato
SODA_21___On_indexing_and_compressing_finite_automata.pdf non disponibili Tipologia: Documento in Post-print Licenza: Accesso chiuso-personale Dimensione 406.6 kB Formato Adobe PDF Visualizza/Apri	406.6 kB	Adobe PDF	Visualizza/Apri

I documenti in ARCA sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/10278/3735063

Citazioni

ND

39

ND

social impact