Wheeler DFAs (WDFAs) are a sub-class of finite-state automata which is playing an important role in the emerging field of compressed data structures: as opposed to general automata, WDFAs can be stored in just logσ+O(1) bits per edge, σ being the alphabet's size, and support optimal-time pattern matching queries on the substring closure of the language they recognize. An important step to achieve further compression is minimization. When the input A is a general deterministic finite-state automaton (DFA), the state-of-the-art is represented by the classic Hopcroft's algorithm, which runs in O(|A|log|A|) time. This algorithm stands at the core of the only existing minimization algorithm for Wheeler DFAs, which inherits its complexity. In this work, we show that the minimum WDFA equivalent to a given input WDFA can be computed in linear O(|A|) time. When run on de Bruijn WDFAs built from real DNA datasets, an implementation of our algorithm reduces the number of nodes from 14% to 51% at a speed of more than 1 million nodes per second.

Linear-time Minimization of Wheeler DFAs

Prezza N.
2022

Abstract

Wheeler DFAs (WDFAs) are a sub-class of finite-state automata which is playing an important role in the emerging field of compressed data structures: as opposed to general automata, WDFAs can be stored in just logσ+O(1) bits per edge, σ being the alphabet's size, and support optimal-time pattern matching queries on the substring closure of the language they recognize. An important step to achieve further compression is minimization. When the input A is a general deterministic finite-state automaton (DFA), the state-of-the-art is represented by the classic Hopcroft's algorithm, which runs in O(|A|log|A|) time. This algorithm stands at the core of the only existing minimization algorithm for Wheeler DFAs, which inherits its complexity. In this work, we show that the minimum WDFA equivalent to a given input WDFA can be computed in linear O(|A|) time. When run on de Bruijn WDFAs built from real DNA datasets, an implementation of our algorithm reduces the number of nodes from 14% to 51% at a speed of more than 1 million nodes per second.
2022 Data Compression Conference (DCC)
File in questo prodotto:
File Dimensione Formato  
Minimum_WDFA_in_linear_time.pdf

non disponibili

Tipologia: Documento in Pre-print
Licenza: Accesso chiuso-personale
Dimensione 252.82 kB
Formato Adobe PDF
252.82 kB Adobe PDF   Visualizza/Apri

I documenti in ARCA sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: http://hdl.handle.net/10278/5004034
Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus 0
  • ???jsp.display-item.citation.isi??? ND
social impact