We propose a polynomial algorithm computing a minimum plain-text representation of k-mer sets, as well as an efficient near-minimum greedy heuristic. When compressing read sets of large model organisms or bacterial pangenomes, with only a minor runtime increase, we shrink the representation by up to 59% over unitigs and 26% over previous work. Additionally, the number of strings is decreased by up to 97% over unitigs and 90% over previous work. Finally, a small representation has advantages in downstream applications, as it speeds up SSHash-Lite queries by up to 4.26x over unitigs and 2.10x over previous work.
Matchtigs: minimum plain text representation of k-mer sets
Pibiri, Giulio E;
2023-01-01
Abstract
We propose a polynomial algorithm computing a minimum plain-text representation of k-mer sets, as well as an efficient near-minimum greedy heuristic. When compressing read sets of large model organisms or bacterial pangenomes, with only a minor runtime increase, we shrink the representation by up to 59% over unitigs and 26% over previous work. Additionally, the number of strings is decreased by up to 97% over unitigs and 90% over previous work. Finally, a small representation has advantages in downstream applications, as it speeds up SSHash-Lite queries by up to 4.26x over unitigs and 2.10x over previous work.File in questo prodotto:
File | Dimensione | Formato | |
---|---|---|---|
GBIO2023.pdf
accesso aperto
Tipologia:
Versione dell'editore
Licenza:
Accesso libero (no vincoli)
Dimensione
2.75 MB
Formato
Adobe PDF
|
2.75 MB | Adobe PDF | Visualizza/Apri |
I documenti in ARCA sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.