Spectrum Preserving Tilings Enable Sparse and Modular Reference Indexing

Fan, J.; Khan, J.; Pibiri, G. E.; Patro, R.

doi:10.1007/978-3-031-29119-7_2

The reference indexing problem for k-mers is to pre-process a collection of reference genomic sequences so that the position of all occurrences of any queried k-mer can be rapidly identified. An efficient and scalable solution to this problem is fundamental for many tasks in bioinformatics. In this work, we introduce the spectrum preserving tiling (SPT), a general representation of that specifies how a set of tiles repeatedly occur to spell out the constituent reference sequences in. By encoding the order and positions where tiles occur, SPTs enable the implementation and analysis of a general class of modular indexes. An index over an SPT decomposes the reference indexing problem for k-mers into: (1) a k-mer-to-tile mapping; and (2) a tile-to-occurrence mapping. Recently introduced work to construct and compactly index k-mer sets can be used to efficiently implement the k-mer-to-tile mapping. However, implementing the tile-to-occurrence mapping remains prohibitively costly in terms of space. As reference collections become large, the space requirements of the tile-to-occurrence mapping dominates that of the k-mer-to-tile mapping since the former depends on the amount of total sequence while the latter depends on the number of unique k-mers in. To address this, we introduce a class of sampling schemes for SPTs that trade off speed to reduce the size of the tile-to-reference mapping. We implement a practical index with these sampling schemes in the tool pufferfish2. When indexing over 30,000 bacterial genomes, pufferfish2 reduces the size of the tile-to-occurrence mapping from 86.3 GB to 34.6 GB while incurring only a 3.6 slowdown when querying k-mers from a sequenced readset. Availability: pufferfish2 is implemented in Rust and available at https://github.com/COMBINE-lab/pufferfish2.

Spectrum Preserving Tilings Enable Sparse and Modular Reference Indexing

Fan J.;Khan J.;Pibiri G. E.;Patro R.

2023-01-01

Abstract

The reference indexing problem for k-mers is to pre-process a collection of reference genomic sequences so that the position of all occurrences of any queried k-mer can be rapidly identified. An efficient and scalable solution to this problem is fundamental for many tasks in bioinformatics. In this work, we introduce the spectrum preserving tiling (SPT), a general representation of that specifies how a set of tiles repeatedly occur to spell out the constituent reference sequences in. By encoding the order and positions where tiles occur, SPTs enable the implementation and analysis of a general class of modular indexes. An index over an SPT decomposes the reference indexing problem for k-mers into: (1) a k-mer-to-tile mapping; and (2) a tile-to-occurrence mapping. Recently introduced work to construct and compactly index k-mer sets can be used to efficiently implement the k-mer-to-tile mapping. However, implementing the tile-to-occurrence mapping remains prohibitively costly in terms of space. As reference collections become large, the space requirements of the tile-to-occurrence mapping dominates that of the k-mer-to-tile mapping since the former depends on the amount of total sequence while the latter depends on the number of unique k-mers in. To address this, we introduce a class of sampling schemes for SPTs that trade off speed to reduce the size of the tile-to-reference mapping. We implement a practical index with these sampling schemes in the tool pufferfish2. When indexing over 30,000 bacterial genomes, pufferfish2 reduces the size of the tile-to-occurrence mapping from 86.3 GB to 34.6 GB while incurring only a 3.6 slowdown when querying k-mers from a sequenced readset. Availability: pufferfish2 is implemented in Rust and available at https://github.com/COMBINE-lab/pufferfish2.

Scheda breve

Scheda completa

Scheda completa (DC)

	Anno pubblicazione
	
				2023
			
	Titolo del volume
	
				Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
			
	DOI
	
				https://dx.doi.org/10.1007/978-3-031-29119-7_2
			
	Appare nelle tipologie:
	
				4.1 Articolo in Atti di convegno

File in questo prodotto:

File	Dimensione	Formato
RECOMB2023.pdf accesso aperto Tipologia: Documento in Post-print Licenza: Accesso libero (no vincoli) Dimensione 716.88 kB Formato Adobe PDF Visualizza/Apri	716.88 kB	Adobe PDF	Visualizza/Apri

I documenti in ARCA sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/10278/5023385

Citazioni

ND

7

ND

Spectrum Preserving Tilings Enable Sparse and Modular Reference Indexing

Fan J.;Khan J.;Pibiri G. E.;Patro R.

2023-01-01

Abstract

Scheda breve Scheda completa Scheda completa (DC)

Informazioni

Citazioni

social impact

Conferma cancellazione

Scheda breve

Scheda completa

Scheda completa (DC)