PerFSeeB: Designing Long High-weight Single Spaced Seeds for Full Sensitivity Alignment with a Given Number of Mismatches

doi:10.21203/rs.3.rs-1051543/v1

Download PDF

Research Article

PerFSeeB: Designing Long High-weight Single Spaced Seeds for Full Sensitivity Alignment with a Given Number of Mismatches

https://doi.org/10.21203/rs.3.rs-1051543/v1

This work is licensed under a CC BY 4.0 License

Journal Publication

published 24 Oct, 2023

Read the published version in BMC Bioinformatics →

You are reading this latest preprint version

Background: Technical progress in computational hardware allows researchers to use new approaches for sequence alignment problems. A standard procedure is usually based on pre-aligning of short subsequences followed by proper comparison of neighbouring parts. For this purpose index files are created that store all subsequences (or numbers associated with them) and their positions within a reference sequence. Index files designed on subsequences of 32–64 symbols for a human reference genome can now be easily stored without any compression even on a budget computer. The main goal now is to choose a combination of symbols (a spaced seed) that will tolerate various mismatches between reference and given sequences. An ideal spaced seed should allow us to find all such positions (full sensitivity). By increasing the seed’s weight by one we usually reduce the number of candidate positions fourfold. At the same time longer seeds also reduce the number of signatures to be checked.

Results: Several algorithms to assist seed generation are presented. The first one allows us to find all permitted spaced seeds iteratively. The results obtained with the algorithm show specific patterns of the seeds of the highest weight. Among the best seeds, there are periodic seeds with a simple relation between the period of a seed, its length and the length of a read. The second algorithm generates blocks for periodic seeds. A list of blocks is found for blocks of up to 50 symbols and up to 9 mismatches. The third algorithm uses those lists to find spaced seeds for reads of an arbitrary length.

Conclusions: Lists of long high-weight spaced seeds are found and available in Supplementary Materials. The seeds are best in terms of weights compared to seeds from other papers and can usually be applied to shorter reads. Codes for all algorithms are available at https://github.com/vtman/PerFSeeB.

Bioinformatics

spaced seeds

lossless seed

full sensitivity

sequence alignment

mismatch

indexing