NanoSTR: A Method for Detection of Target Short Tandem Repeats Based on Nanopore Sequencing Data

doi:10.21203/rs.3.rs-1736830/v1

Download PDF

Research Article

NanoSTR: A Method for Detection of Target Short Tandem Repeats Based on Nanopore Sequencing Data

https://doi.org/10.21203/rs.3.rs-1736830/v1

This work is licensed under a CC BY 4.0 License

Version 1

posted

You are reading this latest preprint version

Background: Short tandem repeats (STRs) are widely present in the human genome. Studies have confirmed that STRs are associated with more than 30 diseases, and they have also been used in forensic identification and paternity testing. However, there are few methods for STR detection based on nanopore sequencing due to the challenges posed by the sequencing principles and the data characteristics of nanopore sequencing.

Results: We developed NanoSTR for detection of target STR loci based on the length-number-rank (LNR) information of reads. NanoSTR can be used for STR detection and genotyping based on long-read data from nanopore sequencing with improved accuracy and efficiency compared with other existing methods, such as Tandem-Genotypes and TRiCoLOR. NanoSTR showed 100% concordance with the expected genotypes using error-free simulated data. Besides, NanoSTR also showed 86.36% and 73.48% concordance using the standard samples 9948 and 2800M with MinION sequencing platform, respectively. Meanwhile, the concordance with Qnome-3841 sequencing platform were 71.97% and 53.03%, respectively.

Conclusions: NanoSTR showed high performance for detection of target STR markers. Although NanoSTR needs further optimization and development, it is useful as an analytical method for the detection of STR loci by nanopore sequencing. This method adds to the toolbox for nanopore-based STR analysis and expands the applications of nanopore sequencing in scientific research and clinical scenarios.

nanopore sequencing

long read sequencing

short tandem repeat

STR

NanoSTR

Short tandem repeats (STRs), also known as microsatellites, are repetitive DNA sequences consisting of 1–6-bp motifs present in a genome. These highly individual-specific number of repeats and the abundance of motifs have contributed to the polymorphism of STR loci. On average, STR loci occur every 15 kb in the human genome. The number of repeat units differs between individuals, resulting in highly complex allele polymorphisms. Because of their high diversity, wide distribution, and high polymorphism, STRs are considered as the second generation of genetic markers after restriction fragment length polymorphisms (RFLP). Therefore, STR detection has been widely used in forensic identification, paternity testing, species polymorphism identification, and genetic disease diagnosis (1)(2)(3)(4). Studies have shown that STRs represent a source of phenotypic variations in more than 30 Mendelian diseases, such as neurological disorders (5)(6).

Nanopore sequencing is an evolving third/fourth generation sequencing technology for direct detection of nucleotide sequences with kb or even Mb base pairs (7)(8). In practice, however, the high error rate and special data characteristics of long-read sequencing have limited the efficient identification of STR polymorphisms, and therefore, further evaluation of the analytical methods is required (9)(10). There are a few methods for STR identification based on nanopore sequencing, and the representative software are Tandem-Genotypes (11), NanoSatellite (12), STRique (13)， etc. These software and related algorithms have limitations and shortcomings. For example, NanoSatellite directly analyzes STRs based on electric current distribution, and the accuracy of analysis depends heavily on the stability of the sequencing current and the precision of the basecalling model. Tandem-Genotypes requires data preprocessing steps such as LAST alignment and establishment of a genomic background database, and histograms are needed to assist STR genotyping. Therefore, the whole process is time-consuming. Other analytical methods such as NCRF (14) and TideHunter (15) are incapable of STR typing. Therefore, these analytical methods have limited applications and insufficient robustness.

We therefore developed NanoSTR as a method for detecting target STRs based on nanopore sequencing. The method uses statistical analysis methods such as multisampling and the length-number-rank (LNR) information of reads for the genotyping and correction of STR markers with improved accuracy (Fig. 1). In terms of data characteristics, NanoSTR effectively avoids the non-random sequencing errors and unexpected insertions-deletions (indels) associated with nanopore sequencing (8)(9) and thus improves the efficiency of sequencing data utilization, the detection rate of STR genotypes, and the accuracy of STR profiling.

Performance on simulated data

Analysis of the three error-free simulated datasets (included in Flanking-1Kb, Flanking-10Kb and Flanking-100Kb) showed 100% concordance with the expected genotypes (Additional file 1: Table S1). However, the three simulated datasets of Flanking-1k and the Simulated_data-1 of Flanking-10k with errors showed 75% concordance. A typing error (an allele with one less repeat unit) occurred at DYS392 in the four simulated datasets. The remaining five simulated datasets showed 50% concordance. Except for the Simulated_data-2 of Flanking-100k with typing errors at DYS392 and DYS635, the remaining datasets showed errors at the markers DYS392 and DYS448 (Fig. 2A). We averaged the number of mismatches, insertions, and deletions over reads (Fig. 2B) and found that the three simulated datasets showed similar results for Flanking-1k, Flanking-10k, and Flanking-100k. We also performed a statistical analysis on the simulated datasets regarding the distribution of lengths with each error type (Fig. 2C) and found that most erroneous sequences were 1–2 bp, with slightly higher length diversity of insertions and deletions. However, the same error type but different flanking lengths showed slight variations in length proportions. We therefore infer that the analytical performance of NanoSTR may be greatly affected by the location of the errors given that the relative proportion and distribution of the erroneous sequence lengths were consistent across the three simulated datasets.

Effect of the number of errors on STR typing accuracy

We calculated the ratio of the number of errors/base × 100 of each error type with simulated datasets containing 10 markers (Additional file 1: Table S2). We found that the accuracy of STR typing decreased with increasing number of errors (Fig. 3). Intriguingly, for the Simulated_data-1 with homozygous STR loci, the accuracy remained at 100% regardless of the ratio. For Simulated_data-2 with heterozygous STR loci and an increase of one of the alleles, the accuracy decreased with increasing ratio, and the accuracy was the lowest compared with the other two simulated datasets. For Simulated_data-3 with heterozygous STR loci and one less allele, the accuracy decreased with increasing ratio. We therefore speculate that NanoSTR may perform less well in STR typing for heterozygous loci with increased number of repeats compared to heterozygous loci with reduced number of repeats and homozygous loci in the reference genome. Regarding the performance of NanoSTR, no more than 2.6 mismatches, 1.5 insertions, and 1.7 deletions per 100 bp on average may be necessary to achieve > 90% concordance, which can be comparable to the quality of next-generation sequencing (NGS).

Performance on real data

A total of 44 STR loci (DYS385-a/b represents DYS385AB-a and DYS385AB-b) from the intersection of two standard samples (9948 and 2800M) and STRBase with MinION sequencing platform were used for genotype analysis (Additional file 1: Table S3 and Table S4). We found similar distributions of average sequencing depth of STR markers in the six control sample datasets (Fig. 4A). However, the coverage of some loci was very low in 2800M, which may have affected the genotyping accuracy of some STR markers. We compared the results of STR typing with the standard sample datasets using NanoSTR, Tandem-Genotypes, and TRiCoLOR. NanoSTR showed better analytical performance (Fig. 4B) and ease of use. NanoSTR achieved the best performance on 9948 and 2800M, with 86.36% and 73.48% concordance, respectively. Tandem-Genotypes showed the worst performance; the concordance was only 15.91% and 9.09% for 9948 and 2800M, respectively. TRiCoLOR showed 25.00% and 15.91% concordance. Further analysis revealed that the inconsistent genotypes presented by TRiCoLOR and Tandem-Genotypes were completely different. TRiCoLOR showed incorrect STR genotypes whereas Tandem-Genotypes failed to detect some STR loci and produced false negative results. This may be explained by the mechanisms of the algorithms. TRiCoLOR cannot effectively distinguish heterozygous STR loci using datasets without a marked source of haplotypes. Tandem-Genotypes relies heavily on the accuracy of the genomic background database and alignment algorithm, which may lead to false negative results due to mismatches. These findings explain the limitations and insufficient robustness of TRiCoLOR and Tandem-Genotypes, and further analysis will be performed in our future work to find alternative explanations.

We also performed the above analysis process for standards 9948 and 2800M with the Qnome-3841 sequencing platform (Additional file 1: Table S3 and Table S5). The results showed the same conclusion with the MinION sequencing platform. That means the similar distributions of average sequencing depth of STR markers in the standard samples (Fig. 4C) and the best performance of NanoSTR (Fig. 4D). The concordance rate of NanoSTR on 9948 and 2800M was 71.97% and 53.03%, respectively. Tandem-Genotypes showed the worst performance; the concordance was only 12.88% and 9.85% for 9948 and 2800M, respectively. TRiCoLOR showed 25.00% and 15.91% concordance.

Nanopore sequencing, or long-read sequencing, has many advantages over short-read sequencing (16). Compared with Illumina’s commercial short-read sequencing platforms such as HiSeq, NextSeq, and MiSeq, which produce read lengths of up to 600 bp (17), long-read sequencing technologies can generate reads with > 10 kb or even > 1 Mb base pairs (8). However, short-read sequencing has evolved rapidly over the past decade and is highly cost-effective and efficient. It provides sequencing data with high accuracy and has a variety of well-established data analysis tools and workflows (18). These features are currently lacking in long-read sequencing platforms (19). Due to the highly repetitive and complex structure of STR loci, both NGS and nanopore-based platforms face some technical challenges in the sequencing, calling, and analysis of STR loci. For example, it is well-known that continuous single-base repeats cannot be accurately sequenced and high-GC and high-repeat regions cannot be efficiently amplified by PCR. Therefore, the accurate detection of STR loci is inherently challenging, and there are particularly urgent and high demands for methods and accuracy of bioinformatics analysis.

NanoSTR is a software for target STR profiling based on long reads from nanopore sequencing. Compared with other analysis methods, NanoSTR can be used to accurately genotype STR loci based on multisampling and LNR of reads. NanoSTR largely circumvents the errors or failure of genotyping associated with nanopore sequencing data characteristics. Moreover, there is no need to establish a genomic background database or align the sequencing data against the human reference genome, thus reducing the consumption of computational resources. There is no requirement for secondary processing steps such as plotting to assist the interpretation of STR genotypes, which saves a considerable amount of time in the analysis. The robustness of NanoSTR is also good, and it can be used on different sequencing platforms and is better than some analysis methods. However, it can also be seen that due to differences in different sequencing platforms or experimental steps (Additional file 2: Figure S2), the performance is slightly different, which also suggests that users need to consider the data characteristics from different sources and need to evaluate and then decide whether the parameters of NanoSTR are even applicable. NanoSTR has some limitations and shortcomings. First, this method relies on LNR of reads to detect and genotype STR loci and therefore can be significantly affected by the distribution, size, number, and sequencing depth of random and/or non-random indels. Second, several threshold values are used in this method, such as the rank difference, the ratio of supported read number, and the number of mismatches in BLAST alignment, which may have sizeable impacts on typing performance. For example, the 164-bp DYS389III in the reference genome showed 12 mismatches, and therefore, similar reads were filtered out despite the fulfillment of other criteria. This reduced the number of valid sequences and increased the errors in genotyping (Additional file 2: see the “Example-2” section, Figure S1). In contrast, retainment of sequencing reads with excess mismatches can lead to false positive results. Therefore, it is necessary for users to balance these opposing effects according to the data characteristics and actual situations. Third, the method can be limited by the alignment software. BLAST alignment shows the number of gaps, but the length of each gap is unknown, which impedes systematic evaluation of the specific effects of these indels on the typing results. In addition, for STR sites with complex structures, such as [A]n[B]nNn[C]n[D]n, the alignment analysis of BLAST also has challenges, which may easily lead to STR typing errors. Fourth, NanoSTR is not suitable for detection of genome-wide STR loci because it was designed for target STR loci. Fifth, as with other analytical methods and software, NanoSTR is highly dependent on the quality of sequencing data. Theoretically, the higher the accuracy of sequencing, the better would be the performance of NanoSTR. Therefore, the performance of NanoSTR in the detection of large-size samples requires additional investigation, and more real-world data are needed for further verification.

In summary, NanoSTR still needs further development and optimization in terms of typing accuracy, computational resource consumption, running time, and statistical algorithms. Our results confirm that a single analytical method cannot detect all STR markers. Methods can be used in combination, or some STR loci can be detected by different methods. We will improve the accuracy of STR typing by incorporating deep learning algorithms and electric current distribution in NanoSTR algorithms. We hope that these efforts will increase the performance of NanoSTR and provide a reference bioinformatics analysis method for the application of nanopore sequencing-based STR detection in scientific research and clinical scenarios. As a result, nanopore sequencing technology will be able to truly aid the development of the sequencing industry and the commercialization of precision medicine.

NanoSTR is a method for STR typing based on nanopore sequencing data and the reads’ length-number-rank information. NanoSTR not only improves the effective use of sequencing data but also shows higher accuracy compared with the existing genotypical methods. NanoSTR provides an alternative analytical method for the detection of STR loci by nanopore sequencing and adds to the related data analysis tools. We hope that NanoSTR can further expand the application of nanopore sequencing techniques in scientific research and clinical scenarios so that these techniques can better promote the development of the sequencing industry and serve the needs of precision medicine.

Analysis principles

Analysis with NanoSTR comprises the following four steps (Fig. 1). The first step is definition of the extension step size d. The start and end positions of the target STR locus on the reference genome are marked as P_start and P_end. Extension is repeated N times to the upstream of P_start and to the downstream of P_end. The P_start’ and P_end’ of each extension are expressed as follows:

P_start_i’ = P_start − d*i

P_end_i’ = P_end + d*i

where 1 <= i <= N

The sequences with P_start_i’ as the start position, P_end_i’ as the end position, and d as the extension step size were extracted from the reference genome, which are referred to as paired-seed sequences. The N paired-seed sequences obtained after N extensions are used for the extraction of the complete matching target sequences from the nanopore sequencing data in *.fastq format to yield N datasets of target sequences. Then, the lengths of the target sequences in each dataset are determined to generate N datasets containing the sequence lengths. Finally, the lengths of the target sequences in each dataset are sorted in descending order of supported read number, and the sorted lengths are numbered in ascending order, which is defined as “rank.” Consequently, dataset1 with N subsets containing the length-number-rank (LNR) information of sequences is generated. In the second step, the target STR loci are extended over a certain distance (e.g., 500 bp by default) upstream of the start position and downstream of the end position on the reference genome, which are used as the reference sequences. Then, the N datasets of the target sequences obtained in the first step are aligned against the reference sequences using BLAST. The results in m8 format are filtered with a threshold mismatch number of < 3. The distances between the start and end positions of the subject sequences are used as the lengths of the matching sequences to obtain N datasets of sequence lengths. Finally, the lengths in each dataset are sorted in descending order of supported read number, and the sorted lengths are numbered in ascending order, resulting in dataset2 with N subsets containing the LNR information. In the third step, the N length distributions in dataset1 are intersected with dataset2, and the lengths with minimum rank differences < 3 are retained and labeled as LNR-joint_i. Then, each LNR-joint_i is subjected to another filtration according to the supported read number. To determine the genotype of each LNR-joint_i, the length with the maximum supported read number is retained if the ratio of the maximum supported read number to the second maximum supported read number is > 3; otherwise, the lengths with the maximum and second-maximum supported read number are retained. Finally, N genotypes are obtained. In the fourth step, the N genotypes are combined for statistical analysis, and the results with the mode and supported read number are selected as the final genotype for this target STR locus, that is, if the mode ratio is >= 3, it is considered to be homozygous; otherwise, it is considered to be heterozygous. Since interference such as background noise may affect the results, a secondary correction is performed according to the difference in the order of magnitude of the number of reads (Additional file 2: see the “Example-1” section).

Simulated data

We downloaded 75 forensic makers from STRBase (Additional file 1: Table S6) (20), and four markers (DYS392, DYS438, DYS448, and DYS635) were used as the simulated target loci. Reference sequences were extracted from the human reference genome hg38 by extension over distances of 1 kb, 10 kb, and 100 kb upstream and downstream of each STR locus. NanoSim-H (version: 1.1.0.4) (21) was used to simulate 100,000 nanopore sequencing reads with and without errors based on the extracted sequences (Additional file 1: Table S1, named Simulated_data-1). Similarly, we simulated heterozygous STR loci with four insertions (Additional file 1: Table S1, named Simulated_data-2) and four deletions (Additional file 1: Table S1, named Simulated_data-3) based on the repeat unit of each STR marker.

Ten STR loci (D12S391, D18S51, D22S1045, DYS635, DYS437, DYS438, DYS390, DYS392, DYS448, and DYS458) were randomly selected to assess the effect of the number of errors on genotyping performance. Reference sequence extraction was performed on the human reference genome hg38 with an extension distance of 100 kb upstream and downstream of these STR loci. NanoSim-H (version: 1.1.0.4) was used to simulate 100,000 nanopore sequencing reads with random proportions of mismatches, insertions, and deletions based on the extracted sequences (Additional file 1: Table S2, named Simulated_data-1). Similarly, we also simulated sequences with four insertions or four deletions based on the repeat unit of each STR marker (Additional file 1: Table S2, named Simulated_data-2 and Simulated_data-3).

Experiment with real data

Two genomic DNA standard products, named 2800M (Promega Biotech Co., Ltd, Beijing, China) and 9948 (AGCU ScienTech Incorporation, Wuxi, Jiangsu, China), were used in this study. They contained 51 and 72 Y-STR and/or autosomal STR loci, respectively. Next, we performed two rounds of PCR amplification by using the MultipSeq® Custom Panel (IGMU339V1hg38) kit (iGeneTech Biotech (Beijing) Co., Ltd, Beijing, China) according to the manufacturer’s user guide. Notably, we designed two pairs of primers to replace the amplification primers during the second-round PCR amplification, which were P5-BC02: 5’-(phos)AATGATACGGCGACCACCGAGATCTACACTCGATTCCGTTTGTAGTCGTCTGTACACTCTTTCCCTACACGACGCTCTTCCGATCT-3’, P7-BC12: 5’-(phos)CAAGCAGAAGACGGCATACGAGATCAGGTAGAAAGAAGCAGAATCGGAGTGACTGGAGTTCCTTGGCACCCGAGAATTCCA-3’, P5-BC03: 5’-(phos)AATGATACGGCGACCACCGAGATCTACACGAGTCTTGTGTCCCAGTTACCAGGACACTCTTTCCCTACACGACGCTCTTCCGATCT-3’, and P7-BC13: 5’-(phos) CAAGCAGAAGACGGCATACGAGATAGAACGACTTCCATACTCGTGTGAGTGACTGGAGTTCCTTGGCACCCGAGAATTCCA-3’. That is, after obtaining the first-round PCR products of 2800M and 9948, we used these four specific barcode primers to carry out the second-round PCR amplification. Then, we performed end-repaired and ligated nanopore sequencing adapters to build sequencing libraries. We also performed three experimental replicates for each standard sample. Finally, all sequencing libraries were nanopore-sequenced on the Oxford Nanopore Technology’s MinION (R9.4) and the Qnome-3841 instrument (Qitan Technology (Beijing) Co., Ltd, Beijing, China) according to the manufacturer’s instructions.

Real data analysis

We used NanoSTR (step_size = 10) to analyze the simulated data. We also used NanoSTR (step_size = 10) as well as Tandem-Genotypes and TRiCoLOR v1.1 with default parameters (22) to genotype 44 target STR loci in the standard samples. Minimap2 (version: 2.21-r1071) (23), Last (version: 2.34) (24), and BLAST (version: 2.2.23) (25) (26) were installed for alignment, and Sambamba (version: 0.8.0) (27) was installed for alignment processing. Porechop (version: 0.2.4) (https://github.com/rrwick/Porechop) was used for data preprocessing, and NanoPlot (version: 1.38.0) (28) was used for quality control.

FUNDING

Not applicable.

AVAILABILITY OF DATA AND MATERIALS

The download link of the STRBase database is https://strbase-b.nist.gov/FactSheets/FactSheets_2. FASTQ data files for this study can be found in the NCBI Sequence Read Archive (SRA) database (BioProject ID: PRJNA846950). The codes are available at https://github.com/langjidong/NanoSTR.

AUTHORS’ CONTRIBUTIONS

Jidong Lang designed the project, analyzed the data, wrote the manuscript. Zhihua Xu and Yue Wang collected the data, did the experiments and sequencing. Jiguo Sun and Zhi Yang reviewed the manuscript.

CONSENT FOR PUBLICATION

Not applicable.

COMPETING INTERESTS

The authors declare that they have no competing interests.

La Spada AR, Roling DB, Harding AE, Warner CL, Spiegel R, Hausmanowa-Petrusewicz I, Yee WC, Fischbeck KH. Meiotic stability and genotype-phenotype correlation of the trinucleotide repeat in X-linked spinal and bulbar muscular atrophy. Nat Genet 1992, 2:301-304.
A novel gene containing a trinucleotide repeat that is expanded and unstable on Huntington's disease chromosomes. The Huntington's Disease Collaborative Research Group. Cell 1993, 72:971-983.
Kayser M. Forensic use of Y-chromosome DNA: a general overview. Hum Genet 2017, 136:621-635.
Alonso A, Barrio PA, Muller P, Kocher S, Berger B, Martin P, Bodner M, Willuweit S, Parson W, Roewer L, Budowle B. Current state-of-art of STR sequencing in forensic genetics. Electrophoresis 2018, 39:2655-2668.
Paulson H. Repeat expansion diseases. Handb Clin Neurol 2018, 147:105-123.
Tang H, Kirkness EF, Lippert C, Biggs WH, Fabani M, Guzman E, Ramakrishnan S, Lavrenko V, Kakaradov B, Hou C, et al. Profiling of Short-Tandem-Repeat Disease Alleles in 12,632 Human Whole Genomes. Am J Hum Genet 2017, 101:700-715.
Magi A, Semeraro R, Mingrino A, Giusti B, D'Aurizio R. Nanopore sequencing data analysis: state of the art, applications and challenges. Brief Bioinform 2018, 19:1256-1272.
Wang Y, Zhao Y, Bollas A, Wang Y, Au KF. Nanopore sequencing technology, bioinformatics and applications. Nat Biotechnol 2021, 39:1348-1365.
Magi A, Giusti B, Tattini L. Characterization of MinION nanopore data for resequencing analyses. Brief Bioinform 2017, 18:940-953.
Rang FJ, Kloosterman WP, de Ridder J. From squiggle to basepair: computational approaches for improving nanopore sequencing read accuracy. Genome Biol 2018, 19:90.
Mitsuhashi S, Frith MC, Mizuguchi T, Miyatake S, Toyota T, Adachi H, Oma Y, Kino Y, Mitsuhashi H, Matsumoto N. Tandem-genotypes: robust detection of tandem repeat expansions from long DNA reads. Genome Biol 2019, 20:58.
De Roeck A, De Coster W, Bossaerts L, Cacace R, De Pooter T, Van Dongen J, D'Hert S, De Rijk P, Strazisar M, Van Broeckhoven C, Sleegers K. NanoSatellite: accurate characterization of expanded tandem repeat length and sequence through whole genome long-read sequencing on PromethION. Genome Biol 2019, 20:239.
Giesselmann P, Brandl B, Raimondeau E, Bowen R, Rohrandt C, Tandon R, Kretzmer H, Assum G, Galonska C, Siebert R, et al. Analysis of short tandem repeat expansions and their methylation state with nanopore sequencing. Nat Biotechnol 2019, 37:1478-1481.
Harris RS, Cechova M, Makova KD. Noise-cancelling repeat finder: uncovering tandem repeats in error-prone long-read sequencing data. Bioinformatics 2019, 35:4809-4811.
Gao Y, Liu B, Wang Y, Xing Y. TideHunter: efficient and sensitive tandem repeat detection from noisy long-reads using seed-and-chain. Bioinformatics 2019, 35:i200-i207.
Pollard MO, Gurdasani D, Mentzer AJ, Porter T, Sandhu MS. Long reads: their purpose and place. Hum Mol Genet 2018, 27:R234-R241.
Bentley DR, Balasubramanian S, Swerdlow HP, Smith GP, Milton J, Brown CG, Hall KP, Evers DJ, Barnes CL, Bignell HR, et al. Accurate whole human genome sequencing using reversible terminator chemistry. Nature 2008, 456:53-59.
Goodwin S, McPherson JD, McCombie WR. Coming of age: ten years of next-generation sequencing technologies. Nat Rev Genet 2016, 17:333-351.
Amarasinghe SL, Su S, Dong X, Zappia L, Ritchie ME, Gouil Q. Opportunities and challenges in long-read sequencing data analysis. Genome Biol 2020, 21:30.
Gettings KB, Aponte RA, Vallone PM, Butler JM. STR allele sequence variation: Current knowledge and future issues. Forensic Sci Int Genet 2015, 18:118-130.
Yang C, Chu J, Warren RL, Birol I. NanoSim: nanopore sequence read simulator based on statistical characterization. Gigascience 2017, 6:1-6.
Bolognini D, Magi A, Benes V, Korbel JO, Rausch T. TRiCoLOR: tandem repeat profiling using whole-genome long-read sequencing data. Gigascience 2020, 9.
Li H. Minimap2: pairwise alignment for nucleotide sequences. Bioinformatics 2018, 34:3094-3100.
Kielbasa SM, Wan R, Sato K, Horton P, Frith MC. Adaptive seeds tame genomic sequence comparison. Genome Res 2011, 21:487-493.
Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ. Basic local alignment search tool. J Mol Biol 1990, 215:403-410.
Camacho C, Coulouris G, Avagyan V, Ma N, Papadopoulos J, Bealer K, Madden TL. BLAST+: architecture and applications. BMC Bioinformatics 2009, 10:421.
Tarasov A, Vilella AJ, Cuppen E, Nijman IJ, Prins P. Sambamba: fast processing of NGS alignment formats. Bioinformatics 2015, 31:2032-2034.
De Coster W, D'Hert S, Schultz DT, Cruts M, Van Broeckhoven C. NanoPack: visualizing and processing long-read sequencing data. Bioinformatics 2018, 34:2666-2669.

SupplementarySheet.xlsx
Additional file 1: Supplementary Sheet Table S1-S6.
SupplementaryInformation.pdf
Additional file 2: Supplementary Information Example-1, Example-2, Figure S1 and Figure S2.

Download PDF

Version 1

posted

You are reading this latest preprint version

NanoSTR: A Method for Detection of Target Short Tandem Repeats Based on Nanopore Sequencing Data

Status:

Version 1

Abstract

Figures

Background

Results

Performance on simulated data

Effect of the number of errors on STR typing accuracy

Performance on real data

Discussion

Conclusions

Methods

Analysis principles

Simulated data

Experiment with real data

Real data analysis

Declarations

References

Supplementary Files

Status:

Version 1