Analysis principles
Analysis with NanoSTR comprises the following four steps (Fig. 1). The first step is definition of the extension step size d. The start and end positions of the target STR locus on the reference genome are marked as P_start and P_end. Extension is repeated N times to the upstream of P_start and to the downstream of P_end. The P_start’ and P_end’ of each extension are expressed as follows:
P_starti’ = P_start − d*i
P_endi’ = P_end + d*i
where 1 <= i <= N
The sequences with P_starti’ as the start position, P_endi’ as the end position, and d as the extension step size were extracted from the reference genome, which are referred to as paired-seed sequences. The N paired-seed sequences obtained after N extensions are used for the extraction of the complete matching target sequences from the nanopore sequencing data in *.fastq format to yield N datasets of target sequences. Then, the lengths of the target sequences in each dataset are determined to generate N datasets containing the sequence lengths. Finally, the lengths of the target sequences in each dataset are sorted in descending order of supported read number, and the sorted lengths are numbered in ascending order, which is defined as “rank.” Consequently, dataset1 with N subsets containing the length-number-rank (LNR) information of sequences is generated. In the second step, the target STR loci are extended over a certain distance (e.g., 500 bp by default) upstream of the start position and downstream of the end position on the reference genome, which are used as the reference sequences. Then, the N datasets of the target sequences obtained in the first step are aligned against the reference sequences using BLAST. The results in m8 format are filtered with a threshold mismatch number of < 3. The distances between the start and end positions of the subject sequences are used as the lengths of the matching sequences to obtain N datasets of sequence lengths. Finally, the lengths in each dataset are sorted in descending order of supported read number, and the sorted lengths are numbered in ascending order, resulting in dataset2 with N subsets containing the LNR information. In the third step, the N length distributions in dataset1 are intersected with dataset2, and the lengths with minimum rank differences < 3 are retained and labeled as LNR-jointi. Then, each LNR-jointi is subjected to another filtration according to the supported read number. To determine the genotype of each LNR-jointi, the length with the maximum supported read number is retained if the ratio of the maximum supported read number to the second maximum supported read number is > 3; otherwise, the lengths with the maximum and second-maximum supported read number are retained. Finally, N genotypes are obtained. In the fourth step, the N genotypes are combined for statistical analysis, and the results with the mode and supported read number are selected as the final genotype for this target STR locus, that is, if the mode ratio is >= 3, it is considered to be homozygous; otherwise, it is considered to be heterozygous. Since interference such as background noise may affect the results, a secondary correction is performed according to the difference in the order of magnitude of the number of reads (Additional file 2: see the “Example-1” section).
Simulated data
We downloaded 75 forensic makers from STRBase (Additional file 1: Table S6) (20), and four markers (DYS392, DYS438, DYS448, and DYS635) were used as the simulated target loci. Reference sequences were extracted from the human reference genome hg38 by extension over distances of 1 kb, 10 kb, and 100 kb upstream and downstream of each STR locus. NanoSim-H (version: 1.1.0.4) (21) was used to simulate 100,000 nanopore sequencing reads with and without errors based on the extracted sequences (Additional file 1: Table S1, named Simulated_data-1). Similarly, we simulated heterozygous STR loci with four insertions (Additional file 1: Table S1, named Simulated_data-2) and four deletions (Additional file 1: Table S1, named Simulated_data-3) based on the repeat unit of each STR marker.
Ten STR loci (D12S391, D18S51, D22S1045, DYS635, DYS437, DYS438, DYS390, DYS392, DYS448, and DYS458) were randomly selected to assess the effect of the number of errors on genotyping performance. Reference sequence extraction was performed on the human reference genome hg38 with an extension distance of 100 kb upstream and downstream of these STR loci. NanoSim-H (version: 1.1.0.4) was used to simulate 100,000 nanopore sequencing reads with random proportions of mismatches, insertions, and deletions based on the extracted sequences (Additional file 1: Table S2, named Simulated_data-1). Similarly, we also simulated sequences with four insertions or four deletions based on the repeat unit of each STR marker (Additional file 1: Table S2, named Simulated_data-2 and Simulated_data-3).
Experiment with real data
Two genomic DNA standard products, named 2800M (Promega Biotech Co., Ltd, Beijing, China) and 9948 (AGCU ScienTech Incorporation, Wuxi, Jiangsu, China), were used in this study. They contained 51 and 72 Y-STR and/or autosomal STR loci, respectively. Next, we performed two rounds of PCR amplification by using the MultipSeq® Custom Panel (IGMU339V1hg38) kit (iGeneTech Biotech (Beijing) Co., Ltd, Beijing, China) according to the manufacturer’s user guide. Notably, we designed two pairs of primers to replace the amplification primers during the second-round PCR amplification, which were P5-BC02: 5’-(phos)AATGATACGGCGACCACCGAGATCTACACTCGATTCCGTTTGTAGTCGTCTGTACACTCTTTCCCTACACGACGCTCTTCCGATCT-3’, P7-BC12: 5’-(phos)CAAGCAGAAGACGGCATACGAGATCAGGTAGAAAGAAGCAGAATCGGAGTGACTGGAGTTCCTTGGCACCCGAGAATTCCA-3’, P5-BC03: 5’-(phos)AATGATACGGCGACCACCGAGATCTACACGAGTCTTGTGTCCCAGTTACCAGGACACTCTTTCCCTACACGACGCTCTTCCGATCT-3’, and P7-BC13: 5’-(phos) CAAGCAGAAGACGGCATACGAGATAGAACGACTTCCATACTCGTGTGAGTGACTGGAGTTCCTTGGCACCCGAGAATTCCA-3’. That is, after obtaining the first-round PCR products of 2800M and 9948, we used these four specific barcode primers to carry out the second-round PCR amplification. Then, we performed end-repaired and ligated nanopore sequencing adapters to build sequencing libraries. We also performed three experimental replicates for each standard sample. Finally, all sequencing libraries were nanopore-sequenced on the Oxford Nanopore Technology’s MinION (R9.4) and the Qnome-3841 instrument (Qitan Technology (Beijing) Co., Ltd, Beijing, China) according to the manufacturer’s instructions.
Real data analysis
We used NanoSTR (step_size = 10) to analyze the simulated data. We also used NanoSTR (step_size = 10) as well as Tandem-Genotypes and TRiCoLOR v1.1 with default parameters (22) to genotype 44 target STR loci in the standard samples. Minimap2 (version: 2.21-r1071) (23), Last (version: 2.34) (24), and BLAST (version: 2.2.23) (25) (26) were installed for alignment, and Sambamba (version: 0.8.0) (27) was installed for alignment processing. Porechop (version: 0.2.4) (https://github.com/rrwick/Porechop) was used for data preprocessing, and NanoPlot (version: 1.38.0) (28) was used for quality control.