Construction and characterization of microsatellite markers for the Schistosoma japonicum isolate from a hilly area of China based on whole genome sequencing

Schistosoma japonicum had once caused the greatest disease burden in China and has still been transmitted in some hilly areas, for example, in Shitai of Anhui province, where rodents are projected to be the main reservoir. This may lead to a critical need of molecular tools with high efficiency in monitoring the dynamic of the rodent-associated S. japonicum, as an appropriate amount of schistosome input can re-establish its life cycle in a place with snails and then result in the re-emergence of schistosomiasis. Therefore, the goal of this study was to develop high polymorphic microsatellites from the whole genome of rodent-associated S. japonicum strain to monitor its transmission dynamic. We sampled the hilly schistosome isolate from Shitai of Anhui in China and sequenced the parasite with the next-generation sequencing technology. The whole genome was assembled with four different approaches. We then developed 71 microsatellite markers at a genome-wide scale throughout two best assembled genomes. Based on their chromosome mapping and the expected length of targeted sequences, we selected 24 markers for the development of multiplex reactions. Two multiplexes composed of 10 loci were finally developed, and their potential was revealed by their successful application on and capturing the genetic diversity of three schistosome populations. The selected 10 markers, each with clear chromosome location and characteristics, will be greatly useful in tracing the dispersal pathways or/and dynamics of the rodent-associated S. japonicum or others in the hilly area of China or elsewhere.


Introduction
Schistosomiasis is caused by the genus Schistosoma which includes three main species, Schistosoma mansoni, S. japonicum, and S. haematobium, affecting approximately 240 million people worldwide with more than 700 million people at risk of infection (Assaré et al. 2014;Barendregt et al. 2013).China is endemic with S. japonicum, and approximately 70 million people live in endemic areas (Zhang et al. 2020).Due to the great health burden caused by the parasite, the central government of China has designated schistosomiasis as one of the four major infectious disease control priorities (Wang et al. 2008), with the goal of the elimination of the disease at the national level by 2030 (Lei and Zhou 2015;Zhou 2016).Indeed, nearly 70 years of control efforts has seen the great achievement in schistosomiasis control in China as schistosome infection in both humans and livestock has been reduced to a much lower level by 2019 (Zhang et al. 2020).However, China still faces many challenges, one of which is the frequent occurrence of re-emergence of schistosomiasis in previously well-controlled regions (Carlton et al. 2011).This may raise the issue of where the schistosomes originated, and the information on this aspect would be of epidemiological implications.
Mengtao Sun and Yuheng Cheng contributed equally to this work.
Section Editor: Pengfei Cai 1 3 S. japonicum is a zoonotic parasite, and over forty mammal species are able to serve as potential reservoirs for the parasite (He et al. 2001).Since all control efforts have been focused on schistosome infections in humans and domestic animals, wild rodents with their inherent ability of rapid reproduction have long been neglected in schistosomiasis control (Wang et al. 2009).We have previously demonstrated the considerable highest infection prevalence (17.7 to 26.5%) and intensity (51.9 to 230.7 eggs per gram of feces) in rodents in a hilly area (i.e., Shitai, Anhui) of China in 2006-2007(Lu et al. 2010) and have also reported the existence of the rodent-associated S. japonicum strain (Lu et al. 2009).Our recent meta-analysis further reported that the estimated infection prevalence of S. japonicum in rodents within hilly and mountainous regions of China significantly increased over time, and even in some marshland and lake regions, infected rodents were still identified given no infections found in humans or bovines (Zou et al. 2020).The hilly rodent-associated S. japonicum still remain transmitted in Shitai of China (and probably its neighboring areas) as schistosome infections in intermediate host snails have always been identified for over 10 years (Lu et al. 2021).Therefore, approaches to monitor the spread and transmission potential of the schistosome isolate should be crucial for the final stage of schistosome elimination at the local, provincial, and national level, as an appropriate amount of S. japonicum input can re-establish the life cycle in a place with existence of snail habitats (Spear et al. 2011).
Microsatellites are known as simple sequence repeats (SSRs) or short tandem repeats (STRs).Due to their abundance, codominant expression, allelism, and neutral selection in genetics, they have been widely used to investigate the population diversity (Rollinson et al. 2009;Shrivastava et al. 2005) and, moreover, to estimate the relationship/gene flow between parasite populations (Queiros et al. 2015).With the advantage of the developed next-generation sequencing technologies, large amount of whole genome data can be generated, throughout which microsatellites can be efficiently identified (Liu et al. 2019).Therefore, in this study, we aimed to screen reliable and accurate microsatellite loci from the whole genome of the hilly rodent-associated S. japonicum with the purpose of further monitoring the transmission and dynamics of the parasite.

Ethics statements
Animals were used according to a protocol approved by the Ethical Committee of Soochow University.

Sample preparation and sequencing
Oncomelania hupensis snails infected with S. japonicum were sampled in May of 2020 in Shitai of Anhui, China (see Lu et al. 2010 andRudge et al. 2013 for details of the sampling site).Female mice (ICR, 8 weeks old) were infected with cercariae shed from the infected snails and then raised in the laboratory for 35 days.The infected mice were euthanized and dissected for adult worms (Shi et al. 2014).Sixteen male worms were selected and then sent to Novogene company (Tianjin, China) for whole genome sequencing.Briefly, genomic DNA (gDNA) of the worms was extracted by using the DNeasy Blood & Tissue Kit (Qiagen) according to the manufacturer's instructions.The quality and quantity of the gDNA were checked by making use of agarose gel electrophoresis with Nanodrop and Qubit 2.0 (Nakayama et al. 2016).A Covaris M220 focused-ultrasonicator was applied to fragment the gDNA into 150-bp fragments to construct a paired-end DNA library.Following that, Qubit 2.0 was used for preliminary quantification.The library was first diluted and then, an Agilent 2100 was used to detect the inserted fragments of the library.After the size of the inserted fragments met the expectations, the effective concentration of the library was quantified by Q-PCR to ensure quality.Finally, a pairedend DNA library was prepared and sequenced on an Illumina NovaSeq 6000 platform.

Genome estimation and assembly
Sequencing data was used to evaluate the values of Q20 (percentage of bases whose accuracy was more than 99%) and Q30 (percentage of bases whose accuracy was more than 99.9%) and GC content.Sequencing quality was accepted when the values of Q20 and Q30 exceed 90% and 85%, respectively (Li et al. 2019).Clean data were obtained with Fastp software (Chen et al. 2018) by filtering out low-quality paired-end sequences (i.e., more than 10% unidentified nucleotides in reads, or more than 50% bases in reads with Phred quality smaller than 5) and reads with index adapter sequences.GenomeScope (Vurture et al. 2017) and jellyfish v2.2.10 (Marcais and Kingsford 2011) were based on the 21 k-mer frequency distribution (Luo et al. 2019), used to estimate genome size, heterozygosity degree, and duplicate ratio.KmerGenie v1.7051 (Chikhi and Medvedev 2014) was used to estimate the best k-mer length for genome assembly.The de novo draft genome was then assembled each with Megahit v1.2.9 (Li et al. 2015), Velvet v1.2.10 (Zerbino and Birney 2008), SPAdes v3.15.3 (Bankevich et al. 2012), and ABySS v2.3.3 (Jackman et al. 2017).

Genome evaluation
Quast v5.1.0(Gurevich et al. 2013) was used to assess the performance of four assemblers on clean reads by aligning their assembled genomes to the reference (ASM636876v1) (Luo et al. 2019).The best assembled genomes were further evaluated in terms of completeness and accuracy with bwa v0.7.17 (Li and Durbin 2009) by mapping clean reads to the reference genome.SAMtools v1.11 (Li et al. 2009) was used to calculate the mapping rate and sequencing depth.BUSCO v4.0.2 (Benchmarking Universal Single-Copy Orthologs software) (Simao et al. 2015) was used to assess the completeness of coding genes.

Microsatellite survey and primer design
The bioinformatic program MISA (Beier et al. 2017) was used to search microsatellite motifs on the assembled genomes with high quality.Mononucleotide motifs were set to a minimum number of six repeats, and di-to hexanucleotide microsatellite motifs were set to a minimum five repeats.A length of 100 bp was set as the interruption of two adjacent microsatellites.RepeatModeler v2.0.1 (Flynn et al. 2020) was used to construct a de novo repeat database based on the assembled genomes.Following that, RepeatMasker v4.1.2(Tarailo-Graovac and Chen 2009) was, in combination with Repbase libraries (issue 1 in volume 19, 2019) (Bao et al. 2015), applied to mask the repeat elements on the assembled genome, where the "nolow" option was selected to avoid any simple repeats being masked.QDD v3.1.2(Meglecz et al. 2014) was applied to isolate and identify microsatellite loci with 200 bp of flanking regions on both sides and then to design primers.The primer pairs generated by QDD software needed to be filtered under the following criteria (Kovach et al. 2021;Liu et al. 2019;Nikolic et al. 2015;Song et al. 2017;Wang et al. 2016): (i) "A" strategy was applied to design the best scenario (i.e., the amplicon includes only one pure microsatellite with no nanosatellites or homopolymers) for the target region using one single perfect microsatellite only; (ii) only tri-and tetranucleotides were included to avoid stutter peaks in PCR applications; (iii) only one pair of primers was selected for each of targeted sequences; (iv) the number of tandem repeats was between 5 and 25; (v) the least distance between 3′ end of two primers and its target region was more than 10 bp; (vi) the length of primers was 18 to 25 bp; (vii) the GC content of primers was 40 to 60%; (viii) microsatellites were neutral; and (ix) grade points of primers evaluated by Premier 6 were more than 50 (Lalitha 2000).Finally, the satisfied microsatellite markers were anchored onto each chromosome of S. japonicum.

Microsatellite validation and application
PCR reactions were performed to verify the validation of the designed primers on two S. japonicum samples with one from Anqing (AQ) and one from Hexian (HX) in Anhui of China.The amplification of each primer pair was carried out in a volume of 15 μl using the Qiagen multiplex PCR Plus kit (Qiagen).PCR thermocycler conditions were as follows: 95 °C for 5 min, then 40 cycles at 95 °C for 30s, 2 cycles at each temperature for 90 s from 57 to 51 °C each decreased by 2 °C, 72 °C for 30 s, and a final step with 68 °C for 10 min.PCR products were separated in 2% agarose gels to check the amplification.The amplicons were then sequenced by Sangon Biotech (Shanghai, China), and sequencing results were visualized and aligned by SnapGene Viewer v6.1.1 to validate the primers and their target sequences.Following that, the software Multiplex Manager v1.2 (Holleley and Geerts 2009) was used in the development of a multiplex PCR reaction with multiple microsatellite loci for amplification efficiency.Each forward primer of the selected locus was labeled with 6-FAM, HEX, TAMRA, or ROX.The polymorphism of these loci and their multiplex reactions were tested and validated on three schistosome populations, including Shitai (ST) and Anqing (AQ) in Anhui province and Wuxi (WX) in Jiangsu province of China.The PCR products were sent to Sangon Biotech (Shanghai, China) for genotyping.GeneMarker HID v2.6.1 (Holland and Parson 2011) was used in allele calling.GenAlEx v. 6.501 (Peakall and Smouse 2012) was used to calculate the genetic diversity, including number of alleles (A), number of effective alleles (A E ), and observed (H O ) and expected (H E ) heterozygosity.Polymorphism information content (PIC) was determined by Cervus v3.0.7 (Kalinowski et al. 2007).

Genome prediction and assembly
Whole genome sequencing of the hilly isolate S. japonicum yielded 13.77 Gb of raw reads and they were deposited in the NCBI BioSample database: SAMN24593635.Q20 and Q30 values were 97.44% and 92.72% respectively, and GC content was 35.12%.After filtering out the reads of low quality, 13.57Gb of clean reads was obtained.The genome size was evaluated to be approximately 352.9 Mb based on the 21-k-mer frequency distribution, close to the reference (Luo et al. 2019).The complexity of the genome was reflected with the heterozygosity of 1.01% and duplication ratio of 1.04% (Fig. 1).
Four software were applied in genome assembly.As the parameter of the 67 k-mer length was inferred to be the best in assembling genome, three software Velvet, SPAdes, and ABySS were then run with 67 k-mer, whereas Megahit was run with the combination of 67 k-mer and its two near 57 and 77 k-mers due to its capability and requirement input of multiple k-mers (Forouzan et al. 2018).Based on N50 and L50, SPAdes performed best.In terms of genome fractions, SPAdes, Velvet, and Megahit were very close.For accuracy indexes in numbers of misassemblies, local misassemblies, mismatches per 100 kb, and indels per 100 kb, ABySS performed best but the genome fraction was less than 70%.Velvet had the shortest unaligned length and largest alignments.In terms of size-related statistics, accuracy, and completeness, as combined and reflected in the assembly quality index (AQI) [46], SPAdes was ranked as the best assembler, followed by Velvet.The genomes assembled with SPAdes and Velvet were therefore evaluated in completeness and accuracy and then used for subsequent microsatellite analyses.Table 1 summarizes the performance of four assemblers on clean reads.

Genome assessment
As shown in Table 2, the genome assembled with SPAdes had approximately 81.37% of clean reads to be concordantly mapped.A proportion of 85.81% of the genome had Genome fraction, percentage of aligned bases in the reference genome N50, the contig length such that using longer or equal length contigs produces half (50%) of the bases of the assembly L50, the minimum number of contigs that produce half (50%) of the bases of the assembly RAE, relative abundance of assembly errors, which is equal to the percentage of an assembly error in sum of all misassembles made by all assemblers RUC, relative abundance of unaligned contig lengths, which is equal to the percentage of unaligned contig length of an assembly in sum of all unaligned contig lengths produced by all assemblers AQI, calculated as N50 (in kbp) * genomic coverage (in percent) * (1 − RAE) * (1 − RUC) For more information, refer to QUAST (Gurevich et  a sequencing depth of > 10.The genome with Velvet had lower values on both indexes.Completeness results showed that 26.3% of 255 core eukaryotic genes and 23.8% of 954 core metazoan genes were mapped to the genome with SPAdes and 15.3% of 255 core eukaryotic genes and 14.2% of 954 core metazoan genes to the one with Velvet (S1 Table ).

Microsatellite searching and development
MISA was used to survey the microsatellite motifs.Based on the genome assembled with Velvet, a total of 38,747 microsatellite motifs located in 207,059 sequences were identified, among which the most prevalent were mononucleotides, followed by tri-, di-, tetra-, penta-, and hexanucleotides.Based on the genome assembled with SPAdes, a total of 61,243 microsatellite motifs located in 574,220 sequences were identified with trinucleotides being the most prevalent, followed by mono-, di-, tetra-, hexa-, and pentanucleotides.The most abundant motifs from the genome assembled with SPAdes comprised A/T, AT/AT, AAT/ATT, AAGT/ACTT, ACACT/AGTGT, and AAC CCT /AGG GTT , and those with Velvet comprised A/T, AT/AT, AAT/ATT, AAGT/ACTT, AATAC/ATTGT, and ACG ACT /AGT CGT (see Table 3 and S2 Table).The repetitive elements obtained from two genomes are shown in Fig. 2. Approximately 49.42% of the repeats were identified in the genome assembled with SPAdes and 47.60% with Velvet, both of which were close to the reference (46.87%) (Luo et al. 2019).Most sequences of the repetitive elements consisted of long interspersed nuclear elements (LINEs).The genomes, in which all repetitive elements were masked, except simple repeats, were then introduced into QDD for microsatellite isolation and primer design.Finally, a total of 99 high-quality microsatellite markers were obtained under the criteria listed in methods (see Table 4 for the detailed process of microsatellite selection).

Microsatellite characterization and application
Out of 99 microsatellite markers, 14 were not successfully anchored onto any schistosome chromosomes and 14 showed low specificity each with multiple locations on chromosomes.The remaining 71 markers were specifically located on one separate site only of a chromosome.Numbers of markers mapped onto chromosomes 1, 2, 3, 4, 5, 6, 7, and Z were 6, 12, 4, 8, 7, 7, 10, and 17, respectively (see S3 and S4 Tables).From each chromosome, three markers were Fig. 2 Percentage of repetitive elements obtained from the S. japonicum genome.The percentage was calculated by dividing the total number of each repeat element by the total number of repeat ele-ments.SINES, short interspersed nuclear elements; LINEs, long interspersed nuclear elements; LTR elements, long terminal repeat elements randomly selected and then, a total of twenty-four microsatellites were included in the following analysis (Table 5).
We conducted PCR reactions to test their usability on two field S. japonicum samples.All markers produced distinct banks in gel (Fig. 3), and all products each were sequenced and visualized as shown in Fig. 4 for a typical chromatogram of three markers.We finally used Multiplex Manager to successfully develop two multiplex reactions, one with six markers and one with four makers.The selected 10 loci, each labeled with a fluorescence color, were used in evaluation of their polymorphism in three populations (i.e., ST, AQ, and HX) of 47 schistosome individuals (Table 6).All loci were polymorphic for each population and each locus produced distinguishable peaks (see Fig. 5 for examples of three loci).A total of 137 alleles were identified.The number of alleles per locus varied from 2 to 10, with an average 4.6 alleles per locus.PIC values of loci tested ranged from 0.337 (Sj17) to 0.790 (Sj20) with an average value 0.556.For each population, the average of the effective alleles over loci was 2.4 in ST, 2.9 in AQ, and 2.7 in WX.The average of H O was 0.381 in ST, 0.418 in AQ, and 0.387 in WX, and H E was 0.532 in ST, 0.601 in AQ, and 0.595 in WX (Table 6).

Discussion
Schistosome japonicum had caused the greatest disease burden in China and has still been transmitted in some hilly areas of Anhui, China, where rodents are projected to be the main reservoir (Lu et al. 2010;Rudge et al. 2013).This may lead to a critical need of molecular tools with high efficiency in monitoring the dynamic of the rodent-associated schistosome.Quite a number of S. japonicum microsatellites had been developed either by analyzing contigs data existed in GenBank (Shrivastava et al. 2003) or based on a draft genome of a marshland S. japonicum isolate (Xiao et al. 2011; The Schistosoma japonicum Genome SFA, Consortium 2009;Yin et al. 2008).We have previously screened all the reported microsatellites against the field S. japonicum samples in our laboratory and found that some loci were very error prone (Bian et al. 2015;Gao et al. 2015).Moreover, all the reported microsatellites lacked information on chromosome mapping with a possible existence of inherent linkage equilibrium between/among loci.Therefore, the development of highly polymorphic microsatellites with known chromosome locations will be of great implication in tracing of the schistosome (isolate, genes or alleles) in the transmission or spread.In this study, we performed whole genome sequencing of the rodent-associated S. japonicum isolate sampled from the hilly area of Anhui, China and evaluated its genomic characteristics.We then assembled the whole genome with four different assemblers, on the base of which we developed novel microsatellite markers with known locations of chromosomes.We finally validated their usability with PCR approach on three geographical schistosome populations.The developed novel microsatellites will be employed to monitor the spread and transmission paths of the rodent-associated schistosome, which has been transmitted within wild rodents in the field (Lu et al. 2021).
With the increasing popularity of next-generation sequencing technologies, developing microsatellite markers, based on the genome-wide scale, has therefore become increasingly popular (Bing et al. 2015).In this study, by whole genome sequencing of the hilly rodent-associated S. japonicum isolate, we developed new microsatellite markers suitable for the genetic study and monitoring of the parasite.The high values of Q20 and Q30 from the raw data of high-through output sequences showed a high sequencing accuracy, and GC content (35.12%) suggested no sequence bias (Aird et al. 2011).The results from the 21 k-mer distributions suggested a high heterozygosity in the genome of the hilly parasite (Kajitani et al. 2014).We assembled the whole genome with four different software, and two best assembled genomes, with SPAdes and Velvet respectively, were used to identify microsatellites.
We found that the base composition of microsatellite motif was strongly biased to A or T, especially at the first base, as observed in other species (Santonoceto et al. 2019).Under the stringent selection process (see "Materials and methods" section), a total of 99 microsatellites and their corresponding primer pairs were generated.We then mapped these 99 markers onto S. japonicum chromosomes using BLAST, and 71 were successfully and specifically located on one of eight (seven autochromes and one Z) chromosomes.After taking into consideration both the expected product size and its chromosome location, we selected twenty-four markers and screened them by running PCR on S. japonicum samples.Their usability was validated from the expected   Rodent-related strains of S. japonicum are still being transmitted locally, and to prevent the expansion of their endemic range, population dynamics monitoring is necessary.We predicted that population genetic analyses with the above developed microsatellites would reveal an insight on transmission and recruitment of S. japonicum genotypes or alleles, responses of parasite populations under current control pressures, and hence the evolutionary potential for the isolate.This could not be obtained through our routine parasitological (and serological) monitoring of prevalence and intensity alone.However, there was a limitation in this study.Due to the high heterozygosity existing in the genome  of S. japonicum, it is difficult to assemble an accurate and complete genome sequence.Therefore, the conservative gene integrity assessment performed in this study resulted in a low alignment rate.This limitation might have led to a smaller number of qualified sequences, from some chromosomes, for screening polymorphic microsatellite.

Conclusions
We sequenced the hilly S. japonicum isolate with the nextgeneration sequencing technology and assembled the whole genome of the parasite with four different approaches.We then developed 71 microsatellite markers at a genome-wide scale.Based on their chromosome mapping and the expected length of targeted sequences, we selected 24 markers for the development of multiplex reactions.Two multiplex reactions composed of 10 loci were finally developed, and their potential in genetic studies was revealed by their successful application on three populations.The selected 10 markers, each with clear chromosome location and characteristics, will be useful in tracing the dispersal pathways or/and the population dynamic of the rodent-associated S. japonicum isolate, which has remained transmitted in China.

Fig. 1
Fig. 1 k-mer (K = 21) distribution of S. japonicum genome.Blue area represents the observed k-mer distribution.The area below the red line is considered sequencing errors; below the black line is reliable k-mers without errors; below the yellow line is the size of the nonrepeating area.Len, inferred total genome length; Uniq, percentage of unique (nonrepetitive) genomes; Het, overall ratio of heterozygosity; Kcov, average k-mer coverage for heterozygous bases; Err, error rate of reads; Dup, average rate of repetition length of clear bands in gel and the resulting sequences of the amplicons.Two multiplex reactions, one with six markers and one with four markers, were finally developed.The results from their successful application in capturing the diversity of three S. japonicum populations validated the potential of the selected ten loci, with five showing medium and five showing high polymorphism in terms of PIC values(Botstein et al. 1980).

Table 3
Key features observed in the analyzed microsatellite sequences

Table 4
(Meglecz et al. 2014)in developing microsatellite markers Note: "A" strategy, the amplicon includes only one pure microsatellite with no nanosatellites (three to four repetitions of two to six basepair motif) or homopolymers (at least five repeats of a single base)(Meglecz et al. 2014)

Table 5
Detailed information of the 24 screened microsatellite markers for S. japonicum

Table 6
Amplification of two multiplex reactions composed of ten microsatellites on three S. japonicum populations (i.e., ST, AQ, and WX) Note: PIC polymorphism information content, N sample size, A number of alleles, A E number of effective alleles, H O observed heterozygosity, H E expected heterozygosity