SMS: A Novel Approach for Bacterial Strain Genome Reconstruction in Multiple Samples

Background The analysis of the bacterial strains is important for the understanding of drug resistance. Despite the existence of dozens of computational tools for bacterial strain studies, most of them are developed for the limited number of known bacterial strains. Almost all remaining tools are designed to analyze individual samples or local regions instead of the entire bacterial genomes. With multiple shotgun metagenomic samples routinely generated in a project, it is necessary to create methods to reconstruct novel bacterial strain genomes in multiple samples. Results We developed a novel computational approach called SMS to de novo reconstruct bacterial Strain genomes from Multiple shotgun sequencing Samples. Tested on 702 simulated and 195 experimental datasets, SMS reliably identified the strain number, strain abundance, and strain polymorphisms. Compared with two existing approaches, SMS showed superior performance in terms of more accurate estimation of the strain number, abundance and variations. Conclusions SMS is a useful tool for novel bacterial strain reconstruction in multiple shotgun metagenomic samples.


Background
Bacteria are ubiquitous in our environments and human body sites [1][2][3]. Because their genomes are constantly evolving, it is common that multiple strains of the same species coexist in an environmental niche. These strain genomes of the same species are different from each other, with small variations such as single nucleotide polymorphisms (SNPs), different gene contents, and/or different plasmid genes [4]. Such a difference results in different strengths and fitness to survive in an environment or react to a treatment, which is often the cause of drug resistance, mixed infection, reinfection, etc. [5,6].
It is thus important to study bacterial strains and reconstruct their strain genomes.
Shotgun metagenomic sequencing is routinely used to study bacterial strains [7,8]. In shotgun metagenomic sequencing, short DNA segments from different strains of all species in an environment are mixed and sequenced. These sequenced DNA segments called reads are then employed to determine the strains present and their abundance. By mapping reads to the marker genes, the reference genome(s), or the metagenome-assembled-genome(s) of a species, the variations buried in the mapped reads are identified, and their patterns are used to infer the presence and abundance of different strains [8,9].
Dozens of computational methods are available to infer bacterial strains from shotgun metagenomic reads [7][8][9][10][11][12][13][14][15][16][17][18][19][20][21][22]. Most of them rely on prior knowledge of known strains of a species, such as the known gene contents of a strain or the SNPs in known strains. These methods have successfully identified known strains in a number of metagenomic shotgun samples. However, new strains and new SNPs accumulated in existing strains cannot be studied by these methods. A handful of methods that do not depend on prior strain information are thus developed, which can be further divided into two groups [8,9,[16][17][18]23]. One group defines strain variations and strains based on species-specific marker genes, which depend on the quality and quantity of the marker genes but can significantly speed up the process to analyze a large number of species in a microbiome [17,23]. The other group of methods consider the SNPs across the entire reference genomes of an individual species instead of only the marker gene regions, which can delineate the strain genomes of a species in detail and are important for the study of individual pathogen species [9,[12][13][14]. The two groups of methods have shed new light on bacterial strains in environmental samples. However, their performance is still suboptimal in terms of the predicted strain number and abundance accuracy. For instance, a recent method, StrainFinder, did not have good accuracy in predicting strain SNPs and strain abundance, even provided with the correct strain number in the sample(s) [12].
To accurately identify strains in environmental or clinical samples, we developed a novel strain identification method called SMS (bacterial Strain identification in Multiple metagenomic Samples).
Starting from a reference genome, SMS de novo reconstructs bacterial strain genomes from shtogun reads in multiple samples, which thus does not rely on the prior knowledge of known strains. Unlike existing methods for novel strain identification, SMS models the coverage of every strain in individual samples by zero-inflated Poisson (ZIP) distributions and classifies SNPs with adatively inferred centers, which enables it to work well with low coverage of strains and predict strains with high accuracy. Tested on 702 simulated datasets and 195 experimental datasets, SMS accurately predicted the strain number, SNPs, and abundance. Compared with two recent approaches, SMS showed much better performance. The SMS tool is freely available at https://github.com/UCF-Li-Lab/SMS.

Materials and methods
SMS reconstructs bacterial strain genomes with a reference genome and raw reads in multiple shotgun sequencing samples (Figure 1). The basic assumption is that different SNPs from the same strain follow a common ZIP distribution in a sample, and SNPs from different strains follow different ZIP distributions in individual samples. Assume there are R strains of a species of interest in m samples.
Starting from the cleaned raw reads, SMS defines SNPs based on the reads mapped to the reference genome. It then determines the initial strains and their abundance with the pooled sample, the combined m samples. Next, SMS refines the initial strains and their abundance based on the SNP coverage patterns across samples. The rationale is that unique SNPs from the same strain will have more similar coverage patterns across samples than SNPs from different strains. Finally, SMS outputs the predicted strains and their abundance. The details are in the following sections.

Identification of potential SNPs
With the reads from M samples, SMS trims reads and filters low-quality reads with the tool trimmomatic. SMS then maps the cleaned reads to the reference genome by bowtie2 [24]. In every sample, SMS obtains a 4 by n sample-specific matrix composed of the frequencies of A, C, G, and T in the mapped reads at each of the n reference genome positions. Similarly, SMS acquires a pooled matrix of 4 by n for the pooled sample, the sum of the m sample-specific matrices. SMS then determines the ′ potential polymorphic positions based on these m+1 matrices. A reference genome position is selected as a potential polymorphic position if the following criteria are satisfied: 1). It has to have a coverage larger than 10% of the pooled coverage of the reference genome. The coverage of a genome (position or SNP) is calculated as the average number of reads mapped to this genome (position or SNP); 2). It has at least two nucleotides with coverage no smaller than 5% of the pooled coverage. Note that when the reference nucleotide at a position has fewer than 5% of the genome coverage, the reference nucleotide is replaced with the most frequent nucleotide at this position; 3). Each of its two most frequent nucleotides must occur in at least 5% of the m samples. Finally, SMS considers all n1 nucleotides with coverage larger than 5% of the genome coverage at these positions as potential SNPs, where ′ ≤ 1 ≤ 3 ′.

Prediction of the strain number and abundance
With the n1 potential SNPs, SMS infers the strain numbers and their abundance in four steps.
First, SMS obtains an initial number of strains and their SNPs. SMS applies mixtureS to the above n1 SNPs with the pooled sample and outputs the predicted strains and their abundance. MixtureS reconstructs the strain genomes from shotgun reads in one sample and has shown good performance previously [12]. In this way, the strains with different pooled coverage are separated into R strains, where R is automatically inferred.
Second, SMS refines the predicted strains so that almost all SNPs in an original strain are assigned to one predicted strain. Since the coverage of SNPs from the same strain are expected to follow the same ZIP distributions in individual samples, the coverage vectors of two SNPs from the same strain are more similar than those of two SNPs from different strains. Here the coverage vector of a SNP is a vector composed of its coverage in the m samples. The similarity measurement of two vectors is described in the next section. Based on this observation, SMS iteratively regroups the n1 SNPs into R groups so that SNPs from the same group have more similar vectors. Starting from the predicted R strains by mixtureS, the majority of SNPs in each of which are likely from the same strain, SMS represents each strain by a m by 1 coverage vector, the average of the coverage vectors of the SNPs currently assigned to this strain. SMS then re-assigns each of the n1 SNPs to the strain with the most similar coverage vector to the coverage vector of this SNP. With the re-assigned SNPs, the coverage vectors of the strains are recalculated. This process is repeated a given number of times or until the assigned SNPs to each strain do not change. In this way, the coverage vector of each predicted strain and the assignment of the n1 SNPs become more and more accurate, with almost all SNPs from an unknown strain grouped together.
Third, SMS investigates whether there are more or fewer than R strains in the m samples. SMS divides each strain into two strains, one strain at a time. To determine whether a strain should be divided, SMS models each strain in a sample by a ZIP distribution, estimates the parameters of the ZIP distributions, and calculates the likelihood ratio of observing the SNPs in this strain across the m samples to that in two divided strains. The details of the ZIP parameter estimation and the likelihood testing are in the following sections. A strain is divided only when its division significantly increases the likelihood (Chisquare test p-value<0.001). If a strain is divided, SMS considers whether the two new divided strains can be further divided similarly. This process is repeated until no strain can be further divided. With all possible divisions that significantly increase the likelihood, SMS obtains the updated R strains and repeats Step two to reassign the n1 SNPs to these R strains again. SMS then considers removing each strain, one strain at a time. The process is similar to dividing a strain based on the ZIP parameter estimation and the likelihood test.
Finally, SMS removes the predicted strains that are majorly composed of shared SNPs by multiple strains and reassigns their SNPs to the corresponding strains. To remove a strain, SMS identifies its consistent strains. Strain one is a consistent strain of strain two if every entry in the coverage vector of strain one is no large than the corresponding entry in the coverage vector of strain two plus a small cutoff, say 0.5. Similarly, multiple strains together are consistent with strain two if the sum of the corresponding entries in their coverage vectors is no large than the corresponding entry in the coverage vector of strain two plus the same cutoff. With the consistent strains of a strain, SMS constructs a graph, with each consistent strain as a node and edges connecting pairs of strains that are together still consistent with this strain. SMS then identifies the largest cliques in this graph with the corresponding groups of strains together consistent with this strain. With a clique identified, which could be a pair of strains, SMS removes this strain and reassigns the SNPs in this strain to all consistent strains identified in this largest clique. In this way, SMS finalizes the predicted strains and their SNPs. The abundance of every strain is calculated as the average coverage of the SNPs unique to this strain.

The similarity of two coverage vectors
SMS calculates the similarity of two coverage vectors ( 1 , 2 , … , ) and ( 1 , 2 , … , ) by a predefined regression formula: 79.25d+ 43.06(c+c 3 )-0.04/(0.0025+d), where d is the distance between the two vectors, and c is the Kendall rank correlation of the two vectors. This formula was constructed based on a set of 18 pre-simulated training datasets. SMS chooses this similarity measurement, because it shows better performance than others, including correlation, Euclendian distance, relative entropy, etc.
We also tried the following formula: , where dpois is the Poisson density function in the R package. This formula works similarly to the above regression formula when the sample number is larger than 15.

ZIP model of a strain in a sample
SMS models the coverage of the sites from the p-th strain in the q-th sample by a ZIP distribution To estimate the parameters in the ZIP, for a given strain that occurs in a given sample, say the r-th . SMS then uses the following iteration method to obtain the maximal likelihood

Log likelihood test
Given R strains, the full likelihood of observation the frequencies of these n1 SNPs in the m samples is When SMS splits one strain into two or removes one strain, the likelihood can be similarly calculated.
To assess the significance of changing the current R strains, we calculate the ratio of the likelihood after changing (split or remove) to the likelihood before changing. The ratio approximately follows a Chisquare distribution with the degree of freedom equal to the difference of the parameters in the two models. If the Chi-square test p-value is smaller than a pre-defined cutoff, SMS correspondingly modifies the current R strains.

Simulated and experimental datasets
We simulated 702 datasets (Supplementary Table S1). In each dataset, a reference genome was chosen, 2 to 4 strains were simulated, and 5 to 35 samples were generated. We considered three different reference genomes (Bartonella clarridgeiae NC_014932, Enterococcus casseli flavus NC_020995, and Methanobrevibacter smithii NC_009515), which were randomly chosen from GenBank. For each reference genome, their four strains were generated by randomly choosing 0.01% of the genome positions and then randomly substituted the reference nucleotide with another nucleotide. The read coverage of a reference genome in a dataset was one of the following five coverage, 50x, 100x, 150x, We tested SMS on 195 experimental datasets [6]. Each dataset is known to have two Mycobacterium tuberculosis strains with predicted abundance. The abundance is inferred from two different computational methods. The actual SNPs in each strain are unknown.

Comparison with existing methods
We compared SMS with mixtureS and StrainFinder in a desktop computer with the Intel Core i9-9900KF CPU (16 cores@3.6GHz) and 32 gigabytes memory. We used the following commands to run the three tools respectively:

SMS correctly predicted the strain numbers
We studied the number of strains predicted in 702 simulated datasets (Material and Methods, Table   S1). There were additional datasets. Note that the correct strain number was changed after the first step and became correct again in 93 datasets. The significant improvement at the refinement steps thus suggests that it is important to consider the coverage variation across individual samples to define strains.
Overall, SMS predicted the correct strain numbers in all but five datasets ( Tables S2-S5). Interestingly, SMS did not predict the correct strain number in at least one dataset for each of the three randomly selected genomes, implying that its performance was not biased towards a specific genome/species. In each of the five datasets, a pair of strains shared 30% of their SNPs. In four of the five datasets, three strains were sharing 20% of their SNPs. These shared SNPs may have confused SMS when the coverage was 100X. As expected, when the coverage was increased from 100X to 200X or 300X, SMS predicted the correct strain number in each of the five corresponding datasets. These analyses on the five datasets suggested that SMS can accurately predict the strain number, even when the pooled coverage was 100X and there were only five samples in a dataset. Moreover, the predicted strain number was even more accurate with a larger pooled coverage (200X coverage for perfect prediction here).

SMS reliably estimated the strain abundance
We investigated how well SMS predicted the strain abundance in the above 702 simulated datasets. No matter whether the strain number was correctly predicted, the predicted strain abundance agreed well with the known strain abundance (Figure 2, Tables S2-S5). This agreement did not depend on the sample number, the total coverage, the strain number, whether the strains shared a portion of their SNPs, etc. In the 697 datasets SMS correctly predicted the strain number, the predicted strain abundance agreed well with the known strain abundance. In these datasets, the predicted strain abundance was within 97.31% of the true abundance. The mean and median ratio of the predicted abundance to the true abundance in these datasets were 0.99 and 1.00, respectively. Even in the five datasets SMS did not correctly predict the strain number, the predicted strain abundance was similar to the true strain abundance. For instance, SMS predicted four strains in three datasets with three strains ( Table S5). In two of these three datasets, two strains had the predicted abundance about 0.08 and 0.29, respectively, which were close to the corresponding true abundance 0.10 and 0.30. The two remaining predicted abundance were about 0.42 and 0.21, which differed from the third true abundance, 0.60. In the third dataset, one strain was predicted with an abundance of 0.31, close to the true abundance of 0.30. The wrong prediction of the strain number and strain abundance was likely due to the third strain's uneven and relatively limited coverage. After increasing the coverage (200X or 300X), SMS predicted the correct strain number and more similar abundance ( Table S5). The agreement between the predicted and true abundance suggested that SMS predicted the strain abundance well.
The accuracy was in general improved with more samples in a dataset (Figure 2). When the sample number was larger, the median of the predicted abundance was closer to the true abundance. Moreover, the variation of the difference between the predicted abundance and the true abundance was smaller.
This improvement was especially evident in datasets with shared SNPs among strains. In datasets with no shared SNPs among strains, the same trend of improvement was still clear. However, the predicted abundance was not necessarily closer to the true abundance when the sample number was larger, which was likely because the coverage of a strain in individual samples became smaller when the sample number was larger.
We also studied how the accuracy of the predicted abundance was affected by the pooled coverage in a dataset. The accuracy was measured by the Maximal Absolute Difference (MAE) of the predicted abundance and the corresponding true abundance. The prediction in general became more accurate with a larger pooled coverage (Figure 2). The median MAE was similar for different pooled coverage.
However, the variation of the MAE became much smaller when the pooled coverage became 200X or larger. We thus would recommend that the pooled coverage should be at least 200X. The pooled coverage may need to be larger if it is expected to have more strains.
In addition, we examined how different bacterial genomes or the number of strains in a dataset would affect the predicted strain abundance. There was a slight variation when different genomes and/or different numbers of strains were considered (Figure 2). For instance, the difference between the predicted and true strain abundance was within a similar range and with a similar mean/median when there were different numbers of strains. The small variations suggested that the predicted abundance by SMS was robust to different bacterial genomes, different number of strains, etc.
Strains with similar abundance were challenging to distinguish from each other in previous studies [12,13]. We thus specifically focused on the datasets with two configurations: 10:20

SMS faithfully determined the SNPs
Existing methods mainly focus on the predicted strain number and only occasionally consider the strain abundance. Rarely do they mention the accuracy of the predicted strain SNPs. With the simulated datasets, we systematically evaluated the predicted SNPs. We found that SMS has a precision of 0.97 and a recall of 0.96 to assign SNPs to strains.
We studied the datasets without shared SNPs among strains ( Table S6) ( Tables S7 and S8). The performance was only slightly lower than that on datasets without shared SNPs, suggesting that SMS could reconstruct the complicated evolutionary trajectories of strains with shotgun sequencing reads.

SMS performed well on experimental datasets
We tested SMS on 195 experimental datasets ( Table S10). We chose these datasets because their strain numbers were known. The strain abundance was also predicted previously [6]. Note that the Critical Assessment of Metagenome Interpretation challenge provided the experimental datasets to evaluate taxonomical compostion analsysis and assembly [25]. They did not provide the strain number, strain abundance and SNPs unique to strains, thus not suitable for the strain genome reconstruction here.
SMS identified two strains in each of these 195 datasets, which agreed well with the previous study [6].
This study showed that there were at least 11 heterozygous sites in each of these 195 datasets.
Interestingly, SMS showed that the two strains in each of these datasets were the same across these datasets, which was consistent with the fact that these datasets were from clinical samples collected from the same region. Moreover, SMS distinguished strains with similar abundance in these datasets.
For instance, in the dataset ERR323056, there were 69 heterozygous sites observed in reads [6]. SMS predicted two strains with a relative abundance of 0.52 and 0.48. The previous study based on the SNP frequency identified only one strain, likely due to the similar abundance of the two strains [6]. In addition, SMS reliably predicted the strain abundance. The difference of the predicted strain abundance to the predicted abundance previously had a mean and median of 0.16 and 0.12 respectively, if we considered only the 186 datasets where the previous study correctly predicted the strain number.

SMS reconstructed strain genomes better than existing methods
We compared SMS with mixtureS [12] and StrainFinder [9]. Although several dozen methods are available for bacterial strain analysis, most of them are not for de novo strain identification [4].
mixtureS and StrainFinder showed better performance among the existing methods for novel strain identifications previously [12]. Since mixtureS works on one sample, we ran it on the pooled sample in each dataset. Because StrainFinder is unable to determine the strain numbers, we specified the known strain numbers in the corresponding datasets.
We compared the strain number, SNPs, and strain abundance predicted by the three methods (Table 1).  (Table 1). For instance, SMS correctly predicted two strains in all 195 experimental datasets, while mixtureS predicted two strains in 84 datasets. Table 1 The performance of the three tools.
We also studied the running time of different methods. We collected the tool running time on nine simulation datasets with the same genome (NC009515.1) and same strain configuration (10:30:60) (    The three columns for each tool are the number (percentage) of datasets where the tool predicted the correct strain number; the precision, recall and F1 score of the predicted strain SNPs; and the average MAE of the predicted strain abundance.

Additional Files
Additional file1: Supplementary Tables   Table S1. Simulated datasets. There are 27 datasets generated in each row, each of which corresponds to one of the three different bacterial species (species 1: Bartonella clarridgeiae NC_014932, species 2: Enterococcus casseli flavus NC_020995, and species 3: Methanobrevibacter smithii NC_009515) and one of the nine different sample numbers (5,8,10,12,15,20,25,30,35). The number and order of strains in each dataset are specified by the strain configuration. For instance, the configuration 10:20:30:40 in the first row specifies that there are four strains, with the coverage of the 1 st , 2 nd , 3 rd and 4 th strain as 10X, 20X, 30X and 40X,respectively. There is no shared SNPs among strains in Type 1 datasets. In each type 2 dataset, the first two strains share 30% of their SNPs, the first three strains share 20% of their SNPs, and the fourth strain shares no SNP with other strains. In each type 3 dataset, the first two strains share 30% of their SNPs, the first three strains share 10% of their SNPs, and there is no SNP shared between the first three strains and the last strain. In each type 4 dataset, the first two strains share 30% of their SNPs, and the third strain share no SNP with other strains.