Feature selection
First, we selected the common “species”, “family”, and “order” existing across all the 16 cities and the number of variables was 7, 9 and 9 respectively. The number of features was limited and more information about the microbes was needed. In order to increase the number of features, we selected the features based on additional rules: a) The features were selected based on the ubiquity of the “species”, “family” and “order” across all the cities. b) The features were selected based on the ubiquity of the “species”, “family” and “order” across all the samples. c) The combinations of the common “species”, “family”, and “order” were regarded as the combined features. Table 2 presented the details of the features selected based on additional rules. For simplicity, the mystery dataset was analyzed based on the common features and combined features. After feature selection, the aggregated raw counts of each dataset were normalized to generate log2-cpm for further analyses.
Machine learning analysis
For the main and mystery datasets respectively, results from Random forest (RF) [4], Support Vector Machine (SVM) [5] and Linear Discriminant Analysis (LDA) [6] were obtained with the leave-one-out cross validation (CV), one test sample was randomly selected in each run with 1000 runs repeated. Table 3 presented the details of the classification error rate based on different rules using the main and mystery datasets respectively.
As seen in the table, when the features existing in at least N cities were selected for the analysis, the changing trends of the error rate of RF-species (qualitied “species” was used for analysis using RF), SVM-species, LDA-species, LDA-family and LDA-order shared a similar pattern, a decreased CV error rate was obtained when increasing the number of features, and then the lowest error rate was achieved. For RF-family and SVM-family, the error rate hasn’t changed considerably when we increased the number of the variables at the family rank. Therefore, using the common “family” was better, as we obtained the low error rate without including too many features. And for RF-order and SVM-order, the lowest error rate was obtained using common “order”. Additionally, when top M features with the highest ubiquity across all the samples were selected for analysis, error rates decreased with the increasing number of features used and then the lowest error rate was achieved, no matter which machine learning methods or which kinds of feature we used for analysis. In addition, for the combined features, the best performance was achieved using the features combined with common “species”, “family” and “order” (7 species, 9 families, 9 orders), the error rates obtained from RF, SVM and LDA were 11.6%, 11.5% and 11.8% respectively. And for the mystery dataset, we obtained the lowest average error rate when we used the combined features with 8 species and 15 orders. The error rates were 29.7%, 32.1% and 28.2% for RF, SVM and LDA respectively.
The error rates of predictions for each city was presented in Figure 1. From Figure 1.A, the low error rates were obtained from the cities with better sequencing depth (Table S1) including Bogota, Ilorin, New York, Offa, Sacramento and Tokyo. The average error rates of the three methods for these four cities were 7.84%, 5.49%, 2.03%, 6.31%, 3.08% and 5.80% respectively. However, the error rates of Auckland and Hamilton, two cities also with good sequencing depth, were 26.39% and 18.44% respectively. By looking into the details of the results, we found that the samples AKL_1, AKL_7, AKL_14 and HAM_7, HAM_12 cannot be predicted correctly by all the three methods. In other words, the microbial composition of these samples was different from the other samples in Auckland and Hamilton (Table S2), making them difficult to be identified. And we found that the samples from London with the poor sequencing depth showed low error rate. Upon finding excessive zeros in samples from London, the samples of London could be easily identified from all the samples, which resulted in the low error rate of London. In addition, a possible explanation for low error rate could be the insufficient number of samples, as the cities with the lowest number of samples including Sofia and Marseille showed high error rates. And from Figure 1.B, the cities with deep sequencing (Table S3), i.e. Oslo and Rio de Janeiro, were the two of the first three cities with the lowest average error rate (6.64% and 16.81% respectively). Similarly, the cities with poor sequencing depth such as Doha and Brisbane, showed the high average error rate (85.71% and 62.61% respectively). And Doha, also with the limited number of samples, achieved the highest error rate among all the cities. The evident from the Figure 1 that most cities with limited samples and poor sequencing depth had high error rates, indicating that sufficient samples and deep sequencing were necessary for successfully predicting the provenance of samples.
In addition to the analyses based on the main and mystery datasets respectively, we have also used the prediction models built based on the main dataset to predict the samples from mystery dataset. As the main dataset and mystery dataset had no cities in common, the information of cities from mystery dataset was lacked in the main dataset. Therefore, 50% of the samples of each city from mystery dataset were randomly sampled and added to the main dataset to serve as the part of the training dataset, and the rest of the samples from mystery dataset were served as the test samples. In the training dataset, the samples from mystery dataset were labeled as the “mystery”. The prediction models were built based on the common “family” (5 families) and “order” (6 orders), as there was no common “species” between the main and mystery datasets. Three different classifications were used and the random samplings were conducted for 1000 times. For each run, the predicted labels were checked with the real labels. The average error rates for RF, SVM and LDA were 10.48%, 9.21% and 10.36% respectively, indicating that the prediction models could effectively identify the mystery samples from all the samples.
The following analyses were based on the features which achieved the lowest average error rate.
Principal Coordinates Analysis
The results of PCoA [7] in Figure 2 presented the bi-plots for both datasets. Figure 2.A illustrated the main dataset and the 58.4% of total variability of the data could be explained by the first two PCoA axes. A separation of the cities could be referred from the plot. For example, London was separated from most cities and on the rightmost site, which was corresponding to the result from machine learning methods, indicating that the excessive zeros in samples from London made the prediction easier. However, many cities overlapped together. Specifically, Ilorin and Offa, both the cities of Nigeria, whose ellipses showed a massive overlap. Also, Auckland and Hamilton, both being in New Zealand, overlapped with each other. The result of mystery dataset was given in Figure 2.B. The first two PCoA axes explained 65.4% of total variability in the data, which was comparable with the percentage explained in the main dataset. Although many cities overlapped, samples of Oslo were clustered together and distributed at the top of the plot, separating from the most samples. Also, some of the samples from Rio de Janeiro were far away from the most cities, which made these samples easier to be identified. These results were corresponding to the low error rates of Oslo and Rio de Janeiro in the previous section.
Analysis of composition of microbiomes
The results from the analysis of composition of microbiomes (ANCOM) [8] were presented in Figure 3. The relative abundances of the features were used to conduct the pair-wise comparisons among all the cities. Upon the significance of the features, the differentially abundant features were found. The features, which served as the predictors, on the right were ordered by the number of times the relative abundance was significantly different in the pair-wise comparisons. As presented in Figure 3.A, the top 10 features, i.e. Bacillaceae, Bacillales, Actinomycetales, Sphingomonadaceae, Pseudomonadaceae, Pseudomonas.spp, Sphingomonadales, Lactobacillales, streptococcaceae and Enterobacteriaceae showed the highest counts. For the top feature, the count of Bacillaceae is 43 out of 120, which meant that in the 120 pair-wise comparisons among the 16 cities, Bacillaceae was found to be significantly different in 43 comparisons. Similarly, in Figure 3.B, the top 10 features were Bacillales, Clostridiales, Pseudomonadales, Staphylococcus.epidermidis, Lactobacillales, Rhodospirillales, Flavobacteriales, Streptophyta, Burkholderiales and Enterobacteriales. Additionally, little change was seen from the importance score (Figure 4) derived from the RF. It could be inferred from Figure 4.A that Bacillales, Actinomycetales, Sphingomonadales, Sphingomonadaceae, Enterobacteriaceae and streptococcaceae were also in top 10 features. Also, for the mystery dataset, Bacillales, Streptophyta, Rhodospirillales, Enterobacteriales, Lactobacillales, Clostridiales and Pseudomonadales were in the top 10 of both lists. In summary, for the main dataset, the common “family” and “order” of the OTU provided the most informative data for predicting the origins of the samples, which was also corresponding to the low error rate of three methods when using the common “family” and “order” only compared with using the common “species”. For the mystery dataset, the results of the ANCOM and feature importance combined with the error rates from three methods indicated that the common “order” dominated the prediction.