Studies of microbiome have demonstrated successes in detecting microbial compositional patterns in health and environmental contexts. Large scale studies such as the Human Microbiome Project [1], the Metagenomics & Metadesign of Subways & Urban Biomes (MetaSUB) [2] and the Earth Microbiome Project [3] exemplify global efforts to facilitate the understanding of microbial presence and abundance in relation to diseases or environmental factors. Recent technological advances have enhanced the ability to both detect varied species and estimate their abundance in collected samples. The 16S ribosomal RNA (rRNA) amplicon sequencing approach targets and specifically sequences the 16S rRNA gene of bacteria and archaea; whereas the shotgun whole genome sequencing approach sequences all genetic material present in a sample, potentially allowing identification to the level of species, detection of the presence of functional units, and concurrent identification of eukaryotes, fungi and DNA viruses. Comparisons between 16S amplicon and shotgun sequencing data have been examined previously, but discrepancy exists among studies regarding which technique provides better robustness and higher biodiversity [4–7]. Despite the pros and cons of each technique, successes in extracting meaningful biological information have been found for disease and environmental studies [2,8–12].
Leveraging metagenomics sequencing data, the majority of analytical approaches for sample source prediction used to date have focused on supervised classification methods such as support vector machines and random forest, in order to assign trained source labels to unknown samples [9,10,13,14]. Delgado-Baquerizo et al. found high variability in relative abundance across various geographical locations through examining soil microbiome, and used random forest modeling to predict habitat preference for dominant phylotypes [9]. In the Earth Microbiome Project, random forest models were built to distinguish samples from various environmental factors including association with plants or animals as well as saline presence [10]. From the perspective of identifying potentially mixed sources, SourceTracker [15] uses a Bayesian approach to estimate the proportions of source environments in a sample without the assumption of one source label. In the 2018 Critical Assessment of Massive Data Analysis (CAMDA) challenge, supervised classification approaches have been applied to predict sample source using urban microbiome with high accuracies up to 0.91, where the unknown samples were of the same origins as samples previously trained [13,16–18].
The objective of the 2019 CAMDA metagenomics forensic challenge was to use urban microbiome data to predict locations of samples from new origins that have not been sampled previously (Figure S1). Classification models are limited to predicting trained origins from which the training samples were already collected and trained; hence can never predict a new origin. For the purpose of predicting new origins, one alternative approach is to model geographic coordinates, as inspired from a previous report on association between human genetics and geographical locations [19]. There have been existing literature reports on the association between latitude and microbial composition in various contexts [20–23]. Richness/diversity in planktonic marine bacteria and microbiome from ambulances in USA were found to be inversely correlated with latitude, a pattern called the “latitudinal diversity gradient” [21,22]. Using human gut microbe data from 23 populations, Suzuki et al. found significant positive and negative correlations to latitude with Firmicutes and Bacteriodetes, respectively [23]. Fisman et al. reported correlation between bloodstream infection from gram negative bacteria and proximity to the equator measured by latitude-squared [20].
Given the availability of 16S rRNA amplicon and shotgun data, we first set out to compare and contrast organism abundance from datasets generated using 16S amplicon versus shotgun sequencing technologies, and evaluated different analytical approaches on a subset of samples. In order to perform attribution of samples to a new geographic origin, we herein model the longitude and latitude as the outcome variables, and evaluate the use of multivariate regression for predictions of new sample origins with Lasso regularization to avoid model overfitting. Subsequently, we compare prediction performance between multivariate regression and multiclass classification models for the mystery data from new origin. Lastly, we report a computational approach to identify whether a sample is from a new or trained origin through the Simpson’s diversity index on classification probabilities.