Experimental design and aim of study
The main aim of this observational cohort study was to use shotgun metagenomics to characterise the gut bacteriome, DNA virome and antibiotic resistome of two highly divergent populations in Central India; rural agriculturalists in Melghat and an urban population in Nagpur. We also sought to investigate comparative differences in microbiome profiles in subjects with and without diarrhoea, including the impact of CDI.
Human participants
Inclusion and Exclusion Criteria
During participant selection, inclusion criteria were (i) adults aged from 18 to 70 years who could provide written or thumb-print acknowledged informed consent, (ii) HIV, hepatitis B or C negative, and (iii) not pregnant or breast-feeding.
For the diarrhoeal group, a presumptive diagnosis of infective diarrhoea was defined as 3 or more loose stools in a 24-hour period accompanied by other gastrointestinal symptoms such as nausea, vomiting, abdominal cramps, tenesmus, bloody stools, or fever (oral temperature ≥38oC). All subjects in the C. difficile-infected group had diarrhoea and a positive stool C. difficile (enzyme immunoassay) for toxin.
The exclusion criteria for this group were (i) any individual with a known non-infectious cause of diarrhoea such as inflammatory bowel disease, (ii) those unable to provide a stool sample, (iii) or if the sample is formed stool. For the non-diarrhoeal control group, the exclusion criteria were (i) presence of acute diarrhoea at the time of or within 2 weeks of recruitment or (ii) those unable to provide a stool sample. It was acknowledged that such individuals could be recruited from the in- or outpatient population and could have been exposed to antibiotics in the recent past (within 3 months of recruitment), although ideally not at the time of recruitment.
Immunosuppression was defined as those on prednisolone (>5mg/d), immunomodulators (azathioprine, methotrexate, calcineurin inhibitor) or biologics.
Human Geography - Nagpur
Nagpur is the third largest city of the Indian state of Maharashtra and the 13th largest city by population (2.5M) in India. It is located at the exact centre of the Indian peninsula (zero milestone) and enjoys a tropical savannah climate where temperatures can reach in excess of 48 oC in the summer months. Hinduism is the main religion followed closely by Buddhism and Islam, with smaller contributions from Christianity, Jainism and Sikhism.
Nagpur is an emerging metropolis attracting significant commercial inward investment and is a major education hub in Central India. It is also home to the Central Indian Institute of Medical Sciences (CIIMS). Nagpur was declared open defecation free in January 2018 and is one of the cleanest and most livable cities in India, as a leader in healthcare, green spaces and public transportation. The majority of households have good drinking water and sanitation facilities, and use clean fuel for cooking.
Human Geography - Melghat
Melghat Tiger Reserve, with its diverse flora and fauna, is located in Amaravati district of Maharashtra and is home to approximately 250,000 members of the Korku tribe spread across two talukas, Dharni and Chikaldhara and 300 villages, and extends across 4,000 square km. By road, it is approximately 250 km from Nagpur.
All rural Melghat subjects within the Melghat Tiger Reserve of Maharashtra identify as members of the Korku Scheduled Tribe and practice Hinduism mixed with ancestral worship. The Korku are an Adivasi ethnic group, speak Korku dialect, and are primarily an agriculturalist community of low socioeconomic status, high rates of illiteracy and malnutrition and possess poor access to medical and educational facilities. They live in small huts typically made of mud, grass and bamboo frames which lack an electricity or running water supply or proper sanitation systems and possess unique and distinct cultural knowledge, beliefs, and customs. MAHAN Trust is a non-governmental organisation which provides medical facilities to the tribal population of the Melghat region through its charitable Mahatma Gandhi Tribal Hospital.
Metadata collection (C. difficile prevalence study)
Site-specific project coordinators were assigned to review health records form each participant. Basic demographic details including age, gender, geographic location, hospitalisation exposure, antibiotic usage during and before (within 3 months) of study recruitment, and toxigenic (GDH+ toxin -) and non-toxigenic (GDH+ toxin-) C. difficile detection rates were recorded for peri-urban outpatient, urban in-and outpatients and a rural population.
Metadata collection (Metagenome study)
In addition to the metadata collected for the C. difficile prevalence study, site-specific coordinators also recorded BMI, immunosuppression status, and environmental details: type and location of home dwelling, number in family, drinking water supply, hygiene practices and number and type of domestic animals
Faecal Sample Collection and Storage
All specimens were anonymised and assigned a study code number linked to participant demographic details. Human faecal samples were collected from urban participants with and without diarrhoea that were either in- or outpatients from the Central Indian Institute of Medical Sciences (CIIMS), Nagpur or from other hospitals within a 20 km radius of CIIMS. Similarly, faecal samples were also collected from participants with and without diarrhoea in Melghat with the assistance of research fellows based at the Mahatma Gandhi Tribal Hospital, which hosts a CIIMS satellite laboratory and other neighbouring hospitals within Melghat. Suitable recruits were identified by the research fellows who interacted daily with village healthcare workers to facilitate participant recruitment and sample collection. Up to two samples (3-5 grams each) were collected in UV sterilised dry plastic containers at the time of recruitment from each participant and placed in a cool box. As per the standard operating procedures, all stool specimens were stored at 4oC immediately after collection to avoid enzymatic degradation prior to detection of toxigenic C. difficile and genomic DNA extraction which were performed within 24 hours of sample collection.
Detection of Clostridioides difficile GDH antigen and free toxin in diarrheal stool samples
All diarrhoeal samples in the metagenome study (58/105) were tested for Clostridioides difficile infection (detection of glutamate dehydrogenase antigen and toxins A/B) using the C. DIFF QUIK CHEK COMPLETE-enzyme immunoassay (QCC; TechLab, Blacksburg, VA, USA) in accordance with the manufacturers’ instructions, including the use of appropriate controls as specified in the package insert. Briefly, ~25 ml of stool sample was added to a tube containing the diluent and conjugate and the mixture was transferred to the device sample well. After incubation for 15 min at room temperature, the wash buffer followed by the substrate were added to the reaction window. The results were read after 10 min. The GDH antigen and/or toxins were reported as positive if a clear visible band was seen on the antigen and toxin side of the device display window, respectively, confirming the presence of toxigenic C. difficile as per manufacturer guidelines.
Detection of diarrhoeagenic E. coli virulence genes in diarrhoeal stool samples
Multiplex polymerase chain reaction assays (HiMedia Laboratories Pvt. Ltd., Mumbai, India) were used to detect diarrheagenic E.coli (DEC) virulence genes including eae and bfpA for Enteropathogenic E. coli (EPEC), hlyA for Enterohaemorrhagic E. coli (EHEC), elt for Enterotoxigenic E. coli (ETEC), CVD432 for Enteroaggregative E. coli (EAEC) and est for Enteroinvasive E. coli (EIEC) as per the manufacturers’ instructions, including the use of 2 different species-specific primer sets and appropriate controls as specified in the package insert (see Supplementary Methods for further details).
Faecal DNA extraction
DNA was extracted from 1 to 1.5g of feces and homogenised in lysis buffer (Tris HCl, EDTA, NaCl and SDS). The content was centrifuged at 7,000 x g for 10 min. The supernatant was then transferred to a 1.5mL tube containing a mixture of Isopropanol and Sodium acetate (5M) and incubated at -20oC for 30 min. Following removal of the supernatant the pellet was dried for about an hour. The pellet was suspended in 1X Tris EDTA buffer (pH 8) and incubated at 65oC for 15 min. An approximate equal volume (0.5- 0.7 ml) of Phenol: Chloroform- Isoamyl alcohol (24:1) was added, mixed thoroughly and centrifuged for 10 min at 12,000 x g. The aqueous viscous supernatant was carefully transferred to a new 1.5mL tube. An equal volume of Chloroform-Isoamyl alcohol (1:1) was added, followed by centrifugation for 10 min at 12,000 x g. The supernatant was mixed with 0.6x volume of Isopropanol to aid precipitation. The precipitated nucleic acids were washed with 75% ethanol, dried and re-suspended in 50μL of TE buffer.
Whole-Genome Shotgun (WGS) Sequencing
Sequencing was carried out by Source Biosciences (Nottingham, U.K.). High quality genomic DNA was quantified using Qubit Broad Range (Invitrogen, U.K.) and prepared for Illumina paired end sequencing following the TruSeq DNA Nano manufacturers protocol (Rev D, June 2015) (Illumina Inc, San Diego, U.S.A.). The DNA was sequenced using a standard HiSeq 4000 150bp PE flowcell. Raw data has been submitted to the European Nucleotide Archive under the accession number https://www.ncbi.nlm.nih.gov/bioproject/PRJNA564397
Generation of taxonomic, resistome and functional profiles from metagenomic shotgun data
Raw Fastq files (average 13,410,735 reads per sample) were assessed for quality using skewer [41], trimming adaptor reads and regions of quality below a phred of 30. The filtered reads (average 10,635,653 reads per sample) were then assessed for taxonomic assignments using Metaphlan2 [42] and for the presence of antimicrobial resistance genes using ARIBA [43] with the MegaRes database [44].
Functional analysis was performed using MOCAT2 (v2.1.3) [45]. Briefly, trimmed and filtered reads were assembled into contigs with SOAPaligner (v2.21). These contigs are initially corrected for indels and chimeric reads using BWA (v0.7.5a-r16) and screened against the human hg19 reference to filter out reads which originated from the host using USEARCH (v5/v6). Genes were predicted using Prodigal (v2.60). Single copy marker genes are extracted using fetchMG (v1.0) and clustered using CD-HIT (v4.6). The gene catalogues were annotated using DIAMOND (v0.7.9.58) against multiple functional databases including eggNOG [46] and KEGG [47]. The abundance of genes annotated to specific KEGG orthologs (KO) was determined using the insert mm dist among unique norm setting in MOCAT2, normalising by read length and sequencing depth and allowing for multiple mappers.
Analysis of taxonomic contributions to functional shifts
Functional shifts between groups and predicted taxonomic contributions were calculated using the FishTaco package [48], taking the species-level taxonomic table produced by Metaphlan2 and the normalised KO abundance table from MOCAT2 as inputs. Only 49 taxa which exceeded a minimum proportional abundance of greater than 0.1 in any single sample were included in the final model. Enriched pathways were identified using the Wilcoxon rank sum test at FDR corrected p<0.05. Taxonomic contributions were predicted by de novo inference in FishTaco, inferring genomic content through a permutation-based approach and performing a total of 50 permutations per differentially abundant pathway.
For comparison of gene copy number for enriched metabolic pathways, KO gene copy numbers for 8 gut-associated annotated reference genomes were obtained from the Integrated Microbial Genomes and Microbiomes (IMG) database [49] as follows; Prevotella stercorea DSM 18206 (IMG: 2513237318), Prevotella copri CB7 DSM 18205 (IMG: 2562617166), Eubacterium rectale DSM 17629 (IMG: 650377936), Ruminococcus bromii L2-63 (IMG: 650377966), Escherichia coli UM147 (IMG: 2728369554), Klebsiella pneumoniae YH43 (IMG: 2687453226), Bacteroides vulgatus mpk (IMG: 2687453192), Parabacteroides distasonis 2b7A (IMG: 2660238380). KO gene copy numbers associated with each enriched metabolic pathway were aggregated to yield overall pathway gene counts.
Detecting viruses in whole community metagenomic shotgun data
Sequencing reads were processed using Trimmomatic (version 0.36) [50], to remove Illumina adaptors and prune sequences where the Phred score dropped below 30 across a 4bp sliding window. All surviving reads less than 70bp were discarded. Fastq reads were assess pre- and post-processing using fastqc [51]. Both the paired and unpaired, forward and reverse reads from samples were assembled individually using metaSPAdes (version 3.11.1) [52]. Only contigs greater than 1,000bp were examined further.
Two approaches were employed to find viruses within whole community metagenomic assemblies. A standard reference-based similarity search was performed to detect sequence relatedness to known viruses, while a reference-independent approach was undertaken by searching for sequences which encode a high density of viral proteins. For the reference-based search, nucleotide sequences were queried locally using BLAST (version 2.6.0+) [53] against the viral RefSeq database (version 89; E-value 1E-10) [54], the complete Reference Viral Database (C-RVDB version 14.0; E-value 1E-05) [55], and 249 crAss-like phages previously described as the human gut’s most abundant virus (E-value 1E-05) [56].
For the reference-independent approach, proteins for all contigs were predicted using Prodigal (version 2.6.3) [57] with the ‘meta’ option enabled for small contigs and Shine-Dalgarno training bypassed. Proteins were subsequently queried against the prokaryote Viral Orthologous Groups database (pVOGs) [58] using HMMER (version 3.1b2) [59], with a minimum score requirement of 15. Putative reference-independent discovered viruses needed to fulfil three basic requirements: (i) ≥1.5kb, (ii) encode 2 distinct proteins with similarity to 2 unique pVOGs, and (iii) encode ≥2 pVOGs per 10kb-equivalent genome length. Additional stringent dynamic filtering was applied to contigs based on their actual genome length. For contigs <5kb, it was required that there were at least ≥5 distinct pVOG hits; contigs ≥5kb and <10kb, ≥6 pVOG hits; contigs ≥10kb and <20kb, ≥7 pVOG hits; contigs ≥20kb and <40kb, ≥8 pVOG hits; contigs ≥40kb and <60kb, 9 pVOG hits; and contigs ≥60kb, 10 pVOG hits.
All putative viral contigs detected using the reference-dependent and -independent methods were pooled and made non-redundant as follows: following a BLASTn all-v-all, the larger of two contigs were retained when the blast identity and coverage between two sequences exceeded 90%. Subsequently, any putative virus encoding a ribosomal protein (BLASTp, E-value 1E-10) was removed from further analysis. This was performed for stringency despite recent research showing specific viruses can encode ribosomal proteins [60]. In addition, any contig encoding a protein with similarity to all available Pfam sequences (version 32.0) plasmid replication proteins PF01051, PF01446, PF01719, PF04796, PF05732, and PF06970, were removed (HMMER, score 15).
Viral contigs were grouped into Viral Clusters (VCs) using vContact2 (version 0.9.8) [22], implemented through the CyVerse Discovery Environment. Protein clusters were identified amongst VCs using default settings (Diamond, E-value 0.0001), and with the inclusion of known viruses (Bacterial and Archaeal Viral RefSeq 85, with ICTV and NCBI taxonomy). Following vContact2, only viral clusters that contain viral sequences from two or more of the study’s complete cohort (n=105) were analysed further. This was designed to remove singleton and spurious viral sequences that may be transiently associated with diet, but are not abundant or stable components of the faecal microbiome. The final Indian faecal virome was visualised as a network through Cytoscape (version 3.7.1) [61], with viral sequences as nodes and shared protein clusters as edges. The edge distance between connected viruses is calculated by vContact2 as their ‘interaction’.
Discerning differences in virome diversity and abundance
Quality filtered reads, both paired and unpaired, were mapped onto the final Indian faecal virome using bowtie2 in ‘end-to-end’ mode (version 2.3.4.1) [62]. The read alignment outputs were converted to sorted bam files through samtools (version 1.7) [63]. The abundance and breadth of coverage of reads mapping to each contig was determined using the bedtools coverage function (version 2.26.0) [64]. Subsequently, in order to determine if a viral sequence was indeed present in a faecal virome, a breadth of coverage filtering was applied. This was designed to remove viruses where potentially 100s of reads could map onto a single conserved region. Therefore, for viral sequences ≤5kb, 75% of the genome needed to be covered by aligned reads; sequences >5kb and ≤50kb, 50% of the genome needed to be covered; and >50kb, 25% of the genome needed to be covered.
In addition to 105 faecal metagenomes, two negative control samples (water) were sequenced. While these samples contributed no contigs to the final Indian faecal virome, the breadth of coverage of sequencing reads from these samples was used to remove potential contaminant sequences. Any viral sequence, from any sample, which ‘passed’ the breadth of coverage filtering using reads derived from either water sample were removed from further analysis.
Any viral sequence from a faecal microbiome sample which failed the breadth of coverage filtering was recorded as zero reads, while if the filtering step was passed, the observed number of reads aligned were used to populate the read count matrix. Due to differences in sequencing depth between samples, the read count matrix was normalised per sample using the DESeq2 ratio of means method [65]. The reads aligned to individual viral sequences were aggregated by their vContact2 determined VCs. DESeq2 was subsequently used to calculate the VC changes between cohorts. The normalised VC read count matrix was used to determine the diversity and statistical differences observed between Indian faecal microbiome cohorts (see ‘Statistical Analyses’ below).
Determining phage-host pairs and viral encoded functions
CRISPR spacers from bacterial contig assemblies were predicted using PILER-CR (version 1.06) [66]. Putative CRIspr spacer predictions <20bp and >100bp were discarded. The CRISPR spacers were queried locally using BLASTn against all individual viral sequences which formed the Indian faecal virome VCs. Due to the use of short nucleotide sequences, only CRISPR spacers with an E-value ≤0.001 and ≤1 mismatch were considered as significant. In order to determine the taxonomy of the original assembled bacterial contigs, or the pre-assembled contigs from the Pasolli et al. (2019) study [67], contig kmer MinHash sketches were queried against JGI taxonomy server using the BBMap sendsketch function (version 38.44) [68].
The functions associated with Indian faecal viruses were determined using eggNOG-mapper v1 (online submission portal) using the eggNOG 4.5.1 database [46]. For each VC, the largest viral sequence was chosen as a representative of that VC. In order to avoid the confounding effect of viral abundance fluctuations within the faecal microbiome, the relative abundance of VCs observed at the specific sampling time-point were not taken into consideration. Only the overall presence-absence and abundance of viral-encoded functions were considered. The similarity between virome-encoded functions, with respect to presence-absence, were assessed through PCoA using the Jaccard index. The abundance of specific metabolic genes were compared between cohorts, with statistical difference determined by the Mann-Whitney U test with Bonferroni correction using the ‘ggpubr’ compare means function in R.
Statistical Analyses and Graphic Generation
All statistical analyses were conducted in R (64-bit, version 3.6.0; Foundation for Statistical Computing, Vienna). The package ‘vegan’ was used for measures of taxonomic diversity including alpha diversity (Inverse Simpson Index) and beta diversity (Principal Coordinates Analysis with Bray Curtis Dissimilarity and Jaccard Similarity). Differences in alpha diversity between study groups was assessed by ANOVA with Tukey’s honest significance test. The contribution of categorical variables to beta diversity was tested for using the Adonis function (PERMANOVA) in vegan. Generalised linear models assuming a negative binomial distribution were used to identify differentially abundant taxa between study groups as implemented in the R package ‘mare’. Hierarchical clustering of resistance gene abundances and heatmap generation was performed with the package ‘heatmap3’. For comparison of resistance gene and metabolic pathways counts between groups, the Mann-Whitney U test was used. All p values obtained from testing with multiple comparisons were corrected for false discovery rate (FDR, Benjamini-Hochberg). The fold changes observed in the relative abundances of VCs across geographical and diarrhoeal status cohorts were calculated using the ‘gtools’ package in R. Using the same package, the fold changes were converted to log ratios (base 10). All graphical images were generated using ‘ggplot2’.