Cohort recruitment and sample collection
Participants were recruited in the Oncology Department of Beijing Hospital, Beijing, China, in 2018. This study was approved by the Beijing Hospital Ethics Committee. Written informed consent was obtained from all patients. The exclusion criteria included: with a history of inflammatory bowel diseases; having been exposed to probiotics, prebiotics or broad-spectrum antibiotics within 30 days; or having received nasal-tube feeding or parenteral nutrition in the month prior to initiation of the study. Detail information about all patients were listed in the Supplementary Table S1. We collected two faecal samples from each participant. A faecal sample was collected on hospital in-patient admission (Day0), prior to administration of chemotherapy, and 30 days later immediately prior to chemotherapy (Day30). For each faecal sample, one gram of stool was collected into a sterile tube and then subsequently stored at -80 °C for molecular analysis. Total genomic DNA was extracted from 200–500 mg of fecal samples using a QIAamp DNA Stool Mini kit (Qiagen, Germany) according to the manufacturer’s instructions. The concentration of the extracted DNA was measured by a NanoDrop2000 (Thermo Scientific, USA), and then the DNA was stored at -80 °C.
16S ribosomal RNA sequencing and bioinformatics analysis
To develop the 16S rRNA amplicon libraries, the V3-V4 region of 16S rRNA gene was PCR-amplified with the primers 341F (5’-CCTAYGGGRBGCASCAG-3’) and 806R (5’-GGACTACNNGGGTATCTAAT-3’), modified by adding barcodes for multiplexing [39]. Pooled amplicons were paired-end sequenced (PE 2 × 250 bp) by using an Illumina HiSeq 2500 platform according to the manufacture’s protocol. Paired-end reads were merged and quality filtered using FLASH and QIIME [40]. Chimeras were detected and removed against the Gold reference database using the UCHIME algorithm [41]. Sequences with ≥ 97% similarity were assigned to the same OTUs. For each representative sequence, the GreenGenes Database was used based on RDP classifier algorithm to annotate. Alpha diversity analysis including Shannon and Chao1 were calculated with QIIME. Phylogenetic beta diversity distances, including unweighted and weighted UniFrac distances were measured using QIIME. Principal-Component Analysis (PCoA) was performed to visualize by ggplot2 package in R software (Version 3.5.0). Further, diversity analyses, such as Adonis and Anosim, were performed by running a workflow on QIIME.
Shotgun metagenomics sequencing library construction
For shotgun metagenomics sequencing, adaptive focused acoustics (Covaris) was used to shear a standard volume of 50 µl of the extracted DNA. DNA libraries were prepared using Illumina TruSeq Sample Preparation Kit (Illumina, USA) according to the manufacturer’s protocol. Libraries were quantified using an Agilent Bioanalyzer 2100 (Agilent Technologies, USA). Paired-end sequencing (PE 2 × 150 bp) was performed on successful DNA libraries using an Illumina HiSeq X-Ten instrument at the Annoroad Genome Biotech at Beijing. Finally, we obtained 4.27 GB to 8.9 GB of raw data for each sample (average 5.98 GB, Supplementary Table S2).
Metagenomics bioinformatics analysis
Concatenated and human DNA sequences were removed using KneadData with default parameter (http://huttenhower.sph.harvard.edu/kneaddata). The remaining high-quality reads were assembled with IDBA-UD (Version: 2.04, parameters: --pre_correction --min_contig 200) [42]. We mapped the clean reads against scaffolds using Bowtie2 (Version: 2.3.4.2) [43]. Genes (minimum length of 100 nucleotides) were predicted on contigs longer than 500 bp and annotated using Prokka (Parameters: --metagenome) [44]. The gene abundance was determined by using the method similar to RPKM (reads per kilo bases per million reads) used for quantifying gene expression from RNA sequencing data. In brief, high-quality reads were counted with HTSeq-count [45]. For each gene, Gi, the number of reads that aligned to it divided by the length of the gene was calculated as NGi and the relative abundance RNGi, of each gene in each sample (n genes) was computed using the following formula:
Species-level quantitative taxonomic profiling was performed using MetaPhlAn2 (Version: 9760413b180f) on the KneadData-filtered reads [46]. Community composition was calculated with MetaPhlan2 using the default settings. Taxonomic profiles included bacteria, archaea, microbial Eukaryotes, and viruses were inferred by MetaPhlAn2 using the 1M unique clade-specific marker genes identified from 17,000 reference genomes (13,500 bacterial and archaeal, 3,500 viral, and 110 eukaryotic).
Functional annotation
Two different methods were used for functional alteration and pathway composition analysis. First, pathway-level composition was calculated with HUMANn2 using the UniRef90 database with default settings, which was followed by further statistical analysis and visualization in STAMP [18]. In order to obtain the detail information of each metabolic pathway, all genes in our genomes were aligned to the KEGG database using DIAMOND (Version 0.7.9.58, parameter: -k 50 -sensitive -e 0.00001) and KOBAS [47]. Each protein was assigned to the KEGG Orthology (KO) families by the highest scoring annotated hits containing at least one HSP scoring over 60 bits. The abundance of KEGG orthology / module was calculated using the methods as mentioned above. For KEGG orthology enrichment analysis, significant enriched KEGG pathway was defined as the adjusted P-value (< 0.05) of the hypergeometric test as the previous study [48].
Generation of a reference genome set
We followed the previous method to build the reference genomes to study the genomic variation of gut microbiota [23]. About 3934 prokaryotic genomes were downloaded from NCBI (ftp://ftp.ncbi.nlm.nih.gov/genomes/refseq/bacteria) on August 1 2018. A set of 40 universal single copy marker genes was identified in these genomes using HMM profiles made for each marker gene from the corresponding orthologous group from eggnog database [49, 50]. For each marker gene, pairwise DNA sequence identities between all genomes were calculated using BLASTn (Version 2.2.25, parameter: e-value < 10− 5 -F F) [51]. For each genome pair, the median identity of all marker genes was used as a proxy for average nucleotide identity (ANI) between the two genomes. Using an operational 95% ANI recommended for identifying species, we generated 929 clusters of genomes. In order to select a reference genome from each cluster, high quality reads from a subset of shotgun metagenomes (bgi-BGI-06A, bgi-DLF001, bgi-DOF002 from T2B cohort, nHF611710, nHF411719 from HTN, HD-2, HD-31 from liver, 1–1, 4 − 1, 5 − 1 from this study) were mapped to the 3934 genomes using Bowtie2 with the options “--very-fast”. Then, the genome with the highest read coverage was selected resulting in a set of 958 reference genomes, each likely representing a unique species (see Supplementary Table S8)
Mapping to reference genomes and variant calling
Illumina reads from 718 fecal samples, including the T2D, Liver and Hypertension cohorts were quality controlled using a KneadData. KneadData-filtered reads were mapping to reference genomes using Bowite2 (Version: 2.3.4.2) with default parameter. The number of reads mapping to reference genomes were counted and normalized by the genome size in order to obtain quantitative relative abundances of each genome in every sample.
To improve the accuracy of genomic variation, we modified the protocol of previous study [23]. First, the SAMtools/BCFtools suite was used for calling high quality InDel supported by more than 5 reads and at least one read on each strand [52]. We then performed local realignment of the aligned reads with those high-quality InDel using the Genome Analysis Toolkit (GATK) [53] to minimize the mismatch bases. GATK’s HaplotypeCaller was used to call variants with default parameter. BreakDancer was used to detect the structural variation with at least three reads supported the event [54].
Positive selection analysis
CDSs were annotated based on their protein family membership using KOBAS [47]. With MUSCLE [55], a multiple sequence alignment (MSA) of the protein sequences was created using each KEGG Orthology. Based on the MSA and the CDS nucleotide sequences, a codon-based alignment was constructed for each KO module with PAL2NAL using default parameters [56]. We then used FastTree, a relaxed neighbour joining algorithm, to reconstruct a phylogenetic tree for each protein family from the obtained MSA [57]. We excluded low-confidence positions in the alignment with a large number of gaps with Gblocks [58]. Dn/Ds was calculated with PAML [59]. A one-sided Fisher's test was performed to identify protein families with a significant enrichment of Dn versus Ds changes in comparison to the entire sample. The false discovery rate (FDR) was controlled using the Benjamini and Hochberg procedure and alpha set to 5%.
in vitro experiment to validate the mutation pattern shift
A filter-sterilized stock solution of 100 mM 5-FU, Oxi and 5-FU + Oxi (1:1) (Sigma Aldrich) were prepared in DMSO and further diluted to 50, 75 and 100 mM. Stock solutions were further diluted (1:1000) in culture medium for the experiments. Four mono-isolated gut microorganism were obtained from the stools of recruited patients. LB agar plates used for passaging were prepared by adding stock 5-FU, Oxi and 5-FU + Oxi. Subsequently, cultures were transferred into fresh plates and allowed to grow 48 h for each batch. After 20 batches, single colony on each plate were then scraped off and transferred to Eppendorf tubes with liquid LB culture and cultured overnight at 37 °C. Total DNA was then extracted using the QIAamp UCP Pathogen Mini Kit (Qiagen, German). The concentration of the extracted DNA was measured by a NanoDrop2000 (Thermo Scientific, USA). Illumina DNA sequence libraries were prepared as the mentioned above. JSpeciesWS was used to assign taxonomy of each isolated strains [60]. Reads were then mapped to the closest reference genomes with BWA and variant calling was performed as mentioned above.
Statistical analysis
The Wilcoxon signed-rank test was performed to evaluate differences in richness and diversity for KEGG and microbial taxa at various taxonomic ranks. Mutation rate comparisons were conducted to determine statistical significance between two groups for incidences where differences were declared significant at P-value < 0.05 with the Student’s T-test.
Data Availability
All sequences generated in this study are available in the NCBI sequence read archive under the accession numbers PRJNA551354.