2.1 Animals Used in Experiments
The current investigation involved female Balb/c mice that were 8 weeks old (Beijing HFK Bioscience Co., Ltd., Beijing, China). In accordance with standard laboratory procedures, the animal models were established in Beijing Viewsolid Biotechnology Co., Ltd., and all animals were housed at the Beijing Hospital of Traditional Chinese Medicine (22-24 ℃, 40-60% relative humidity) food and water were freely available, and the light/dark-cycle was maintained for 12/12 hours, with the light turned on at 6:00 am. The selenium-enriched microalgal protein contained Se (VI): N. D., Se (IV):14.975, SeCs2:173.433, MeSeCys:14.468, SeMet:5.104, GSSeGS:18.249(μg/g), was provided by Enshi Zaoyuan Selenopeptide Biotechnology Co.Ltd.
2.2 Selenium Therapy in An Animal Model
Female Balb/c mice were divided into 2 groups (n=6 in each group) according to their body weight, as shown below: (I) a high-fat diet (45% kcal fat, 35% kcal carbohydrates, and 20% kcal proteins) was provided to the 4T1 + high-fat diet group (control group); (Ⅱ) 4T1 + selenium + fat diet group was fed with a high-fat diet and selenium-enriched microalgal protein 0.4mg/kg intragastric injection administration(IG) daily (selenium group). All mice were subcutaneously xenografted with 1 × 105 4T1 cells/50 µl each in the right fourth mammary fat pad under anesthesia. The observation period began with the injection of 4T1 cells for 4 weeks.
The selenium-enriched microalgal protein was dissolved in deionized water make out to a product with a stock solution that could be used to prepare treatment media at intragastric injection administration. The volume of the IG solution was performed at 10ml/kg according to mice body weight.
2.3 Sample Collection and DNA Extraction
Stool samples were collected in a specimen collection kit and stored at -20°C immediately after defecation and then at -80°C before further manipulation in the laboratory. DNA was extracted from samples using the stool DNA extraction Mini Kit. 0.2g fecal sample was added to Glass Beads Tube I, after which 600 µl Buffer STL and 20 µl Proteinase K were added. Samples were vortexed for 5min at maximum speed and further lysed by heating at 70 ° C for 15 min, after the samples had returned to room temperature, 200 µl Buffer IRP was added to the samples, which were vortexed thoroughly to mix. After the sample was centrifuged at 13,000g for 3 min on ice for 5 min, 450µl of supernatant was taken into a 2 ml centrifuge tube, after addition of 450µl Buffer MBL invert 3~4 times, vortex for 15s, 55˚C for 10min, during which mix several times by inversion, add 450µl absolute ethanol, vortex for 15s and centrifuge briefly. The extracted supernatant was purified following the instructions of DNA Extraction Mini Columns II to obtain sample DNA and stored at -20°C. DNA was fragmented to a fragment size of 200-300 BP for library construction, paired end sequencing was performed using PE150, and metagenomics was performed on the Illumina HiSeq platform following the manufacturer's instructions. Sequencing instrument model: novaseq6000.
2.4 Quality Control of Integrated Data
All raw metagenomics sequencing data were quality-controlled by MOCAT2 software(Kultima et al., 2016). All the original sequencing reads were disjointed by Cutadapt software(Kechin et al., 2017)and were trimmed by SolexaQA package with a quality of less than 20 and a length of less than 30 bp(Cox et al., 2010). Clean reads were obtained by quality control. We used SOAPaligner to compare the filtered reads with host reads whose genome was decontaminated to obtain high-quality clean data(Li et al., 2009).
2.5 Microbial Community Structure and Diversity Analysis
The CleanReads were used to be the input data and we used MetaPhlAn3(Segata et al., 2012)to make a Classification of species,thus the relative abundance of the sample bacteria from species level to phylum level was obtained.
2.5.1 Species Accumulation Curves
The Species Accumulation curve describes how species increase as the sample size increases condition, which can be used to determine the adequacy of sample size and estimate species richness. The species accumulation curve can not only judge whether the sample size is sufficient, but also predict the species richness if the sample size is sufficient. We used specaccum in R package Vegan to calculate the cumulative species curve Based on the species annotation results of MetaPhlAn3(L. Simpson, 2017). Alpha diversity includes richness, evenness and diversity Shannon index, which can be used to describe genes and functions in samples and the diversity of species. We used diversity in R package Vegan Function evaluates these indices based on the species annotation results of MetaPhlAn3. The clean reads obtained from quality control were assembled by De Novo using SOAPdenovo to obtain scaftigs with a length greater than 500 bp: Reads were interrupted into K-mer to construct de Bruijn graph and Eulerian Path was searched to assemble into contigs. Then scaffolds are connected according to the location relationship of pin-end reads, we select successive contigs from scaffolds to get scaftigs. Among them, k-mer of each sample was calculated by MOCAT2 according to the length and quantity of sample reads. Based on scaftigs obtained by assembly, we used MetaGeneMark for gene structure prediction, and CD-HIT was used for clustering to eliminate redundancy after gene set was obtained (Fu et al., 2012). If the sequence consistency of two genes is greater than 95%, and the overlap area covers more than 90% of the short sequences, then the two genes will be clustered into a cluster. The longest sequence in the cluster was selected as the representative sequence of the cluster to construct a non-redundant gene set with gene length greater than 100bp. Finally, we compared high-quality reads to the constructed non-redundant reference gene set by BWA, and the reads with length less than 30bp and consistency less than 95% were removed to obtain the reads count of each gene. We downsized count data by using the Rrarefy function in the vegan package of R language to obtain the corrected gene abundance.
2.5.2 Dimensionality Reduction Cluster Analysis
Principal component analysis (PCA), a simplified analysis of data that effectively finds the most "dominant" elements and structures in the data, is based on the original species composition matrix derived from MetaPhlAn3 using the RDA function of R language and the eigenvalue decomposition method. The differences of multiple sets of data are reflected on the two-dimensional principal component diagram. The more similar the sample composition is, the closer the distance is in the PCA diagram. Principal co-ordinates analysis (PCoA) is a visualization method to study data similarity or difference, which is similar to PCA analysis. The difference is that PCoA uses the Bray-Curtis distance. We used the PCoA function of R language to sort the eigenvalues and eigenvectors, and selected the eigenvalues ranked at the top to represent the distance difference between samples. Permutational multivariate analysis of variane (PERMANOVA) was used to assess whether there were significant community differences among different groups. Among them, PERMANOVA was calculated by adonis2 function in the Vegan package, and the P value was obtained by 1000 permutations. Nonmetric Multidimensional Scale analysis (NMDS) is a data analysis method which can simplify the study objects (samples or variables) in multidimensional space to a low-dimensional space for positioning, analysis and classification, while preserving the original relationships between objects. Based on Bray-Curtis distance, we used metaMDS function of R language for NMDS analysis. In addition, PERMANOVA was used to evaluate whether there were significant community differences among different groups.
2.5.3 Enterotypes Analysis
Enterotypes are clusters divided according to the composition of microbial communities, which is a very effective method to distinguish human intestinal microbes. JSD distance was calculated according to the genus level abundance, and CH index was used to calculate the optimal cluster K value after clustering by PAM method. Finally, Between Class analysis (BCA) was used for visualization.
2.5.4 Analysis of Differences between Group
Multiple Response Permutation Procedure, often combined with dimension reduction analysis such as PCoA and NMDS, is a statistical method used to test whether the differences between groups (two or more groups) are significantly greater than the differences within groups. Based on the relative species abundance data of MetaPhlAn3, we used the MRPP function of R package Vegan to compute A semi-metric matrix such as Bray-Curtis to evaluate the differences between and within groups, and finally performed significance analysis of the groups using the substitution test to obtain the P value. Based on the species composition results obtained from metaphlan3, we scored the units that were significantly different by comparing the relative abundances between the two groups using the Wilcoxon rank-sum permutation test, set the P value at 0.05, and corrected the P value using the Benjamini-Hochberg method. Comprehensive antibiotic resistance database (CARD), the core of which is ARO, the antibiotic resistance ontology, which contains 4,498 terms related to antibiotic resistance genes, resistance mechanisms, antibiotics, and targets. In addition, 2,984 reference sequences, 1,337 SNPs, 3,030 AMR detection models, and 2685 relevant references (Alcock et al., 2020) were included in the database. We used the CARD (v3.0.8) official software RGI (resistance gene identifier, v4.0.2) to annotate nonredundant gene sets for resistance genes. Evolutionary gene of genes: non conserved orthologous groups, the specialized annotation databases of orthologous groups (Egg-NOG), is a public data resource collecting information on orthologous relationships, gene evolutionary history and functional annotations, which also includes information on functional classification such as COG/KOG. The version5.0 incorporates 5,090 organisms, including 4,445 representative bacterial species, 168 archaeal species, 477 eukaryotic species, and also the whole genome protein sequences of 2502 viruses. Among them, the species belonged to 379 different taxonomic levels, and the corresponding genes were classified into 4.4M orthologous genes (orthologous group, OG) (Huerta-Cepas et al., 2019). The data were generated by aligning non redundant reference gene sets to the Egg-NOG database using EggNOG-mapper V2. Clusters of orthologous groups is a genome scale database for protein functional and evolutionary analysis, we aligned nonredundant reference gene sets to the COG database using Egg-NOG mapper V2 produce the data. Carbohydrate-Active enzymes database (CAZy) is a database specifically designed to describe enzymes of the degradable, modified or synthetic glycosidic bond class, which contains information on the species origin of carbohydrate enzyme classes, enzyme function EC classification, gene sequences, protein sequences and their structures. Annotation of carbohydrate active enzymes was performed using EggNOG-mapper V2 by aligning non redundant reference gene sets to the CAZy. MvirDB integrates the sequence information of DNA and protein in Tox-Prot, SCORPION, the PRINTS virus factors, VFDB, TVFac, Islander, ARGO and VIDA (part), and can provide rapid information retrieval of protein toxins, virulence factors, antibiotic resistance and other related genes for biological defense and medical fields. However, the database had been abandoned, and we obtained the 2015 updated sequence files as well as annotation information for the site using wayback machine. Annotation of biodefense related genes was performed by aligning non redundant reference gene sets to the MvirDB database using DIAMOND (v0.7.9.58, parameters: blastp-v-sensitive-k 10) (Buchfink et al., 2015). In this study, after obtaining the alignment results, if there was more than one alignment result for each sequence, one optimal alignment result shall be retained as the gene annotation result. Alignment parameters set expect values E-value as 1e-5. The Pathogen host interaction database (PHI), primarily collects interaction information from fungal, oomycetal, and bacterial pathogens, as well as from infected hosts (including animals, plants, fungi, insects, etc.) that has been experimentally validated. Version 4.8 contained information on 6,780 genes, 13,801 pathogen host interaction relationships, 268 pathogens, 210 hosts, etc(Urban et al., 2017). Annotation information such as pathogen host interactions was obtained by aligning non redundant reference gene sets to the PHI (v4.8) database using DIAMOND. Quorum sensing is a type of signal communication mechanism between microorganisms, also known as cell communication or autoinduction, by secreting, releasing some specific signaling molecules, sensing concentration changes, detecting the density of bacteria, regulating the physiological functions of bacteria, and thus adapting to environmental conditions. We obtained a QS database of 23,225 associated genes by a variety of means such as searching in the Uniprot database using QS related keywords (Barriuso and Martinez, 2018). Annotation of quorum sensing associated genes was performed by aligning non redundant reference gene sets to QS databases using DIAMOND. Transporter classification database (TCDB) details a classification system for membrane transporters known as the transporter classification (TC) system, which is similar to the Enzyme Commission (EC) system, in addition to integrating functional as well as phylogenetic information. These contain a total of 1474 descriptions of transporter families, TC numbers, as well as gene information, and the entire system is divided into 5 tiers(Saier et al., 2016), such as classes, subclasses, families, subfamilies, and transport systems, each corresponding to a letter in TC#. Taxonomic annotation of transporters was performed by aligning nonredundant reference gene sets to the TCDB database (update: 2015) using DIAMOND. Virulence factor database (VFDB)(Chen et al., 2016) is an integrated and integrated web resource database for managing virulence factors. Among them, there were 74 bacterial pathogens (32 with complete annotation information), 1081 virulence factors (576 experimental validation), 951 bacterial species (529 complete genomes), 32522 VF related non redundant genes (3228 experimental validation), and 2612 related literatures. Non redundant reference gene set ratios were individually aligned to the VFDB (update: 2020.4) database using DIAMOND.
2.5.5 DeSeq2 Analysis
The amount of sequencing will be different in different samples, and conventional normalization methods, RPKM, FPKM, etc., are CPM (count per million) normalization according to the length of genes or transcripts, but their disadvantages are easy to be affected by extremely highly expressed and differentially expressed genes in different samples. DeSeq2 was developed to address this concern by using a scale factor that accounts for differences in sequencing depth. First, based on clean reads data, we used Kraken2 (2.0.8-beta) for bacterial species annotation to obtain species count abundances, followed by the DeSeq2 function of the R language for differential species analysis, and the Benjamini-Hochberg method was used to correct P values.
2.5.6 LEfSe Analysis
LEfSe analysis enables comparison between two or more groups to find biomarkers with significant differences in abundance between groups. Based on MetaPhlAn3 based species relative abundance data, LEfSe was used to first detect species with significantly different abundances between different groupings using the nonparametric Kruskal Wallis rank-sum test for two or more groups, and then significant differential species obtained in the previous step were analyzed using the Wilcoxon rank-sum test for groups, Finally, linear discriminant analysis (LDA) was used to dimensionally reduce the data and assess the influence of the significantly different species (i.e., LDAscore).
2.6 Multiple Linear Association Analysis
Due to the presence of other factors affecting the microbiota, such as age, gender, BMI, etc., we analyzed the relationships between these factors and the microbiota as well as the functions of the microbiota using multiple linear association models (MaAsLin). This method employs a generalized linear model, while MaAsLin screens for potential, microbiota related phenotypic variables with a boosting approach due to the sparsity of metagene data. We performed multivariate association analysis with the results of MetaPhlAn3.
2.7 KEGG Pathway Enrichment Analysis
After aligning clean reads to their customized database with HUMAnN and obtaining the relative abundances at the level of metabolic pathways in the gene families from the UniProt Reference Cluster and the MetaCyc database, the resulting data were analyzed using the humann_regroup_table merges annotation results, resulting in KEGG KO annotation results. For relative abundance of KEGG orthology (KO), KEGG pathway enrichment analysis was finally performed with the GAGE package from R/Bioconductor.
2.8 Metabolic Potential Analysis
The microbiota is able to generate and consume many metabolites, and for each sample and metabolite, we used the PRMT (predicted relative metabolic turnover) method to calculate the CMPs (community metabolic potential) after predicting the resulting relative abundances of genes and to estimate the differences in metabolic potential between the different groups (Noecker et al., 2016).