All fecal samples were collected from Dalian Mingwei Marten Industry Company Limited, these wild sables imported from Mohe County of Heilongjiang Province and Greater Khingan Range. When collecting stool samples, we recorded the details of host gender, sample date and cage number. We carried sterile collection device and a portable refrigerator and put the collected samples into the sterile bags, then all fecal samples were immediately frozen in the refrigerator. These samples were categorized into three groups: the five female sable fecal samples were named MZF.1–MZF.5, the four male sable fecal samples were named MZM.1–MZM.4, and intestinal contents were named MZS.1, MZB.1 and MZB.2 (Table 1). Before DNA extraction from fecal samples, we guarantee the freshness of samples and reduce the pollution of the environment on the samples.
Table 1
Table of the information in samples
Species | Sample | Sex | Time |
Sable | MZF.1 | Female | 2017.11 |
Sable | MZF.2 | Female | 2017.11 |
Sable | MZF.3 | Female | 2017.11 |
Sable | MZF.4 | Female | 2017.11 |
Sable | MZF.5 | Female | 2017.11 |
Sable | MZM.1 | Male | 2017.11 |
Sable | MZM.2 | Male | 2017.11 |
Sable | MZM.3 | Male | 2017.11 |
Sable | MZM.4 | Male | 2017.11 |
Sable | MZS.1 | Male | 2017.11 |
Sable | MZB.1 | Male | 2017.11 |
Sable | MZB.2 | Male | 2017.11 |
DNA extraction,library preparation and metagenomics sequencing
The QIAamp DNA Stool Mini Kit (Qiagen, Hilden, Germany) was exploited to extract microbial DNA from the stool of sable. The detection of DNA samples mainly includes two methods. Agarose gel electrophoresis (AGE) was used to analyze the purity and integrity of DNA and Qubit 2.0 (Invitrogen, USA) quantifies DNA concentrations precisely. During the library construction step, qualified DNA samples were randomly broken into a length of about 350 bp fragments with ultrasonic crusher (Covaris, UK), then the fragments go through end repair and A-tailing, ligated with adapters including PCR primer hybridization sites, purified and PCR amplificated. After the library preparation,Qubit 2.0 (Invitrogen, USA) was used for initial quantification and the library was diluted to 2 ng/µl. Subsequently, using Agilent 2100 Bioanalyzer (Agilent, USA) to detect if the insert sizes of the library correspond to expectations. In order to ensure the quality of library, Real-time q-PCR was used to accurately quantify the effective concentration (> 3 nM) of library. After the library passed the inspection, sequencing was implemented on Illumina HiSeq xten platform (Illumina, USA). The raw reads are available at the NCBI Sequence Read Archive (BioProject ID PRJNA630144, SRA SRP265006).
Quality control and genome assembly
After sequencing, raw genome data that contain adapter information n and low-quality bases would interfere with the subsequent analysis. First of all, the original data should be controlled to remove interfering data so as to obtain clean data. There was the possibility of host genome contamination, which needs to be compared with the host gene database to filter out reads that may be host gene (SoapAligner parameter settings: identity ≥ 90%, -l 30, -v 7, -M 4, -m 200, -x 400). Reads with a quality value less than 38 (the default ≤ 40) and N (undetected bases) reached the set number (default set to 10) in the original data were removed. Remove reads if the overlap between adapter and the sequence exceeds a certain threshold (≥ 15 bp). Clean Data was obtained after pretreatment, and the SOAP denovo assembly software was used for assembly analysis (Luo et al. 2012). For a single sample, k-mer = 55 was selected for assembly to obtain the assembly result of the sample (Assembly parameters: -d 1, -M 3, -R, -u, -F) (Brum et al. 2015; Feng et al. 2015; Qin et al. 2014; Scher et al. 2013). The Scaffolds were interrupted from the N-junction to obtain N-free sequence fragments, called Scaftigs (i.e., continuous sequences within scaffolds) (Mende et al. 2012). The Clean Data of each sample was compared to Scaftigs of each sample by SoapAligner software to obtain PE reads (Alignment parameters: -u, -2, -m 200). Put reads of each sample that had not been utilized together, and k-mer = 55 was selected for mixed assembly (Karlsson et al. 2013), other assembly parameters are the same as single sample assembly parameters. The Scaffolds were broken from the N -junction to obtain the Scaftigs sequences without N. For Scaftigs generated by single sample and mixed assembly, fragments less than 500 bp were filtered out (Zeller et al. 2014), and statistical analysis and subsequent genetic prediction were performed.
Gene prediction and abundance analysis
Starting from the Scaftigs of each sample and mixed assembly ( > = 500 bp), MetaGeneMark was used for ORF (Open Reading Frame) prediction (Li et al. 2014), and fragment length less than 100nt was filtered from the prediction results. For the ORF prediction results of each sample, the CD-HIT software was used to remove redundancy, so as to obtain the initial gene catalogue with non-redundancy. Clustering is conducted with identity 95% and coverage 90%, and the longest sequence is selected as the representative sequence (parameters: -c 0.95, -G 0, -aS 0.9, -g 1, -d 0). The Clean Data of each sample was compared with the initial gene catalogue by SoapAligner, and the number of reads of genes in each sample was calculated (Alignment parameters: -m 200, -x 400, identity ≥ 95%). The number of genes supporting reads in each sample < = 2 were filtered out to obtain the final gene catalogue for subsequent analysis (Qin et al. 2012). The abundance information of each gene in each sample was calculated from the number of reads and gene length (Villar et al. 2015). Based on the abundance information of each gene in each sample, basic information statistics, core-pan gene analysis, correlation analysis between samples, and gene number Venn diagram analysis were conducted (blastp,evalue ≤ 1e-5).
Species annotation
DIAMOND software was used to compare unigenes with the sequences of Bacteria, Fungi, Archaea and Viruses extracted from NCBI NR database (Buchfink et al. 2015). Alignment filtering: for each sequence alignment, evalue ≤ minimum evalue * 10 is selected for subsequent analysis. After filtering, since there may be multiple alignment results for each sequence to obtain multiple different species classification information, in order to ensure its biological significance, LCA algorithm (systematic classification applied to MEGAN software) was adopted to take the classification level before the first branch as the species annotation information of the sequence (Huson et al. 2011). According to the LCA annotation results and gene abundance table, the abundance information of each sample at each classification level (genus and species) is obtained. The abundance of a species in a sample is equal to the sum of the gene abundance of the species annotated. Based on the LCA annotation results and gene abundance table, the gene number table of each sample at each classification level (genus and species) was obtained. For a species, the number of genes in a sample was equal to the number of genes whose abundance was not 0 in the annotated species. Krona analysis, relative abundance overview display, abundance cluster heat map display, PCA and NMDS dimensionality reduction analysis, Anosim inter-group (internal) difference analysis, Metastat and LEfSe multivariate statistical analysis of inter-group difference species were conducted from the abundance table of each classification level (genus and species).
Functional database and resistance gene annotation
Currently, the commonly used functional databases mainly include: Kyoto Encyclopedia of Genes and Genomes (KEGG), Evolutionary genealogy of genes: Non-supervised Orthologous Groups (eggNOG), Carbohydrate-Active enzymes Database (CAZy). KEGG database was introduced by Kanehisa Laboratories in 1995 with version 0.1. At present, it has developed into a comprehensive database, the core of which is KEGG pathway and KEGG Ortholog database. In the KEGG Ortholog database, genes performing the same function are clustered together and called Ortholog Groups (KO entries). In the KEGG pathway database, biological metabolic pathways are divided into 6 categories, which are Cellular Processes, Environmental Information Processing, Genetic Information Processing, Human Diseases, Metabolism, Organismal Systems, each of which is systematically classified as layer 2, 3, 4. The second layer currently includes 43 seed pathways, the third layer is the metabolic pathway diagram and the fourth layer is the specific annotation information of each metabolic pathway map. The database of eggNOG uses the smith-waterman comparison algorithm to annotate the Orthologous Groups of constructed genes. EggNOG V4.1 covers 2,031 species and about 190,000 Orthologous Groups are constructed. CAZy database is the study of carbohydrate enzyme professional database, mainly covers six functional categories: GHs (Glycoside Hydrolases), GTs (Glycosyl Transferases), PLs (Polysaccharide Lyases), CEs (Carbohydrate Esterases), AAs (Auxiliary Activities) and CBMs (Carbohydrate Binding Modules). DIAMOND software was used to compare Unigenes with each functional database (blastp, evalue ≤ 1e-5). Alignment filtering: for alignment results of each sequence, the alignment results with the highest score (one HSP > 60 bits) were selected for subsequent analysis. Based on the result of functional annotation and gene abundance table, the number of genes in each sample at each classification level is obtained. The number of genes with a certain function in a sample is equal to the number of genes with a non-zero abundance in the genes with a certain function. Based on the abundance table at each classification level, carry out annotation gene number statistics, relative abundance overview display, abundance cluster heat map display, PCA and NMDS dimensionality reduction analysis, Anosim inter-group (internal) difference analysis based on functional abundance, metabolic pathway comparative analysis, Metastat and LEfSe analysis of inter-group functional difference. Use the Resistance Gene Identifier (RGI) software provided by CARD database to compare Unigenes with CARD database (RGI built-in blastp, evalue ≤ 1e-30) (Qin et al. 2010). According to the comparison results of RGI and the abundance information of unigenes, the relative abundance of ARO was calculated. According to ARO abundance, column diagram of abundance, cluster heat map of abundance, circle diagram of abundance distribution, ARO difference analysis between groups, and species attribution analysis of resistance genes (note to ARO unigenes) were conducted.