All faecal samples analysed in this study were collected at Dalian Mingwei Marten Industry Company Limited and derived from wild sables imported from Mohe County of Heilongjiang Province and Greater Khingan Range. As carnivores, the sables were fed fish and chicken. When collecting stool samples, we recorded information on host gender, sampling date and cage number. Fresh faecal samples from sables were collected aseptically in sterile stool containers and immediately frozen in a freezer. The samples were categorized into three groups, with the five faecal samples from female sables labelled MZF.1–MZF.5, the four faecal samples from male sables labelled MZM.1–MZM.4, and the samples of intestinal contents labelled MZS.1, MZB.1 and MZB.2 (Table 1). Before DNA extraction from the faecal samples, we ensured the samples were used immediately upon removal from the freezer and avoided sample contamination.
Table 1 Table of the information in samples
Species
|
Sample
|
Sex
|
Time
|
Sable
|
MZF.1
|
Female
|
2017.11
|
Sable
|
MZF.2
|
Female
|
2017.11
|
Sable
|
MZF.3
|
Female
|
2017.11
|
Sable
|
MZF.4
|
Female
|
2017.11
|
Sable
|
MZF.5
|
Female
|
2017.11
|
Sable
|
MZM.1
|
Male
|
2017.11
|
Sable
|
MZM.2
|
Male
|
2017.11
|
Sable
|
MZM.3
|
Male
|
2017.11
|
Sable
|
MZM.4
|
Male
|
2017.11
|
Sable
|
MZS.1
|
Male
|
2017.11
|
Sable
|
MZB.1
|
Male
|
2017.11
|
Sable
|
MZB.2
|
Male
|
2017.11
|
DNA extraction,library preparation and metagenomics sequencing
The QIAamp Fast DNA Stool Mini Kit (Qiagen, Hilden, Germany) was used to extract microbial DNA from the sable stool samples. Before library construction, the DNA was evaluated for quality control and quantified. Agarose gel electrophoresis (AGE) was used to analyse the purity and integrity of the DNA, and Qubit 2.0 (Invitrogen, USA) was used to precisely quantify DNA concentration. During library construction, qualified DNA samples were randomly broken into fragments approximately 350 bp in length with an ultrasonic crusher (Covaris, UK). Then, the fragments were end-repaired, A-tailed, and ligated to adapters. After library preparation, Qubit 2.0 (Invitrogen, USA) was used for initial quantification, and the library was diluted to 2 ng/µl. Subsequently, an Agilent 2100 Bioanalyzer (Agilent, USA) was used to determine whether the insert sizes of the library corresponded to expectations. To ensure library quality, real-time q-PCR was used to accurately quantify the effective concentration (> 3 nM) of the library. After the library passed the inspection, sequencing was implemented on an Illumina HiSeq X Ten platform (Illumina, USA). The raw reads are available at the NCBI Sequence Read Archive (BioProject ID PRJNA630144, SRA SRP265006).
Quality control and genome assembly
In metagenomics research, the raw genome data obtained after sequencing include adapter information and low-quality bases, which interfere with subsequent analysis. Therefore, the raw data require quality control to remove interfering data and obtain clean data. Because of the possibility of host genome contamination, we searched the data against a database of host genes to filter out reads from host genes (SOAP aligner parameter settings: identity ≥ 90%, -l 30, -v 7, -M 4, -m 200, -x 400). Reads with a quality value less than 38 (different from the default setting of ≤ 40) and number of Ns (undetected bases) at or exceeding the set number (default set to 10) were removed. In addition, reads with an overlap between the adapter and the sequence exceeding a certain threshold (≥ 15 bp) were removed. Clean data were obtained after these filtering steps, and SOAP denovo assembly software was used for assembly analysis (Luo et al. 2012). For each sample, k-mer = 55 was selected to obtain the assembly results (assembly parameters: -d 1, -M 3, -R, -u, -F) (Brum et al. 2015; Feng et al. 2015; Qin et al. 2014; Scher et al. 2013). The scaffolds were interrupted from the N-junctions to obtain N-free sequence fragments called scaftigs (i.e., continuous sequences within scaffolds) (Mende et al. 2012). The clean data for each sample were compared to the scaftigs of each sample by SOAP aligner software to obtain PE reads (alignment parameters: -u, -2, -m 200). After pooling the clean reads from each sample, k-mer =55 was selected for mixed assembly (Karlsson et al. 2013), the remaining assembly parameters were the same as those used for single sample assembly. The scaffolds were broken from the N-junctions to obtain scaftig sequences without Ns. For the scaftigs generated by both single sample assembly and mixed assembly, fragments less than 500 bp were filtered out (Zeller et al. 2014), and statistical analysis and subsequent gene prediction were performed.
Gene prediction and abundance analysis
Employing the scaftigs for each sample assembly and mixed assembly (>=500 bp), MetaGeneMark was used for open reading frame (ORF) prediction (Li et al. 2014). Fragments less than 100 nt in length were filtered out from the prediction results. For the ORF prediction results of each sample, CD-HIT software was used to remove redundancies, yielding an initial non-redundant gene catalogue. Clustering was conducted with identity 95% and coverage 90%, and the longest sequence was selected as the representative sequence (parameters: -c 0.95, -G 0, -aS 0.9, -g 1, -d 0). The clean data for each sample were compared with the initial gene catalogue by SOAPaligner, and the number of reads of genes in each sample was calculated (alignment parameters: -m 200, -x 400, identity ≥ 95%). The number of genes supporting reads in each sample <=2 were filtered out to obtain the final gene catalogue for subsequent analysis (Qin et al. 2012). The abundance information of each gene in each sample was calculated from the number of reads and gene length (Villar et al. 2015). Based on the abundance information of each gene in each sample, descriptive statistics were calculated, and core-pan gene analysis, sample correlation analysis, and Venn diagram analysis of gene number were conducted (blastp, evalue ≤ 1e-5).
Species annotation
DIAMOND software was used to compare the unigenes with the sequences of bacteria, fungi, archaea and viruses extracted from the NCBI NR database (Buchfink et al. 2015). Alignment filtering was conducted using evalue ≤ minimum evalue * 10 for each sequence alignment for subsequent analysis. Multiple alignment results for each sequence may arise after filtering, yielding different species classification information. Thus, to ensure its biological significance, the LCA algorithm (a systematic classification algorithm applied in MEGAN software) was adopted to assign the classification level before the first branch as the species annotation information of the sequence (Huson et al. 2011). Based on the LCA annotation results and the gene abundance table, the abundance information of each sample at each classification level (genus and species) was obtained. The abundance of a species in a sample was determined as the sum of the gene abundance of the species annotated. For each species, the number of genes in a sample was equal to the number of genes with abundances greater than 0 in the annotated species.
Functional database and resistance gene annotation
Currently, the commonly used databases providing functional annotations mainly include Kyoto Encyclopedia of Genes and Genomes (KEGG), Evolutionary Genealogy of Genes: Non-supervised Orthologous Groups (eggNOG), and Carbohydrate-Active Enzymes Database (CAZy). The KEGG database was introduced by Kanehisa Laboratories in 1995 with version 0.1. It has since developed into a comprehensive database, the core of which is the KEGG pathway database and the KEGG Ortholog database. In the KEGG Ortholog database, genes performing the same function are clustered together into groups called ortholog groups (KO entries). In the KEGG pathway database, biological metabolic pathways are divided into 6 categories: cellular processes, environmental information processing, genetic information processing, human diseases, metabolism, and organismal systems. The second layer currently includes 43 seed pathways, the third layer comprises metabolic pathway diagrams, and the fourth layer provides specific annotation information for each metabolic pathway map. The eggNOG database uses the Smith-Waterman comparison algorithm to annotate the orthologous groups of genes. eggNOG V4.1 covers 2,031 species and approximately 190,000 orthologous groups. The CAZy database is used to annotate carbohydrate enzymes and covers six main functional categories: GHs (glycoside hydrolases), GTs (glycosyl transferases), PLs (polysaccharide lyases), CEs (carbohydrate esterases), AAs (auxiliary activities) and CBMs (carbohydrate-binding modules). DIAMOND software was used to compare unigenes with each functional database (blastp, evalue ≤ 1e-5). In the alignment filtering step, the alignment results of each sequence with the highest score (one HSP > 60 bits) were selected for subsequent analysis. Based on the results of the functional annotations and the gene abundance table, the number of genes in each sample at each classification level was obtained. The number of genes with a certain function in a sample was calculated as the number of genes with non-zero abundance. Based on the abundance table at each classification level, analyses of the number and relative abundance of annotated genes were conducted. Resistance Gene Identifier (RGI) software provided by CARD was employed to compare unigenes with the CARD database (RGI built-in blastp, evalue ≤ 1e-30) (Qin et al. 2010). Based on the comparison results of RGI and the abundance information of unigenes, the relative abundance of ARO was calculated. Employing the ARO abundance data, a Venn diagram of abundance distribution was constructed, ARO differences between groups were analysed, and species attribution analysis of resistance genes (with focus on ARO unigenes) was conducted.