DNA sampling and sequencing
Genomic DNA was extracted from muscle tissue of a male golden Syrian hamster and was used to prepare an Illumina paired-end library, nanopore library and Hi-C library. For the establishment of the Illumina paired-end library, genomic DNA was fragmented by Covaris, and then end-polished. dA and adaptors were added for constructing a sequencing library with the insert size of 350bp. Paired-end sequencing was performed with the Illumina NovaSeq 6000 platform according to the manufacturer’s instructions and produced a total of 307.53 Gb (98.25×) raw data for the survey and correction of the Syrian hamster genome. Quality control was performed on the Illumina raw data to ensure Q30>=80% through in-house Perl scripts. Specifically, reads with the adaptors, reads with N bases more than 10%, paired reads with Q<5 base pairs more than 20% and duplicate reads were removed to obtain clean reads. We obtained a total of 306.92Gb (~98×) clean reads for downstream analysis. All clean reads were applied to estimate the hamster genome’s size and heterogeneity based on the k-mer distribution analysis using a k value of 17. For Nanopore libraries, high molecular weight genomic DNA was processed with a Ligation Sequencing Kit following the manufacturer’s protocol including steps: (a) DNA repair and end-preparation; (b) ligation of sequencing adaptors and clean-up; (c) quality control. Nanopore libraries were sequenced on PromethION according to the manufacturer’s instruction.
We constructed a high-throughput chromosome conformation capture (Hi-C) library for the hamster genome. Muscle tissue was fixed with formaldehyde to induce DNA cross-linking. After digestion with a restriction endonuclease, DNA was biotinylated by biotin-14-dCTP and then ligated by T4 DNA Ligase to form chimeric junctions. The ligated DNA was reverse cross-linked and physically sheared into 300-600 bp fragments. The DNA fragments were purified through biotin-streptavidin-mediated pulldown and were blunted-end repaired and A-tailed to construct Hi-C sequencing libraries. The Hi-C libraries were quantified and sequenced on the Illumina NovaSeq 6000 and HiSeq platform (Novogene Biotech Cor., Ltd, Beijing China). The Hi-C libraries generated 336.87 Gb of raw data and 336.06 Gb of clean data were left after quality control.
Nanopore reads assembly, correction and validation
A total of 314.37Gb Nanopore sequencing data was obtained to perform de novo genome assembly by using the long-read genome assembler wtdbg2 (v2.3, parameters: input_fofn = input.fofn, ref = '', data_type =ONT, genome_size = 3.0, max_depth = 60, node-drop=0.25, node-len=1536, node-max=200, brute_force=1) [51]. The following strategies were performed: (a) alignment algorithms: Kmer-Bin-Mapping; (b) assembling algorithms based on fuzzy-Bruijin graph (FBG); (c) consensus correction was performed using Nanopore data with Racon (v1.3.1) and the following parameters: -u -t 40 [52]. To further decrease the overall error rate, we performed consensus correction alignment using Illumina reads mapped with BWA -0.6 and Plicon (v1.22, parameters: -Xmx300G --diploid --threads 20) [53]. To evaluate uniformity of the sequencing data, we estimated the mapping rate of the Illumina reads and coverage/average sequencing depth of the genome. We employed CEGMA (Core Eukaryotic Genes Mapping Approach: http://korflab.ucdavis.edu/dataseda/cegma/) and BUSCO (Benchmarking Universal Single-Copy Orthologs: http://busco.ezlab.org/)to evaluate the completeness of the genome assembly.
Chromosome-scale assembly of Syrian hamster genome with Hi-C mapping information
Hi-C high-quality clean data were filtered from Hi-C clean data with the same standard for Illumina raw reads. We used BWA -0.6 to map the clean Hi-C reads to the draft assembled sequence. We used SAMTOOLS (v0.1.18)[54] to select high mapping quality reads that aligned within 500 bp of a restriction site. We excluded reads with low mapping quality, multiple hits or duplication. Lachesis (version-201701) (https://github.com/shendurelab/LACHESIS) was used to cluster contigs into chromosome groups, order contigs within chromosomes and orient the contigs with parameters “RE_SITE_SEQ = GATC , CLUSTER_N = 22, CLUSTER_MIN_RE_SITES = 325” .
Genome annotation
The repeats in the genome consisted of tandem repeats and interspersed repeats (also known as transposable elements, TEs). Tandem Repeat sequences and transposable elements in the hamster genome were identified using an integrated strategy incorporating de novo and homology-based approaches at the DNA and protein levels. A de novo repeat library for the hamster genome was generated using the combination of three programs RepeatModeler (http://www.repeatmasker.org/RepeatModeler/), RepeatScout (http://www.repeatmasker.org/) and LTR_FINDER (http://tlife.fudan.edu.cn/tlife/ltr_finder/).
TEs were identified by a combination of the de novo repeat library and RepBase library (Bao et al., 2015) by Uclust with 80-80-80 principle and annotated by RepeatMasker. RepeatProteinMask was used to search the TE protein database using WU-BLASTX algorithms in order to identify and classify TEs at the protein level. We integrated the results and generated a consensus and non-redundant TEs library (combined TEs) [7].
Gene prediction
Structural annotation of gene models was constructed by incorporating de novo, homology-based and RNA-seq-assisted prediction. The de novo prediction was implemented using Augustus(http://bioinf.uni-greifswald.de/augustus/) [25], GlimmerHMM (http://ccb.jhu.edu/software/glimmerhmm/) [55], SNAP(https://github.com/KorfLab/SNAP) [56], Geneid [57] and Genscan (http://argonaute.mit.edu/GENSCAN.html) [58]. For homology-based prediction, protein sequences from seven sequenced vertebrates Cricetulus griseus (Cgr), Microtus ochrogaster (Moc), Rattus norvegicus (Rno), Mus musculus (Mmu), Mus caroli (Mca), Peromyscus maniculatus (Pma), Homo sapiens (Hsa), were initially mapped onto the hamster genome using blast(http://blast.ncbi.nlm.nih.gov/Blast.cgi). Homologous genome sequences were then aligned against matched protein using GeneWise (http://www.ebi.ac.uk/~birney/wise2/) to obtain accurate spliced alignments.
To obtain additional support for the structural annotation of gene models, 11 Illlumina RNA libraries from 11 tissues and a PacBio SMRT RNA-seq library of eight mixed tissues of Syrian hamster were sequenced and processed. Raw reads were submitted in Genbank (PRJNA662719, Table S8). Illlumina RNA-seq data were mapped to genome using Tophat (version 2.0.8). Cufflinks (version 2.1.1) (http://cufflinks.cbcb.umd.edu/) was used to assemble transcripts to gene models. PacBio RNA-seq data subreads.bam was processed using css from SMRTlink _6.0.0.47841 with parameters “--skip-polish --num-threads 30”. The PacBio bam file was then clustered and polished using isoseq3_0.7.2 with parameters “--rq-cutoff 0.99, --coverage 60”. Furthermore, the assembled transcripts based on Illlumina RNA-seq data and PacBio SMRT RNA-seq data were used to identify candidate protein-coding regions with the Program to Assemble Spliced Alignments (PASA, https://github.com/PASApipeline/PASApipeline/wiki) [27]. Finally, consensus gene sets were integrated using three respective annotation files with EVidenceModeler (EVM, http://evidencemodeler.github.io/) [27].
Gene function annotation
The functional annotation of the predicted genes of hamster was performed by alignment to the protein databases using BLASTALL and KAAS. The protein databases involved in the present study were SwissProt (http://www.uniprot.org/), Nr (http://www.ncbi.nlm.nih.gov/protein), Pfam (http://pfam.xfam.org/), KEGG (http://www.genome.jp/kegg/), and InterPro (https://www.ebi.ac.uk/interpro/).
Non-coding RNA annotation
Putative tRNAs were annotated by tRNAscan-SE (http://lowelab.ucsc.edu/tRNAscan-SE/). miRNAs and snRNAs were predicted using INFERNAL 1.1 (http://infernal.janelia.org/) by searching against the Rfam database (available via the website at http://rfam.sanger.ac.uk and through our mirror at http://rfam.janelia.org). rRNAs were detected by performing blast search with rRNA sequences from closely related species.
Gene family classification
We analyzed the protein-coding genes from hamster and 15 other vertebrate species, including Canis lupus familiaris, Sus scrofa, Cavia porcellus, Heterocephalus glaber, Rattus norvegicus, Mus caroli, Mus musculus, Peromyscus maniculatus, Microtus ochrogaster, Cricetulus griseus, Spalax galili, Homo sapiens, Macaca mulatta, Macaca fascicularis and Loxodonta africana. All data were downloaded from NCBI and Ensembl. Only the longest protein sequence was kept for further analysis if there were alternative splicing isoforms. Genes encoding proteins with less than 30 amino acids were excluded. The identification of orthologous groups was performed with OrthoMCL (Version 2.0, http://orthomcl.org/orthomcl/) using a Markov Cluster algorithm to group (putative) orthologs and paralogs. We identified a total of 19,615 gene families and 6,620 single-copy orthologues.
Phylogenetic analysis
Phylogenetic analysis was performed using 6,620 single-copy orthologues. Amino acid and nucleotide sequences of the ortholog genes were aligned using MUSCLE (http://www.drive5.com/muscle/) [59]. All single-copy ortholog alignments were concatenated into a super alignment matrix. A maximum likelihood-based phylogenetic tree was estimated based on the matrix of nucleotide sequences using RAxML (http://sco.h-its.org/exelixis/web/software/raxml/index.html). Clade support was evaluated using the boot-strapping algorithm in the RAxML package with 100 alignment replicates. The constructed phylogenetic tree showed that hamster and other Cricetidae species were clustered closely first, and then clustered with Muridae, which is in agreement with their putative evolutionary relationships.
The estimation of divergence time
The species divergence times were deduced with MCMCTree included in PAML (http://abacus.gene.ucl.ac.uk/software/paml.html) with the parameter set as “burn-in=1,000,000, sample-number=1,000,000sample-frequency=10”. We estimated that the divergence between Mesocricetus auratus (Syrian hamster) and Cricetulus griseus (Chinese hamster) happened at 29.4 million years ago.
Gene family expansion and contraction analysis
To identify the expanded and contracted gene families, we performed analysis with CAFE (Computational Analysis of gene Family Evolution, https://sourceforge.net/projects/cafehahnlab/). We used Fisher’s exact test to detect pathway enrichment among the expanded and contracted genes (FDR<0.05).
We obtained a new set of single-copy orthologues shared among four species Mesocricetus auratus, Cricetulus griseus, Rattus norvegicus, and Mus musculus to determine possible positively selected genes and genes under accelerated evolution. We implemented multiple alignment based on the protein sequences of single-copy orthologues with MUSCLE [59]. We estimated synonymous (Ks) and non-synonymous (Ka)s substitution rates by using the CODEML program from the PAML package [60]. We did a likelihood ratio test to identify genes under positive selection in the branch-site model. A total of 902 genes were identified as candidates for positively selected genes (p-value<0.01, FDR<0.05). Functional enrichment analysis of positively selected genes was performed based on Gene Ontology (GO, http://www.geneontology.org) and KEGG Pathway database (Kyoto Encyclopedia of Genes and Genomes, http://www.genome.jp/kegg).
RNA sampling and sequencing
The 45 samples of the Syrian hamster were collected from 15 tissues and organs, including heart, liver, spleen, lung, kidney, pancreas, stomach, bowel, brain, muscle, testis, epididymis, lymph node, thymus and peripheral blood monocyte cells. Total RNA for transcriptome sequencing was extracted using NucleoSpin RNA kits (Macherey-Nagel). The quality including degradation and contamination was screened on 1% agarose gels. RNA integrity was evaluated by using the RNA Nano 6000 Assay Kit of the Agilent Bioanalyzer 2100 system (Agilent Technologies, CA, USA). 1.5 µg RNA per sample was obtained for library preparations. NEBNext® Ultra™ RNA Library Prep Kit for Illumina® (NEB, USA) was used for library preparation and library quality was assessed on the Agilent Bioanalyzer 2100 system. TruSeq PE Cluster Kit v3-cBot-HS (Illumia) was used for the clustering of the index-coded samples that was performed on a cBot Cluster Generation System. The libraries for each sample were sequenced on an Illumina NovaSeq 6000 platform and paired-end reads were generated. Raw data (raw reads) of fastq format were filtered by removing adapters, poly-N and the low-quality reads. As well as the Q20, Q30, GC-content and sequence duplication were calculated. Raw reads were submitted in Genbank (PRJNA662719, Table S9).
Gene functional annotation
Gene function was annotated based on the public databases including Nr (NCBI non-redundant protein sequences), Nt (NCBI non-redundant nucleotide sequences), Pfam (Protein family), KOG/COG (Clusters of Orthologous Groups of proteins), Swiss-Prot (A manually annotated and reviewed protein sequence database), KO (KEGG Ortholog database). Gene Ontology (GO) enrichment analysis of the differentially expressed genes (DEGs) was implemented by the GOseq R package-based Wallenius non-central hyper-geometric distribution. We used KOBAS software to test the statistical enrichment of differentially expressed genes in KEGG pathways[61]. The cluster of the annotation and KEGG enrichment analysis were conducted by DAVID bioinformatics Resources 6.8 and KOBAS-I [62-64].
Transcriptome assembly
The transcriptomes of 15 tissues and organs were sorted and prepared for NCBI transcriptome shotgun assembly (TSA) submission, and the NCBI submission information is shown in Table S9. Reference genome and gene model annotation files were obtained from assembled genome and annotation in the present study. The index of the reference genome was built using HISAT2 v2.0.5 and paired-end clean reads were aligned to the reference genome using HISAT2 v2.0.5 [65]. The mapped reads of each sample were assembled by StringTie (v1.3.3b) in a reference-based approach[66]. StringTie uses a novel network flow algorithm as well as an optional de novo assembly step to assemble and quantitate full-length transcripts representing multiple splice variants for each gene locus.
featureCounts v1.5.0-p3 (http://bioinf.wehi.edu.au/featureCounts/) was used to count the read numbers mapped to each gene. FPKM of each gene was calculated based on the length of the gene and read counts mapped to this gene. FPKM, expected number of Fragments Per Kilobase of transcript sequence per Millions base pairs sequenced, considers the effect of sequencing depth and gene length for the reads count at the same time, and is currently the most commonly used method for estimating gene expression levels.
Transcriptome Quality and Comparisons
Assembled transcriptome metrics showed an acceptable percentage (over 70%) of reads mapping back to each transcriptome indicating qualified assemblies. Bench-marking universal single-copy orthologs (BUSCO) v. 1.1.1 results using the nematode dataset (downloaded in March 2018) indicated that the transcriptomes have a moderate level of completeness (over 50%)[67, 68]. Cufflinks (V 2.2.1) analysis was run for the comparison of 15 different tissues and organs of hamster and different gene expression patterns were found Figure S2 [69].
Distribution of alternative splicing modes
The annotation file of the hamster (gff3 file) was obtained from EVM. We performed an alternative splicing transcriptional landscape analysis on the Syrian hamster using the ASTALAVISTA web server (http://genome.crg.es/astalavista/).
Western blotting
Tissues protein extracts were isolated from duodenum, small intestine, kidney, adrenal gland, colon, lung, esophagus, stomach, liver, heart, brain, spleen, testis and pancreas from wild-type Syrian hamsters and mice (n=3) with RIPA lysis buffer (Beyotime Biotechnology, P0013C) containing 1% protease inhibitor cocktail solution (Roche, 04693132001). Bradford Protein Assay (CWBIO, Beijing, China) was used to estimate the protein concentration. 50 µg of protein prepared with loading buffer was separated in a 10% gel by SDS-PAGE and transferred to PVDF membranes (Millipore). After blocking for one hour at room temperature with a blocking solution of 5 % milk, the membrane was incubated with primary antibody diluted in blocking solution at 4℃ overnight. For detection of the ACE2 protein, a 1:1,000 dilution of the ACE2 Rabbit monoclonal antibody (ET1611-58, HUABIO, China) was used as the primary antibody and a 1:10,000 dilution of the goat anti-rabbit antibody (ZB-5301, ZSBIO, China). GAPDH expression was demonstrated with a 1:10,000 dilution of GAPDH mouse monoclonal antibody (60004-1-lg, Proteintech, USA) as the primary antibody and a 1:12,000 dilution of the goat anti-mouse antibody (ZB-5305, ZSBIO, China) as the secondary antibody.
Animal studies
Syrian hamsters (4-5 wk old, 80 g in weight) were purchased from Beijing Vital River Laboratory Animal Technology Co. (Beijing, China). The animals were maintained under specific pathogen-free, 14 h light/10 h dark cycle (room temperature 23 ± 0.5 °C humidity 40%-60%) conditions and the animals had free access to feed irradiated chow and water. All animal care and experiments were approved by the Ethical Committee of the Zhengzhou University and were in accordance with the Provision and General Recommendation of Chinese Experimental Animals Administration Legislation. The data reported in this manuscript is in accordance with ARRIVE guidelines.
Gene enrichment analysis
In order to compare the genes enrichment of different tissues from human, Syrian hamster, mouse and rat, the main tissue differential expressing genes (top500) of hamsters are identified firstly, then overlapped genes were screened out using Venny software for further gene clustering with R studio or GO (Gene Ontology) analysis with metascape software. To explore whether hamsters are superior in the field of cardiovascular research, 128 overlapped genes (in human, mouse and Syrian hamster) from 244 coronary artery disease (CAD) related orthologs were clustered in liver and heart tissues [22].
SARS-CoV-2 pseudovirus assay
To generate SARS-CoV2 pseudovirus, plasmids including lenti-Luciferase, psPAX and codon-optimized Spike with 18 amino acid deletion were co-transfected into 293T cells following our previous study [70]. 1×104 cells expressing ACE2 derived from human, mouse, rat and Syrian hamster were seeded into 96-well plates and infected with 50 μL SARS-CoV2 pseudovirus, and 72 hours later, the relative luciferase activities were measured with the luciferase assay system (Promega) and GloMax Discover detector (Promega).
SARS-CoV-2 infection assay
Syrian hamster kidney cancer cells HaK (Purchased from DSMZ) were seeded into 24-well plates, and infected with SARS-CoV-2 (isolated, BetaCoV/Wuhan/IVDC-HB-01/2020|EPI_ISL_402119) at MOI=0.2; one hour later, medium was refreshed and virus genomes in cell supernatant at 24 or 72 hours were detected with real-time RT-PCR (ORF1ab-foward primer CCCTGTGGGTTTTACACTTAA; ORF1ab-Forward reverse primer ACGATTGTGCATCAGCTGA; and the probe 5’-VIC-CCGTCTGCGGTATGTGGAAAGGTTATGG- BHQ1-3’). Total virus titers at 72 hours in the virus-infected HaK cells were measured with Vero cell line as described previously [71]. For the immunofluorescence assay, HaK cells were seeded into Chambered Coverglass System (Thermo ScientificTM NuncTM Lab-TekTM) and infected with SARS-CoV2 at MOI=10; 24 hours later, cells were fixed, and virus detected with rabbit anti-SARS-CoV-2 polyclonal antibody as described previously [71]. All these experiments were performed in biosafety level 3 laboratory.
Statistical analysis
For the experimental data, data are presented as mean ± standard error of the mean (SEM). Unless otherwise stated, statistically significant differences between the two groups were analyzed by t-test when variances are equal, and 1-way analysis of variance followed by Dunnett test was used for multiple comparisons with Prism 7 (GraphPad Inc). P<0.05 was statistically significant.