Animals and Genome Sequencing
All experiments complied with all applicable laws, NIH guidelines, and were approved by the University of Colorado and Ludwig-Maximilians-Universitaet Munich IACUC. Five young adult (postnatal day 65-71) gerbils (three males and two females) were used for tissue RNA transcriptome analysis and DNA genome assembly (these animals are maintained and housed at the University of Colorado with original animals obtained from Charles River (Wilmington, MA) in 2011). In addition, two old (postnatal day 1013 or 2.7 years) female gerbil’s tissue was used for transcriptome analysis (these were obtained from a colony housed at the Ludwig-Maximilians-Universitaet Munich (which were also originally obtained from Charles River (Wilmington, MA)) and tissues were sent on dry ice to be processed at the University of Colorado Anschutz). All animals were euthanized with isoflurane inhalation followed by decapitation. Genomic DNA was extracted from young adult animal tail and ear snips using a commercial kit (DNeasy Blood and Tissue Kit, Qiagen, Venlo, Netherlands). We then used the extracted DNA to create different pair-end insert libraries of 250 bp, 350 bp, 500 bp, 800 bp, 2 Kb, 4 Kb, 6 Kb, and 10 Kb. These libraries were then sequenced using an Illumina HiSeq2000 Genome Analyzer (Ilumina, San Diego, CA, USA) generating a total of 322.13 Gb in raw data, from which a total of 287.4 Gb of ‘clean’ data was obtained after removal of duplicates, contaminated reads, and low-quality reads.
High-quality reads were used for genome assembly using the SOAPdenovo (version 2.04) package.
Transcriptome Sequencing and Assembly
Samples from 27 tissues were collected from the seven gerbils described above (Supplementary Table 1). The tissues were collected after the animals were euthanized with isoflurane (followed by decapitation) and stored on liquid nitrogen until homogenized with a pestle. RNA was prepared using the RNeasy mini isolation kit (Qiagen, Venlo, Netherlands). RNA integrity was analyzed using a Nanodrop Spectrophotometer (Thermo Fisher Waltham, MA, USA) followed by analysis with an Agilent Technologies 2100 Bioanalyzer (Agilent Technologies, Santa Clara, CA, USA) and samples with an RNA integrity number (RIN) value greater than 7.0 were used to prepare libraries which were sequenced using an Ilumina Hiseq2000 Genome Analyzer (Ilumina, San Diego, CA, USA). The sequenced libraries were assembled with Trinity (v2.0.6 parameters: "--min_contig_length 150 --min_kmer_cov 3 --min_glue 3 --bfly_opts '-V 5 --edge-thr=0.1 --stderr'"). Quality of the RNA assembly was assessed by filtering RNA-seq reads using SOAPnuke (v1.5.2 parameters: "-l 10 -q 0.1 -p 50 -n 0.05 -t 5,5,5,5") followed by mapping of clean reads to the assembled genome using HISAT2 (v2.0.4) and StringTie (v1.3.0). The initial assembled transcripts were then filtered using CD-HIT (v4.6.1) with sequence identity threshold of 0.9 followed by a homology search (human, rat, mouse proteins) and TransDecoder (v2.0.1) open reading frame (ORF) prediction.
Genomic repeat elements of the genome assembly were also identified and annotated using RepeatMasker (v4.0.5 RRID:SCR_012954)(13) and RepBase library (v20.04)(14). In addition, we constructed a de novo repeat sequence database using LTR-FINDER (v1.0.6) (15) and RepeatModeler (v1.0.8) (13) to identify any additional repeat elements using RepeatMasker.
Protein-coding genes were predicted and annotated by a combination of homology searching, ab initio prediction (using AUGUSTUS (v3.1), GENSCAN (1.0), and SNAP (v2.0)), and RNA-seq data (using TopHat (v1.2 with parameters: “-p 4 --max-intron-length 50000 -m 1 –r 20 --mate-std-dev 20 --closure-search --coverage-search --microexon-search”) and Cufflinks (v2.2.1 http://cole-trapnell-lab.github.io/cufflinks/)) after repetitive sequences in the genome were masked using known repeat information detected by RepeatMasker and RepeatProteinMask. Homology searching was performed using protein data from Homo Sapiens (human), Mus musculus (mouse), and Rattus norvegicus (rat) from Ensembl (v80) aligned to the masked genome using BLAT. Genewise (v2.2.0) was then used to improve the accuracy of alignments and to predict gene models. The de novo gene predictions and homology-based search were then combined using GLEAN. The GLEAN results were then integrated with the transcriptome dataset using an in-house program (Table 5).
InterProScan (v5.11) was used to align the final gene models to databases (ProDom, ProSiteProfiles, SMART, PANTHER, PRINTS, Pfam, PIRSF, ProSitePatterns, SignalP_EUK, Phobius, IGRFAM, and TMHMM) to detect consensus motifs and domains within these genes. Using the InterProScan results, we obtained the annotations of the gene products from the Gene Ontology database. We then mapped these genes to proteins in SwissProt and TrEMBL (Uniprot release 2015.04) using blastp with an E-value <1E-5. We also aligned the final gene models to proteins in KEGG (release 76) to determine the functional pathways for each gene (Table 6).
Genome assembly and annotation quality were further assessed by comparison with closely related species, gene family construction, evaluation of housekeeping genes, and Benchmarking Universal Single-Copy Orthologs (BUSCO) search. Gene family construction was performed using Treefam (http://www.treefam.org/). To examine housekeeping genes we downloaded 2169 human housekeeping genes from (http://www.tau.ac.il/~elieis/HKG/) and extracted corresponding protein sequences to align to the gerbil genome using blastp (v.2.2.26). Lastly, we employed BUSCO (v1.2) to search 3023 mammalian groups.