Nine species were chosen to maximize represention of the phylogenetic and dietary diversity of Phyllostomidae (figure 1). We also included two insectivorous bats, Myotis lucifugus from Vespertilionidae and Pteronotus parnellii from Mormoopidae, as outgroups. Tissues were extracted and frozen in liquid nitrogen within five minutes after euthanasia. Additional details from tissue loans provided by the Natural Science Research Laboratory (NSRL) of the Museum of Texas Tech University can be found in Additional file 1: Table S1.
RNA isolation, sequencing, and assembly
RNA was extracted from SMGs of each bat using Trizol (Invitrogen, Carlsbad, CA, USA) following manufacturer protocols. Oligo-dT magnetic beads were used to enrich for mRNA strands with poly-A tails and a strand-specific paired-end cDNA library was prepared using a ScriptSeq kit (Epicentre, Madison WI USA). Libraries were sequenced on Illumina platform (see Additional file 1: Table S1). For each species, paired-end reads were filtered for quality using Trimmomatic 0.36 [30] with the parameters LEADING:20 TRAILING:20 SLIDINGWINDOW:1:30 MINLEN:75 (MINLEN:50 was used for D. rotundus), and de novo transcript assemblies were constructed from clean read pairs with Trinity v2.7 [31] using default settings.
Gene discovery and annotation
Putative open reading frames and translated peptide sequences were identified using TransDecoder [32] and the resulting peptide sequences were processed through the Trinotate pipeline to identify functional properties and Gene Ontology (GO) annotations associated with biological processes, molecular functions, and cellular components. To summarize, peptide sequences were queried against the SwissProt database [33] using BLASTP [34] and the Pfam [35] database using hmmer 3.2.1 [36]. Peptides were also scanned for a signal peptide using signalP [37], and transmembrane domains using tmhmm 2.0 [38].
Orthology clustering
Orthology assignment is still a major challenge in bioinformatics and evolutionary biology. Here, we developed a process to filter out similar transcripts produced by Trinity. The first step was to assume similar sequences would have similar Trinotate annotations. We parsed the Trinotate output to identify the best sequence representative for each unique SwissProt annotation. To choose the best gene representative among multiple coding sequences (CDSs) with the same SwissProt annotation, we multiplied the length of the CDS to the percent identity of the SwissProt hit. This metric correlated with E value, but effectively acted as a bit score when E values were identical. The CDS with the highest metric was chosen to represent the annotation. If a CDS did not have a SwissProt annotation, it was removed. We then ran combined best sequences from all species, and ran this dataset through the OrthoMCL [39] pipeline to identify orthologous groups. Only single gene ortholog groups were used in downstream selection tests.
Poor ortholog assignment can influence sequence relationships in a phylogeny, and since the relationships among phyllostomids are robust, a reasonable test of ortholog assignment would be to reconstruct a phylogenetic tree and determine if the resulting tree reflects previously described relationships. Therefore, we reconstructed a phylogenetic tree from 500 randomly sampled single gene orthologs shared among all 11 individuals. Each orthologous group was translated, aligned using LINSi parameters in MAFFT [40] and reverse translated to construct a codon alignment. Resulting alignments were concatenated and we used RAxML [41] to find the best tree from the un-partitioned dataset using the ML and rapid bootstrapping algorithm, a GTRGAMMA model of nucleotide substitution, and with 100 starting trees.
Identifying genes under selection
Single gene orthologous groups that were found in seven or more individuals were tested for evidence of selection using the maximum likelihood approach described by Goldman and Yang [42]. Amino acid sequences were aligned using LINSi parameters in MAFFT [40], and the resulting alignment was reverse translated. Each gene tree was individually estimated using RAxML. We used CODEML in PAML [43] to estimate the role of selection on gene evolution by comparing the rate of nonsynonymous substitution per nonsynonymous site (dN) to the rate of synonymous substitution per synonymous site (dS). The dN/dS ratio can be used as a sensitive measure of selective pressure; however, in most cases the overall dN/dS is less than one and only a few amino acid sites are evolving quickly. Therefore, to determine whether a gene was evolving adaptively, we calculated the likelihood of models that allow dN/dS to vary among codon sites (M1a, M2a, M7, and M8). For all genes we used likelihood ratio tests (LRTs) to compare nested models that allow and disallow codon site dN/dS to be greater than one (M2a v M1a, M8 v M7), and to test for significant differences between nested models [44]. If both models that allow dN/dS rates to be greater than one (M2a and M8) were significantly better fits to the data (LRT, p < 0.05), we inferred these genes were evolving under positive selection. We performed a false discovery rate correction (FDR) on the p-values resulting from the M2a v M1a LRT because the M2a tests were more conservative than M8 and resulted in fewer LTR tests with p < 0.05. FDR was estimated using the qvalue function in the “qvalue” R module [45, 46].
Functional analysis of genes under selection
For all single gene orthologs tested for selection we counted the number of reoccurring GO terms. We then used the GO term proportion to summarize the general function of genes tested in the SMGs of phyllostomids using REVIGO [47], which nests redundant and similar terms from long GO term lists by semantic clustering. We repeated the same analysis using GO terms described from orthologs under positive selection. To test for GO term enrichment for the genes under selection, PANTHER [48] was applied to identify statistically overrepresented GO terms from the list of genes under positive selection using the genes tested as the reference list. We used the Fisher’s exact test and applied the FDR correction and statistical significance was determined below an adjusted p-value < 0.05. We estimated whether a protein was membrane bound, a receptor, immune related, or secreted by searching for specific GO terms. If “secretion”, “extracellular space” or “extracellular region” was present among GO terms, we annotated the gene as secreted. If “immune”, “defense”, or “antimicrobial” was present, the gene was annotated as an immune. If “membrane” was present we called the gene membrane bound and if “receptor” was present, we called the gene a receptor (Additional file 3: Table S2).