Olfactory receptor gene annotation pipeline
The expert OR gene annotation pipeline is summarised in Figure 1. Manual annotation was performed in our in-house Otter annotation system (Genome Research Ltd. https://www.sanger.ac.uk/science/tools/otter), which employs the ZMap graphical user interface (31, 32). We started by retrieving all annotated OR genes in the human and mouse genomes from RefSeq (14) (https://www.ncbi.nlm.nih.gov/refseq/), MGI (16) (http://www.informatics.jax.org/), and HORDE (15) (https://genome.weizmann.ac.il/horde/), a database of human OR gene sequences (33). Most OR genes are arranged in clusters along the genome, which are prone to duplication and recombination events that lead to the expansion of the repertoire. To identify any missing OR loci, we first extracted all genomic sequences with an annotated OR gene or gene cluster, along with the flanking 500 kb. We then used dotter (34) to compute the alignment of a few representative OR genes from each extended genomic region. These were visualised as dot matrix plots and allowed the identification of any matches that were not already annotated as OR genes. These putative novel OR loci were included in the corresponding OR gene repertoires for downstream analyses.
For each OR locus, we performed cross-species conservation analyses, including human, mouse, cow, sheep, cat, donkey, shrew, mole, guinea pig, elephant, dog, sheep, rat, chimp, gorilla, orangutan, marmoset, lemur, bushbaby, tarsier, baboon, bonobo and/or gibbon. These analyses served to determine the biotype (protein-coding or pseudogene) and, if necessary, amend the length of the coding sequence (CDS) and/or splice sites sourced from the databases. Specifically, unambiguous orthologues were identified by performing a BLAT search (35) of the protein sequence of each OR gene in the UCSC browser (36); if none were found, the closest paralogue(s) was used instead. When alternative open reading frames (ORFs) were present, we favoured the initiation codon (iATG) with highest conservation. An alternative protein isoform for a locus was created if it had a predicted intact ORF, conserved OR topology and high identity to known annotated isoforms.
OR loci that contained an intact ORF were then subjected to transmembrane topology predictions using HMMTOP (37), TMPred (38) and TMHMM Server v 2.0 (39). OR genes with an intact ORF of at least 300 aa, with a predicted seven-transmembrane domain structure (characteristic of OR proteins), an extracellular N-terminus and intracellular C-terminus were biotyped as protein coding.
For loci without an intact ORF, those with three or fewer disruptions were checked against NCBI dbSNP database (40) and The Mouse Genomes Project catalogue of mouse strains variation (18), to assess whether there were reported single nucleotide polymorphisms (SNPs) that restored an intact, full-length ORF. Restored loci were annotated as polymorphic pseudogenes. The remaining disrupted loci were designated as unprocessed pseudogenes. For the human pseudogenes, where a syntenic one-to-one protein-coding orthologue could be unequivocally established in the mouse (GRCm38) or dog (Broad CanFam3.1) genomes, the locus was annotated as a unitary pseudogene instead. We did not perform this analysis for the mouse repertoire since the large number of very closely related paralogues makes it difficult to establish unequivocal orthologous relationships. For all pseudogenes, the pseudogenic CDS was determined using protein homology with existing protein-coding ORs. This was achieved using multiple alignments that were visualised in the Blixem browser (41), with the genomic coordinates for homologous regions to protein-coding OR genes manually determined.
The second part of the annotation pipeline involved the use of RNAseq data from human and mouse olfactory mucosa samples to assemble mapped reads into transcript models using Cufflinks (see next section for details). Each annotated OR locus was visualised in the Integrative Genomics Viewer (IGV) (42, 43) guided by Ensembl and/or HORDE annotation, along with the Cufflinks gene models and RNAseq data from all samples combined together. Evidence from any available mRNA and/or EST clones from GenBank (irrespective of their tissue of origin) was also considered, where the alignment was best-in-genome. Each Cufflinks model was manually assessed, revealing numerous inaccuracies which were corrected using bespoke in-house software (https://gitlab.com/olfr/olfr_transcript_model_curation). Briefly, short (1-3 bp) read overhangs at splice junctions incorrectly generated non-canonical splice sites; spliced reads were often misaligned, with a short portion of the read mapping to a close paralogue instead of the target gene, resulting in erroneous chimeric transcript models; the 3’ and 5’ termini predictions were often extended too far, and thus UTRs were always terminated when coverage dropped below 3 reads; and, finally, failure to account for drops in coverage due to low complexity regions (often found within OR gene UTRs (13) resulted in premature termination of the transcript model.
For highly expressed loci, only the most abundant transcript models were retained. For a fraction of OR genes, we observed a drop in the average read depth towards the end of the 3’ UTR, suggesting the possibility of alternative 3’ termini. In these cases we used the longest UTR in our transcript models. We also took a conservative approach to define the 3’ UTRs of genes in close proximity on opposite strands, often terminating the transcripts where read depth differences occurred between the two loci.
Where no Cufflinks models were predicted for OR loci, we checked whether there was enough evidence from the RNAseq data to manually build transcript models. To build a transcript model we required contiguous overlapping reads (except in low complexity regions), support from a minimum of 3 RNAseq reads spanning the splice junctions, and defined the UTR ends when read depth dropped below 3 reads.
All transcript models were integrated into Zmap (31, 32) using Annotrack (44) and constitute our final annotation of the complete human and mouse OR gene repertoires. We refer to this annotation as Ensembl-HAVANA, which was then integrated into the Ensembl/GENCODE reference gene set (45, 46) that is available through the Ensembl (release 98 onwards) and UCSC (human release 32 and mouse release M23 onwards) genome browsers. As both Ensembl and UCSC genome browsers import information from other datasets (e.g., RefSeq), some genes contain additional transcript models not included in the curated dataset from this paper (e.g., Olfr240-ps1, discussed in this paper). We have provided both a detailed table with every annotated transcript in both species (Supplementary File 1) and GTF annotation files including only the Ensembl-HAVANA curated human (Supplementary File 2) and mouse (Supplementary File 3) OR gene models.
RNAseq datasets
For mouse, we used previously published data from two studies (11, 19) comprising RNAseq of whole olfactory mucosa samples of adult (8-10 weeks old) male and female C57BL/6J mice (12 samples in total, 6 from each sex). Raw data were retrieved from the European Nucleotide Archive (ENA, https://www.ebi.ac.uk/ena) from projects PRJEB1365 (11) and PRJEB5984 (samples ERS658588, ERS658589, ERS658590, ERS658591, ERS658592 and ERS658593) (19).
For human, we used data from two published studies comprising RNAseq of olfactory mucosa biopsy samples (three male samples from (20); two male and two female samples from (12)). Raw data were retrieved from the European Genome-Phenome Archive (EGA, https://ega-archive.org/) study EGAS00001001486 (20) and from NCBI Gene Expression Omnibus (https://www.ncbi.nlm.nih.gov/geo/) project GSE80249 (12).
An additional six human olfactory mucosa samples were collected and sequenced to increase the coverage of the human repertoire, as previously described (20). Briefly, samples were collected from male patients undergoing endoscopic sinus surgery in Leuven, Belgium, for resection of an adenocarcinoma. During the procedure, olfactory mucosa of the contralateral (healthy) side was harvested from the olfactory groove. Collected tissue was stored in RNAlater and shipped to the Max Planck Research Unit of Neurogenetics (Frankfurt, Germany) for further processing. For one patient the sample was divided into three samples (samples 10-12), and each was processed separately. Thus, there was a total of six samples from four different patients (Supplementary File 6). All patients provided written informed consent according to the study protocol, approved by the Medical Ethical Committee on Clinical Investigations at the University Hospitals of Leuven on 23 April 2014 (S5648).
RNA was extracted using the RNeasy mini kit (Qiagen) following the manufacturer’s instructions. Then, mRNA was prepared for sequencing using the TruSeq RNA sample preparation kit (Illumina) with a selected fragment size of 200-500 bp. Samples were multiplexed together and sequenced on an Illumina HiSeq 2500 to produce paired-end 100 bp sequencing fragments. All raw data have been deposited in the EGA under accession EGAS00001001486.
RNAseq data processing and analysis
Human and mouse RNAseq data were aligned to the corresponding reference genomes (GRCh38 for human and GRCm38 for mouse) using Tophat version 2.0.13 (47), with default parameters. Mapped reads were used to perform reference-guided transcript assembly with Cufflinks version 2.2.1 (48) with default parameters, guided by Ensembl annotation, version 85 for human and version 83 for mouse (49). For the mouse data, Cufflinks was run on every sample and the results were compiled into a unique set of gene models using Cuffmerge. For the human data, we increased the coverage of OR genes by merging all 9 samples from Saraiva et al. (2019) (20) and this study into one BAM file. Similarly, the four samples from Olender et al. (2016) (12) were merged into a second BAM file. Each of these files were then used as input for Cufflinks. The two sets of gene models were kept separate and curated in parallel.
Gene expression levels for the mouse repertoire were taken from (19) Figure 2 - source data 1 (https://doi.org/10.7554/eLife.21476.006). For the human repertoire we applied the same methods as in (19). Briefly, we obtained the number of fragments mapped to each gene using the script htseq-count (mode intersection-nonempty; HTSeq version 0.6.2; (50)), using Ensembl annotation version 95, which contains the Ensembl-HAVANA curated OR models (51). To account for differences in sequencing depth between samples, raw counts were normalised using the method implemented in the DESeq2 package (52). Further normalisation to account for differences in the proportion of OSNs per sample was performed using the method proposed by Khan et al. (2013) (53). This consists of using the geometric mean of five genes expressed specifically in OSNs (ADCY3, ANO2, OMP, CNGA2 and GNAL) to compute a size factor to scale the normalised counts of OR genes. The code used to analyse the data can be found in https://github.com/xibarrasoria/ORgeneAnnotation_HAVANA. The normalised counts for the human OR repertoire are provided in Supplementary File 6.
Data processing, statistical analyses and plotting were performed in R (54) (http://www.R-project.org).
Identification of OR open reading frames split across two exons
Having defined full-length gene models for a large portion of the mouse OR gene repertoire, we reanalysed the resulting transcripts to check if any showed evidence for full-length multi-exonic ORFs. We focused our analysis on the mouse repertoire since it contains a much higher number of full-length gene models. We used an ad hoc script (https://gitlab.com/olfr/olfr_multi_exon_orf_finder/tree/master) to identify all transcripts whose longest predicted ORF spanned multiple exons and encoded a protein greater than 300 aa. The resulting list of 202 different transcripts (171 genes) was filtered to remove transcripts that had a second conserved in-frame ATG downstream of the splice site that still generated a protein of more than 300 aa. We further removed any transcripts that had deletions in the seven transmembrane domains (assessed from a multiple alignment of all receptors) as these are likely to be pseudogenes. Finally, to increase the likelihood of retaining functional proteins, we only selected genes whose orthologues had a conserved starting methionine (i.e., in the same relative position). This procedure resulted in a list of 58 candidate transcripts with a split ORF. Each was manually curated to ensure that the predicted ORF satisfied all of the criteria described above to be biotyped as protein-coding. We further required good conservation of the splice junction in the orthologous loci in other mammals, or in the closest paralogues if an orthologue could not be found. The transcripts that satisfied all these requirements are detailed in Supplementary File 5, which also contains the corresponding human orthologues with conserved split ORF structures.
Single-cell RNAseq of mature mouse olfactory sensory neurons
We dissected whole olfactory mucosa from two male and two female three-day old heterozygous OMP-GFP mice (B6;129P2-Omptm3Mom/MomJ, The Jackson Laboratory, Stock # 006667) (27) that were backcrossed 10 times with C57BL/6 animals, designated as OMP-GFP (B6-N11). The dissected tissue from all animals was pooled, minced and then enzymatically digested in HBSS without Ca2+ and Mg2+, supplemented with 44 U/mL dispase (Invitrogen), 1000 U/mL collagenase type II (Invitrogen) and 10 mg/mL DNaseI (Roche), for 15–20 min at 37°C with agitation. Digested tissue was centrifuged at 0.4 x 1000 rcf for 5 min and washed twice in HBSS without Ca2+ and Mg2+. The dissociated cell suspension was passed through a series of filters: 100 µm (Falcon), 70 µm (Flacon) and 20 µm cell strainers (pluriSelect). The final dissociated cells were resuspended in 0.5% BSA in PBS without Ca2+ and Mg2+. Single cells were isolated using a Nikon Narishige microinjection setup under a fluorescence Nikon TE300 microscope. GFP fluorescence was verified on a flat screen using NIS Elements v4.5 software (Nikon) in order to ensure only GFP-positive cells were selected. Isolated cells were washed twice and then pipetted into a 0.2 mL tube, which was immediately frozen on dry ice.
Single-cell cDNA libraries were prepared using SMART-Seq v4 (Clontech) according to the manufacturer’s recommendations. Briefly, single cells were lysed and reverse transcription was performed with SMART-Seq v4 Oligonucleotide and SMARTScribe Reverse Transcriptase to generate nearly full-length cDNA libraries. Libraries were purified twice with AMPure XP system (Beckman Coulter) to minimize primer dimers. The size of the libraries was checked on high‐sensitivity DNA chips (Agilent); cells with abundant short cDNA (average size < 1.3 kb) were discarded. Purified cDNAs were used to construct libraries for sequencing using the Nextera XT DNA Library Preparation Kit (Illumina) according to manufacturer’s recommendations. Briefly, cDNAs were fragmented to ~300 bp, followed by PCR amplification with index primers. Libraries were purified with AMPure XP system (Beckman Coulter), normalised and pooled with different i7 indexes. Pooled libraries were sequenced on the Illumina HiSeq X Ten platform, to produce 150 bp paired-end fragments (Novogene Co., Beijing). Raw data have been deposited to Array Express (https://www.ebi.ac.uk/arrayexpress/) under accession number E-MTAB-8285.
Sequencing reads were mapped to the mouse reference genome (GRCm38) using STAR (55) version 2.6.0c, with parameters --outFilterMismatchNmax 6 --outFilterMatchNminOverLread 0.5 --outFilterScoreMinOverLread 0.5 --outSAMtype BAM SortedByCoordinate --outFilterType BySJout --outFilterMultimapNmax 20 --alignSJoverhangMin 8 --alignSJDBoverhangMin 1 --alignIntronMin 20 --alignIntronMax 1000000 --alignMatesGapMax 1000000 --outSAMstrandField intronMotif --quantMode GeneCounts. We guided the mapping with Ensembl annotation version 93 (56) that includes the Ensembl-HAVANA OR gene annotation. We enabled STAR’s quantification mode to obtain the number of fragments mapped to each gene.
All 34 cells but one showed good performance on quality-control statistics (library size, percent of reads mapping to mitochondrial reads, and number of genes detected); the failed sample had a much lower number of genes detected compared to all other samples and was removed from the analysis. Raw counts were normalised with the algorithm implemented in the Bioconductor package scran (57). For each single cell, we visually inspected the alignments of all OR genes expressed at 30 or more normalised counts. When clear mapping artefacts were detected, the counts of the corresponding OR gene were set to 0. Details of this procedure along with the code used to analyse the data can be found in https://github.com/xibarrasoria/ORgeneAnnotation_HAVANA.
Phylogenetic analysis of protein-coding ORs
To reconstruct the phylogenetic relationship between ORs we first aligned the protein sequences of all protein-coding genes. For the human repertoire we used MUSCLE (58), which can handle up to 500 sequences (https://www.ebi.ac.uk/Tools/msa/muscle/; (59)), and for mouse we used CLUSTAL Omega (60), which aligns up to 2000 sequences (https://www.ebi.ac.uk/Tools/msa/clustalo/; (59)). A phylogenetic tree was constructed from the resulting multiple alignments using the BIONJ algorithm (61) (a modified neighbour-joining procedure), in Phylogeny.fr (62) (http://www.phylogeny.fr/one_task.cgi?task_type=bionj). The resulting trees were visualised with the Interactive Tree of Life tool (63) (http://itol.embl.de).