Odorants
The musk odorants used in this study were provided by Takasago International Corporation or were purchased from Wako or TCI (Table S4). The musk odorants, their structural analogues, and odorant mixtures used in this study are as follows: 1. musk ketone, 2. musk xylol, 3. civettone, 4. globanoneTM, 5. cyclopentadecanone/exaltone (Firmenich), 6. muscenoneTM delta, 7. muscone, 8. cosmoneTM, 9. cyclopentadecanol, 10. ethylene brassylate, 11. ambrettolide, 12. habanolideTM, 13. pentalide, 14. galaxolide, 15. tentarome/tonalidTM, 16. celestolide, 17. helvetolide, 18. romandolideTM, 19. ω-dodecanolactam, 20. amber xtremeTM, 21. b-ionone, 22. ambrinol, 23. YsamberTM K, 24. p-cresyl phenyl acetate, 25. amberketal, 26. d-dodecalactone, 27. cyclopentadecane, 28. cyclododecanone, 29. 2-pentadecanone, 30. Calone, 31. heliotropyl acetone, 32. 8-pentadecanone, 33. florexTM, 34. styrallyl acetate, 35. β-damascone, 36. ambroxanTM, 37. isolongiforanone, 38. methyl cedryl ketone, 39. 6-methyl quinoline, 40. raspberry ketone, 41. p-t-butyl cyclohexanone, 42. Kephalis, 43. Amber CoreTM, 44. (R)-(+)-sclareolide, 45. iso e super, 46. boisambreneTM forte, 47. sandalmysore coreTM, 48. methyl b-naphthyl ketone, 49. aldehyde mixture (n-hexanal, n-octanal, 1-decanal, 2-methyl undecanal, melonal, and helional), 50. alcohol mixture (cis-3-hexenol, 1-dodecanol, 1-decanol, 1-octanol, hexanol, and geraniol), 51. ketone mixture (methyl heptenone, p-amyl cyclohexanon, acetophenone, claritone, and fructone), 52. acid mixture (phenyl acetic acid, acetic acid, benzoic acid, hexanoic acid, ocrtanoic acid, and iso-valeriec acid), 53. terpene mixture (l-carvone, l-menthol, iso-menthone, α-terpinene, camphene, and dipentene), 54. lactone mixture (g-decalactone, g-nonalactone, g-undecalactone, jasmolactone, g-valerolactone, and g-octalactone), 55. ester mixture (methyl cinnamate, benzyl propionate, iso-bornyl acetate, cis-3-hexenyl hexanoate, iso-amyl n-butyrate, and phenyl ethyl acetate), and 56 camphor mixture (1,4-cineol, iso-borneol, borneol, and camphor). Musk ketone and musk xylol were prepared as a 50 mM EtOH solution. Each of the other odorants were prepared and stocked as 100 mM EtOH solution. These stock solutions were stored in a freezer and diluted with medium for stimulation. The odorant mixtures (#49-56) were prepared as EtOH solutions and applied to cells with each component’s final concentration as 100 mM.
Designing Consensus ORs
BLASTP searches were conducted using a reference amino acid sequence of a target OR as a query against proteins in the database of reference proteins. As a result of the searches, the number of identified orthologous receptors without human receptors were as follows: 242 for OR5AN1, 196 for OR5A1, 111 for OR5A2, 134 for OR4D6, and 102 for OR4D10, (Supplementary text files). Amino acid alignment using ClustalW allowed identification of consensus amino acid residues, which were conserved across more than 50% orthologous receptors at a given site. A consensus amino acid was inserted into a human OR when more than 60% orthologous ORs had an amino acid at the position. On the other hand, an amino acid was deleted from a human OR when more than 60% orthologous receptor did not have an amino acid at the corresponding position. The consensus amino acid sequences were translated into DNA sequences through codon optimization. The DNA sequence of each OR was synthesized by GenScript Japan and inserted into a pME18S vector to generate OR proteins fused with N-terminal epitope tags of eight amino acids of FLAG tag, followed by twenty N-terminal amino acids of bovine rhodopsin.
Expression Vector
The sequence information of human ORs used in this study was reported previously10,40. Genes coding human ORs, trace amine-associated receptors (TAARs), and vomeronasal receptors type I (VN1R1s) were amplified from human genomic DNA (Promega, Madison, WI, USA). The identified single nucleotide polymorphisms (SNPs) that were different from the reference sequences were not modified. When we amplified an OR gene with unknown missense mutations, we modified the genes with mutations to reference sequences. The human RTP1S gene was amplified from human genomic DNA (Promega, Madison, WI, USA), and it was inserted into pME18S without any N-terminal epitope tag. cDNA of muscarinic acetylcholine receptor M2 was purchased from Thermo Fisher Scientific (Waltham, MA, USA), and the ORF was inserted with pME18S, containing both the N-terminal epitope tags of FLAG and twenty amino acids of bovine rhodopsin.
Cells
HEK293 cells were grown in Dulbecco's Modified Eagle Medium (DMEM, 4.5 g/L glucose, Nacalai Tesque, Kyoto, Japan) supplemented with 10% fetal bovine serum (FBS). Cells were cultured on a 100-mm cell culture dish at 37 °C in a humidified atmosphere containing 5% CO2. Cells were split every two to four days before reaching confluence. When passaging cells, all medium in the cell culture was removed, and cells were gently washed with PBS. Trypsin-EDTA (0.25%, Invitrogen, Waltham, MA, USA) was used to detach cells from the bottom of the dish. Equal volumes of DMEM with 10% FBS were applied to the dish immediately after the cells detached. The cells’ suspension was transferred to a 15 ml polypropylene tube and centrifuged for 1 minute at 200g at room temperature. DMEM and trypsin-EDTA were aspirated, and the retained cell pellet at the bottom of the tube was resuspended with DMEM with 10%. The cell suspension was transferred to a 100 mm cell culture dish with 9 ml DMEM and 10% FBS.
Flow Cytometry Analyses
HEK293 cells were grown to confluency, resuspended, and seeded onto 35-mm cell culture dish or 6-well plate. The number of seeded cells were 3.6×105 in 2 ml DMEM with 10% FBS. The cells were cultured overnight before transfection. The DNA transfection mixture in 100 μl DMEM contained 3 mg of FLAG-Rho-tagged OR, 1 mg RTP1S, and 10 μl polyethylenimine Max (PEI-MAX, 0.1%, pH 7.5, Polysciences, Warrington, PA, USA). For preparing cells without any receptors (mock-transfected cells) or M2AchR-expressing cells, 3 mg of empty vector or FLAG-Rho-tagged M2AchR was transfected instead of a FLAG-Rho-tagged OR, respectively. After incubation at room temperature for 15 min, the DNA transfection mixture was transferred to the cell culture dish. After 24 hours, the cells were detached and resuspended with 1 ml cell stripper (Corning, NY, USA) and 1 ml PBS with 2% FBS on ice. The cells in 15 ml polypropylene tubes (AGC TECHNO GLASS, Shizuoka, Japan) were centrifuged for 3 minutes at 200g at 4 °C. The cell pellet at the bottom of the tube was resuspended with 120 ml of 0.3 mg/ml primary antibody [Anti-DYKDDDDK, Mouse-Mono (2H8, transgenic, Fukuoka, Japan)] in PBS with 2% FBS and incubated for 60 minutes on ice. At the end of the incubation period, the cells were washed with 1 ml PBS and 2% FBS twice and stained with 100 ml of 50 mg/ml phycoerythrin (PE)-conjugated goat anti-mouse IgG H&L antibody in PBS with 2% FBS for 30 minutes on ice (FUJIFILM Wako Pure Chemicals, Osaka, Japan) in the dark. Dead cells were stained with 0.5 mg/ml, and 7-Amino-actinomycin D (FUJIFILM Wako Pure Chemicals, Osaka, Japan) was added. The cells were analyzed using BD FACSuit with gating allowing for single, spherical, viable cells, and the measured PE fluorescence intensities were analyzed and visualized. We normalized the cell surface expression levels by cells expressing FLAG-Rho-tagged M2AchR, which was robustly expressed on the cell surface, and mock-transfected cells. The normalized expression level was calculated as [PE (OR) − PE (mock)] / [PE (M2AchR)- PE (mock)], where PE (OR) = mean PE from cells transfected with an OR; PE (mock) = mean PE from cells without any receptors; and PE (M2AchR) = mean PE from cells transfected with FLAG-Rho-tagged M2AchR. Statistical analyses were performed using one-tailed t-test, and the significance level was set at P<0.05.
cAMP response Element (CRE)-regulated Luciferase Reporter Gene Assay
The Dual-Glo Luciferase Assay (Promega) was used to determine the activities of firefly and Renilla luciferase in HEK293 cells. Transfection and luciferase assays were performed as previously described10,11. Briefly, 75 ng of a FLAG-Rho-tagged OR pME18S, 30 ng of CRE/luc2PpGL4.29, 30 ng of pRL-CMV, and 30 ng of RTP1S pME18S were applied in 10 ml DMEM with 0.41 ml of PEI-MAX (0.1%, pH 7.5) for each well of a poly-D lysine-coated 96-well plate (Corning, NY, USA). For testing cOR5A2 with humanized ligand-binding pocket, 7.5 ng of a FLAG-Rho-tagged OR pME18S, 3.0 ng of CRE/luc2PpGL4.29, 30 ng of pRL-CMV, and 30 ng of RTP1S pME18S were applied. After incubation for 15 min, 90 µl of cell suspension (2 x 105 cells/cm2 in DMEM with 10% FBS) were added to the 10 µl transfection solution, and the plate was incubated for 24 hours. For each well of poly-D lysine-coated 384-well plate (Corning, NY, USA), 29 ng of a FLAG-Rho-tagged OR pME18S, 22 ng of CRE/luc2PpGL4.29, 11 ng of pRL-CMV, and 12 ng of RTP1S were applied in 4.4 ml DMEM with 0.16 ml of 0.1% PEI-MAX (0.1%, pH 7.5). After incubation for 15 minutes, 40 µl of cell suspension (2 x 105 cells/cm2 in DMEM with 10% FBS) was added to 4.4 µl of the transfection solution, and the plate was incubated for 24 hours. The 384-well plate assay was performed using the BiomekFX laboratory automation system (Beckman Coulter, Brea, CA, USA). After 24 hours of transfection, the medium was removed, and the transfected cells were stimulated with an odorant solution diluted in the DMEM (75 µl and 40 µl per well for 96-well plate and 384-well plate, respectively). The 96- or 384-well plates were sealed and incubated at 37 °C for 3-4 hours. The luciferase reporter gene activities were measured by Mithras LB940 (Berthold Technologies, Bad Wildbad, Germany) and by Ensight multimode plate reader (PerkinElmer, Waltham, MA, USA). An odorant-induced activity was calculated as fold increase or normalized response. Fold increase was calculated as Luc(N) divided by Luc(0), where Luc(N) was the luminescence intensity of firefly luciferase divided by the luminescence intensity of Renilla luciferase of a certain odorant-stimulated well, and Luc(0) was the luminescence intensity of firefly luciferase divided by the luminescence intensity of Renilla luciferase of a certain non-stimulated well. Normalized response was calculated as [Luc (N) – Luc (0)] / [Luc (Fk)- Luc (0)]x100, where Luc (Fk) was the luminescence intensity of firefly luciferase divided by the luminescence intensity of Renilla luciferase of a forskolin-stimulated well (30 mM, LKT Laboratories Inc. Phalen Blvd., MN, USA). In a dose-response analyses, the reported three statistical criteria were applied: (i) 95% compatibility intervals (CIs) of top and bottom parameters don’t overlap, (ii) the standard deviation of logEC50 is less than 1, and (iii) the extra sum-of-squares test to examine that the odorant-response curve is different from the empty-vector curve (P<0.05)9,25. We confirmed that the EC50 values calculated from the sigmoidal curves with the normalized response values did not differ by more than 1 log step from those with raw firefly luciferase values8. Data analysis was performed using Microsoft Excel or GraphPad Prism software, and data were fitted to a sigmoidal curve using GraphPad Prism software.
Construction of cOR5A2 with humanized ligand binding pocket
The structural construction of cOR5A2 and human native OR5A2 is generated using open source Alphafold (version 2.1.0) developed by Deepmind. The full amino acid sequence is fed into the alphafold model trained on the full genetic database including, BFD, MGnify, PDB70, PDB, Uniclust30 and UniRef90. From the generated pdb structure, the two proteins are centered and rotated with kabasch method to minimize the RMSD for structural comparison. The ligand binding pocket is manually labeled by taking the extracellular loop 2 and the top of the transmembrane domain 3, 5, 6 and 7 above the FYG motif of TM6. Calculations of distance, directionality and angles are calculated for every amino-acid relative to the binding pocket center. Here, the binding pocket center of the binding cavity is obtained by the common centroid function. Distance is simply calculated by subtracting the longest carbon chain position of the amino acid (CX) by the center of the binding pocket. The angle of relative amino acid is obtained by finding the angle degree created between the two vectors of alpha carbon (CA) to CX on the amino acid and CA to binding pocket center. All data pre-processing and analysis were conducted using open source Python package pandas (version 1.2.1), and numpy (version 1.19.5).
Data Introduction and Pre-processing for Machine Learning
The dataset containing 20 agonists and 27 non-agonists against a consensus version of OR5A2 was used to train and evaluate the ML classifiers. Oversampling technique was applied to randomly replicate the agonists’ data samples41. In this study, we employed the compound structural information as data features, represented by the Count-based Morgan Fingerprint (CMF) in radius 2, which represents each chemical structure as molecular fragments by iteratively obtaining distinct paths through each atom of the molecule42. These fragments were hashed into a Morgan Fingerprint bit (MFB) vector. The CMF counts the occurrences of MFB, forming an integer vector instead of a binary vector. The CMFs were calculated using RDKit library (version 2020.03.2), and no bit collisions occurred were also confirmed manually in this study. Overall, 2048 MFBs were calculated. Since the size of the data sample used in this study was relatively small, we conducted feature selection (FS) to eliminate non-correlated features by calculating intercorrelations and regression coefficient values to the target variable43. One of the features in every pair with the intercorrelation values over the threshold of 0.9 were omitted. We also ensembled four regression models, Least Absolute Shrinkage and Selection Operator (LASSO), Ridge Regression (RR), Stepwise LASSO, and Random Forest (RF) to omit low correlation features. The Observations with missing values and the feature with no variance were also eliminated. All the features were rescaled to standardize numerical inputs whose mean was zero, and standard deviation was one for the ML classification models. After the data pre-processing, we retrieved 14 features for the model training and evaluation. All the data pre-processing and feature selection were conducted using open source Python package scikit-learn (version 0.23.2), pandas (version 1.2.1), and numpy (version 1.19.5).
Model Construction and Validation
In this study, we employed eight widely used classification models, including linear and non-linear models, to predict the compound bioactivity against OR5A2 and compared their performances. The eight models were as follows: naïve bayes (NB), linear discriminant analysis (LDA), logistic regression (LR), random forest (RF), K-nearest neighbors (KNN), Gaussian process (GP), support vector machine (SVM), and XGBoost linear (XGBL). Leave-one-out cross validation (LOOCV) was applied to evaluate the model performances44. LOOCV allows using the maximum amount of data samples to train the ML models and leave only one single instance as test set in every cross-validation fold. Since the learning algorithm can apply at least once for each instance, it solves the bias problem caused by small dataset when it is randomly split. Furthermore, the model performances were also evaluated from Accuracy, F1 score, and the area under the receiver operator curve (ROC_AUC) score. All the ML models were constructed using open source Python package scikit-learn (version 0.23.2). The hyperparameter setting for all the models were set as default in the scikit-learn package (version 0.23.2).
Data Availability Statement
All data discussed in the paper are available in the manuscript or Supplemental information