Identification and Annotation of Peptide Allergens in Prunus dulcis

Approximately, 10% of the world population is facing the challenge of food allergy in direct or indirect way. In this study, a genome-wide identification and annotation of the novel putative allergen from almond is performed. Initially, the whole proteome of almond (31,000 proteins) was scanned by Allergenonline, a publically available database of already reported allergens from different sources. The detailed analysis suggests that there are 430 putative allergens which reduced to 45 on motif-based screening using AllFam database. These predicted allergens are annotated for their function by using PFAM, GO databases and orthology analysis. To validate our prediction, we have used structural insights of allergen and antibody interactions for one of the predicted putative allergen protein, homologous to Pru ar 3.0101allergen from apricot. The structure of putative allergen was modeled and molecular docking studies were performed against the antibody. The best docked conformation was subjected to molecular simulation studies to confirm the stable binding of these two molecules. This detailed analysis suggests that the identified allergen will show cross reactivity similar to Pru ar 3.0101 allergen from apricot. This is one of the first report of identifying and annotating the homologous of Pru ar 3.0101 allergen in almond.


Introduction
Prevalence of food allergy ranges from five to ten% in the world population (Pepper 2020) with no current cure (Moore et al. 2017). Tree nuts are allergic foods that often cause deadly anaphylactic reactions. Generally, the tree nut allergies that develops in childhood continues to adulthood (Iweala 2018). As estimated, tree nut allergies contribute by affecting about 0.05 to 7.3% of the population (Weinberger 2018 ). Tree nut allergens mostly constitue seed storage proteins such as vicilins and legumins. While in an allergic reaction, antibody IgE present on the surface of mast cells and basophils bind to tree nut allergens, resulting in degranulation and release of histamine and other inflammatory mediators (Smeekens 2018 ). Tree nuts, also contain so-called pan-allergens (profilins, heveins, lipid transfer proteins) which show high IgE-mediated cross-reactivity towards pollen and food homologs (Valenta 1992). Reports from PREVALE (Prevalence of Food Allergy in Leganés) reveals that the cases of allergy to tree nuts in the first 3 years of lifespan, alone accounts to 0.83% of the total (LÁ 2020).
Until now individual tree nut allergies such as apricot received only minimal public attention as compared with peanut allergies though pervasiveness of these is equally as high as that of peanut allergy. For that reason, tree nut allergies are not well studied (Che 2019). To exemplify, there are only 2 current macadamia (Macadamia integrifolia) nut allergens identified though several such allergies have been recorded.
Besides this, tree nuts have occupied major portion of current human diets mainly due to their great taste and promising health benefits. Various studies have reported the beneficial impacts of tree nuts on coronary diseases and cholesterol serum levels (Rehm and Drewnowski 2017). Among all the tree nuts, almonds (Prunus dulcis or Amygdalus communis L.) constitute a prime portion of global consumption, followed by walnuts (Juglans regia), cashews (Anacardium occidentale), pistachio nuts (Pistacia vera), hazelnuts (Corylus avellana), pecan nuts (Carya illinoiesis), macadamia nuts (Macadamia ternifolia), brazil nuts (Bertholletia excelsa), and pine nuts (Pinus pinea and other Pinus species). As a consequence, being highly consumed there are various reported cases of c related allergies.
Reports from United States, stated the prevalence of almond and cashew nut allergies to be 0.7% among children and adults (Gupta 2018). Likewise, Australian surveys affirmed the prevalence of almond allergy to be 0.2% in adolescents (Sasaki 2018) and 0.3% in children (McWilliam 2019). Peanut allergy victims demonstrated a high rate of concurrent almond sensitization ranging from 45 to 71%, however only 11% showed allergenic symptoms (Mustafa 2020). Patients with tree nut allergies showed almond sensitization of about more than 40% (Elizur 2018).
The foremost step for recognizing the allergy related impacts of food proteins is, discovering and characterizing food allergens (Che 2019). The aim of present study is to screen novel allergen proteins by the proteome wide analysis of almond species (Prunus Dulcis). We have filtered out the putative allergens on the basis of sequence identity to already known allergens. Thereafter, motif analysis along with homology studies further refined our set of putative allergens. Finally, T-cell and B-cell epitope conservancy examination led us to a very narrow set of putative allergens.

Identification of Potential Allergen Proteins
For discovering potential allergens from Prunus Dulcis proteome, Allergenonline, version 21 was used (www. aller genon line. org). This database contains 2233 protein sequences (913 groups, 430 species) and is maintained by Food Allergy Research and Resource Program (FARRP), University of Nebraska-Lincoln. For identification the FASTA sequence was searched against the database with search method Full Fasta 36, provided within the Allergenonline toolbox. Credentials: E value ≤ 1E-07 and Identity ≥ 50% were used while searching and max alignments was set to default.

Motif Based Screening of Potential Cross-Reactive Protein Sequences
Identified potential allergens were further screened. In, here we selected allergen sequences from AllFam database and used them as a reference for screening. AllFam constitute 151 families with 1059 allergens in total and in maintained by Medical University of Vienna. Reference allergens were chosen on two bases, first on the basis of their routes of exposure, second was the source. Motifs from query sequence were compared to that found in reference and the sequences that matched the most were selected.

Classification of Putative Allergens into Protein Families Along with Their gene Ontology (GO)
The retrieved sequences were subjected to Pfam database (https:// www. pfam. xfam. org) and Conserved Domain Database (https:// www. ncbi. nlm. nih. gov/ Struc ture/ cdd/ cdd. shtml/) for protein family analysis. Pfam 34.0 contains 19,179 families (as of May 2021) and was produced and maintained by European Bioinformatics Institute. Protein families were retrieved by using Hidden Markov Models Method (https:// www. ebi. ac. uk/ Tools/ hmmer/ search/ hmmsc an/). Gene ontology (GO) was carried out to determine the biologic, molecular or cellular function of all the discovered potential allergens. GO accession numbers of retrieved allergens were obtained from InterPro database (https:// www. ebi. ac. uk/ inter pro/). For plotting and visualizing the ontology data the GO accessions were submitted to Web Gene Ontology Annotation Plot (WEGO, https:// www. wego. genom ics. cn/).

Orthology Modelling and Phylogeny
Galaxy server was used for analyzing orthologs of putative allergens (https:// usega laxy. eu/). Proteinortho tool from galaxy server was used for finding the orthologs (Lechner 2011). It is a tool used to detect orthologous genes or proteins within different species. Basically, it equates the similarities between the given data sequences (genes or proteins) and assemble them to significant groups. Diamond Algorithm and E-value 1E-03 was used for finding the orthologs. Orthology analysis was carried out using the proteomes of related species i.e., Apricot (Prunus armeniaca), Cherry (Prunus avium) and Peach (Prunus persica). Orthology groups with allergenic proteins were examined and selected. Phylogeny tree of the selected sequences was built using Phylogeny.fr (Dereeper 2010).

B-Cell Epitope Prediction
The Immune Epitope Database (IEDB) was used for prediction of B cell epitopes (https:// www. iedb. org/). IEDB constitute experimental data of antibody and T cell epitope studies conducted in humans and other animals. Majority of the available data is from infectious disease, allergy and autoimmunity studies. B-cell linear epitopes were predicted using Bepipred 2.0 and ABCpred. Bepipred 2.0 web server uses Hidden Markov Model (HMM) whereas ABCpred uses artificial neural network (ANN) to predict B cell epitopes. Thereafter, to find the conserved epitopes among the allergens and non/putative-allergen proteins, epitope conservancy analysis was carried out.

Structure Prediction by Molecular Modelling
Swiss model server was used for structure modelling of the selected potential allergens (www. swiss model. expasy. org/). Swiss model is a web server maintained by Swiss Institute of Bioinformatics Biozentrum at University of Basel. The target sequence was uploaded along with project title to search for templates. Among all the suggested templates a specific template on the basis of its coverage, GMQE (Global Model Quality Estimate), identity and method of structure prediction was selected to predict the model of target sequence. The Global quality estimate, local quality estimate and Z-score index of the predicted model was used to assess the quality of predicted model. Quality of structure was also assessed by ProCheck tool. Model with best quality index was selected for further analysis.

Molecular Docking
Cluspro 2.0, server was used for molecular docking (www. clusp ro. bu. edu/ home. php ). It is an online server created and maintained by Vajda lab at Boston University. Finally selected potential allergen structure was uploaded as a ligand whereas an antibody structure procured from protein data bank was uploaded as receptor onto the server. For specific antibody-protein docking, antibody mode provided by Cluspro was used. For precise results, non-CDR regions of antibody were masked on.

Molecular Dynamic Simulation Studies
MD simulations of the best docked results with most negative energy was carried out using GROMACS 2020.1 software package. The elements of the simulation system were protein and water. All the MD simulations were executed using CHARMM36 force field (Huang 2016). A triclinic box was created around the protein-protein complex and thereafter solvated with water model TIP3P. The system was positively charged and to neutralize, 3 CL negative ions were added. After Energy minimization, position restrained was applied followed by NVT and NPT equilibrium. NVT (constant Number of particles, Volume, and Temperature) was performed with coupling groups of protein and non-protein, at temperature of 300 K, with a coupling constant of 0.1 ps for 100 ps whereas for NPT (Number of particles, Pressure, and Temperature are all constant) pressure of 1 bar with coupling constant of 2.0ps for 100 ps with the same coupling groups was applied. Finally, MD simulation of 25 ns was executed (Lemkul 2018).

Identification of Putative Allergens
Full length sequence alignment was performed to identify putative allergens. Newly submitted proteome was procured from UniProt. It constitutes 31,934 proteins, with 31,932 as unreviewed and 2 as reviewed. We compared our query sequences to allergenonline database and found 430 sequences as putative allergens. The allergenonline toolbox uses scoring matrix of BLOSUM 50 to execute these alignments. As BLOSUM 50 is believed to screen highly similar proteins in terms of their overall structure and function, irrespective of their evolutionary relation.

Screening of Potential Allergenic Sequences in Prunus dulcis
Identified allergenic sequences were further screened by comparing with reference from AllFam Database. Database has categorized allergens into various families with their sources and routes of exposure. A reference was created from the proteins in AllFam database on the basis of their allergy causative method i.e., ingestion, contact and inhalation.
Motifs plays an important role in allergenicity of an allergen. There were 229 allergen sequences selected for reference. Motifs for query allergen sequences as well as reference allergen sequences were identified and compared using motif finder from GenomeNet server (www.genome.jp/tools/ motif/). Finally, the query sequences with matched motifs were selected and this amounted to 45. These sequences were further characterized and analyzed for allergenicity.

Protein Family and Gene Ontology Analysis
Screened query sequences were subjected to protein family and gene otology analysis. Conserved Domain Database and Pfam database were used for protein family analysis. Out of 45 query proteins maximum number of proteins were found in GST_N (Glutathione S-transferase) and this amount to 10. This family includes phase II enzymes with capability to perform conjugation reactions. Bet_v_1 contained the second most number of proteins with 5 in number followed by EF-hand_1 with 4 proteins (Fig. 1 A). EF-hand motifs play their role in calcium binding properties of proteins. On the other hand, protein families NmrA, Enolase_C, Thioredoxin GST_N_3 and Thaumatin all constitute equal number of proteins i.e., three out of 45 putative allergens 35 has GO annotations and these constitute to 66 different GO terms which were categorized into cellular component, molecular function and biological process. In case of cellular component 8.6% of 45 proteins were located in cell part (GO:0044464), 8.6% in cell (GO:0005623) and another 8.6% of genes were located in protein-containing complex (GO:0032991). Considering molecular function 57.1% are seen to show binding activity (GO:0005488) whereas 28.6% own catalytic activity (GO:0003824). While in case of biological process 65.7% of protein are involved in metabolic processes (GO:0008152), 65.7% depicted their role cellular process (GO:0009987) and only a little i.e., 5.7% of them take part in localization (GO:0051179) (Fig. 1B).

Orthology and Phylogeny Analysis
Protein family and gene ontology characterized protein sequences were subjected to ortholog analysis. All the 45 protein sequences were put through ortholog analysis by means of a tool i.e., Proteinortho. Total 45 ortholog groups were created. Among all, 5 groups included orthologs that are already reported allergens and are listed on WHO/ IUIS Allergen Nomenclature. Phylogeny tree depicts the groups with their respective orthologs. Accession numbers highlighted in red color belong to the category of already reported allergens and the one with blue color belong to the category of putative allergens. There were total of 4 motifs found namely, Bet_v_1 (Pathogenesis-related protein Bet v 1 family), Thaumatin, Profilin and LTP_2 (lipid transfer protein). Highlighted (blue) proteins showed more than 90% similarity to their respective ortholog allergen protein, hence were considered for further analysis (Fig. 2).

B-cell Epitope Prediction
There were 7 putative allergens considered for B-cell epitope analysis. The given table illustrates the predicted epitopes with their positions. Default parameters were used while predicting the linear B-cell epitopes. Primarily, epitopes of all the allergens were predicted along with the epitopes of their respective ortholog allergens. Epitopes that were 100% conserved in both i.e., orthologous reported allergens and our almond putative allergens, were screened and used for further analysis (Table 1).

Structure Modelling of Selected Protein Sequences
Among 45 putative allergen sequences 7 proteins that showed more than 90% identity to their allergen ortholog were considered structure modelling. These proteins were modelled using Swiss model server. Templates were searched against the PDB (Protein Data Bank). 2b5s.1 (Pru p3, the prototypic member of the family of plant non-specific lipid transfer protein) template was used for modelling of putative allergen protein (A0A5E4EF81). This showed coverage of 0.68%, sequence identity of 92.31% and GMQE of 0.91. Structure quality parameters with 0.48 GMQE scores, QMEAN of 0.77, Z-score lied close to 0 and Local quality estimate scores more than 0.6 confirmed the quality of the modelled structure.
PROCHECK analysis provided the Ramachandran plot which also confirmed the quality of modelled structures. It Moreover, in case of A0A5E4E9E4, template used is 5fds (monomeric allergen profilin (Hev b 8)) and Ramachandran plot shows the presence of 91.5% of residues in the structure lies in the most favored regions 5.6% in additional allowed regions and 2.8% in generously allowed regions. For A0A5E4FGB4, A0A5E4FC44, A0A5E4F9Q4 proteins, 6stb.1.A (strawberry pathogenesis-related 10 (PR-10), Fra a 1.02 protein, Q64W mutant) template is used. Ramachandran plot all the 3 proteins illustrates presence of 91.3% of residues in most favored regions and 8.7% in additional allowed regions. 3zs3.1.A (Mal d 2, the thaumatin like food allergen from apple) template is used for modelling the structure of A0A5E4GAL7 and 2ahn.1.A (cherry allergen Pru av 2) is used for A0A5E4GM68 protein. Ramcahndran plot of structure of A0A5E4GAL7 and A0A5E4GM68 demosntrate the presence of 88.9% and 91.1% residues in most favored regions, 11.1% and 8.9% residues in additional allowed regions respectively (Fig. 3).

Molecular Docking and Molecular Dynamic Simulation Studies
Antibody structure procured from PDB was docked against the modelled structure. Antibody mode of Online server Cluspro 2.0 was used for docking. Docking generated various docked models and the best docked model with lowest energy was selected for further analysis (Table 2). Rainbow color sites in the figure depicts the predicted linear conserved B-cell epitopes (Fig. 4 A).
The conformational stability of the antigen-antibody complex during MD simulation was analyzed by root mean square deviation (RMSD). RMSD was calculated for backbone atoms. At most points throughout simulation, the system laid within the RMSD window of 0.05-0.9 nm. At the beginning, RMSD rose un-evenly until 0.9 ns, thereafter it became begun to decline to reach at 0.4 nm at the end of the end of simulation (Fig. 4B).
Allergenicty to a particular food can also lead to crossreaction to a closely related food. Generally, cross-reactivity among tree nut allergens is very high especially betwixt cashew and pistachio being from the same family.i.e. Anacardiacea family, walnut and pecan from Juglandaceae family (Andorf 2017). Jug r 5, Cor a 1, Ara h 8 from walnut, hazelnut and peanut respectively, have been recognised to show cross-reaction to Bet v1 (Wangorsch 2017;Mittag 2004) (Hofmann 2013). Similarly, Pru p 3 (peach lipid transfer protein) has been identified to show cross reactivity to certain lipid transfer proteins including Ara h 9 from peanut, Cor a 8 from hazelnut, Jug r 3 from walnut and Pru du 3 from almond(Mothes-Luksch 2017) .
Identity of protein sequences along with structural epitopes play an important role in predicting cross-reactivity. For protein sequence identity, it has been observed that there is at least 70% sequence identity between two cross-reacting allergens (RC 2000). For more precise prediction of crossreactivity, IgE binding epitopes along with the location of identical sequences should be considered (Smeekens 2018). The usage of amino acids in the translation of proteins and peptide of different allergen families also have unique pattern (Singh 2022).
Considering the suggested possibilities, we predicted a putative cross-reactive allergen i.e., A0A5E4EF81 a lipid transfer protein. This peptide shows 92.11% sequence identity to Pru ar 3.0101 an allergen from Prunus armeniaca (apricot) along with conserved B-cell epitopes. Docking and MDS analysis (stable RMSD) proves the stability of antigenantibody complex, hence this confirms the possibility of our predicted peptide to be cross-reactive to Pru ar 3.0101 however, there is still a need for in vitro verification.

Conclusion
In the present study, we executed a genome wide identification of a novel putative allergen from almonds. Screening of protein sequences from proteome through Allergenonline and AllFam database along with ortholog analysis has narrowed us to a single protein which was further subjected to docking and molecular dynamics studies analysis to confirm its cross-reactivity. Main aim of this study, was to provide new insights into almond associated allergy. As almonds are among the most consumed tree nuts throughout the world, there is a special need to safeguard the consumption of these nuts.

References
Alioto T et al (2020) Transposons played a major role in the diversification between the closely related almond and peach genomes: results from the almond genome sequence. Plant J 101:455-472