HighAltitudeOmicsDB: An integrated resource for High-Altitude associated genes and proteins, interacting networks and semantic-similarities.

Millions of people worldwide visit, live, or work in the hypoxic environment encountered at high altitudes and it is important to understand the biomolecular responses to this stress. This would help design mitigation strategies for High Altitude Illnesses (HAIs). Inspite of a number of studies spanned over 100 years, complex mechanisms controlling acclimatization to hypoxia remain largely unknown. Some biomolecules, though, have been proposed as potential diagnostic, therapeutic and predictive markers for HA stress. HighAltitudeOmicsDB is a unique resource that provides a comprehensive, curated, user- friendly and detailed compilation of various genes/proteins which have been experimentally validated to be associated with various HA conditions; their Protein Protein Interactions (PPI) and Gene Ontology (GO) semantic similarities. For each database entry, HighAltitudeOmicsDB stores the level of regulation (up/down-regulation), fold change, control (low landers or high landers), duration and altitude of exposure, tissue of expression, source organism, level of hypoxia, experimental validation, place/country of study, ethnicity, geographical location, link to respective publication etc.Tthe database also collates information on disease and drug association, Gene Ontology and KEGG pathway associations. The web resource is a unique server platform which constructs PPI networks and extracts GO semantic similarities among the interactors. These unique features help to offer mechanistic insights in disease pathology. Hence, HighAltitudeOmicsDB is a unique platform for researchers working in this area to explore, fetch, compare and analyze HA associated gens/proteins, their PPI networks and GO semantic similarities. The database is available at http://www.altitudeomicsdb.in


Introduction
A large percentage of world's population live at high altitude areas and many also visit the mountains above 2500m for outdoor activities such as trekking, climbing, and other adventure sports. Rapid ascent to high altitude leads to an instantaneous decrease in barometric pressure, the oxygen concentration remains the same but, the number of oxygen molecules per breath is reduced e.g. at altitude 3,600 meters the barometric pressure decreases to 483 mmHg, and <40% of oxygen molecules are available to breathe.
Since the amount of oxygen required for activity is the same, the body must adjust to having less oxygen or hypobaric hypoxia [1]. Some lowland residents adjust to the reduced oxygen availability at high altitude through a process known as acclimatization but some suffer from various disorders like Acute Mountain Sickness (AMS), High-Altitude Cerebral Edema (HACE), and High-Altitude Pulmonary Edema (HAPE), etc. ( [2], [3]) . Therefore research for the identi cation of the early signs of these physiological alterations is gaining momentum. A recent comparison in protein pro les of low landers with their induction at high altitude has identi ed differentially expressed proteins like serum proteins Irisin, myostatin, acute precursor proteins (APPs), apolipoprotein A1, etc in high altitude acclimatization [4], [5], [6]. These proteins are associated with energy-related processes, skeletal muscle regeneration, in ammatory responses, and other hallmark molecular responses at high altitude. Henceforth these proteins are being proposed as biomarkers to predict early acclimatization of individuals at high altitude.
Hunting for novel protein biomarkers in low landers and native samples using peptide pro ling has become a key method ( [3], [6]). Identi cation of the differentially expressed proteins that play a key role in the acclimatization process has helped to uncover the mechanisms responsible for the acclimatization at high altitudes. Eg a genome-wide study has uncovered plasma proteins that have the potential to predict vascular homeostasis during HAPE ( [7]). Similarly, a transcriptomic study indicated the modulation of multiple pathways and proteins involved in the early phase of hypobaric hypoxia exposure like VIM, CORO1A, CD37, STMN1, etc. ( [8]). Though there is enormous literature available that have reported 'omics' pro les of the human and animals exposed to high altitude; the real challenge remains to integrate all these studies to produce a holistic understanding of continuously evolving mechanisms involved in functional adaptations of cells, tissues and organs, as well as the whole organism in the high-altitude hypoxic environment. Hence we developed a HighAltitudeOmicsDB where all this scattered data is collected, curated, analyzed and visualized. The database currently contains ~1300 proteins that are manually curated from peer-reviewed publications. The database stores the association of each protein with HA-stress in terms of level of regulation (up/down-regulation), fold change, control (low landers or high landers), duration and altitude of exposure, tissue of expression, source organism, level of hypoxia, experimental validation, place/country of study, ethnicity, geographical location etc. The database also stored whether the protein has been associated as a high altitude associated biomarker with a corresponding link to the respective publication. The database also collates protein o cial symbol, also provides protein-protein network interactions of each protein with its top-50 interacting partners. The network can be visualized interactively on the webserver. Additionally, HighAltitudeOmicsDB additionally calculates the GO semantic similarity with these 50 interactors to identify functionally related proteins.
The database additionally stores the Transcription factors interacting with the gene and their regulation type (repression, activation, distal, proximal etc). Additionally, the miRNAs interacting with the gene is also listed. Thus, HighAltitudeOmicsDB is a unique integrated platform to explore, retrieve, compare and evaluate genes/proteins associated with HA-stress, their PPI networks and semantic similarity and regulation by Transcription factors and miRNAs. This will help uncover the underlying crosstalk between proteins that exists to acclimatize to HA and provide mechanistic insights in these complex molecular responses. It will thus be useful in identifying novel and robust molecular biomarker candidates that can further help in the development of new diagnostic, prognostics, and therapeutic strategies for high altitude disorders.

Methodology
Data collection A combination of various keywords such as "high altitude", "protein", "gene", "omics", "hypobarichypoxia", "anoxia" were used for extensive literature mining from PubMed and google search engines as in 2020 ( [9]). The publications were manually scrutinised to identify differentially expressed genes/proteins. After removing redundancy and duplicity, a comprehensive list of proteins that have been Differentially Expressed (DE) at HA were curated from these publications. For each DE protein, its associated information was also fetched that includes 'Name of the protein', 'Protein O cial Symbol', 'Aliases', homologous 'Human Entrez ID', 'Source Organism', 'Tissue of expression', 'level of hypoxia', 'altitude', 'duration of experiment', 'Level of regulation', 'Fold change', 'Experiment details', 'geographical location', 'ethnicity', 'Control group', 'Associated as Biomarker'. For the studies in which the source organism was other than human, the homologous human gene/protein was identi ed and stored. This way, even for experiments conducted on different experimental organisms (mice/rats/yak/bird/toad/sheep), human equivalence/translation would be easier. The collection was stored in JavaScript Object Notation (JSON) le format and stored in MongoDB. For each protein in the database, its top-50 protein interactors were identi ed by Search tool for retrieval of interacting proteins (STRING) webserver ( [11]). The stringency for the search was kept at the highest level (0.9) and the lter was placed to allow a maximum of 50 associated proteins as the direct interactor of queried protein. STRING database constructs the protein-protein interactions network based on seven sources of information i.e. neighbourhood on the chromosome, gene fusion, phylogenetic co-occurrence, homology, co-expression, experimentally determined interaction, database annotated, automated textmining. The interaction le was downloaded from the STRING database was stored in JSON format.

Data Processing and Enrichment
To make the database more informative several other attributes were also added; protein-disease associations were mined from Diseagenet [12] ; protein-drug relationship from DGIdb 3.0 database ( [13]). All these attributes were also stored in JSON les.
Gene Ontology (GO) annotation based semantic comparisons between genes is an innovative approach to quantitatively assess the functional similarities between them. They have been extensively used across varied bioinformatics analysis ( [14], [15]). The higher the semantic similarity score more is the probability that two genes/proteins are likely to have a similar molecular function or involved in a common biological process [14]. Whereas the low semantic similarity score shows two genes imparting different molecular functions. To identify semantic similarity, each protein in the HighAltitudeOmicsDB and its top-50 direct interacting proteins was submitted to GOSemSim R algorithm. GOSemSim is an R package for semantic similarity computation among GO terms, sets of GO terms, gene products and gene clusters ( [15]). The results were represented in a 51 X 51 matrix. All these matrix les were also stored in the JSON le format.

Database Development
All constructed JSON les were transferred to the MongoDB database collection and uploaded on the server localhost using pymongo. Server query commands were made in the MongoDB compass. Vis.js library speci cally was used to display the Protein-protein interactions network. The IDs such as Human Entrez ID, Uniprot ID, Protein O cial Symbol, EC Number, PDB_ID, InterPro ID, Pfam ID, dbSNP ID, and reference PMIDs present in all tables are hyperlinked to the corresponding databases to provide additional details. The web interface also has a 'Contact us' page which includes option to send email to the developers; for the submission of any new data by the user. It would be reviewed and appended to database.

Web Interface
HighAltitudeOmicsDB (Figure1) is a user-friendly, free-to-access resource which requires no prior registration. It is a comprehensive, non-redundant, manually curated resource of genes/proteins whose expression level are experimentally validated to be associated with high-altitude stress. The database surveyed using "browse" and "Search" options.
The "Browse" option allows the user to choose easily single or multiple genes/proteins of the database from a pull-down menu. Alternatively, the user may upload a le containing the protein o cial symbols or alternatively type the protein-o cial symbols. Clicking the adjacent 'Browse' button connects to a tabular format which hyperlinks the individual protein page. If the user-list contains protein symbols which are not in the database, a separate table highlighting the same is also provided (Figure 1) The Search Option of the database offers multiple options to explore the database based on user research interests. Search by chromosome allows to click on the any human chromosome number and identify the proteins of HighAltitudeOmicsDB which lie on the respective chromosome ( Figure 2). Search by 'duration of experiment' allows to identify the list of genes/proteins whose expression changes in hours/days/weeks/months/years. Searching by 'Tissue of expression' opens a pull-down menu from which the user can choose the tissue of interest. Searching by 'Ethnicity', 'source organism', 'level of regulation', 'geographical location' similarly opens a pull-down menu from which the user may choose the ethnicity, source organism, up/down regulation and location respectively and get a tabular list of genes/proteins which are hyperlinked to the respective detailed information page of the protein (as discussed in following sections).
Additionally 'Associated as Biomarker' option leads to a tabular list of proteins which have been proposed/validated as molecular biomarkers for HA-stress ( Figure 2). The protein symbols are hyperlinked to the respective protein page which provides a link to the Pubmed which validates the protein as a biomarker. Additionally to fetch proteins which are DE in an altitude-dependent manner, a user-interactive slider (ranging from 2200mt to 9800 mt) is provided. The user may set the slider values and fetch genes/proteins which are associated with a de ned altitude-range. This has been combined with (AND/OR) options with time of exposure to HA and level of regulation (Up/Down). The user may thus be able to make combination queries like up/down-regulated proteins expressed in days at an altitude range of 2200mt to 4500 mt. The list of these proteins can be downloaded in Excel /CSV format for further analysis.
The webserver also allows to explore the proteins of HighAltitudeOmicsDB associated with a particular Transcription Factor (TF), microRNA (miRNA), disease, drug, GO or Kegg pathway (Figure 3).
The details of the protein and its association with HA is provided in the detailed information page which may be divided into six sections.

Knowledge Base
This is the rst section of the database that gives general information about the protein like Protein O cial Symbol, Aliases, Chromosomal location, Length, Uniprot Id, EC number, Pfam Id, PDB Id, InterProID, dbSNP Id which allows cross-linking to additional databases easy and quick. The Uniprot Id is hyperlinked to the Uniprot database (Figure 4(i)).

Interactions and Semantics
The top-50 direct protein interactors of each protein is identi ed from STRING database using cut-offs described in the methodology section. The network is displayed in a user-interactive format with translation, zoom-in and zoom-out features. The nodes are color coded (yellow: the protein being studied; blue : the top-50 interactors) (Figure 4). The edges are also color coded (yellow: interactions between the protein being studied and its 50 direct interactors; blue: interactions among the top-50 interactors). The network may easily be downloaded in .sif format which can be easily visualised in network visualization software like cystoscape, bina etc. The list of interactions between them and their combined score is realidy provided in a tabular format which can be downloaded in excel/CSV format. The table is also provided with a 'search' option to easily search the protein of interest.
The pai-wise GO semantic similarity score is calculated between the protein being studied and its top-50 interacting proteins as described in methodology section. The results are visualised as 51 X 51 matrix. The GO semantic similarity score > 0.8 is highlighted in red-color in the matrix. If any protein among the top-50 interactors is also a part of HighAltitudeOmicsDB, the protein symbol in the matrix is hyperlinked to the respective detailed protein information page within the database. This helps to identify any functional hubs of proteins that would be associated HA stress and hence could shed light om molecular basis for acclimatization/adaptation (Figure 4).

Association with High Altitude
For each protein, its association with HA stress is compiled in a tabular format. The details are presented as the human protein symbol, source organism (organism in which the study was performed), tissue of expression, Level of Hypoxia, Altitude, Duration of experiment, Level of expression, Fold Change, Experiment Details, Geographical location, Ethnicity, Control group expression, control group details and reference paper (Figure 4 ). The association of the protein as a biomarker is also compiled i.e. if the protein is ever experimentally validated to be a biomarker, the entry in the column will be "Yes" otherwise "No". The papers are hyperlinked to PubMed which allows ready access to the original publication. In this format, the expression changes of a protein in different durations, tissues and altitude-conditions can be easily and quickly explored, compared and analysed.

Association with TFs and miRNAs
Transcription Factors and miRNAs are two most important transcriptional and post-transcriptional regulatory molecules ne-tuning the expression of genes. Thus the list of TF and miRNAs that are known the regulate the protein being studied is presented in a tabular format. The TF association table lists the TF symbol (hyperlinked to Genecards Database), its entrez id, symbol and entrez id of the protein being studied, type of association, link to publication which ascertained this association and the database from which the association is extracted. The tables are downloadable in Excel/csv format and provided with 'search' option to explore the table with a user-de ned keyword ( Figure 5).
Similary the miRNA-gene association table lists the miRNA miRTarBase Id, miRNA, symbol and entrez id of the protein being studied, experiment (luciferase reporter assay/western blot/ PCR/ Immunohostochemistry etc), support type and link to respective publication (hyperlinked to PubMed) which ascertained this association. The tables may be downloaded in Excel/csv format. The table is also provided with 'search' option to explore the table with a user-de ned keyword.

Gene Ontology and KEGG Pathway annotations
The Gene Ontology annotations are presented in a tabular format. The GO ID, GO Term and GO type are listed. The GO ID is also hyperlinked to QuickGO which provides detailed GO annotations. The KEGG pathway annotations are also compiled and presented as KEGG ID and KEGG Term. The KEGG ID is hyperlinked to KEGG database that provides additional details about the respective pathways ( Figure 5).
Both these tables can eb downloaded in Excel/CSV format and have an in-built 'search' option for keyword search.

Association of proteins with other diseases and drugs
This section provides details of drug, tissue, and disease-association of HA-associated genes/proteins. The information is represented in two tables belonging to each category respectively ( Figure 5). The rst table shows information about protein and its associated drug. This type of information can help the users to guide/design any protein-based drug-targeting experiment. These two tables are equipped with the "search" option which help in easy search of user-de ned terms across lengthy tables. The tables can also be downloaded in Excel/CSV format.

Web Statistics
HighAltitudeOmicsDB contains ~1300 associations of 820 proteins that have been found differentially expressed at High altitude. A detailed review of the database shows that all proteins were sourced from experimental studies in 25 tissues (Figure 6a (Figure 7a). 'Metabolic process' is highly associated with weight loss due to the adaptation mechanism at high altitude ( [16]). At high altitude, induction of hypobaric hypoxia activates HIF protein that further regulates genes responsible for mediating changes in cellular metabolism/energetics leading to weight loss due to increases in energy expenditure ( [17]). The second biological process 'Outer Dynein Arm Assembly' is the process for axonemal assemblies. The increase in the length and density of axoneme-like cilia due to hypoxia has been associated with cell death ( [18]). Lastly, 'Response To Reactive Oxygen Species' is the re ection of the redox status of the cell, and disturbances in redox status due to hypobaric hypoxia can lead to oxidative stress and DNA damage ( [3]). Similarly, terms like 'Fructose-Bisphosphate Aldolase Activity', 'Oxidoreductase Activity', 'Acting On Paired Donors', 'Incorporation Or Reduction Of Molecular Oxygen', 'Oxidoreductase Activity', 'Acting On Peroxide As Acceptor', 'Electron Transfer Activity' and 'ATP Binding', etc. are found to be top molecular functions of proteins present in the database (Figure 7b). All the molecular functions are direct steps or feedback mechanisms associated with oxidative phosphorylation (aerobic respiration). Mitochondria plays important role in oxidative phosphorylation and recent clinical studies have revealed a high percentage of mitochondria are present in gastrocnemius muscle tissue of high lander that helps to adapt to high energy expenditure environment ( [19]). 'COP9 signalosome' and 'Actomyosin' are the two cellular components terms that are found most enriched in differentially expressed protein sets present in the database (Figure 7c). COP9 Signalosome is the part of the ubiquitin proteasomal degradation complex that controls the expression of pVHL, HIF-1α, and other oxygen responsive transcription factors regulated during hypobaric hypoxia ( [20]). Whereas Actomyosin is a cytoskeleton of actin-myosin ber complex present in different muscle tissues like skeletal muscle.
The muscle ber-type composition of both adult animals and humans is markedly altered during chronic exposure to high altitude.
The KEGG pathway enrichment shows 'hsa00910: Nitrogen metabolism' as the most enriched pathways in the differentially expressed HA protein set (Figure 7d). Nitrogen metabolism is a process of nitrogen oxides production and these oxides such as nitrous, nitrite, nitrate have been found to play important role in high altitude acclimatization responses ( [21]). Thus the proteins in the database are associated with hallmark responses to hypobaric hypoxic stress responses, which supports the comprehensiveness of the database.

Conclusions
HighAltitudeOmicsDB is an interactive resource and a server platform that captures and organizes knowledge for genes/proteins associated with HA stress. It provides the comprehensive view of different HA related studies; offers the annotations and visualization of PPI networks and semantic similarities associated with gene/protein in the database. HighAltitudeOmicsDB is the rst of its kind database that has a collection of manually curated differentially expressed genes/proteins that were fetched using text mining and manual curation. The information enables the user to browse biomolecules based on different query lters in the database i.e. level of expression; duration of experiment; altitude and source organism. HighAltitudeOmicsDB also encompasses protein-associated information such as proteindisease association, protein-drug association. Hence the information-base of HighAltitudeOmicsDB is much larger than any other database. HighAltitudeOmicsDB also identi es PPIs for each protein in the database and calculates GO semantic similarity between them. The analysis of PPI networks and similarities would enable the user to infer mechanistic insights during HA stress. The webserver also offers functional correlation of proteins. The functional correlation includes both GO enrichment and KEGG pathways enrichment.