Integrative Rare Disease Biomedical Profile based Network Supporting Drug Repurposing, a case study of Glioblastoma

Background Glioblastoma (GBM) is the most aggressive and common malignant primary brain tumor; however, treatment remains a significant challenge. This study aims to identify drug repurposing candidates for GBM by developing an integrative rare disease profile network containing heterogeneous types of biomedical data. Methods We developed a Glioblastoma-based Biomedical Profile Network (GBPN) by extracting and integrating biomedical information pertinent to GBM-related diseases from the NCATS GARD Knowledge Graph (NGKG). We further clustered the GBPN based on modularity classes which resulted in multiple focused subgraphs, named mc_GBPN. We then identified high-influence nodes by performing network analysis over the mc_GBPN and validated those nodes that could be potential drug repositioning candidates for GBM. Results We developed the GBPN with 1,466 nodes and 107,423 edges and consequently the mc_GBPN with forty-one modularity classes. A list of the ten most influential nodes were identified from the mc_GBPN. These notably include Riluzole, stem cell therapy, cannabidiol, and VK-0214, with proven evidence for treating GBM. Conclusion Our GBM-targeted network analysis allowed us to effectively identify potential candidates for drug repurposing. This could lead to less invasive treatments for glioblastoma while significantly reducing research costs by shortening the drug development timeline. Furthermore, this workflow can be extended to other disease areas.

In this study, to uncover signi cant associations relevant to GBM for drug repurposing, we performed network analysis in three steps: 1) we developed a GBMbased Biomedical Pro le Network (GBPN) by obtaining the GBM-related biomedical data extracted from the NGKG,[17] 2) we clustered the GBPN into a modularity classes-based network (mc_GBPN) by performing community detection, and 3) we identi ed high-in uence nodes as potential candidates for drug repurposing for GBM from mc_GBPN via various centrality measures. Figure 1 shows the study work ow.

A. NCATS GARD Knowledge Graph (NGKG)[17]
The GARD Information Center was managed by the NCATS to provide freely accessible consumer health information on over 6500 genetic and rare diseases.
To expand the use of information from GARD for biomedical research in rare diseases, we previously developed the NGKG,[17] a knowledge graph that integrated data from GARD and other well-known rare disease related resources including Orphanet,[18] OMIM, [19] MONDO, [20] and curated mappings between FDA orphan designations to GARD, and information on FDA approval status and drug indications from Inxight Drugs,[17] using our stitcher [21] software. Stitcher de nes edges to link equivalent/relevant concepts from different resources; for instance, "N_Name" denotes linked concepts with the same concept names, while "I_CODE" denotes linked concepts sharing the same external reference. In addition, stitcher adopts predicates from original resources, such as "R_equivalentClass" from MONDO. More examples are shown in Fig. 2. B. GBM-based Biomedical Pro le Network (GBPN) Development. To construct the GBPN with GBM-relevant information, we generated a disease cluster pertinent to GBM. This cluster containing GBM and 91 other GBM-related rare diseases was generated through a modi ed version of DL2Vec [22] applied to data obtained from the NGKG and enriched with additional data sources. Speci cally, a focused subgraph of the NGKG was extracted containing disease, genes and phenotypes. The subgraph was annotated with Gene Ontology [23] and Human Phenotype Ontology [24] and then enriched with small molecule and pathway data from Pharos [25] and The Pathway Commons,[26] respectively. Random walks emanating from each rare disease were used to generate a corpus from which disease node embeddings were created. The disease node embeddings were clustered using the k-means algorithm. Detailed description of the disease clustering procedure has been described in a separate submission. [27] We extracted 92 subgraphs from the NGKG, each an ego graph[28] of radius of 3 centered on a node containing one of those 92 GBM-related rare diseases. Figure 2 shows one subgraph that is centered on the node of Familial Alzheimer Disease, one disease from the GBM-related disease cluster. We then merged the union of these subgraphs to create the GBPN.
Optimization. The NGKG maintains connections among equivalent or relevant concepts from different resources via pre-de ned edges, e.g. "N_Name" and "I_CODE" or those adopted predicates, e.g. "R_equivalentClass" and "R_exactMatch". We optimized the GBPN by merging associated diseases, genes, treatments, etc. with those aforementioned edges into singular nodes, yielding a more condensed graph of nodes with enriched biomedical information for e cient network analysis. Speci cally, we optimized the GBPN via these rules: 1) the attributes of merged nodes were concatenated; 2) edges were removed if the connected nodes were merged (i.e. if nodes A and B merged, all edges between A and B would be removed); 3) edges were maintained between unmerged and newly-merged nodes (i.e. if node A and B merged into node AB, an edge from A to node C would be reassigned as an edge from AB to C). The code used to implement rules 1-3 is in the supplemental materials. Synonyms were subsequently ltered out of name labels within newly merged nodes. For instance, if the nodes "Addison's Disease" and "Adrenal aplasia'' were merged, both of these labels (which denote the same disease) would be concatenated within the newly merged node. In this case, we would verify that "Adrenal aplasia" is a synonym of "Addison's Disease" by querying the NGKG for the "synonyms" attribute of the "Addison's Disease" node and would subsequently remove "Adrenal aplasia" from the newly merged node's name label in the GBPN. This process was automated and applied to each newly merged node; some other complementary resources, including the NORD Rare Diseases database, [29] GeneCards, [30] the National Library of Medicine's MedlinePlus, [31] PubChem, [32] and the National Cancer Institute's List of Cancer Drugs, [33] were applied for this process as well. Figure 3 illustrates one merging example.

C. mc_GBPN Development
To group the GBPN into focused subgraphs, we clustered the GBPN into modularity classes (mc_GBPN) using community detection [34] available from Gephi 0.9.6. [35] Gephi is an open source tool for creating and exploring interactive network visualizations that includes functions for network analysis. Gephi uses the Louvain modularity algorithm[36] for community detection, which maximizes a modularity score for each community and is well-suited to large networks. [37] We set randomize to "On" and the resolution to 1.0. Smaller resolution values recover more communities (each containing fewer nodes), while larger resolution values recover fewer communities (each containing more nodes).
[38] While larger resolution values may fail to separate distinct communities, [39] smaller resolution values may produce communities that are too small to perform meaningful network analysis on. In the case of the GBPN, resolution values less than 1.0 translated to over half of the communities being too small (three nodes or fewer) to analyze. We prioritized the mc_GBPN by modularity score and the top ten mc_GBPN were applied for further investigation. Speci cally, we sorted mc_GBPN with more than three nodes in descending order by modularity score. The modularity score of a mc_GBPN is de ned as where is the number of intra-community edges for mc_GBPN, is the sum of degrees of the nodes in mc_GBPN, is the total number of edges across all mc_GBPN, and is the resolution parameter (in this case, 1.0). [40][41] A mc_GBPN with a higher modularity score contains more internal-connections and less external-connections, which results in a large number of "hub nodes" with high centrality scores and therefore is of interest to our investigation for drug repurposing. Thus, we sought out mc_GBPN with a high modularity score.
The mc_GBPN were then reviewed and assigned a class label based on parent-child relationships denoted in the NGKG and Disease Ontology. [42] For example, one mc_GBPN containing disease nodes of "Tumor Grade 1," "Intracranial Cystic Lesion," "Hemangioblastoma," "Benign Neoplasm," etc. was assigned the class label "Abnormal Brain Growths," as the majority of its nodes are associated with abnormal growths in the brain. High-in uence node identi cation. We calculated the degree, closeness, betweenness, eigenvector, and PageRank centrality for each node within their respective mc_GBPN. Each centrality measure detects the amount of in uence a given node has over the ow of information in the mc_GBPN. Speci cally, the degree centrality of a node is the number of edges connected to it.
[43] Closeness centrality measures the average distance between a node and all other nodes in its mc_GBPN. [44] Betweenness centrality of a node is the percentage of shortest paths between any other pair of nodes in the graph which include the given node. [45][46] Eigenvector centrality measures the transitive in uence of nodes; edges originating from a node with a high eigenvector centrality score contribute more to the score of the node they target than edges originating from a node with a lower eigenvector centrality score. Thus, if a node has a high eigenvector centrality score, it is connected to many other nodes with high eigenvector centrality scores. [47] We used 100 iterations in our eigenvector centrality calculations[48] (though we note that after experimenting with values ranging from 50-200 iterations, number of iterations had a negligible impact on the calculation and particularly did not affect the order of nodes from highest to lowest eigenvector centrality score). Finally, PageRank centrality is a subtype of eigenvector centrality that uses indegree rather than total degree. [49] We used the default probability setting in Gephi of 0.85 and the default epsilon setting 0.001 in our PageRank centrality calculations. [50] Note that all centrality scores will be greater than zero, and that closeness, eigenvector, and PageRank centrality must all be within the range of zero to one. [43][44][45][48][49] In general across all metrics, higher centrality scores indicate a node is connected to a greater number of other nodes and/or is more centrally located within the network.
Drug repurposing candidate identi cation. We ranked the ve most in uential nodes for each top ranked mc_GBPN by the ve aforementioned centrality measures. We manually reviewed and selected the most interesting nodes from prioritized mc_GBPN based on their in uence as potential candidates for drug repurposing for GBM.

A. Results of the GBPN
The NGKG contains 3,819,623 nodes and 84,223,681 edges from forty-three different biomedical data resources. Of these, 4,789 nodes and 177,106 edges were extracted and applied to generate the GBPN. After optimization, the GBPN contained 1,466 nodes (538 of which contained the merged information of two or more pre-optimization nodes) and 107,423 edges with average degree 73.276, de ned as the total number of edges divided by the total number of nodes.
Additional network properties can be found in Table 1.  We identi ed the ve most in uential nodes from each of the ten mc_GBPN ( Table 2) by each centrality measure. The identi ed high-in uence nodes from the mc_GBPN with an index of 0 are shown in Fig. 4. Centrality scores were normalized to a 0-1 range using the scikit-learn MinMaxScalar preprocessing function t_transform method. [54] The full list of the ve most in uential nodes by each centrality measure within these ten mc_GBPN is in the supplemental materials. We examined the ve most in uential nodes from the top ten mc_GBPN ( Table 2) by their centrality scores as potential candidates for drug repurposing for GBM. We rst normalized the centrality scores of the top ve nodes by each centrality measure to a 0-1 range using the t_transform method of the scikitlearn MinMaxScalar preprocessing function. [54] We then calculated a total normalized centrality score (TNCS) for each distinct node. The TCNS of a node is de ned as the sum of its normalized centrality scores across degree, closeness, betweenness, eigenvector, and PageRank centralities. The TNCS of a node may range from 0-5, as there are ve centrality measures. The nodes with the highest TCNS in each mc_GBPN listed in Table 2 are identi ed in Table 4. Of the nodes in Table 4, six had the highest centrality scores across all ve centrality measures within their respective mc_GBPN. The high-in uence nodes in Table 4 shed light on drug repurposing. For instance, a novel COL4A1 gene variant associated with CADASIL syndrome was recently found to be associated with GBM.
[60] Moreover, the NOTCH3 gene (also associated with CADASIL syndrome) is a prognostic factor that promotes glioma cell proliferation, migration, and invasion.
[61] Several drugs were identi ed as potential candidates for GBM, although they have not been clinically administered for GBM. Riluzole, a treatment for amyotrophic lateral sclerosis (ALS), has been shown to be an effective pretreatment that sensitizes glioma to radiation therapy. It also has synergistic effects in combination with select other drugs when used to treat GBM.
[63] Inhalant cannabidiol has also been shown to inhibit the progression of GBM through regulation of the tumor environment.
[64] Finally, stem cell therapy has shown potential for treating neuron and glial cell damage in the brain or spinal cord that results from neurological conditions such as GBM.
[65] Interestingly, VK-0214 is currently being tested in a clinical trial as a treatment for x-linked adrenoleukodystrophy.
[66] VK-0214 is a thyroid beta receptor agonist[67] which induces the ABCD2 gene by binding to and activating the thyroid beta receptor.
[68] In ABCD1 knockout mice, overexpression of ABCD2 via thyroid receptor activation has been shown to decrease the accumulation of very long chain fatty acids (VLCFA).
[68] Based on these ndings, selective thyroid receptor agonists are being evaluated as a novel treatment for X-ALD, which is characterized by the accumulation of VLCFA.
[68] However, inhibition of fatty acid accumulation and oxidation has been shown to reduce GBM proliferation,[69] growth, [70] and survival [71] as well. The fatty acid accumulation-inhibiting effect of VK-0214 may be bene cial in the treatment of GBM. We will perform additional experimental validation. The full list of associations we examined between the nodes in Table 4 and GBM is in the supplemental materials.

Discussion
In this study, we introduced an integrative GBM-based Biomedical Pro le Network (GBPN) by integrating heterogeneous types of data, including disease, gene, drug, etc. based on their shared concept characteristics. To further construct focused subgraphs from the GBPN for supporting high-in uence node identi cation for drug repurposing, we derived modularity class-based subnetworks (mc_GBPN) by leveraging community detection, a form of graph clustering algorithm. Through implementing multiple network analysis techniques over the mc_GBPN, we successfully identi ed multiple high-in uence nodes as potential drug repurposing candidates for GBM. This presented framework sheds light on supporting drug repurposing in a more effective manner. While integrating more data to expand the search space, we organized the data in a more manageable scale with consideration of their relevance from the network view.

A. Observations and Findings
We applied a rare disease cluster consisting of 92 GBM-related diseases to construct the GBPN by exploring data from the NGKG. We optimized the GBPN for integrative rare disease pro le generation by merging associated diseases, genes, treatments, etc. into singular nodes based on their shared concept names or external references. This approach allowed us to explore a large scale of GBM-relevant data in a concentrated and scalable form, which effectively supports drug repurposing with lower computational burden as demonstrated in the Results section. As shown in Fig. 3, some level of inference was introduced during the optimization. When we merged Lafora disease, EPM2A, EPM2B and Metformin, we declared the new connections between Metformin and EPM2A and EPM2B based on inference, since there are no existing connections among them obtained from the NGKG. Since PME2 shares different degrees of associations (different numbers of edges) with EPM2A, EPM2B and Metformin, we inferred these four concepts are potentially associated with each other, leading to node merging. The ndings from Bisulli et al. [72] proved the inference introduced for this particular case. In the future study, we will attach relevant references gathered from the previously developed scienti c annotation knowledge graph, [73] to the merged nodes, as scienti c evidence enrichment.
After GBPN optimization, we generated focused subgraphs of the GBPN by performing community detection as a graph clustering algorithm, resulting in a network partitioned into modularity classes (mc_GBPN). mc_GBPN as a set of subgraphs (i.e., clusters) derived from the GBPN were ranked by their modularity scores, which allowed programmatically upgrade those top prioritized clusters for further investigation and downgrade those with lower priority. Our experiments showed that such a strategy did not lose any important information compared to the GBPN, instead more high-in uence nodes were exposed in the top ranked clusters for easy extraction. For instance, nine distinct top high-in uence nodes derived from the GBPN appear in the top ve most in uential node lists from their respective mc_GBPN. We calculated the ve most in uential nodes by each centrality measure in the GBPN and found that seven of the ten most-in uential nodes (see Table 3) were included in the resulting list. The remaining nodes (i.e., Spastic Paraplegia 10, Rett syndrome, Myoclonus Dystonia) were present exclusively in the lists of high-in uence nodes derived from the mc_GBPN. The complete lists of the ve most in uential nodes by each centrality measure in the GBPN and in each modularity class of the mc_GBPN are in the supplemental materials.

B. Limitations of This Study
Due to lack of standardization across the biomedical resources that the NGKG sources data from, integrating information from different resources with a high level of precision proved to be a signi cant challenge. While we optimized the GBPN by merging nodes with closely associated information into a singular node, we were not able to fully automate this process because the data was not represented in a standard form and the nature of the NGKG that does not contain prede ned data models, instead a rule-based semi-automatic approach. A more sophisticated harmonization process will be proposed when we obtain data to build the GBPN. For instance, rare diseases from different resources will be harmonized and standardized by using GARD ID, genes with HGNC ID, etc. Additionally, during the step of high-in uence node identi cation, we manually searched for scienti c evidence to support our ndings. In the future study, we will programmatically query the rare disease-based scienti c annotation knowledge graph [73] for evidence collection. In the future study, we will adopt/extend the strategy of network optimization to apply on the datasets with well-de ned data models underneath, then we will be able to generate highly condensed graphs by merging nodes/relationships by different concept types.

C. Future Directions
We presented a preliminary analysis of GBM-related data that allowed us to identify potential candidates for drug repurposing to treat the condition. Although scienti c evidence has been identi ed to support our initial ndings, experimental validation is necessary to determine whether these candidates would be effective in treating GBM patients in practice. Clinical observations/e cacy regarding those candidates administered for patients with GBM, derived from Electronic Medical Records (EMR) can serve as another layer of validation. We propose to mine clinical data from National COVID Cohort Collaborative (N3C) and the Biomedical Translational Research Information System (BTRIS) at NIH for clinical evidence identi cation. Our pipeline is modularized as shown in Fig. 1, thus we propose to extend the use of each module. We will expand to other disease areas by starting with other disease clusters and generating corresponding GBPN. We also propose to explore other clustering algorithms besides community detection for focused subgraph generation. Besides the application of drug repurposing we started with, we believe mc_GBPN as a collection of rare disease pro les providing a complete picture of direct and indirect associations to the target disease can be a valuable source to help us understand the etiology of rare diseases.

Conclusion
In this study we presented a preliminary network analysis-based approach to drug repurposing for GBM. We successfully identi ed several potential candidates via centrality and community detection calculations, and substantiated the connections between these candidates and GBM. We reinforced the ndings of emerging studies into some treatments and also identi ed a new candidate, VK-0214, that could be potentially repurposed to treat GBM. These ndings can guide future experimental validation, which could lead to new, more effective treatments that extend the lifespan of patients living with GBM.

Consent for publication
None declared Availability of data and materials The full documentation, codes, and other supplemental data is found on GitHub (https://github.com/ncats/drug_rep/tree/main/Glioblastoma_Subgraph ).

Competing interests
None declared Funding None declared Authors' contributions EM: performed the experiments and wrote the manuscript. JS: generated the GBM disease cluster and edited the manuscript. EAM: participated in the discussion and edited the manuscript; QZ: conceived, designed, and supervised this study and wrote the manuscript. All authors reviewed and approved the manuscript.

Figure 2
Familial Alzheimer Disease-based subgraph derived from the NGKG. Orange nodes denote diseases, blue nodes denote genes, and purple nodes denote drugs.
Familial Alzheimer Disease is highlighted in yellow.

Figure 3
A node containing Lafora disease is merged with nodes connected to it by an edge label of "I_CODE": two Lafora disease related genes, the EPM2A gene, the EPM2B gene, and Metformin, a treatment that has been used for Lafora disease. The gray node is one of the merged nodes in the GBPN.

Figure 4
High-in uence nodes identi ed by degree, closeness, betweenness, eigenvector, and PageRank centrality in mc_GBPN with an index of 0. The nodes displayed have a strong relationship to white matter-related conditions (as does GBM). Note that several nodes have high centrality scores across multiple measures; these nodes have a higher potential for drug repurposing.

Supplementary Files
This is a list of supplementary les associated with this preprint. Click to download.