The management of healthcare data is constantly transforming to assist the healthcare industry in increasingly productive ways 1,2,3. In the traditional electronic medical record (EMR) systems, data are organized and managed in relational database systems, where there is no association between the data stored4. To illustrate, multiple tables are linked to each other with foreign keys attached in a column where the relation is not focused on the data points but between the data tables. Contrastingly, graph databases link the data records to organize data features more effectively by focusing more in between the data points providing emphasis on the relationships5,6. In these, entities and links are used to increase the space efficiency and provide a faster querying period for large datasets compared to relational mapping7. There are three major advantages in applying graph databases: Firstly, a graph allows the total visualization of the full picture, which delivers simplified or alternative perspectives to otherwise complex problems. Secondly, a graph enables a deeper understanding of abstract relationships. Lastly, a graph facilitates the understanding of the information flow and applies to improve the details.
A substantial number of studies have been performed on constructing graph representation learning in biomedicine8–13. However, EMR data currently include four characteristics that make it difficult to be converted into a network formation: (1) EMR has many different types of datasets, creating separate entity types. (2) Divergent datasets of node types have non-identical sizes and types of property sets. (3) Multiple types of edges are needed to correlate with the following entity types. (4) Multiple datasets are inherently linked with each other by unique IDs. Due to these reasons described above, the application of the combined mechanism of multi-modal, multi-attributed, and multi-relational aspects of the EMR into the representation is still in development to fully convey the necessities of datasets into a graph structure.
Recently, considerable studies have been conducted in medicine focusing on building a knowledge graph or graph representations in medicine14–18. Indeed, Soulakis D Nicholas presented a bipartite network database constructed from electronic health records with heart failure patients19. The study demonstrated network analysis on the relationships between patients and providers by calculating the network statistics and showed guided care coordination for patients. Moreover, Xiu Xiaolei proposed a method for building a highly granular semantic knowledge graph on rare diseases from EMR20. The study emphasized the importance of semi-automation schema building to create more granular semantic relationships to increase the concept correlations. In alternative approaches, LinFeng Li implemented probability values of a novel quadruplet structure as an embedding method of medical knowledge graphs21. Although this study importantly proved that the prediction task performs better when the entity types are indicated, the evaluation dataset contained limited relationship types due to non-automatic labor-intensive works. In contrast, we created a graph-building process automated for labeling diverse link types.
Moreover, to utilize the valuable assets from medical datasets, applications in a graph neural network are gaining popularity in personalized health and predictive medicine. Liu Z provided the heterogeneous similarity graph neural network to analyze health records in terms of temporal structure aspects by forming multiple subgraphs as an input for prediction22. Similarly, Tong Wu built a ME2Vec bipartite hierarchical graph for predicting patients' clinical outcomes on the interactions of calculated entities23. Instead, all hierarchical sequences during the pre-processing step, we handled, enabled its implementation regardless of data types and sizes. Parisot S presented a graph model for population diagnosis where individual feature information is learned in binary classification tasks using graph convolutional networks 24. Chaoyang and Haohui Lu further extended the potential for applying attributes to the nodes, proving that classification is performed at higher precision when distinct node attributes are added to each partition of a bipartite network25–27. Nonetheless, the data source was specific to claims data, which generally contain simple datasets compared to EMR data, thus limiting the data inclusion for reliable gestimation in predictions. Additionally, several studies have tackled integration of heterogeneous structured graphs combined with the attribute aspects28–32. Yet, the methods depicted in these studies do not fit well with the properties of the EMR in network integration.
We suggest the methodologies for the integration of heterogeneous medical entities and relationships to predict a patient’s outcome from a graph constructed based solely on EMR dataset. Our main contributions are:
-
The proposal of a novel approach to construct a heterogeneous bipartite graph model from EMR with attributes on nodes and edges. Using an effective visualization, in conjunction with a patient-centered graph method allows the latent associations among the population to be fully investigated and analyzed.
-
We established applied downstream link prediction tasks based on the HinSAGE algorithm to demonstrate the efficient disease predictive model. This framework shows that the performance gained from EMR supports a sufficient significance to predict the outcome of an event within the patients and advocates for overall healthcare.
Furthermore, we proposed the method to build a graph database integrating EMR data. The EMR embedded graph model was then applied to network learning using the HinSAGE algorithm. In this study, we illustrated the structure of the graph database and showed the query results to demonstrate the efficiency of the model. This research provides insights into physicians’ decision-making by predicting the disease occurrence based on the performance of our implementation.