Background: Electronic Medical Record (EMR) is an electronic record of a patient’s health information. Nowadays, the importance of EMR analysis is increasing in medical informatics. An EMR is multi-modal data containing medical text and medical concepts. Recent studies attempt to embed either medical text or medical concepts to analyze an EMR partially. However, no type of embedding understands an entire EMR, including both the medical text and the medical concepts. Moreover, most medical concept embeddings train concept sequences from the EMR. These embeddings do not reflect the ontology of the medical concepts, which contain medical semantics. Thus, this study proposes a novel medical ontology representation with medical text for understanding an entire EMR.
Methods: First, we generated the International Classification of Disease (ICD)-10 graph using the ICD-10 graph dataset we created and initialized each node based on the pre-trained medical text embedding. Next, the ICD-10 nodes were trained by the code sequences sampled from the ontology using the encoder-decoder-based graph embedding. Lastly, we trained the ICD-10 nodes by the relation between the ICD code and the text.
Results: For quantitative evaluation, we created similarity pairs of the ICD-10 codes dataset and compared the similarities of the ICD-10 code pairs. The average cosine similarity of ours is 0.87, which is the highest average among the comparison models. We also conducted a comparison of the similarity pairs using open datasets. The pearson correlation coefficients of ours are about 0.461 to 0.463, which is similar to the best model, 0.464. Both comparisons demonstrated that our medical ontology embedding performed well while maintaining the medical text embedding characteristics.
Conclusions: In this study, we proposed a MedionRep which is a medical ontology representation method using graph embedding with medical text. MedionRep is a new medical concept embedding method with medical text and reflect the ontology of the medical concept including medical semantics. Our method could broaden the analysis of the EMR data, which includes several types of medical data.