ICD2Vec: Mathematical representation of diseases

The International Classification of Diseases (ICD) codes represent the global standard for reporting disease conditions. The current ICD codes are hierarchically structured and only connote partial relationships among diseases. Therefore, it is important to represent the ICD codes as mathematical vectors to indicate the complex relationships across diseases. Here, we proposed a framework denoted “ICD2Vec” for providing mathematical representations of diseases by encoding corresponding information. First, we presented the arithmetic and semantic relationships between diseases by mapping composite vectors for symptoms or diseases to the most similar ICD codes. Second, we confirmed the validity of ICD2Vec by comparing the biological relationships and cosine similarities among the vectorized ICD codes. Third, we proposed a new risk score derived from ICD2Vec, and demonstrated its potential clinical utility for coronary artery disease, type 2 diabetes, dementia, and liver cancer, based on a large prospective cohort from the UK and large electronic medical records from a medical centre in South Korea. In summary, ICD2Vec is applicable for diverse quantitative analyses using ICD codes

Nonetheless, there are no publicly available generalized vectors for providing mathematical representations of ICD codes that are universally applicable in biomedical research. Hence, an objective representation of ICD codes as mathematical vectors is necessary.
In this work, we proposed ICD2Vec, a framework for converting ICD codes into meaningful vectors (Fig. 1). In our study, we used word embeddings generated from FastText to construct a vector corresponding to each ICD code, such that each vector encoded meaningful information through the semantic relationships between the ICD codes. We demonstrated that the semantic relationships were preserved between ICD codes by finding an ICD code acting as an arithmetical combination of multiple symptoms or ICD codes. In addition, to confirm the validity of the vectorized ICD codes, we compared the semantic relationships with the biological relationships between diseases generated from other sources18. Finally, to illustrate the clinical utility of ICD2Vec, we proposed an individual risk score derived from ICD2Vec for predicting the risk of subsequent diseases based on a history of diagnosed diseases of an individual relative to a baseline. The score was applied to epidemiological studies using two datasets (from the UK and South Korea).  19,154 ICD codes from https://www.icd10data.com. After natural language processing, every word was converted to a corresponding 300-dimensional vector. Finally, we calculated an average for the word vectors to indicate an ICD code. b, We validated the ICD2Vec with analogical reasoning using arithmetic operations and observed significant associations in the cosine similarities between pairs of ICD codes and their biological relationships. We suggested a new "ICD2Vec risk score" (IRIS) derived from ICD2Vec. The IRIS is a value representing the cosine similarities between the target disease and diagnosed diseases for a person in a specific time period. The individual risk for the target disease was measured using IRIS according to the diagnosis history. We validated clinical utility of IRIS as a risk score with two datasets from the UK and South Korea. Illustrations are created with BioRender.com.

Results
We embedded the information regarding diseases from the ICD codes into vectors using the pretrained FastText model; this model was trained to learn the relationships between adjacent vectors from a large corpus and subword information.

Arithmetic and semantic relationships between ICD2Vec
We empirically investigated the relationships between a symptom and a disease or between diseases using the analogical reasoning task shown by Mikolov et al. 19,20 . The task was performed by finding and evaluating the ICD code closest to the composite vector derived from the given vectors for the symptoms or diseases ( Table 1). For example, the ICD code most similar to nearsightedness appears to be strabismus (ICD-10: H50); nearsightedness is one of the symptoms of strabismus 21,22 . A specific symptom could be represented by arithmetic operations, such as addition and subtraction. For instance, itchy skin is denoted by the addition of the vector for "skin" and the vector for "itching," and the composite vector derived from these two vectors matches with pruritus (ICD-10: L29). Using the same approach, it is possible to use the calculation of diseases (e.g., I25 + I10) to find clinical manifestations with similar medical descriptions (e.g., I51). We created a vector for coronavirus disease , as summarized by the Mayo clinic 23 . Interestingly, the disease most similar to COVID-19 is A99, an unspecified viral hemorrhagic fever. Table 1 | Examples of analogical reasoning with ICD2Vec. The top three similar International Classification of Diseases (ICD)-10 codes were found based on the cosine similarity with a given vector for a symptom or disease. Semantic compositionality through vector operations such as addition and subtraction was also possible when representing an ICD-10 code. Python codes are provided to check for the most similar ICD-10 code for the given symptoms or diseases (see Code availability).

Symptoms
Top three similar ICD-10 (definition of the disease) (CS) R00-R99 of ICD-10 indicates unclassified symptoms, signs, and abnormal clinical findings. We found the most similar ICD-10 code using an additive operation of up to three randomly selected R codes, using cosine similarity. This random selection was repeated 1,000 times. Table 2 lists the top 10 results according to the cosine similarity. For example, a combination of R05 (cough) and R06 (abnormalities of breathing) is indicative of J459 (asthma, unspecified) with a cosine similarity of 0.971. A single symptom, R59 (enlarged lymph nodes), is found to be a symptom of C77 (secondary and unspecified malignant neoplasm of lymph nodes). (Asthma, unspecified)

Comparing ICD2Vec pairs with disease-disease relationships
ICD2Vec was constructed based on the clinical words, sentences, and descriptions related to each disease. We sought to confirm the rationale and validity of ICD2Vec by comparing the biological relationships between diseases. We identified the significant associations between the cosine similarities obtained in our disease-to-disease pairs and three biological indicators, i.e., the relative risk for comorbidities in observational clinical data, separation in a disease-disease network, and genetic correlations based on genome-wide association studies.
Relative risk of comorbidity. The relative risk was quantified as the ratio of co-occurrence of the two diseases to the incidence of each disease. We only considered statistically significant relative risks. A total of 127,579 selected disease pairs are presented in Fig. 2a. The correlation of the top 10% of the cosine similarity group is higher than that of the total set of pairs (0.296 vs. 0.117). There is a tendency for a higher relative risk between two diseases with a higher cosine similarity. The mean difference in the relative risk between disease pairs with the top 10% of cosine similarity and the other group is statistically significant (P = 2.6 × 10 -118 , Fig. 2d).

Separation in a disease interactome network.
A previous study by Menche et al. developed a network-based separation of two diseases as a measure of how close the diseases were in a disease-disease interactome network 18 . In other words, the smaller the value, the more similar the diseases. We considered a total of 33,332 pairs with separation values under a q-value of 0.05. The correlation of the top 10% of the cosine similarity group is negatively higher than that of the total set of pairs (-0.240 vs. -0.142) (Fig. 2b). The mean difference of the top 10% group is statistically lower than that of the other groups (P = 7.6 × 10 -146 , Fig. 2e).

Genetic correlation based on genome-wide association studies.
A genetic correlation is a biological indicator that can be used to quantify disease-disease relationships. We compared the ICD2Vec cosine similarities with the genetic correlations of 1,531 pairs, as obtained using linkage disequilibrium score regression (LDSC) based on statistics of diseases from a genome-wide association study (GWAS). The genetic correlations between the top 10% cosine similarity group and the remaining groups are statistically different; the genetic correlations in the top 10% cosine similarity group are significantly higher than those in the other group (P = 3.0 × 10 -5 , Fig. 2f). In addition, there is a positive correlation between the genetic correlation and cosine similarity, although it is not statistically significant (Fig. 2c).

Fig. 2 | Comparison between cosine similarities of ICD2Vec pairs and biological relationships between disease pairs.
We defined three factors to measure biological relationships between disease pairs: relative risk (a, d), separation (b, e), and genetic correlation (c, f). The factors were compared with the cosine similarity between ICD2Vec pairs. The red dots and bars represent the average of the pairs with the top 10% of cosine similarity, and blue dots and bars represent that of the other 90% of the pairs. We measured the correlations between the cosine similarities from the total and top 10% of ICD2Vec pairs and the biological relationships from corresponding disease pairs. The error bar indicates the standard error of the mean.

Use-cases: epidemiological studies using ICD2Vec risk score
We examined the association between a ICD2Vec risk score (IRIS) and the risks for four diseases: coronary artery disease (CAD), type 2 diabetes (T2D), dementia, and liver cancer. We designed prospective and retrospective cohort studies for each disease using data collected in the UK Biobank and with data extracted from the Clinical Data Warehouse (CDW) of the Samsung Medical Center (SMC) in South Korea. A description of these analyses is illustrated in Supplementary Fig. 1. The incidence of CAD, T2D, dementia, and liver cancer was defined using ICD-10 and other treatment codes (Supplementary Table 1). We collected the ICD-10 codes of patients who had been diagnosed with any disease at least once for a fixed duration before the start of the follow-up as the baseline. The detailed inclusion and exclusion criteria for selecting the patients in the two cohorts are described in Supplementary Information. The patients were grouped into deciles or 50-tiles according to the IRIS calculated for each patient.
We estimated the hazard ratios (HRs) for the IRIS values using a Cox proportional hazard model. The model was adjusted for age and sex. Additionally, we estimated HRs for baselines of 3, 5, 7, and 9 years in the UK Biobank and for a baseline of 3 years in SMC. The statistical significance was set at P < 0.05. Table 3 shows the demographic characteristics and highest HRs for each target disease with the two independent datasets. For example, in the analysis for CAD using 9 years for the baseline and the top-three average of the IRIS values in the UK Biobank, 2,848 patients are diagnosed with CAD in the top 10% of the IRISCAD values, whereas 1,170 patients are diagnosed in the bottom 10% of the IRISCAD values. The effect of the IRISCAD is significantly associated with the risk of CAD (adjusted HR, 2.08, 95% confidence interval [CI], 1.94-2.23). The highest HR for each disease was selected among eight HRs obtained from the two methods (Ravg3 and Rmax) for four baseline years (Supplementary Tables 2-5). Patients with a higher risk score tend to have a higher hazard for CAD (Supplementary Fig. 2). In the SMC, with 3 years as the baseline and the top-three average of the IRISCAD values, 1,532 patients are diagnosed with CAD in the top 10% of the IRISCAD values, whereas 437 patients are diagnosed in the bottom 10% of IRISCAD values (  Table 6 and Supplementary Fig. 3). Moreover, a higher incidence of CAD is observed in groups with high scores binned according to the 50-tiles of the IRISCAD (Fig. 3). For instance, the incidence of CAD in the highest risk score group is approximately four times higher than that in the lowest risk score group (19.2% vs. 5.0%). * Hazard ratio was estimated using a Cox proportional hazard model adjusted for sex and age. ** All p-values for hazard ratios were estimated under 2 × 10 -16 . Abbreviations: Samsung Medical Center (SMC); coronary artery disease (CAD); type 2 diabetes (T2D); confidence interval (CI); ICD2Vec risk score (IRIS).

IRISCAD for a common disease and comorbid conditions
The IRIS, calculated based on each individual's disease history, not only identifies the risk groups for a single target disease but also provides insights into comorbid conditions.
To identify the comorbid conditions in a risk group based on the disease-wide risk score, we divided the study population at a baseline of 9 years into decile risk groups, as derived from the top three averages based on the onset of CAD. We used the 10 most common comorbidities for CAD and three other cardiovascular-related diseases as reported by the United States Department of Health & Human Services for comparison with actual prevalence 24 . The 13 comorbidities for CAD generally show a higher prevalence among the high-risk CAD groups (Fig. 4a). Of the 13 comorbidities, the prevalence of hypertension has the strongest association with the high-CAD-risk group-0.20% and 50.41% for the bottom 10% and top 10% CAD risk groups, respectively. In contrast, dementia has the lowest prevalence irrespective of the CAD risk group, with values of 0.01% and 0.37% for the bottom 10% and top 10% CAD risk groups, respectively. In addition, the prevalence of other comorbidities similarly increases for the top 10% CAD risk group relative to the bottom 10% group, including hyperlipidaemia (0.13-12.25%), T2D (0.66-8.39%), anaemia (0.03-8.36%), atrial fibrillation (0.10-7.25%), chronic obstructive pulmonary disease (0.01-5.10%), cataracts (2.03-4.15%), depression (0.36-3.70%), stroke (< 0.01-3.19%), heart failure (< 0.01-2.34%), rheumatoid arthritis (0.09-2.05%), and chronic kidney disease (< 0.01-1.94%) (Fig. 4b). The 20 most prevalent diseases in the highest CAD risk group are described in Supplementary Table 7. Either or without events, participants are most frequently diagnosed with hypertension (ICD-10: I10).
We further investigated the risk of CAD among subgroups as divided by the IRISCAD and 10-year atherosclerotic cardiovascular disease (ASCVD) risk. The 10-year risk for ASCVD was categorized according to the definition by the American College of Cardiology Foundation 25 . A high IRISCAD was associated with CAD risk, even within the same 10-year ASCVD risk group ( Table 4). The cumulative CAD incidence of all other groups is higher than that of the 1st decile of IRISCAD values with a low 10-year ASCVD risk (Supplementary Fig. 4). In particular, the top 10% of IRISCAD values and high 10-year ASCVD risk groups show substantially increased CAD risks (adjusted HR, 4.65; 95% CI, 3.84-5.64).

Discussion
ICD-10 is a global standardized hierarchical code for representing diseases. In addition to its medical purposes, such as in clinical diagnosis, ICD-10 is essential in biomedical research because it is used to define disease variables 8 . In conventional analyses, such qualitative data are represented as dummy variables or one-hot encoding variables randomly distributed in the latent space; however, this representation loses the intrinsic relationships with other related variables. Accordingly, we proposed ICD codes that were vectorized to contain meaningful measurable quantitative values and suggested a mathematical representation of diseases using ICD-10, called ICD2Vec. We showed that ICD2Vec contained the semantic and arithmetic relationships among diseases and presented medically conceptual meanings (Tables 1, 2). Moreover, the similarity of the vectors was significantly correlated with known biological relationships (Fig. 2). Furthermore, we found a significant association between the individual risk score derived from ICD2Vec and the actual disease incidence based on prospective cohort analyses of CAD, T2D, dementia, and liver cancer across two independent datasets from the UK and South Korea (Supplementary Fig. 1 and Table 3), as discussed above. The higher risk score group for CAD (high IRISCAD carriers) had more CAD-related comorbid conditions ( Fig. 4 and Table 4).
Previous encoding techniques have used converted vectors to provide information 19,26,27 . These techniques have also been applied to a variety of concepts in medicine and biology 17,28 . In some studies, mathematical vectors embedding medical histories were used to predict the future statuses of patients 29,30 . Applications to unstructured medical records provided novel insights for quantifying the relationships between diseases in multi-dimensional vector spaces 29 . However, these algorithms have limitations; they need to be retrained for a specific dataset, and the resulting vectors are not consistent. In other research, vectors embedded with medical knowledge were used for a question-answering task 16,31,32 . These algorithms showed improved clinical natural language processing (NLP) tasks. However, they were not used to convert ICD codes to vectors, although it is reasonable and useful to use vectorized ICD codes to analyse medical conditions. To the best of our knowledge, only one study has been conducted to elucidate the relationships between embedded vectors and clinical definitions 32 . The study compared their own vectors with medical concepts rated by physicians. Their vectors were validated as embedded vectors for readmission prediction based on a large cohort set. We quantitatively demonstrated that the vectors of ICD2Vec were meaningfully distributed in the latent space through a statistical investigation of the relationship between disease pairs based on three biologically established indicators (comorbidity as a relative risk, separation, and genetic correlation). We also confirmed the utility of ICD2Vec across the association study using the UK Biobank and the SMC cohort.
In addition, we defined individual risk scores based on ICD2Vec and performed use-case studies to investigate whether the scores for past diseases were effective in classifying risk groups. As a result, the risk of a target disease was predicted using these meaningful vectors. We observed a significant association between the risk of a target disease and the IRIS values for all other diseases during a specific baseline period (Fig. 3). In our experiments, the disease incidence of patients in the high-risk group was higher than that in the low-risk group ( Table 3, Supplementary Tables 1-4, Fig. 3, and Supplementary Figs. 2-3). Specifically, we identified ICD codes related to the 10 most common comorbidities for CAD, which were previously reported as diseases with a significant association with CAD 24 (Fig. 5). Patients in the high-risk group were more likely to have underlying diseases relevant to the target disease over several years before the manifestation of the target disease. We confirmed that patients with a higher 10-year ASCVD risk and higher IRISCAD risk group were likely to develop CAD (Table 4). Notably, the IRIS index derived from ICD2Vec is intuitive and simple to calculate.
ICD2Vec is publicly available and applicable to various fields. Several examples of possible applications have been suggested. First, the IRIS values of all other diseases corresponding to a target disease can be computed and used as a risk predictor or covariate. Typically, researchers define arbitrarily selected relevant diseases as covariates or define an index for scoring whether participants were diagnosed with other diseases using the Charlson Comorbidity Index. Although conventional approaches adjust the main effect restrictively with only diseases selected and known by researchers, the IRIS is computed from all diseases and can be used to analyse a disease with unknown covariates. Second, a multimodal analysis requires efficient vectors for the features of a disease. The size of ICD2Vec can be lower than those of one-hot encoded ICD-10 codes. The dimensions (300) of ICD2Vec are reduced more than 60 times relative to the original dimensions (19,514) of the ICD codes. Third, using pretrained vectors, it is possible to increase the performance of a predictive model. Owing to the pretrained knowledge embedded in ICD2Vec, the parameters of a model can be well optimized by ICD2Vec, i.e., better than the approach with randomly defined variables used by one-hot encoding. Lastly, ICD2Vec might be helpful in mapping ICD standards to other standard terminologies, such as "Systematized Nomenclature of Medicine -Clinical Terms" and "Medical Subject Headings." By calculating the cosine similarity between codes of different standards with an algorithm such as ICD2Vec, code mapping can be performed in the order of the highest similarity and relevance.
Our study had several limitations. First, analogical reasoning with ICD2Vec does not represent a clinical diagnosis. Clinical diagnosis involves systematic clinical reasoning with patient information such as disease history, physical examination, and laboratory results [33][34][35] . For example, the key diagnostic factor of pulmonary oedema is excessive fluid, as revealed from medical lung imaging tests 36 . Similar ICD codes are not the results of a diagnosis but have similar sources. Thus, in this study, we presented the top three similar ICD codes to illustrate the arithmetic and semantic relationships among the ICD2Vec codes. Second, the current ICD2Vec was based on FastText, which was trained on a general corpus rather than a medical corpus.
Certain words need to be redefined to be more appropriate for medical use. For example, the "down" in Down Syndrome does not mean "being or moving lower" and should be used in Down Syndrome itself.
Future work is required to update the vectors using a pretrained NLP model with a medical corpus. Third, converting ICD codes to vectors largely depends on the descriptions of the ICD codes and medical knowledge. Therefore, ICD2Vec needs to be updated as medical knowledge accumulates. Fourth, advanced algorithms such as BioBERT or Generative Pre-trained Transformer-3 can be used to generate vectors 16,37 .
A comparison of different algorithms is required to improve the performance of ICD2Vec.
ICD2Vec is a framework for converting qualitatively measured ICD codes into quantitative vectors containing the semantic relationships among diseases. The publicly available ICD2Vec can be used in diverse research and medical practices.

Dataset
We extracted the disease definitions from ICD-10-CM from a publicly available web database: ICD-10-CM is a clinical modification of ICD-10 in the United States and additionally extends some medical conditions or illnesses 38 . The disease names from the ICD codes were used to encode the information of each code. Additionally, the clinical information and approximate synonyms of the ICD codes were incorporated into their vector representations. The clinical information was obtained from official medical websites, such as the National Institutes of Health of the United States and a summary in MedlinePlus provided by the Centers for Disease Control and Prevention. Incorporating this information into the vector representation enabled us to obtain a well-defined and accurate representation of the disease codes (Fig. 1). We used 321,593 words from 19,154 ICD codes and 1,166,960 words with clinical information and approximate synonyms.
Our embedded vectors were evaluated for generalization using a large prospective cohort from the UK Biobank and massive EHR data from the CDW of SMC. The UK Biobank provides approximately 500,000 participants' medical histories from interviews and questionnaires, as well as biological samples for genotyping and informed consent for long-term medical follow-up through links to national health registries. This research was conducted using the UK Biobank Resource under Application Number 33002. Second, the EHRs of over 200,000 patients who visited the SMC between 2010 and 2012 were collected in a hospital information system, i.e., the CDW of SMC. We could access and retrieve deidentified records, including demographic information and the diagnosis histories of the participants. Ethical approval for the study protocol was granted by the Institutional Review Board (No: 2021-07-016) at SMC.

Pre-processing for ICD2Vec
NLP is a subfield of computer science and a branch of artificial intelligence that helps computers to read, understand, interpret, and manipulate human languages. The word embeddings in NLP are a type of word representation in a high-dimensional space, such that words with similar meanings have similar representations and are positioned close to each other in the vector space. Word embeddings help the machine capture the context of human language. They are used to efficiently solve NLP problems owing to their ability to capture the semantic relationships among words.
We performed pre-processing using the following procedures: tokenization, stop-word removal, and stemming. Tokenization is the most common pre-processing task in NLP. Tokenization basically involves splitting up a corpus of text and breaking it up into individual words called tokens. Stop-word denotes a commonly used word, for example, "the," "a," "an," or "in," which does not add much meaning to the sentence and can be ignored without affecting the meaning of the sentence. It can be removed by storing a list of words considered to be stop-words. Here, this was accomplished using the Natural Language Toolkit (NLTK), a library in Python that includes a list of defined stop-words for the English language. Stemming is a text normalization technique process in NLP where words are reduced to their word stems. A stemming algorithm reduces the word "likes," "liked," "likely," or "liking" to its base word, i.e., the word "like." It is desirable for removing unnecessary noise, as often the base words and their inflected words mean the same thing. We performed stemming using the "Snowball Stemmer" from the NLTK.
After pre-processing the data, the clinical information and approximate synonyms from the ICD-10 website were combined with the ICD code information. This provided more defined descriptions of the disease codes and helped us to obtain more accurate vectors corresponding to each ICD code. In general, ICD codes have a parent-child hierarchical structure from the top level (e.g., A00) to the bottom level (e.g., A001). We added the clinical information and approximate synonyms of the ICD codes of the bottom level to the ICD codes of the top level, such as A00-Z99, to maintain the hierarchical structures.

ICD2Vec development
We used a pre-trained FastText model trained over a Wikipedia corpus to obtain vectors for the words in our prepared data. This pre-trained FastText model contained vectors with 300 dimensions obtained using the skip-gram model with the default parameters. The skip-gram model can be considered as a model for predicting the context, given a word. The input is the current word, whereas the output is the preceding words and the words following the current word. If the current word in the sentence is represented as , the input to the model is , whereas the outputs of the model are −2 , −1 , +1 , and +2 . After preparing the data, each word representing every ICD code was passed through the pre-trained FastText model. The FastText model provided a unique vector corresponding to each word.
A vector for each ICD code was obtained by averaging the vectors for the words in its description. For example, if the ICD code contained n words 1 , 2 , … , , we obtained vectors for each of these words denoted by 1 , 2 , … , after passing each of the words through the FastText model. Then, the vector for the ICD code was calculated by averaging the vectors for the words;

IRIS
The most advantageous feature of ICD2Vec is that the disease definition, as a high-dimensional concept, can be easily expressed as an arithmetic vector. Because the vectors have meaningful relationships with each other, it is possible to adapt the vectors as a clinical variable for epidemiological or association studies. We suggested a disease-wide risk score for quantitatively measuring the risk of a target disease using ICD2Vec.
The disease-wide risk score ( ) for a diagnosed disease calculated using the cosine similarity ( ) between the disease ( ) and target disease ( ) was defined as = ( , ). We defined the IRIS using the top three averages or using the maximum of the individual disease-wide risk score.

Definitions of biological factors
We considered the relative risk, network-based separation, and genetic correlation as quantitative biological measurements of the relationships between diseases.
Relative risk. The relative risk can be used as a measure of disease comorbidity. We analysed the cooccurrences of diseases within a patient. The relative risk RR is defined as follows 18  We calculated the risk of a disease relative to other diseases as the ratio of co-occurrence of the two diseases in the medical records of the UK Biobank.

Separation.
A network-based separation was suggested that compared the network-based shortest distance between two diseases to the shortest distances within each disease 18 . This measurement indicates the degree to which two network-based disease modules are separated. A zero or negative value indicates that the two diseases may share pathologically similar clinical characteristics. The separation with disease and is defined as follows: where 〈 〉 is the shortest distance between diseases A and B, and 〈 〉, 〈 〉 is the shortest distances within diseases A and B, respectively.
Genetic correlation. To examine the biological rationale of ICD2Vec, we compared the cosine similarities of ICD2Vec and genetic correlation values for each pair of diseases. The genetic correlation is an indicator that quantifies the genetic relationship between two traits, based on summary statistics from a GWAS. The genetic correlation ranges from -1 to +1, where -1 indicates a strong negative association and +1 indicates a strong positive association. Comparisons between cosine similarities of ICD2Vec and genetic correlations for pairs of diseases can provide useful etiological insights.
We used a total of 1,585 ICD-10-based UK Biobank GWAS summary statistics data from https://www.nealelab.is/uk-biobank. For each disease pair, we performed an LDSC on the pair of corresponding GWAS summary statistics 39 . We considered disease pairs with positive correlation values and significance (p < 0.05).

Statistical analysis
We performed a Student's t-test and Pearson's correlation test to compare the correlations between the three biological factors and cosine similarity. We also measured the hazard risk of the target disease with individual risk scores using Cox proportional hazard regression. All statistical analyses were performed using R statistical software (version 3.6.1; R Foundation for Statistical Computing, Vienna, Austria). Statistical significance was declared at P < 0.05.

Data availability
Our ICD2Vec embedding dataset is available at https://github.com/YeongChanLee/ICD2Vec. We crawled the information on ICD-10 in https://www.icd10data.com. The UK Biobank GWAS summary statistics are available at http://www.nealelab.is/uk-biobank. The genotype and phenotype data can be obtained from the UK Biobank (https://www.ukbiobank.ac.uk) upon project application. The EHR data that support the findings of this study are available from the CDW of SMC but restrictions apply to the availability of these data, which were used under approval for the current study, and so are not publicly available.