Text Mining and Predicting Disease-Gene-Drug Associations of Hypertension Data Cubes-Based

Background Predispositions to hypertension is possibly associated with numerous potential gene polymorphisms and systemic disorders. Large-scale text mining of biomedical literature is a flexible and essential tool that can be applied to search for innovative drugs and treatments for diseases, such as investigating and predicting the bio-entities associations. Result We proposed a generality approach for extracting and predicting hypertension-related disease-gene-drug associations based on dictionary and data cube from biomedical abstracts. After data preprocessing, we constructed the 0-D vertex cube, which we then filtered to construct three 1-D cubes consisting of 252 diseases, 185 genes, and 141 drugs. By applying association rules to quantify the disease-gene-drug associations, we found 235 associations between 79 diseases and the 71 genes, and AUCs was 84.1%; 196 associations between 43 diseases and 102 drugs, and AUCs was 85.8%; 160 associations between 31 genes and 106 drugs, and AUCs was 83.6%. Using the bottom-up computation algorithm, we established three 2-D cubes and one 3-D disease-gene-drug cube, which revealed 591 associations between 90 diseases, 82 genes, and 145 drugs. Based on this 3-D cube, we obtained 262 predictive bio-entity association pairs of which 57 disease-drugs, 84 disease-genes, and 121 gene-drugs. Conclusions We have implemented and validated a data cube-based text mining approach to identifying and ranking the hypertension-related disease-gene-drug associations. Our results provide new pathways in the search for the potential treatment drugs of hypertension.


Introduction
Text mining has been established as a necessary NER tool help improve knowledge reusability from the large number of biomedical literature databases such as PubMed, which in turn creates new opportunities and challenges to explore the causes of diseases and their potential treatment drugs [,].
The rich biological/medical genomic databases provide a further opportunity to discover diseasegene-drug association. At present, automatically annotating biological entities such as diseases/genes and drugs and other important information in the biomedical literature is the first step in computational approaches to predicting associations between bio-entities, and it is useful for improving the scalability of biocuration services [,]. In order to identify semantic associations between biological/medical entities and extract structural associations from the rich literature databases, a dedicated set of computational and analytical techniques is required []. Some of the basic biological connections between entities, such as gene-gene associations [], drug-drug associations [], proteindrug association [] and drug-disease associations [], can be revealed through these computational and analytical approaches.
Analytical tools are essential to uncovering causal disease-gene-drug associations, and in this era of big data, data cubes and related analytical techniques can be useful tools. Data cubes can store multiple data dimensions (e.g., diseases, genes, and drugs) and measures (e.g., strength of association) in multidimensional ways [,]. Online analytical processing (OLAP), such as the following drill or roll-up operation, performs multidimensional analysis of medical and genomic data and provides the capability for complex calculations, trend analysis, and sophisticated data modeling. Studies have also found that PDI as a potential therapeutic target in the treatment of atherosclerosis, thrombosis and hypertension []. While each of these mutations taken alone may cause only a slight impact, the combination causes severe changes in hypertension symptoms [,]. Identifying hypertension diseases-gene-drug associations will increase our understanding of the genetic pathogenesis and drug targets of hypertension, which will contribute to the development of novel prevention and treatment of hypertension in the future.
In this study, we probed the data in a multidimensional space based on a data cube structure and used association rules to determine the associations between bio-entities. We used hypertensive as a model disease for the construction of disease-gene-drug data cube to analyze the potential associations between diseases, genes, and drugs. The results provide valuable information for the development of innovative diagnosis and treatment tools for hypertension based on candidate genes.

Materials And Methods Dictionary construction
Dictionary-based methods instead rely -as the name suggests -on matching a diction of names against text. For this purpose, the quality of dictionary is obviously very important, the best performing methods for NER according to blind assessments rely on carefully curated dictionaries to eliminate synonyms that give rise to many false positives []. Moreover, dictionary-based methods have the crucial advantage of being able to normalize names. A high-quality, comprehensive dictionary of disease / gene / drug names is thus a prerequisite for mining disease-gene-drug associations from the bio-literature.  [] databases to compile standardized gene and drug dictionaries, which we named "DiseaseDictionary" (with entries for 26,813 human diseases) "GeneDictionary" (with entries for 40,172 human genes) and "DrugDictionary" (with entries for 1,763 drugs), respectively. Each entry included the disease's /gene's / drug's standard name, alias, synonyms, standard code, etc. ICD 10 is designed for clinical coding and billing purposes; its structure and disease names are poorly suited for bio-literature mining.
To improve recall, we automatically generated variants of the disease names. Although the terms disease, disorder, and syndrome have separate definitions, we found that they are used inconsistently in the literature when part of disease names; for instance, Alzheimer's disease is occasionally referred to as Alzheimer's disorder or Alzheimer's syndrome. We also removed words in parentheses and brackets occurring at the end of disease names, unless this would cause ambiguity.

Integration of Corpus
We searched the diabetes-related literature in PubMed for the most recent year using the following search strategy: "((hypertension) AND (("1988/3/16"[Date -Publication]: "2018/3/16"[Date -Publication])))". The search returned a total of 334,155 with abstracts, which we downloaded and saved in text format.
On account of the literature abstract unstructured data, we needed to standardize it before proceeding with our analysis. We used the normalization method, one of the most widely used approaches for the extension of dictionaries with synonyms []. In this study, all entity sets were standardized based on the names of genes (drugs) extracted from the literature.

Corpus Preprocessing
Here, we describe how to preprocess the corpus by applying three text processing steps: sentence splitting, tokenization and recognition. This step is the text corpus which is used for the task of association classification and the output of this step is the set of unigram features that will be further used for the feature set. The preprocessing activities used in this process are:

1)
Sentence splitting: the sentence splitter splits the text that is required for taggers into sentences. The sentence splitter uses a dictionary list of abbreviations to differentiate between full stops and other token types. Sentence splitter takes the "(.)" to split one sentence from another. For example, "prevention, diagnosis and treatment" is a single sentence. 2,864,308 sentences were obtained.

3)
Recognition: to match a document against the dictionary, we have labeled each word with the method of BIO (B-begin, I-inside, O-outside, E-end) [], so that it became a standard text corpus for entity recognition and association extraction. For instance, for the statement "However, in patients with type 2 diabetes and hypertension, multiple studies demonstrate the benefit of gene ACEI or ARB in preventing or delaying the onset of nephropathy.", the classification results are shown in Fig. 1. For each name recognition in the text, we normalize it to the corresponding unique identifier and, in case of diseases, backtrack the term to the root of the ontology through "is_a" relationship to assign also the identifiers of all parent terms.

Extraction and sorting of bio-entities association
After corpus preprocessing, we extracted and scored associations between disease, gene and drug using association rules of "Support", "Confidence" and "Lift" , which take into account co-occurrences at the level of individual sentences, simultaneously. For bio-entity A and B (such as: disease, gene, and drug), "Support" measures the frequency of a bio-entity against the total corpus: where a is the number of occurrences and N is the total number of sentences / associations in the corpus / network. "Confidence" measures the intensity of the association: "Lift" assesses whether a forecasting model is effective and reflects the importance of set to set : If the value of Lift is 1, A and B are not associated; if Lift < 1, the emergence probability of A is inversely proportional to B; if Lift > 1, the higher the value, the stronger the association between A and B []. Considering that related entities might be mentioned only occasionally or in comparison to one another in the abstract, we set the minimum threshold of Lift to 3. This is equivalent to a confidence level above 99.8% or a critical value of 3 times the standard deviation in a standard normal distribution; i.e., Lift > 3 is considered a strong association.

Data cube
Data cubes are defined by facts (the data elements being measured) and dimensions (the perspectives from which data is analyzed). Each dimension has an associated table, known as a dimension table. In this study, we used the biomedical documents downloaded from PubMed as a data warehouse, in which the biomedical entities (disease, gene, and drug) were the dimensions. To explore the associations between different dimensions, we used the values of support and lift as measures of fact.
A total of eight cubes (or groups) comprise all possible combinations of disease, gene, and drug, including the empty set. For instance, (disease, gene, drug) represents the disease-gene-drug cube, and (disease, gene) represents the disease-gene cube (Fig. 2). The vertex cube (or 0-D cube), often described as "all," is the most generalized (least specific), while the basic cube is the least generalized (most specific).

Bottom-up computation (BUC) algorithm
The diabetes data cube in this study is an iceberg cube, so it is suitable for using the BUC algorithm to build the network model of associations. The details of the BUC algorithm have been described Briefly, the algorithm drills down from the top, i.e., from higher-level, less detailed units to lower-level, more detailed units. In this study, we used Lift as the measure of association for partitioning. Along the recursive process, frequent combinations are sent to output. When all attribute values are partitioned in the last dimension, the algorithm recurses back to the previous level, and the next attribute value is processed, and so on. The algorithm eventually returns a full association network.

ROC curve
The receiver operating characteristic (ROC) curve, which detects the accuracy of a binary classification algorithm, is widely used to evaluate the performance of medical diagnostic tests []. The accuracy of a test is measured by the area under the ROC curve. In this study, we used R v3.3 to construct the network, create the ROC curve, and evaluate the algorithm performance. R is a computer language and environment for statistical computing and graphics. It provides more than  There were few studies on the associations between hypertension and gene, and ACE was the most important one enzyme [], which was involved in the process of transforming ACE I into ACE II with physiological activity, may be related to myocardial infarction, SARS resistance, renal diabetes, Alzheimer's disease and other diseases. TNF gene was expressed in leukocytes and macrophages; it was involved in protein nuclear entry, positive regulation of protein amino acid phosphorylation, negative regulation of L-glutamate transport, glucose metabolism, etc.; it was involved in dementia, migraine, asthma susceptibility, septicemia susceptibility and other diseases.
Glucose was the most frequently studied drug related to hypertension. Its "access number" in DrugBank was "db09341". It was mainly stored in animals as plant starch and glycogen. It helped various metabolic processes at the cell level, usually in the form of injection, providing nutritional supplement for metabolic disorder or improper regulation of blood glucose level. Glucose is one of the most important drugs in the World Health Organization (WHO) list of essential drugs. Table 1 provides the support and lift score between disease and gene in the 2-D cube. Using the association rules, 235 significant associations between the 79 classes of hypertension-related clinical manifestations / symptoms and the 71 candidate genes are extracted from the 2-D cube (see Fig. 3 Table 2 provides the support and lift score between disease and drug in the 2-D cube.  Fig. 3 (b), attachment 2-D.xlsx).We found that the central nodes in this 2-D cube were the disease nodes of Hypertension (Support = 0.383), stroke (Support = 0.122), and the drug node of Glucose (Support = 0.066), which were associated with 75, 24 and 14 different types of entities respectively. There are three pairs of one-against-one strategy between the disease and drug, including: Pain NOS-Morphine, Celecoxib-AIDS-Diclofenac and Dorzolamide-Glaucoma-Timolol. Table 3 provides the support and lift score between gene and drug in the 2-D cube.  Prediction of disease-gene-drug associations

2-D Gene-drug cube
In this study, we used ABC discovery method [] to predict hypertension candidate diseases, genes and drugs, and to mine new association between hypertension related diseases, genes and drugs.
Similarly, some unproven bio-entity association pairs are also obtained in this study, and the predictive results with false-positive are allowed [], because this is also one of the objectives of This study verifies all the associations between disease-gene, disease-drug, and gene-drug. After verification, 262 kinds of predictive association (i.e. false-positive association) were obtained, including 57 kinds of disease-drug, 84 kinds of disease-gene and 121 kinds of gene-drug. Table 4 provides a partial list of disease-gene-drug associations linked implicitly but not explicitly to hypertension through the literature. Table 4 Partial list of predictive associations between bio-entities. In the prediction of novel disease-gene association, for example, no study has yet reported whether an association exists between Depression and SLC2A9. The rs6855911 allele variation in SLC2A9 gene expression is strongly associated with serum uric acid concentration [], while the serum uric acid level in human body has a positive correlation with depression and anxiety; therefore, there may be potential association between the two bio-entity. AGT gene is expressed in adipose tissue, adrenal gland, brain, blood vessel and nervous system, which is involved in cell growth, positive regulation of cytokine synthesis, apoptosis and cardiac hypertrophy. AGT is a susceptible gene of hypertension, while hypertensive cerebral hemorrhage will cause coma; AGT is related to insulin resistance [], similarly, diabetes may also cause coma. It is suggested that AGT may be related to Coma.
In the prediction of novel disease-drug association, there was a clinical case report [] that a 58 year old woman who was hospitalized with intermittent fever, accompanied by anemia and inflammation, had been taking atenolol and chlorothiazide to treat hypertension, and had increased diuretics due to uncontrolled hypertension six weeks before hospitalization; however, after discontinuing antihypertensive drugs, the fever symptoms relieved rapidly, and the diagnosis might be allergic to chlorothiazide. Therefore, Chlorothiazide may be associated with Anemia and may cause fever. For the related bio-entity pair of Glomerulonephritis-Doxazosin, three cases report []

Conclusions
In this study, we developed a dictionary and data cube-based NER tool to extract and predict hypertension-related disease-gene-drug associations from PubMed. The results of our study provide novel predictions of disease-gene-drug associations, which will aid researchers in designing future medical/biological experiments.
Conceptual modeling of spatial data cubes requires that two types of metadata be defined: 1) metadata describing a data warehouse that consists of various data sources, can be maintained and integrated, and has model data structure; and 2) metadata describing how the data in the warehouse can be analyzed to meet the needs of decision makers. In the data cube of Hypertension, the entities we defined (diseases, genes, and drugs) can be viewed as the first type of metadata, while other entities as defined in literature and dictionaries can be viewed as the second type of metadata. Therefore, the model of the multidimensional dataset is network-based.
Using hypertension-related literature abstracts from the most recent year based on the entities of diseases, genes, and drugs dictionaries, we applied data cube analysis and association rules to highlight and extract the key nodes in the disease-gene-drug cubes network. The AUCs of our algorithm achieved 0.841, 0.858, and 0.836, respectively. This allowed us to mine the potential links between bio-entities, and make quantitative assessments. Meanwhile, the whole heterogeneous network may contribute further to the discovery of candidate genes and drugs in hypertension-related diseases and the deduction of novel disease-gene-drug associations. The next step will be to assess the performance of the algorithm on massive data sets and to promote its further use.

Availability of data and materials
The disease dataset used in this article is available in the International Classification of Diseases (ICD 10) at https://icd.who.int/browse10/2019/en.
The drug dataset used in this article is available in the Drug Bank, accession number Accession Figure 2 Sketch graph of the three-dimensional data cube. Each cube represents a different grouping. The basic cube contains three dimensions: disease, gene, and drug.