AI-based mining of biomedical literature: Applications for drug repurposing for the treatment of dementia

Abstract Neurodegenerative pathologies such as Alzheimer's disease, Parkinson's disease, Huntington's disease, Amyotrophic lateral sclerosis, Multiple sclerosis, HIV-associated neurocognitive disorder, and others significantly affect individuals, their families, caregivers, and healthcare systems. While there are no cures yet, researchers worldwide are actively working on the development of novel treatments that have the potential to slow disease progression, alleviate symptoms, and ultimately improve the overall health of patients. Huge volumes of new scientific information necessitate new analytical approaches for meaningful hypothesis generation. To enable the automatic analysis of biomedical data we introduced AGATHA, an effective AI-based literature mining tool that can navigate massive scientific literature databases, such as PubMed. The overarching goal of this effort is to adapt AGATHA for drug repurposing by revealing hidden connections between FDA-approved medications and a health condition of interest. Our tool converts the abstracts of peer-reviewed papers from PubMed into multidimensional space where each gene and health condition are represented by specific metrics. We implemented advanced statistical analysis to reveal distinct clusters of scientific terms within the virtual space created using AGATHA-calculated parameters for selected health conditions and genes. Partial Least Squares Discriminant Analysis was employed for categorizing and predicting samples (122 diseases and 20889 genes) fitted to specific classes. Advanced statistics were employed to build a discrimination model and extract lists of genes specific to each disease class. Here we focus on drugs that can be repurposed for dementia treatment as an outcome of neurodegenerative diseases. Therefore, we determined dementia-associated genes statistically highly ranked in other disease classes. Additionally, we report a mechanism for detecting genes common to multiple health conditions. These sets of genes were classified based on their presence in biological pathways, aiding in selecting candidates and biological processes that are exploitable with drug repurposing.


Background
Over the past decade, advancements in analytical methods have opened dramatic new opportunities to unveil hidden connections among complex networks (1).The advancement of Arti cial Intelligence (AI) techniques enables researchers to query and analyze massive datasets, simulate experiments virtually, and generate scienti c hypotheses through advanced analysis.Such tools have been used extensively by the pharmaceutical industry to lower the costs of drug discovery (2,3).The development of novel therapeutic agents from idea to FDA approval involves substantial commitments of time and money (4).The FDA strictly evaluates e cacy and safety when approving new therapeutics, but the costs of newly approved therapeutics are now under intense scrutiny as the costs of new drug development have risen.Repurposing existing FDA-approved drugs for new indications can alleviate drug development costs in part because the safety pro le and clinical experience already exists for the drug (5).Repurposing signi cantly accelerates the entire process by taking advantage of crucial steps that already occurred in the original FDA approval process (6).The key to the initial steps of drug repurposing is to nd a connection between an existing drug and a disease of interest that is worth exploring preclinically or clinically for a new therapeutic indication.In many cases the necessary knowledge may already be present in biomedical literature; however, the connections between various pieces of information may not be obvious.To determine the hidden connections, we developed AI-based literature analysis tools: MOLIERE (7), followed by Automatic Graph Mining And Transformer based Hypothesis Generation Approach (AGATHA) (8).The recent development of AI allows automated extraction of valuable information from unstructured text such as scienti c abstracts or articles, thus enabling e cient and scalable processing of textual data that dramatically saves time and effort compared to manual processing (9).AI algorithms can identify themes, topics, and clusters within a collection of documents, thus helping researchers to overcome challenging, time-demanding analysis of literature (10).In vast databases like PubMed, search results often overwhelm users with excessive information that is hard to curate without detailed, time intensive assessment.Natural language processing (NLP) techniques in AI frameworks not only speed up the research but also enhance extracting valuable knowledge from massive datasets, which is challenging to achieve manually (11).Drug repurposing research bene ts from these advantages in many ways (5,12,13).Studying disease at the gene level remains a challenging task despite recent advancements in genomics and technology.NLP methods were successfully used for a variety of gene-related tasks including but not limited to the identi cation of unique anticancer targets (2), predicting cognitive decline (14), interpreting microbial genes (15) and others.To achieve successful results in NLP calculations, it is imperative to have high-quality training datasets, conduct preprocessing procedures to normalize the data and reduce noise, choose appropriate model architecture (such as recurrent neural networks (RNNs), convolutional neural networks (CNNs), or transformers), and optimize hyperparameters such as learning rate, batch size, dropout rate, and model architecture con gurations.
AGATHA is an effective literature-based discovery tool capable of extracting relevant information by sifting through immense scienti c databases including PubMed, which expands annually by over one million papers (16).This tool analyzes the collection of lexical elements (e.g., words, phrases, and lemmas) within each research article abstract to identify possible hidden connections among terms speci ed by the user.The likelihood of the potential connection is estimated using a multi-headed self-attention mechanism accounting for the spatial relationships between individual terms in a latent vector space, which we will refer to as the "AGATHA space".
The AGATHA system pipeline can roughly be split into two stages: (1) semantic knowledge network construction and its embedding into a low-dimensional vector space and (2) transformer-based predictor training.These two stages can be used independently from each other, which we leverage in the current work here.The rst stage results in a large multi-layered knowledge network, which connects individual units of information (such as the UMLS terms, semantic predicates or entities) with their corresponding literature sources.For example, a term representing "breast cancer" is connected to all sentences and semantic predicates containing this term.After construction, we perform network embedding, such that each node is assigned a learned vector representation (or coordinates) of 512 dimensions.This allows us to establish spatial relationships between individual concepts by computing distances between them (8).Terms that are logically connected, such as different stages of the same health condition or its type, tend to cluster together in AGATHA space.Conversely, terms that are logically distant from each other are positioned relatively far apart.This approach aims to facilitate an intuitive understanding of the relationships and connections between the many scienti c terms and concepts within scienti c literature.
The second stage results in a transformer-based predictor model, which is trained to prioritize meaningful associations between biomedical concepts above random noise.For each term-term association, it outputs a score within a unit interval indicating the likelihood of this association being biomedically relevant, based on the insights learned from scienti c abstracts.When we use the AGATHA predictor in a one-to-many setting (like in this work), we obtain the probability distribution over a range of pairs, where the source term is xed (e.g., a particular disease) and target terms represent a group of concepts of similar semantic type (e.g., list of different genes).Therefore, we can identify what genes are more likely to be associated with a particular disease and select the most prominent candidates for further downstream analysis.
To classify AGATHA outcomes, we applied multivariant statistical methods, including Partial Least Squares Discriminant Analysis (PLSDA) (17) (18).This method helps categorize and predict samples (diseases/genes) belonging to speci c classes.PLSDA has advantages compared to other discriminant methods due to its ability to handle data with high variability and a power to reduce dataset dimensionality through the utilization of latent variables (19).In addition to the classi cation analysis, unsupervised clustering was utilized to unveil latent relationships that cannot be directly measured within the multivariate data.The combination of these steps followed by comprehensive pathway analysis helps to explain the biological signi cance of the classi cation outcomes and produces a nal list of genes as candidates for drug repurposing.
In this work, we focused on application of our methodology to identify candidate drugs suitable to be repurposed as treatments for neurodegenerative diseases (20), which pose major challenges in healthcare as the seventh leading cause of death in the world (21).The term "neurodegenerative" covers a wide spectrum of neurocognitive conditions that despite their different pathologies often share common symptoms, in which dementia is a major outcome (22).Therefore, the analysis included a broad spectrum of neurodegeneration in the dementia domain to search for common themes and pathways.
Initially, the classi cation model facilitated the extraction of "dementia" genes, which were subsequently analyzed within the context of the pathways in which they participate.Once it was con rmed that the proposed method effectively extracts the necessary data, the same procedure was employed on the remaining non-neurodegenerative disease classes to obtain speci c genes for each group.Next, after obtaining a list of genes that have a high likelihood to be associated with each disease, they were mapped in the Dementia class to assess their places on a probability scale.Genes for which known small molecules interact with the pathways of interest were prioritized to select FDA-approved medications or medications in experimental or investigational status.We chose a total of 38 drugs for potential repurposing with a focus on six of them, all of which have demonstrated effectiveness in treating diseases unrelated to central nervous system function.

Materials and methods
AGATHA is an open-source algorithm available at: https://github.com/IlyaTyagin/AGATHA-C-GP.The operational principles of AGATHA are detailed in the Additional Information section.Statistical analysis was performed on a multidimensional dataset using MATLAB R2023a software from MathWorks (Natick, MA) and the PLS Toolbox from Eigenvector Research, Inc. (Wenatchee, WA).Pathway analysis was conducted using g:pro ler tool (23).Gene characterization and selection of potential drugs were achieved with the help of GeneCards (24) and CTD (25) databases.
Principal Component Analysis (PCA) (26) was primarily used for dimensionality reduction and preliminary data structure analysis.This widely-employed method is based on transforming the introduced data into a set of principal components that describe the variance of the data.It involves a series of mathematical steps, including calculating the covariance matrix of the data, computing its eigenvalues, and subsequently reducing the dimensionality of the data.The prepared dataset was normalized by the total area, and auto-scaled by the division of each of the 512-column in the calibration matrix by its standard deviation.Subsequently, cross-validation was performed using the Venetian Blinds approach, consisting of 10 splits with a blind thickness of 1.The achieved model showed a clear separation between the Dementia and SUD classes with the rest of the data falling into a single cohesive group.However, the cross-validated Root Mean Square Error for this model was extremely low (0.00231327), which indicates a good accuracy of the model.The classi cation model was calculated using the PLSDA method (17), which can deal with heterogeneous data and describe it by only a few Latent Variables (LV).LVs are calculated using regression coe cients, determined for each component, and followed by estimating their positions in the PLSDA space.To moderate the risk of over tting, Venetian-blinds cross-validation was employed.This method involves partitioning the data into k equal-sized segments and alternately using them as training and validation sets.Alongside dimensionality reduction, PLSDA ensures that the calculated components possess unique information by being orthogonally opposed to each other.
Unsupervised hierarchical clustering analysis was applied on part of the Dementia-classi ed data to group similar values into clusters based on their common characteristics.In cluster analysis, pairs of samples with the smallest distance between them are identi ed and merged without knowledge of class origin.These similar clusters are then grouped together in dendrogram visualization to provide a clearer representation.Ward's method was used to minimize the variance within each cluster by evaluating the differences between merging two groups of data (27).This approach is especially e cient when handling high-dimensional data, such as our disease/coordinate or gene/coordinate sets, or when clusters are more likely to exhibit equal variance within them.These differences were estimated by the sum of squared deviations from the mean (variance) after merging the clusters (Mahalanobis distance).

Data description
Diseases and conditions of interest were selected from the Disease Database provided by the Uni ed Medical Language System (UMLS) (28, 29) and combined into the Health Conditions Data Set (HCDS) comprising a total of 122 terms, which are categorized into seven groups: Dementia (24 conditions), Diabetes (12 conditions), Arthritis (9 conditions), Heart Conditions/Diseases (14 conditions), Hypertension (11 conditions), Cancer (12 conditions), and Substance Use Disorders (SUD) (40 conditions/substances).All the selected terms are formal names for diseases and health conditions and the last group (SUD) additionally contains the most common substances of abuse (Table 1).We hypothesized that there are spatial clusters within the AGATHA space that correspond to different groups of health conditions (Table 1).This implies that by mapping genes within the AGATHA space and analyzing their positions relative to disease groups, we can uncover previously unrecognized links between speci c gene sets and health conditions.These genes are speci cally evaluated for potential drug repurposing opportunities.
A general logical work ow is represented in Scheme 1 below.Disease-categorized data from a variety of databases was mapped to the AGATHA space for further characterization.Then, classi cation methods were used to build a discrimination model that extracted four lists of genes: 1) genes speci c for each disease class; 2) Dementia genes, highly ranked in other disease classes; 3) Disease genes, highly ranked in Dementia class, and 4) genes common for all diseases.These groups of genes were used for pathway analysis performed using g:pro ler tool (23), which helped to select the candidates for drug repurposing evaluation.
Exploratory analysis of semantic links revealed by the AGATHA embedding space.
The complex relationships between the selected health condition terms (30, 31) (Table 1), described across multiple diverse scienti c articles, were assessed using AGATHA text mining software.Following semantic embedding, these health condition terms are represented as points within a highdimensional latent space, the embedding space we named earlier as AGATHA space (see the rst paragraph of the Results section).The coordinates of these terms are calculated to re ect their semantic properties in such a way that words or phrases with similar meanings are represented by points that are closer to each other within the space.The number of dimensions in an embedding space is typically in uenced by the volume and complexity of the text data being analyzed.However, it is also determined by the speci c requirements of the model and the task at hand.While a larger and more complex dataset might bene t from a higher-dimensional space to capture more nuanced semantic relationships, the choice of dimensionality also depends on computational constraints and the desired balance between detail and e ciency.In our case, preliminary studies indicated that an e cient embedding of the information contained within the PubMed database of scienti c abstracts is achieved by using an embedding space with 512 dimensions (8).In Fig. 1, we see the relative positions of HCDS groups as visualized in 3D space.This visualization is the result of condensing the original 512-dimensional data into a more comprehensible three-dimensional space using Principal Component Analysis (PCA).Two distinct clusters are formed by two non-overlapping sets of health conditions: SUD (green diamonds) and Dementia (red diamonds).The ve remaining sets -Diabetes, Arthritis, Heart Diseases, Hypertension, and Cancer -form a tight spatial cluster that is separated from both the SUD and Dementia clusters.Subclusters corresponding to these ve groups of health conditions remain distinguishable.However, they are positioned close to each other resulting in the overlap of certain groups.These observations have led us to hypothesize that the AGATHA space contains a spatial pattern characteristic of the health condition groups.In further sections we implement advanced statistics to identify and characterize such patterns.
Classi cation analysis of health condition groups mapped to the AGATHA space.
The validity of spatial patterns in the AGATHA space associated with health condition groups was tested using PLSDA, a standard partial least square classi cation approach.In this study, we leverage both the interpretability of the multiclass PLSDA models and their capability to effectively handle collinear data.The strong predictive performance of the PLSDA models was subsequently employed to investigate gene/health condition associations.
PLSDA classi cation has been demonstrated to be a successful method for addressing multivariate data, offering tunable model complexity (18).We used PLSDA to build a supervised classi cation model (Fig. 2) with classes de ned by health condition groups (Table 1).Extensive preliminary classi cation trials (not reported here) enabled the identi cation of optimal data preprocessing and classi cation parameters.For the nal classi cation model, the input matrix containing coordinates in 512-dimensional AGATHA space for all health conditions was preprocessed using normalization by the total area and auto-scaling.
Despite the high dimensionality of the input data generated through complex algorithms implemented in the AGATHA text mining software, four latent variables were su cient to produce a robust classi cation of health conditions.Generally, latent variables are calculated so that each subsequent latent variable captures the shared variance remaining after the extraction by the previously calculated variables.A total of 16.31% of the data was covered by the rst four latent variables.
The stability of the PLSDA classi cation model was veri ed using the Venetian blinds cross-validation approach, which involves dividing the data into ten equally sized folds.The nal classi cation model effectively categorizes health conditions into seven prede ned groups, as shown in Fig. 2.A, demonstrating cross-validated sensitivity and speci city parameters within the range of 0.786 to 0.990.As expected from the exploratory analysis (Fig. 1), the Dementia and SUD classes exhibited the best classi cation performance.The Dementia panel in Fig. 2.A reveals that all health conditions initially selected for the Dementia group have a probability close to 100% of being classi ed as part of the Dementia class.Note that the 0-1 range on the Y-axis in the panel corresponds to a 0-100% range of probabilities.These observations suggest that all the health conditions we originally selected for the Dementia group constitute a distinct spatial cluster in the AGATHA space.Furthermore, should a small portion (one-tenth) of the Dementia set be omitted as 'unknown' health conditions during the training phase, these 'unknowns' are likely to be accurately classi ed in subsequent classi cation analysis.Interestingly, this observation holds true not only for Dementia and SUD, but also for Diabetes.In Fig. 1, Diabetes is the most distant from the Dementia and SUD groups, neighboring but not overlapping with the other groups.Discriminating between the Arthritis, Cancer, Heart Disease/Condition, and Hypertension groups is also achievable, as shown in the Discussion section, despite overlapping regions.Two distinct pairs of groups can be identi ed: the Heart Disease/Condition and Hypertension pair, and the Arthritis and Cancer pair (Fig. 2.A corresponding panels).Health conditions originally selected for these four groups overlap and show a non-zero probability of being assigned to another class of the pair.The proximity and overlap of the Heart Disease/Condition and Hypertension groups can be explained by the shared physiological characteristics of these disorders (32).There are also certain connections between Cancer and Arthritis, such as associations with chronic in ammation and paraneoplastic arthritis (33).While considering the physiological origins of connections within these two pairs of health condition groups is beyond the scope of this proof-of-concept study, we will later demonstrate that, upon more detailed analysis of the overlapping groups (see black circles in Figs.2.B and C), it is possible to build a robust classi cation model for discrimination of all groups.
Assigning human genes to health condition groups using AGATHA latent space and advanced statistical methods.
Text mining algorithms provide a unique opportunity to connect scienti c concepts using lexical context.As demonstrated above, the AGATHA algorithm successfully condensed scienti c information within the PubMed database, capturing lexical context characteristics of the health condition groups we selected for this proof-of-concept study.In this section, we explore the ability of the AGATHA system to uncover hidden connections between genes and health conditions.This was achieved by mapping all human genes to the AGATHA space and categorizing them into health condition groups using the PLSDA classi cation model.This step was followed by an in-depth analysis of the identi ed gene clusters in the context of diseases, physiological pathways, and drugs known to interact with these pathways.
The complete list of human genes, mapped to the AGATHA space as a matrix with 20,889 rows and 512 columns, was analyzed using the PLSDA model developed for HCDS. Figure 3 illustrates the distribution of genes among all disease classes, with the color bar showing their attribution to the Dementia class in each category.
As seen in the gure above, the distribution of Dementia genes does not follow the same pattern across all other classes.At this point, the evaluation of gene distribution in the Dementia class is necessary to show that the calculated model is coherent from the biological point of view.To achieve this, genes with a probability exceeding 80% were analyzed using hierarchical clustering.This approach aided in investigating the internal structure of the data, followed by the pathway analysis of the calculated clusters.
A total of 1079 genes with high probability to be associated with dementia were identi ed by the classi cation model and further subjected to unsupervised cluster analysis using agglomerative Ward's method with a total of four principal components and Mahalanobis distance that accounts for the variations of multivariate data (Fig. 4).The selected threshold allowed for gene separation into four distinct clusters that were further subjected to a pathway analysis to justify the biological meaning of data distribution.
The dendrogram in Fig. 4 shows four well-separated gene clusters, each de ned by speci c biological processes and mechanisms.These assignments were determined through pathway analysis, which can be summarized as follows: Compared with Clusters 2-4, Cluster 1 is separated from the remaining data at the initial threshold level, indicating its unique characteristics.Pathway analysis revealed that the processes within Cluster 1 do not show direct connections to speci c physical or behavioral anomalies and cannot be linked to any speci c disease category.However, this information is still useful when genes from this cluster are mapped as high-ranked in other disease categories.
Cluster 2 has a strong connection to several kinds of pathways known to be altered in neurodegenerative conditions including but not limited to Alzheimer's disease, Amyotrophic lateral sclerosis, Parkinson's disease, and Apoptosis -multiple species as labeled by g-pro ler.Cluster 3 has characteristics analogous to Cluster 2, such as nervous system development, presynaptic endocytosis, neuron projection organization, visual perception, regulation of neuron projection development, dendrite morphogenesis, and regulation of cell projection organization, and neuron projection.Some of the pathways discussed here are relevant to conditions such as Parkinsonism, disturbances in higher cognitive functions, central motor function disruptions, Ataxia, speech impairments related to the nervous system, and the life cycle of the HIV-1 virus.Cluster 4 is different from the other three by having pathways related to substance abuse.It includes nicotine, cocaine, amphetamine addictions, alcoholism, and some pathways connected to the nervous system such as neuroactive ligand-receptor interaction, dopaminergic synapse, retrograde endocannabinoid signaling, axon guidance, and many more (Additional Table 1.Summary of pathways and genes for the Dementia class).
As a result, we acquired a list of genes with a high probability of being connected to Dementia as well as being simultaneously highly ranked in the remaining six classes.After evaluation by the GeneCards database, these genes were separated into one speci c group (Additional Table 2. Dementia non-speci c genes, highly ranked in other disease groups).Subsequently, highly ranked genes from the Diabetes, Arthritis, Heart, Hypertension, Cancer, and SUD classes were extracted for each disease/condition and were mapped in the Dementia group (Fig. 5).For most of the classes, they were not presented at the top of the probability scale, so only the ones with the highest likelihood to be connected to Dementia were combined.(Additional Table 3. Disease-speci c genes, highly ranked in Dementia).
In addition to the described analyses, we followed the same procedure and extracted the top 1079 genes for each disease resulting in six lists: Diabetes, Arthritis, Heart conditions/diseases, Hypertension, Cancer, and SUD.These lists were reduced by retaining only genes distinct to each speci c class of disorders to remove excessive overlapping of biological information.Based on these results, pathway analysis was performed for the speci c gene lists and summarized in Table 2.

Pathway overview
Pathways speci c for Dementia class based on selected genes are described above.

Diabetes (109 genes)
The diabetes group included insulin and sugar metabolism-related pathways, glucose transmembrane transporter activity, insulin secretion, monosaccharide transmembrane transport, Type II diabetes mellitus, abnormal hemoglobin, and others, as well as general pathways that can be assigned to a variety of conditions.

Arthritis (32 genes)
The arthritis group included a variety of autoimmune and in ammatory diseases, which is re ected by the list of different characteristic pathways: immune response-activating cell surface receptor signaling pathway, activation of immune response, regulation of B cell receptor signaling pathway, protein deglutamylation, abnormal lymphocyte proliferation, and others.
Heart condition/disease (67 genes) The group included pathways such as blood circulation, regulation of blood circulation, regulation of heart contraction, regulation of heart rate, heart development, and others.By nature, some of these pathways are related to hypertension which can explain an absence of speci c genes in the Hypertension cluster.

Hypertension (0 genes)
There were no speci c genes identi ed for the current group.Hypertension can be caused by various factors.It is related to the function of the heart, diabetes due to damaged arteries, excessive alcohol consumption, and others.Some of these causes were considered by the rest of the six classes, so the absence of hypertension-speci c genes is not surprising and can be investigated further.

Cancer (18 genes)
The cancer group can be described by several pathways, related to DNA damage sensing activity (34).This class includes the RAD51B-RAD51C-RAD51D-XRCC2-XRCC3 complex, in which inactivating mutations predispose to breast, ovarian and prostate cancers (35).

SUD (397 genes)
The selected SUD group was composed of various addictions and individual substances.This was re ected in identi ed pathways that included chemical carcinogenesis -DNA adducts, nicotine addiction, morphine addiction, cocaine addiction, Common pathways underlying drug addiction, Drug metabolismcytochrome P450, drug ADME (absorption, distribution, metabolism and excretion), steroid hormone biosynthesis, dopamine neurotransmitter receptor activity, dopamine secretion, modulation of chemical synaptic transmission, retrograde endocannabinoid signaling and many others.

Drug repurposing
The classi cation model predicted the list of genes with high potential to be associated with Dementia.These hidden connections are selected based on the learned patterns and relationships in the data indirectly revealing acquaintances between terms.Further developments in understanding these relationships will require additional interpretation and analysis beyond the model itself.We selected a list of potential drugs for repurposing analysis based on the presence of genes speci c to six groups of diseases within the Dementia class using GeneCards and Comparative Toxicogenomics Database (CTD) (25).Chosen medications were additionally veri ed by the DrugBank database (36) to track their approval status as well as the stage of Clinical trials (Additional Table 4).Finally, Table 3 summarizes the most signi cant candidates for drug repurposing based on the combination of statistically predicted gene/disease connections discovered by the classi cation model and gene/drug connections identi ed using databases listed above.Speci cally, Bosentan, Mecamylamine, and Methylphenidate are the most compelling candidates, ranking at the top of the lists of small molecules for their predicted targets membrane involvement, cytoplasmic activities, and more.As of now, this list may not serve as a basis for future target selection, but it effectively illustrates the shared nature of diseases.

Discussion
Over the last decade, various literature-mining methods were introduced for biological analysis.AI technology provides researchers with an opportunity to perform experiments with biomedical entity normalization applied to multiple datasets (43).This facilitates the identi cation of intricate gene citations in scienti c articles and books (44) and aids drug repurposing efforts (45,46).Previously we introduced literature mining methods (7,8,47) and demonstrated their potential application in many areas including the introduction of a new use for existing approved drug therapies (48).As a result of this project, a list of potential drugs for dementia treatment was extracted by AGATHA and advanced statistical analysis.The method identi ed hidden connections and pathways related to different diseases and neurodegeneration speci cally.AGATHA-calculated variables for 122 diseases were separated into seven classes to calculate the PLSDA classi cation model.Initial discrimination showed that Dementia and SUD were separated from the rest of the group, agglomerating well-de ned clusters when other diseases stay in a uniform cloud.However, when these two classes are excluded from the dataset, the rest of the diseases separate without any overlap (Fig. 6).This illustrates the potential of the method to be applied for future projects studying other diseases or gene combinations.
In the next step 20889 genes were classi ed by the PLSDA model, which revealed different distribution patterns among the classes (Fig. 3).It appears that in most cases Dementia genes are present at the bottom of the probability scale.It was noted that certain Dementia genes within the top 20% have a probability of being associated with SUD, with only three showing a connection to Diabetes.These genes are not prevalent in the top tier of the other four classes (Fig. 5).As a result, a close look at the genes on top of the Dementia class showed elements of neurodegeneration as well as substance abuse.A total of 1079 genes (top 20%, Fig. 2A, Dementia plot) were subjected to the pathway analysis which proved their belonging to that class, since they play crucial roles in numerous vital biological processes.Notably, they are involved in the Glutamatergic synapse pathway, which contributes to ensuring proper brain function.Disturbances in glutamate transmission or the improper regulation of glutamate receptors have been linked to various neurological disorders, such as epilepsy, Alzheimer's disease, and schizophrenia.On the other hand, it has been shown that changes in metaplasticity of glutamatergic synapses play a signi cant role in the development of chronic SUD (49).In addition, it is known that tryptophan metabolism can have implications in the context of substance abuse due to its role in the production of neurotransmitters, including serotonin (50), which was shown in studies of patients with alcohol use disorder (51,52).The same pathway leads to development of Alzheimer's disease due to the inhibition of various enzymes responsible for the biosynthesis of β-amyloid (53).Thus, the genes that have a higher probability of being associated with Dementia can serve as potential targets for future drug repurposing due to their shared nature between SUD and Dementia based on the discovered pathways.As the result of mapping statistically allocated Dementia genes in the remaining classes, we obtained a list of genes highly ranked in other diseases.
A selection procedure was performed for the Diabetes, Arthritis, Heart conditions/diseases, Hypertension, Cancer, and SUD classes extracting the same number of genes as was performed for Dementia.Highly ranked genes in every group were then mapped in the Dementia class to evaluate their positions.This resulted in a separate list of genes that are not necessarily speci c for any of the selected types of neurodegenerative disorders but have higher scores in general.Based on acquired information, the list of potential drugs for repurposing was created using GeneCards and CTD (Table 3).The suggested method enabled us to explore textual data from various angles.Apart from examining the interconnection of genes, it facilitates the identi cation of genes unique to each type of disease (Table 2).
The exploration of genes common for all 122 groups revealed their tendency to be present in pathways included in many biological processes simultaneously, proving the accuracy of the proposed method.The pathways disclosed in this list have a wide range of meanings and can be attributed to many processes or disorders.These similarities could be potentially used in future steps of the research project to discover new hidden connections.To summarize, the combination of the literature-mining method AGATHA, coupled with advanced statistical analysis allowed for the separation of the different lists of genes: Dementia genes, highly ranked in other disease classes, Disease genes, highly ranked in Dementia class, genes speci c for every disorder, genes common for all diseases.This information was used for the selection of potential drugs for repurposing and has the potential of being used for future experiments involving nding new common pathways, selecting speci c genes within the same group of diseases, or creating a robust automatic prediction method for the different inquiries.

Conclusions
We developed an AI-based literature mining tool AGATHA and proposed its novel use to discover drugs with the potential for repurposing in the context of neurocognitive disorders.The accomplished a primary objective of identifying hidden connections between approved medications and speci c health conditions through advanced statistical analysis, including techniques like PLSDA and unsupervised clustering.The methodology involved grouping scienti c terms related to different health conditions and genes, followed by building discrimination models to extract lists of disease-speci c genes.These genes were explored through pathway analysis to select candidates for drug repurposing.As a result, we selected six main drugs for the subsequent bench study: Bosentan, Mecamylamine, Methylphenidate, Tretinoin, Imatinib, and Hydralazine.PLSDA prediction.Genes highly ranked for every disease (pink markers) mapped by their prediction probability for the Dementia group.

AbbreviationsAI: 1 Principal
AbbreviationsAI: Arti cial Intelligence AGATHA: Automatic Graph Mining And Transformer based Hypothesis Generation Approach NLP: Natural language processing

Figure 2 Results
Figure 2 Results of cross-validated PLSDA classi cation analysis.A -Probabilities for each disease to be classi ed as an assigned class, Q Residuals vs T 2 Hoteling plot error plot.PLSDA plots represent diseases separation for the rst two (B) and three (C) latent variables.

Figure 3 PLSDA
Figure 3 PLSDA prediction of gene distribution among diseases A -Dementia, B -Diabetes, C -Arthritis, D -Heart condition/disease, E -Hypertension, F -Cancer, G -SUD, H -Q residuals vs Hotelling T^2 plot.Genes colored by the prediction probability to be assigned to Dementia.

Figure 4 Hierarchical
Figure 4Hierarchical clustering analysis using Ward's method.Different colors illustrate gene groupings characterized by various disease markers.

Table 1
Health Condition Data Set.Diseases, conditions, and substances are grouped by the type and color coded regarding the

Table 2
Main pathways identi ed for the lists of speci c genes for all disease classes.