Colorectal cancer
|
148
|
negative = 44
positive = 104
|
GDS2547
|
Metastatic prostate cancer (HG-U95C)
|
Prostate cancer
|
164
|
negative = 75
positive = 89
|
GDS3268
|
Colon epithelial biopsies of ulcerative colitis patients
|
Colitis
|
202
|
negative = 73
positive = 129
|
DisGeNET disease-gene association dataset
The dataset containing genes and associated diseases was downloaded from DisGeNET version 7.0 [25]. The dataset contains 30,170 diseases and 21,666 genes that form 3,241,576 gene-disease connections. Given the massive dataset size, two filters were set to reduce the number of associations in terms of practicality and to reduce the computational complexity. The filters were set on the columns diseaseType and diseaseSemanticType in the DisGeNET dataset. The diseaseType column divided the data into three categories - disease, phenotype, and group - and we only chose disease concerning our study. On the column diseaseSemanticType, we only chose those rows categorized as Neoplastic Process and Disease. This was done to increase compatibility and better understand the workflow results. After filtering, only 15991 genes and 3929 diseases related to diseases remained for further analysis, which accounted for 329936 gene-disease associations. Figure 1 illustrates a part of the disease distribution over the number of genes for each disease.
The merit of our GediNET as Disease-Disease Associations
Let's assume that given a gene expression dataset D, which was designed to study a specific disease R (for example, Lung Cancer or Breast cancer) to detect the significant genes or the biomarker of this disease-based gene expression. The traditional approach of the classification model suggests a list of k genes that can serve as a biomarker for predicting the patients with the disease R. In other words, Identifying disease-gene associations. One solution could be a linear function F(X) as :
F(X) = w1g1+w2g2+...+wkgk, where wi are the weights (scores) while the gi are the gene expression values. The weights could serve as the importance of each gene expression in this equation. For instance, a value weight close to zero indicates that the associated genes contribute less to the equation model. In other words, F(X) describes the biological interaction between those k genes to form biomarkers.
Our new approach GediNet is different from those traditional approaches by suggesting the following model equation that Identifies disease-disease gene associations:
F*(X) = w1*grp_disease1 + w2*grp_disease2+...+wp*grp_diseasep, where the model consists of p groups that were highly scored by the component S of GediNET(See section The main workflow of GediNET). The group grp_diseasei (i=1,2,..p) is a set of genes associated with one disease. F*(x) represents the model by a linear function of groups, also it could be represented as a decision tree, as illustrated in Figure 2 (Right panel). The left panel of Figure 2 illustrates the decision tree model of the significant genes selected by the traditional approach. One needs to take this list and proceed with other functional enrichment processes to discover more biological relationships. On the other hand, the right panel of Figure 2 shows that the decision tree model consists of genes that are associated with the top three GeDiNET ranked diseases. This model contains information about biological knowledge of the diseases showing the disease-disease associations.
For example, considering the datasets GDS1962 that studies the Glioma disease, GediNet might suggest a model of top three significant groups/diseases:
Grp_disease_1 =PAPILLARY RENAL CELL CARCINOMA,
Grp_disease_2= PLASMA CELL,
Grp_disease_3 = NEOPLASM, and ADULT GLIOBLASTOMA.
Where is the following are the sets of genes associated with each disease:
Grp_disease_1 = {SLC16A1, TAGLN2, TIMP3, IGFBP7, TOP2A, TP53, RRM2,..},
Grp_disease_2 = {CD99, TP53, LPL, CD40, CD38, NCAM1, MYC, CSF3, CDKN2A, FGFR3, CCND1},
Grp_disease_3= {EDNRA, CSPG4, MELK, ENPEP, …}.
Applying GediNet will compute F*(x) that describes the association between the genes, associated with different diseases to the current disease under study and also will describe the relationship between the top detect significant diseases(groups).This might lead to new discoveries that have not been observed before by the traditional approaches. Additionally, the model that GediNET discovers could be combined with the top-ranked group of gene diseases (See the M Component in section number 4), which might be used to explore the relationship between those diseases. One can use those genes in enrichment analysis to discover the different pathways and their roles in each disease and the current disease.
The main workflow of GediNET
The main inspiration for developing our novel tool GeDiNET is the generic approach named G-S-M, which was adopted by different tools such as SVM-RCE [36], SVM-RCE-R [25],SVM-RCE-R-OPT[37], SVM-RNE[35], maTE [27], CogNet [28], miRcorrNet [31], Integrating Gene Ontology Based Grouping and Ranking [34], miRModuleNet [32], PriPath[33] and recently reviewed in Yousef et al. [38]. The main workflow of GeDiNET is illustrated in Figure 3, where the G-S-M approach is presented in the three main sections labeled with the orange section (G), the yellow section (S), and the green section (M), which represent:
- The G Component (Grouping): where the genes of each disease are grouped.
- The S Component (Scoring): where the groups are scored and ranked.
- The M Component (Machine Learning model): where the model is created by training a classifier (Random Forest).
The input for the workflow is gene expression data. The data consist of two classes of samples, classes are control (negative) and disease (positive). The data is split into training and testing. The training data is used to build the final model, while the testing data is used to evaluate the model's performance. The whole workflow is repeated 100 iterations using the cross-validation loop, where the input is randomly split into 90% training and 10% testing in each iteration. A one-way ANOVA test is performed on the training set to filter out the top genes. The top 2000 differentially expressed genes with a P-value less than 0.05 are selected. The selected genes are then used to filter the test dataset to contain the same genes.
The main contribution of the generic approach and the description of each component's functions are explained in detail in the following sections.
Grouping Genes based on Disease (The G component)
The first main component in our tool is the grouping component G ( the orange section in Figure 3), which groups genes into groups. The G component can be any algorithm that groups genes. For example, Yousef et al. previously used the maTE algorithm to group the gene expression by the miRNAs that can target them according to the miRTarBase database [29]. In this tool, the G components group genes based on their disease associations extracted from the biological knowledge of the DisGeNET v7 database [39]. The main idea is to group the genes into disease groups. Each group is one of the known diseases where its group members are the genes associated with this disease. Table 2 is an example of such groups where it includes the disease name (group name), the set of genes associated with this disease and the last column is the number of genes that represent the size of the group.
Table 2. An example of groups of diseases with their associated genes. The last column represents the number of genes in each group (group size)
Group Name
|
Genes
|
#Genes
|
Small Cell Carcinoma Of Lung
|
VPS13B, SLC16A1, ANXA1, CD99, SMARCC1, PCNA…
|
41
|
Leukemia, B-Cell
|
TP53, LAMA4, STK11, CSPG4, CD40, TNFRSF1A…
|
43
|
Stage Iii Breast Cancer Ajcc V6
|
TP53, BRCA2
|
2
|
Head And Neck Carcinoma
|
PRMT5, ANXA1, LGALS1, TIMP3, IGFBP7, PCNA, TNC, TP53…
|
149
|
Secondary Malignant Neoplasm Of Bone
|
ADAM9, SLC16A1, CD99, NME1-NME2, DPYSL3, TNC, TP53, NRAS…
|
145
|
Malignant Glioma
|
TK1, NPAS3, CD63, HMGB1, TAGLN2, TXNIP…
|
162
|
Adenocarcinoma, Tubular
|
PCNA, TP53, EFEMP1, APOE, STK11, PRKD1…
|
31
|
Childhood Brain Neoplasm
|
TP53, NRAS, SOX9, MYC, TNFRSF11B
|
5
|
Adult Myelodysplastic Syndrome
|
CSNK1A1, CTNNA1, HMGB1, PCNA, TOP2A, TP53…
|
58
|
Non-Small Cell Lung Cancer Stage I
|
TP53, PRRX1, IGFBP3, VEGFA, S100A6, GSTK1…
|
22
|
Creating Sub-data
Further, a sub-data set needs to be generated for each disease group. This is achieved by extracting the genes belonging to the specific disease and their original class label from the original gene expression training part of the data. Let f=1,..m be the number of groups generated by the G component. This stage we will extract or create m sub_data named grp_disease_genes_subdata1, grp_disease_genes_subdata2, …, grp_disease_genes_subdatam, that will be serving for the S (Scoring) component. Figure 4 is an example of creating sub_data for four different diseases (groups). For example, the Well Differentiated Pancreatic Endocrine Tumor disease group is a group with five genes associated with this specific disease. The genes are RBMS3, TFE3, SSTR2, NTRK1, and PAX8. Moreover, the X-Linked Lymphoproliferative Disorder disease is another group with only two genes which are SERPINA4 and NR0B2. The sample class is also extracted and specified for each sample, where pos is for the positive class and neg for the negative class. Each disease group with its sub-data is the input to the following S component (yellow section in Figure 3) to be scored and sorted.
Scoring the groups of diseases associated with their genes
Consider the gene expression dataset as D, which contains two classes of s covariate samples (patients and control) and n genes. After applying the grouping component G for each disease, the diseases that are now represented by a sub_data, are scored according to their ability to best differentiate between the two classes after training on the associated sub_data using a Random Forest (RF) classifier. The sub_data is divided into a conventional 80:20 training and testing split. We repeat this procedure r=5 times recording different performance metrics while we use the mean of the accuracy as the assigned score for the specific disease. However, one might use a different combination of those metrics to assign the final score. For more information on such an option, we refer to [37].
Table 3 is an example output of the Scoring component for the GDS2525 dataset.
Table 3: An example of the output of the Scoring S component. The first column is the name of the disease name, the Accuracy column is the score given by the S component, and the Rank is the Rank of the group based on the value of the score.
Disease
|
Score as Accuracy
|
Rank
|
PAPILLARY RENAL CELL CARCINOMA
|
0.98
|
1
|
PLASMA CELL NEOPLASM
|
0.98
|
1
|
ADULT GLIOBLASTOMA
|
0.97
|
2
|
INTESTINAL CANCER
|
0.97
|
2
|
MALIGNANT NEOPLASM OF COLON STAGE IV
|
0.97
|
2
|
DERMATOFIBROSARCOMA
|
0.95
|
3
|
Implementation of GeDiNET
We have implemented the GeDiNET tool using the free and open-source platform KNIME [41] due to its simple and intuitive graphical user interface. KNIME is a highly integrative platform that has enabled the scope to utilize scripts in both python and R in tandem to implement our tool as a KNIME workflow.
The workflow created on KNIME comprises several nodes with their separate functions. Meta-nodes are created as a collection of nodes that perform specific tasks.
The KNIME workflow for GeDiNET is presented in Figure 5. It starts by uploading a list of the names of the dataset via the "List Files/Folders" node. Then a loop over those datasets is run to read each dataset by the node "Table Reader," which is then processed by the meta-node "FilterMissingValues" to remove and or filter out rows with missing values. It then sends the filtered data as input to the GeDiNET meta-node. While the "Integer Input" node allows modifying the number of iterations, the tool should be used while training the model.
The flowchart for the GeDiNET tool is presented in Figure 3, while in Figure 5, the implementation of the GeDiNET as a KNIME workflow along with its meta-nodes and components are shown. The left outport (input ports) brings 3 inputs into the GeDiNET node (Top: Variable Flow port, which contains path/location of datasets and output files, Middle: Input dataset with missing values removed, and Bottom: Number of iterations). The input dataset is passed to the "GroupTargetsByDisease" node, which acts as the biological grouping function by grouping all genes concerning their corresponding diseases. The dataset is then normalized and passed onto the "ClassificationBasedDiseaseRanks" component for ranking via a "Partitioning" node that segments the input dataset into training and a testing set.
The "ClassificationBasedDiseaseRanks" node is expanded in Figure 6, which shows the two further meta-nodes, "Genes filter ttest" which performs a one-way ANOVA test to filter out top genes which are further used in the "Rank and Classify."
Finally, the "SaveResults" node collects all the results after each iteration of the "Loop End" node to process and save the results.
The GediNET Knime workflow could be downloaded from: https://github.com/malikyousef/GediNET.git or https://kni.me/w/3kH1SQV_mMUsMTS-
Model Performance Evaluation
We used the Random Forest Classifier while splitting the data into 80% training and 20% testing. Since the datasets are imbalanced, meaning the dataset's target class has an uneven distribution of observations, we employed the under-sampling method. Such a method deals with the imbalanced datasets by pertaining all of the data in the minority class while decreasing the size of the majority class. Besides, for model training, we applied 10-fold Monte Carlo cross-validation (MCCV) [42]. With Monte Carlo cross-validation (MCCV), fractions of the samples are randomly selected as training data, and the rest is assigned for the test data. The performance measures are computed as the average of 100-fold MCCV.
To evaluate the performance of RF model, several quantitative metrics were calculated, such as Accuracy, Sensitivity, Specificity, and Precision [43], using the following formulations:
Sensitivity (SE, Recall) = TP/(TP + FN)
Specificity (SP) = TN/(TN + FP)
Accuracy (ACC) = (TP + TN)/(TP + TN + FP + FN
Where TP: true positive; FP: false positive, TN: true negative; and FN: false negative. Moreover, the Area Under the Curve (AUC) measures the ability of a classifier to distinguish between classes and is used as a summary of the ROC curve [44]. We used the AUC to evaluate the performance results.
In each iteration, our approach generates lists of diseases groups and their associated genes that are slightly different. Hence, there is a need to apply a prioritization approach on those lists. As utilized in miRcorrNet, we have used rank aggregation methods. In this respect, we have embedded the RobustRankAggreg R package[40], developed by (Kolde et al., 2012) into GediNET workflow. The RobustRankAggreg assigns a p-Value to each element in the aggregated list, which describes how good each element/entity was ranked compared to the expected value.