Properties of Drug Targets
We determined the set of approved oncology drugs based on the current version of the Therapeutic Target Database . Approved, clinical and investigational drugs with corresponding indication and their respective targets were downloaded from the database. The data was filtered and curated to determine a high confidence list of approved and clinical trial oncology targets. Our final list consisted of 102 targets for approved drugs and an additional 277 targets for clinical drugs (Table S1).
Next, we determined a comprehensive list of 70 properties for all human proteins by combining the manually curated literature captured in the Swiss-prot database, the computational predictions for missing features and network centrality properties calculated based on the protein-protein interaction network (Methods). The set of features included standard characteristics based on the protein sequences from Swiss-prot database  (such as molecular weight and physicochemical classes) as well as several other properties that have been shown to differentiate drug-targets from non-drug targets based on previous publications [14–16]. These features included: subcellular localization, post-translational modifications, enzyme classification, PEST region (peptide sequence enriched in proline, glutamic acid, serine and threonine amino acids), secondary structure, signal peptide cleavage, protein essentiality, solvent accessibility, and tissue specificity (for a full list of features see Table S2). Protein essentiality was determined based on mouse homozygous loss-of function mutations that lead to lethality and mapped through orthologs to human proteins .
In addition, we calculated network centrality measures for each protein based on the protein-protein network information from the STRING database . It has been shown that network properties of proteins correlate with their biological functions, essentiality and tissue specificity [25, 26]. Therefore, the network information complements these other properties further helping with the evaluation of less studied proteins where information about biological function and essentiality may be limited.
To capture the known biological functions of the proteins we used the comprehensive ontology database of the commercially available database of Metacore  and scored the proteins based on biological processes, molecular function, and signaling pathways. The scoring of the protein biological functions was done according to the ranked ontologies ordered by the enrichment analysis of the targets in the training set (Methods). This procedure captured the most significant gene ontology categories for the validated oncology targets in the positive training dataset.
Next, we performed statistical tests to identify the feature which were significantly different between the set of drug targets and non-drug targets (Figure 1). Our main findings were in good agreement with previous studies [14, 15]. For example, we found that targets were more likely to be membrane proteins (p=2.33*10-7), were enriched in enzymes (p=2.8*10-11) and tended to be tissue-specific (1.6*10-8). We found the presence of more glycosylation sites (p=9.8*10-20) possibly indicating longer half-lives for drug targets. The target essentiality status, determined from mouse knock-out studies , was significantly more established compared to the non-targets (p<2.2*10-16) reflecting the fact that the drug targets are in general more studied proteins.
Interestingly, the network properties of the proteins showed large differences between drug- and non-drug targets (p<2.2*10-16 for all network measure). Indeed, it has been observed in previous studies that drug targets tend to have a higher number of connections compared to non-target proteins based on unbiased high-throughput yeast two-hybrid system . We found that the commonly used network centrality measures showed some of the most significant differentiation between targets and non-targets (Figure 1). We included a complete list of differentiating features in supplementary data (Table S2 and Figure S1).
Machine Learning prediction of target “druggability”
Our main objective is to use the set of 70 protein features described above to prioritize and score proteins in order to promote the discovery of novel targets and/or to filter the candidate list of targets. The set of 102 targets of approved cancer drugs is our positive training set, this is the typical setting where a learner only has access to positive examples and unlabeled data because in the set of proteins outside this small collection of approved targets there are many targets (for example in the pipelines) or targets still to be considered or discovered. This kind of problem is known in machine learning as Positive Unlabeled (PU) [17, 27] with the additional complication of the high unbalance  between the positive set and the wide set on unlabeled samples. Here we adopt an approach combining easy ensemble  and bagging  as shown in Figure 2. In order to have a balanced training set for our model we generated negative training sets of the same size of the positive set by random sampling without replacement all human proteins after excluding both the approved and clinical trial oncology targets. We built 10,000 random forest models using each of the random negative sets and made predictions based on each model. We then assigned a drug target probability score to each protein by averaging the predictions over the 10,000 models. To evaluate the performance of our model we considered the independent set of 277 targets which had at least one oncology clinical trial drug targeting them, but no approved drugs. We achieved a high-accuracy prediction in identifying the clinical trial drugs resulting in an AUC of 0.89 (Figure 3A and Table S3). Furthermore, we investigated if the threshold on the protein-protein interaction confidence would affect our results, but found no significant changes in prediction accuracy (Table S3A). We note that we tested neural network models as well for the predictions and the results and prediction accuracy was very similar to the random forest models (results not shown). The list of targets for approved cancer drugs used for the training set and the clinical trial drug targets for validation was included in the supplementary data (Table S1).
To further evaluate our predictions, we compared our drug target score with a recent genome-scale CRISPR screen by Behan et al. , where the authors defined a priority score based on a 324 cancer cell line panel. We found that our prediction score had a significant correlation with the priority score calculated from the CRISPR screen (p=5.5*10-12 significant correlation) validating our findings.
Predictive features of target “druggability”
In our models, we included all features, regardless of how well they differentiated drug targets form non-drug targets. Most of the features included can be considered as independent variables, however, there were few subsets such as the network centrality measures or the gene ontology classifications which were strongly correlated (Figure S2). Nevertheless, we found that the random forest models were able to pick the most important variables for the predictions. To characterize the relative importance of the protein features in the predictions we looked at the mean decrease of Gini metric . We averaged the variable importance over 10,000 models giving us an overall estimate of feature importance (Figure 3B). The top measures include the biological function of proteins, network centrality measures (page-rank, closeness, betweenness, degree), protein essentiality, tissue specificity, localization, and solvent accessibility. Indeed, it is expected that the biological function is a crucial feature of a drug target protein, because a protein may be druggable based on its 3D structure and other properties, but if it is not involved in disease-relevant processes targeting it with a drug will have minimal impact on the disease state.
The different network centrality measures and protein essentiality are correlated as shown in previous studies . The combination of biological function, essentiality and centrality measures reflects the fact that the target not only has to have the right biological function, but the high centrality also assures that targeting it with a drug will have a high impact on those processes. Another important feature for our predictions was tissue specificity. Indeed, tissue-specific genes have been shown that are more likely to be drug targets [26, 32] due to the reduced risk of side effects.
We compared our predictions scores for the approved oncology targets, clinical targets and the rest of the proteins (Figure 4A). As expected, our training data had the highest score with a median of 0.96. Nevertheless, the independent set of cancer clinical targets was also characterized by a high median score of 0.73 compared to the rest of the non-target proteins which had a median score of only 0.11. It is expected that a large fraction of proteins is not good drug targets because of poor druggability, toxicity or disease irrelevant biological functions. However, the set of non-drug targets had a subset of 2,117 outliers, proteins characterized with scores larger than 0.5 (Figure 4A). We believe this subset of proteins may contain potentially interesting candidates for novel drug targets.
Next, we compared our predictions with existing drug information from the DrugBank (Figure 4B). We found a strong positive correlation between the number of currently approved drugs and the prediction scores of their corresponding targets. Indeed, the proteins with scores of at least 0.7 had in average at least two drugs targeting them. Our top scored proteins (>0.9) were well established targets, being targeted on average by nearly eight drugs (Figure 4B). These results show that using only a small set of 102 validated targets we were able to recover the majority of known drug targets with high confidence. Although our predictions were based on a training set of oncology targets, many of the druggability features in our model most likely were not specific to oncology targets. However, because we scored higher the cancer specific biological functions, we expected that cancer targets may score higher compared to the targets of other indications. In order to test this, we compared the scores of 540 non-cancer clinical trial targets with the scores of the 277 cancer clinical trial targets from TTD. Indeed, we found a small but significant (p=0.02) difference between the scores of non-cancer and cancer clinical trial targets (0.64 versus 0.73). Additionally, we downloaded a set of 402 targets of withdrawn drugs from the DrugBank. A small set of 20 targets of withdrawn drugs were identified as not having any approved drug targeting them. We found that the scores of targets associated only with withdrawn drugs were significantly smaller compared to the targets of drugs with ongoing clinical trials (0.37 vs 0.73, Figure S3). We note that the set of targets associated with withdrawn targets was rather small, as most drugs fail not necessarily because of their targets, but issues related to efficacy, off-target effects and clinical trial design.
Almost all the highest scored proteins not included in the training set had existing drugs targeting them (Table S4). The target proteins with highest scores included many well validated oncology targets (for example: EGFR, VEGF, c-KIT and c-Met).
The indication specificity of the predictions can be further enhanced by filtering the candidate targets using data specific for the indication of interest. For example, oncology targets could be further filtered down by correlation analysis of gene expression with overall survival in cancer based on public data such as the Cancer Genome Atlas (TCGA) . Similarly for immune-oncology targets it is possible to focus on targets with immune cell specific expression profiles based on publicly available datasets such as for example Database of Immune Cell EQTLs (DICE) .
Comparisons with similar approaches
We compared our prediction performance with other similar publications [15, 16]. Direct comparison between these approaches is difficult due to the differences in training and validation sets, the use of different features and the differences in the performance metrics reported. We would note that the main difference between our approach and the publications of Kim et al. and Bakheet at al. [15, 16] are in the addition of network centrality measures and our rank-based scoring of the oncology relevant biological functions. Indeed, including these features had a significant impact on improving the prediction accuracy (Table S3C) and also enabled us to prioritize targets based on our indication of interest. Our leave-one -out cross validation averaged over 100 random non-target sets achieved a sensitivity, specificity and accuracy around 0.91 and an AUC value of 0.97 (Table S3A). These values show improvement compared to the sensitivity between 0.64-0.85 reported by Kim et al. and the accuracy of 0.89 reported by Bakheet et al. Finally, the publication of Li et al. is similar in their choice to include several network properties. Direct comparison is difficult since they use a very large set of different features. Nevertheless, we performed a cross validation on the 1,100 drug targets they utilized in their model and performed a cross validation using random sampling for the negative drug target set (Table S3B). Our performance on this set of targets was very similar within the sampling error to the prediction performance reported by Li et al. (sensitivity is 0.87 vs 0.9; specificity is 0.85 vs 0.8; accuracy 0.87 vs 0.85) Our approach has higher prediction performance when focusing only on oncology targets by applying a rank-ordered scoring method that can prioritize cancer specific biological functions and potentially can make predictions specific to any indication of interest.