Similarity Ensemble Approach
The chemical-centric method can exploit the pharmacological relationships among protein targets in addition to their biological [4]. The target molecule Curcumin was queried and we found 193 human target proteins associated with it. Desmethoxycurcumin was mapped to 166 human target proteins, Bisdemethoxycurcumin identified to have 71 human target proteins and Turmerone was associated with 2 target proteins. After removing overlapped target proteins , we had 219 unique target proteins for further PPIN study.
Network formation and Property
We downloaded Human protein interaction data (scored links between proteins) from String DB and retrieved all the interaction in which any of the 219 target proteins were involved. This has led to 208125 interactions having interaction score from 150 to 999. We removed edges having score below 300 which gave a total of 58482 interactions as edgeand 11979 (TP+IP) proteins as nodes. The nodes were comprised of TP (219) and IP (11,760).
Biological Interactions Network (True PPIN) vs. False interaction Network (false PPIN)
To understand the network property of both the networks (true PPIN and false PPIN), we calculated the four different edge attributes (scores) using link prediction algorithms . Thus, we calculate the score value for each edge in both the networks. To calculate the score value, we used different algorithms implemented in the Networkx library in python. The calculated scores were namely 1)preferential attachment score: Preferential attachment algorithm shows that the more connected a node is, the more likely it is to receive new links. Thus an edge which connects two nodes which themselves are highly connected to other nodes (by an edge) will have higher edge score value. 2)common neighbors score: Common neighbors algorithm captures the idea that two strangers who have a friend in common are more likely to be introduced than those who don’t have any friends in common. Thus, an edge which connects two nodes which are having higher number of common connection (other nodes which they are connected to) will have higher value of edge score. 3) jaccard score: jaccard score is a measure used to compute the closeness of nodes based on their shared neighbors and their degree values. The higher jaccard score value for an edge (connecting two nodes) shows that the two nodes are having higher number of common connection but themseleves are not highly connected to other nodes. and 4) resource allocation score: resource allocation score is a measure used to compute the closeness of nodes based on their shared neighbors and the degree value of that shared neighbor nodes. The higher resource allocation score value for an edge (connecting two nodes) shows that the two nodes are having higher number of common connections and those common connections are not highly connected to the other nodes. To calculate these edge score using above mentioned four link prediction algorithms.
For true PPIN, we calculated the correlation coefficient of score values of edge attributes along with interaction score obtained from StringDB using pearson correlation coefficient. We found a poor correlation between interaction score against each of the topological edge attributes. The obtained correlation coefficient values ranges from 0.076 to 0.31. Thus, none of the topological edge attributes resembled the biological interactions between two protein nodes. Further, we performed the significance testing of edge attributes belonging to the two groups; true PPIN and false PPIN. The most significant edge attribute between the two groups obtained by t-test was jaccard score. The t-tests results are uploaded on the Github as folder named Edge_attributes_hypothesis_testing.
Difference in Centrality Measures
Further, we studied the node attributes of these two networks, and calculated different types centrality measures. We calculated the degree , closeness centrality , Eigenvector Centrality , betweenness Centrality ,
Local Clustering Coefficient , Eccentricity .
We calculated the correlation coefficient of all the centrality measures for the true PPIN and false PPIN. For true PPIN , we found the very strong correlation between degree and betweenness centrality (0.95) which shows that nodes with high degree control the information flow in the network by being present in shortest paths in PPIN and may contribute to multiple pathways.
For false PPIN , we found the very strong correlation between degree and eigenvector centrality (0.93) but a poor correlation between degree and betweenness centrality (0.56). This showed that the unlike true PPIN , high degree nodes do not control the information flow in the network.
Further, we used the machine learning algorithm such as logistic regression and random forest to select best classifier node attributes to differentiate between the true PPIN and false PPIN. The closeness centrality was identified as a best classifier. For true PPIN , nodes have relatively higher values for closeness centrality.
By using our findings, we removed the insignificant edges and nodes from true PPIN and made it sparse. We removed edges having jaccard score value above 75 percentile of true PPIN. We also removed the nodes that had closeness centrality value less than the 25 percentiles in true PPIN. This yielded a resulting network of 1900 nodes and 4637 edges.
Protein Cluster identification
We used Markov cluster (MCL) algorithm for protein cluster identification. MCL algorithm is particularly noise-tolerant as well as effective in identifying high-quality protein clusters [5]. MCL is unsupervised cluster algorithm for graphs based on manipulation of transition probabilities to identify protein clusters. Protein clusters are generally highly overlapped but MCL is hard clustering algorithm and proteins are non-overlapping. The fundamental concept of identifying protein clusters is that a pair of proteins interacting with each other has higher probability of sharing the same function (pathway) than two proteins not interacting with each other. The MCL algorithm identified 6 clusters within true PPIN (Figure 2).
Pathway enrichment Analysis
Target identification and synergistic interaction among multiple target is important unravel the pharmacological mechanism of action of bioactives. Target proteins belonging to each cluster were searched into Gene Ontology database (http://pantherdb.org/webservices/go/overrep.jsp). We uploaded the protein list of each cluster, we selected the option of statistical overrepresentation test.
A detailed table showing the cluster number, their TP, IP and pathways is uploaded on the Github page as cluster_proteins_pathways.xlsx. we can conclude that the three cluster involved in the significant number of pathways are cluster number 2, 4, and 5 contributing to 25, 35 and 38 pathways respectively. Three pathways were overlapped among these three cluster. These pathways were Gonadotropin-releasing hormone receptor pathway, Endothelin signaling pathway, and Inflammation mediated by chemokine and cytokine signaling pathway. Earlier studies [6] showed the connection between presence of Gonadotropin-releasing hormone receptor in extra-pituitary tissues and progression of some cancers which gives indirect evidence to the anticancer activity of the C. longa.