Evaluation of the original catalogue
Since the publication of our original catalogue up to the preparation of this manuscript, we identified 79 Drosophila genes that had gathered enough experimental evidence to be considered a synaptic gene according to our previous criteria (6). Additional file 2 lists these genes along with the references supporting their synaptic functions. Roughly a third of these NSG (28) were present in our original catalogue, giving an enrichment in NSG of 4.38 with a p-value < 10-10 (Table 1).
Table 1 – Evaluation of our original prediction.
|
|
|
|
|
|
|
|
Original model
|
Improved model
|
New model
|
|
|
Training set
|
original
|
original
|
updated
|
|
|
Training scheme
|
original
|
new
|
new
|
|
|
Classification threshold
|
0.9
|
0.9
|
0.95
|
|
|
Genes above threshold
|
988
|
192
|
605
|
|
|
Enrichment in NSG
|
4.38
|
6.07
|
-
|
|
|
p-value
|
4.0 E-11
|
1.0 E-4
|
-
|
|
Comparison between the results of the original model, a model trained with the original training set but with the new training scheme and a model trained with the new training scheme and the updated training set. The enrichment in NSG found in the catalogue obtained with the new training scheme is 38% higher than that found in the catalogue obtained with the original training scheme even though both models were trained with the same set of genes. The training set for the new model includes the 79 NSG, thus the enrichment in NSG of the resulting catalogues can not be defined.
Improved training scheme
The changes to the training scheme of our model tested here are detailed in the Methods section and schematized in Figure 2.
To test if these changes improved the predictive performance of our approach, we trained a model implementing these changes with the original training set and then compared the results with those of the original model. As shown in Table 1 and Figure 3, the changes resulted in a better predictive power measured as enrichment in NSG. This improvement is also observed when we considered each classification algorithm separately (Figure 3A-C). The performance of the intersection of the classifiers trained with the sub-samples of the training set was always better than that of the performance of the classifier trained with the full training set. By intersection of the classifiers for a given threshold we mean the set of genes that were assigned with a probability above the threshold by the 3 classifiers simultaneously.
Evaluation of the new classifiers
After demonstrating that the proposed changes to the training scheme and ensemble rules would have resulted in a series of catalogues more enriched in NSG, we incorporated the 79 NSG to the training set and repeated the whole procedure described above, obtaining 15 new classifiers. Each of these classifiers was evaluated with an independent test set, that was used to calculate the accuracy, the F1 score and the area under the ROC curve (Fig.2 and Table 2). The obtained values were compared with those reported by other colleagues when training models to predict other biological functions (12–14).
Table 2. Evaluation of the 15 classifiers.
|
|
Accuracy
|
F1
|
AU ROC
|
kNN
|
Mean
|
0,93
|
0,88
|
0,97
|
|
SD
|
0,03
|
0,05
|
0,02
|
SVM
|
Mean
|
0,93
|
0,89
|
0,97
|
|
SD
|
0,02
|
0,04
|
0,01
|
RF
|
Mean
|
0,95
|
0,91
|
0,97
|
|
SD
|
0,03
|
0,03
|
0,01
|
Kerepesi et al. 2018
|
|
-
|
-
|
0,93
|
Kacsoh et al. 2017
|
|
-
|
-
|
0,81
|
Moore et al. 2019
|
|
-
|
-
|
0,87
|
Fifteen classifiers were obtained by training three algorithms with five different training sets. The performance of each classifier was evaluated using a test set conformed by genes that were not used during training. The table shows the mean and standard deviation of the accuracy, the F1 score and the area under the ROC curve of the five classifiers trained with each algorithm. The last three rows show the area under the ROC obtained by other colleagues when predicting other biological functions through machine learning.
A new catalogue of putative synaptic genes
We trained a new model incorporating the 79 NSG to the training set and the changes to the training scheme. In our original work only those genes assigned with a probability of being synaptic of at least 0.9 by the three classifiers were included in the final catalogue. This high classification threshold was set to obtain a catalogue of a given size. Now the threshold was set at 0.95 because we aimed to obtain a smaller catalogue since there are fewer unknown synaptic genes. The resulting catalogue had 601 genes.
Enrichment of the new catalogue in synapse-related GO terms
To evaluate the quality of the new catalogue, we determined its enrichment in synapse-related GO terms. This could be done because we constructed our training set without taking into account Gene Ontology. We found that 83 of the 601 genes to which our 15 classifiers assigned a probability above 0.95 had some synapse-related GO annotation. To determine whether this is a significant enrichment, all the genes in our training set that have some synapse-related GO annotation must be removed from the background set. This analysis was performed with Gorilla (15) and the results are shown in Table 3. After excluding from the catalogue these 83 genes a final catalogue of 518 putative synaptic genes was obtained (Additional file 3).
Table 3. Enrichment of the new catalogue in synapse-related GO terms.
|
GO term
|
Description
|
p value
|
FDR q value
|
Enrichment
|
|
GO:0050807
|
regulation of synapse organization
|
1.66E-12
|
2.03E-10
|
4.67
|
|
GO:0051963
|
regulation of synapse assembly
|
1.38E-10
|
1.27E-8
|
4.87
|
|
GO:0008582
|
regulation of synaptic growth at neuromuscular junction
|
2.21E-9
|
1.59E-7
|
4.65
|
|
GO:0016080
|
synaptic vesicle targeting
|
7.01E-4
|
1.25E-2
|
8.67
|
Regularly updated on-line catalogue
The model we are presenting here will be re-trained as new synaptic genes are identified. This will result in an updated catalogue that will be available here: http://synapticgenes.bnd.edu.uy. The updated list of synaptic genes used to train the model will be available at the same site.