Evaluation of the original catalogue
Since the publication of our original catalogue up to the preparation of this manuscript, we identified 79 Drosophila genes that had gathered enough experimental evidence to be considered a synaptic gene according to our previous criteria [6]. Additional file 2 lists these genes along with the references supporting their synaptic functions. Roughly a third of these NSG (28 genes) were present in our original catalogue. A standard approach to evaluate the overrepresentation of certain feature (in this case, being a NSG) in a list of genes is to perform enrichment analysis (see Methods). Using in-house scripts and the hypergeometric distribution we calculated the enrichment in NSG of our original catalogue and its associated p-value. We found our original catalogue has an enrichment in NSG of 4.38 with a p-value < 10-10 (Table 1 in the Supplemental Files).
Improved training scheme
The changes to the training scheme of our model tested here are detailed in the Methods section and schematized in Fig. 2.
To test if these changes improved the predictive performance of our approach, we trained a model implementing these changes with the original training set and then compared the results with those of the original model. As shown in Table 1 and Fig. 3, the changes resulted in a better predictive power measured as enrichment in NSG. This improvement is also observed when we considered each classification algorithm separately (Fig. 3A-C). The performance of the intersection of the classifiers trained with the sub-samples of the training set was always better than that of the performance of the classifier trained with the full training set. By intersection of the classifiers for a given threshold we mean the set of genes that were assigned with a probability above the threshold by the 3 classifiers simultaneously.
Evaluation of the new classifiers
After demonstrating that the proposed changes to the training scheme and ensemble rules would have resulted in a series of catalogues more enriched in NSG, we incorporated the 79 NSG to the training set and repeated the whole procedure described above, obtaining 15 new classifiers. Each of these classifiers was evaluated with an independent test set, that was used to calculate the accuracy, the F1 score and the area under the ROC curve (Fig.2 and Table 2 in the Supplementary Files). The obtained values were compared with those reported by other colleagues when training models to predict other biological functions [14–16].
A new catalogue of putative synaptic genes
We trained a new model incorporating the 79 NSG to the training set and the changes to the training scheme. In our original work only those genes assigned with a probability of being synaptic of at least 0.9 by the three classifiers were included in the final catalogue. This high classification threshold was set to obtain a catalogue of a given size. Now the threshold was set at 0.95 because we aimed to obtain a smaller catalogue since there are fewer unknown synaptic genes. The resulting catalogue had 601 genes.
Enrichment of the new catalogue in synapse-related GO terms
To evaluate the quality of the new catalogue, we determined its enrichment in synapse-related GO terms. This could be done because we constructed our training set without taking into account Gene Ontology. We found that 83 of the 601 genes to which our 15 classifiers assigned a probability above 0.95 had some synapse-related GO annotation. To determine whether this is a significant enrichment, all the genes in our training set that have some synapse-related GO annotation must be removed from the background set. This analysis was performed with Gorilla [17] and the results are shown in Table 3 (see Supplementary Files). After excluding from the catalogue these 83 genes a final catalogue of 518 putative synaptic genes was obtained (Additional file 3).
Regularly updated on-line catalogue
The model we are presenting here will be re-trained as new synaptic genes are identified. This will result in an updated catalogue that will be available here: http://synapticgenes.bnd.edu.uy. The updated list of synaptic genes used to train the model will be available at the same site.