Procedures for building Phosphoproteomes of reference.
GenRS
To obtain the list of protein-coding genes, the whole Arabidopsis thaliana proteome was downloaded from arabidospis.org website (Data Sources). This list (https://github.com/paulati/arabidopsis_phospho/blob/master/data/preproc/Araport11_genes_all.zip) is composed of 48.359 proteins since every splicing variant is included, and they are associated to in-use IDs. So, it became the reference list to map every other dataset to get rid of obsolete IDs. After trimming the splicing variants and filtering for unique registers, a list of 27.655 root-IDs were retrieved, that correspond to all Arabidopsis thaliana protein-coding genes [15]. Procedure can be found at https://github.com/paulati/arabidopsis_phospho/blob/master/preparation/1_all_proteins.ipyn. This dataset is GenRS (Additional File 1: Table S1, https://github.com/paulati/arabidopsis_phospho/blob/master/data/results_preproc/all_ids.zip).
ExpRS
To build ExpRS, 55 of the most significant Arabidopsis thaliana phosphoproteomics datasets published in literature were collected. These datasets were mostly retrieved from the updated databases PhosPhAt4.0 [31] [32] [33] and P3DB3.5 [14] and supplemented with [8] data; and the recent released Arabidopsis thaliana expression atlas which contains one of the most comprehensive single Arabidopsis thaliana phosphoproteomes published so far [9]. Links to databases can be found at Data Sources. Among all these studies included; 17 used cell cultures, and the rest used soil-, plate-, or hydroponically grown tissue, including young seedlings and leaves/rosettes (32), roots (6), pollen (2), or seeds (5). Various types of stress treatments were employed, in particular nutrient stress (nitrogen, sugars, and phosphate), but also hormone treatments (e.g., ethylene and abscisic acid) and biotic stress (as exposed previously in [8]). However, more than 65% of the ExpRS dataset belongs to control conditions. To assemble ExpRS, we retrieve the lists of genes encoding for phosphoproteins identified by mass spectrometry in each study. All in all, ExpRS contains 13.137 genes that encode for proteins with experimental evidence of phosphorylation (Additional File 1: Table S1, https://github.com/paulati/arabidopsis_phospho/blob/master/data/results_preproc/ExpRS.zip).
PredRS
The predicted phosphoproteome named PredRS, was built combining two of the most accurate and updated predictors used in plants: MusiteDeep [34-36] and PhosPhAt [31,32, 33]. We evaluated performance of MusiteDeep, PhosPhAt and the ensemble of both classifiers (PredRS), in order to achieve a high confident prediction with the highest number of phosphoprotein-coding-genes experimentally validated. Procedures can be found at https://github.com/paulati/arabidopsis_phospho/blob/master/preparation/3_predictions.ipynb.
MusiteDeep
MusiteDeep [34,35,36] provides a deep-learning framework for protein post-translational modification site prediction [34]. The method overview and deep learning architecture are well detailed in [34].
MusiteDeep phosphorylation predictor uses protein sequences as input and calculates a score at the amino acid level. All the residues annotated by UniProtKB/Swiss-Prot as phosphorylated were treated as positive sites, while the residues with the same amino acids excluding annotations were regarded as the negative sites [34].
With the aim to formulate the prediction as a binary classification between “gene encoding for a phosphorylated protein” or positive result, and “gene encoding for a non-phosphorylated protein” or negative result; a prediction score was required to be assigned per protein rather than per site. In this regard, we considered a positive prediction if MusiteDeep predicted any site as phosphorylated. This constituted a new classifier based on MusiteDeep.
All the Arabidopsis proteins’ sequences were submitted to MusiteDeep Phosphorylation(S, T) and Phosphorylation(Y) prediction models (Data Sources). The cutoff was set to zero in order to obtain the complete unfiltered set of putative phosphoserine S, phosphothreonine T and phosphotyrosine Y (https://github.com/paulati/arabidopsis_phospho/blob/master/data/results_preproc/musitedeep/filenum_Prediction_results.txt.zip). For every S, T and Y in each protein, a prediction score was obtained (https://github.com/paulati/arabidopsis_phospho/blob/master/data/results_preproc/musitedeep/Prediction_results_all.zip), otherwise we setted -1 as a flag for missing prediction scores. We retrieved the maximum prediction score among phosphoserine score, phosphothreonine score and phosphotyrosine score for each protein (https://github.com/paulati/arabidopsis_phospho/blob/master/data/results_preproc/musitedeep/Prediction_results_all_scores.zip). We then calculated a single score for each protein (protein score) as the maximum between phosphoserine, phosphothreonine and phosphotyrosine scores (https://github.com/paulati/arabidopsis_phospho/blob/master/data/results_preproc/musitedeep/Prediction_results_score_by_protein.zip). As exposed previously, several proteins may be encoded by the same gene due to splicing variants. So we calculated a prediction score (Gene Score) for each gene as the maximum of protein scores within the proteins encoded by the same gene (https://github.com/paulati/arabidopsis_phospho/blob/master/data/results_preproc/musitedeep/Prediction_results_score_by_id_base.zip). These gene scores belong to the new classifier based on MusiteDeep.
A true positive result (tp) is a gene predicted as coding for a phosphorylated protein and included in ExpRS, thus experimentally validated. In this regard, each gene has been labeled according to ExpRS so as to judge it as a true or false positive prediction. According to this, a false positive result (fp) is a gene predicted as coding for phosphorylated protein and not included in ExpRS. A true negative result (tn) is a gene predicted as coding for non-phosphorylated protein and not included in ExpRS. A false negative result (fn) is a gene predicted as coding for non-phosphorylated protein and included in ExpRS. These parameters characterize a classifier’s performance.
To determine the optimal classifier’s threshold for gene scores, we have calculated the value on the ROC curve (Receiver Operating Characteristic) which maximizes true positive rate (tpr = tp / (tp + fn)) and minimizes false positive rate (fpr = fp / (fp + tn)). AUC (Area under ROC Curve) score for this classifier was 0.73 and the value obtained for the threshold was 0.85. Therefore, the optimal threshold for this classifier was calculated as the mean of the results from computing the optimal threshold value for 5000 random samples of 30% of the total set. With threshold set in 0.85, classifier’s performance results in fpr = 0.35, tpr = 0.69 and 4105 genes predicted as fn.
PhosPhAt
PhosPhAt offers a phosphorylation site prediction tool specifically trained on experimentally identified Arabidopsis thaliana phosphorylation motifs [14]. It is based on a Support-Vector-Machines algorithm whose feature-vector consists of the sequence of amino acids and their chemical-physical properties [14]. The high-confident predicted sites are available to download from PhosPhAt website (Data Sources).
We retrieved the genes that encode for proteins with predicted sites by PhosPhAt, and considered as true positive results every gene included in ExpRS. According to this, PhosPhAt’s performance resulted in fpr = 0.001, tpr = 0.769 and 3034 genes predicted as fn.
PredRS performance
PredRS resulted from the combination of PhosPhAt and MusiteDeep predictions. PredRS considered a gene as encoding for a phosphorylated protein, if either PhosPhAt or MusiteDeep predicted it as positive.
Metrics obtained for PredRS performance were the following: Sensitivity or recall = 0.91, specificity = 0.66, precision = 0.71, fpr = 0.34, tpr = 0.91 and 1113 genes predicted as fn.
According to our evaluation of the predictors’ performance on Arabidopsis data, PhosPhAt outperformed MusiteDeep, because of lower fpr and higher tpr. However, PredRS presented the highest tpr and lowest fn, leading to high sensitivity and precision values. This is the reason why it constituted the predicted phosphoproteome of reference in this study. PredRS contains 17.156 high confident predicted phosphoprotein-coding genes (Additional File 1: Table S1, https://github.com/paulati/arabidopsis_phospho/blob/master/data/results_preproc/PredRS.zip), 70% of which are experimentally validated (intersect with ExpRS).
UnRS.
To assemble the current Arabidopsis thaliana phosphoproteome so far, we performed the union of ExpRS and PredRS, leading to a list of 18.269 phosphoprotein-coding genes (https://github.com/paulati/arabidopsis_phospho/blob/master/data/results_preproc/UnRS.zip). For contrast tests, 11 novel phosphoproteins from in-house unpublished data were added to UnRS becoming U11RS (Additional File 1: Table S1).
Phosphoproteomic from etiolated Arabidopsis seedlings.
Plant material.
WT seeds from ecotype Landsberg erecta background were obtained from the Arabidopsis Biological Resource center (ABRC, Ohio, USA). Seeds were sterilized for two hours in a Cl2 (g) atmosphere generated by the addition of 1,5ml HCl (37% v/v) to 50 ml of bleach (sodium hypochlorite). Sterilized seeds were sown on Murashige-Skoog 0.5X agar 0.8% (w/v) medium in petri boxes and incubated at 4°C in D during 3 d for stratification. Chilled seeds were exposed to a 2 h red light pulse (50 µmol m-2s-1) at 22°C to synchronize germination, and then kept in darkness for 5 d before harvest.
Protein extraction, phosphopeptide enrichment and LC-MS/MS.
Fifteen g of seedlings from 5-d-dark-grown WT seedlings were harvested and grinded with mortar and pestle in liquid N2. Seedling powder was resuspended in 750µl of cold extraction buffer (0.7 M sucrose, 0.1 M KCl, 0.5 M Tris-Cl, pH 7.5, 50 mM EDTA and 2% v/v β-Merchaptoethanol). 750µl of phenol equilibrated at pH 8 with Tris-HCl was added and incubated at 4°C for 30 min. Phases were separated by centrifugation during 30 min at 9000 g (4°C). The phenolic phase was recovered and re-extracted with an equal volume of cold extraction buffer. Proteins were precipitated by addition of five volume of 0.1M ammonium acetate in methanol (Chilled at -20°C) to the phenolic phase, incubated over-night at -20°C, and centrifuged during 1 h at 12000 g (4°C). The pellets were washed twice with two volumes of cold methanol. The washing procedure was repeated twice with two volumes of cold acetone. The pellets were dried and resuspended in 100 µl of resuspension buffer (50 mM potassium phosphate buffer, pH 7.5, 8 M urea, antiproteases 1X (RocheTM) and antiphophatases 1X (RocheTM). Total proteins were quantified by Bradford method (Bradford, 1976) and lyophilized. Lyofilized proteins were sent for phosphopeptide enrichment and mass spectrometry label free quantification to the Protomic Plataform of Chu de Quebec Research Centre (Laval, QC). Samples were resuspended in 450 μl of 50 mM Ammonium bicarbonate. A trypsin digestion of 450µg of each sample was performed followed by phosphopeptide enrichment using Pierce™ TiO2 Phosphopeptide Enrichment and Clean-up Kit. 1/5th of the elution for each sample was injected in a Thermo Orbitrap Fusion Mass Spectrometer. It was performed a 120 min run with a 90 min gradient in a data dependent acquisition (DDA) with high energy collision induced dissociation (HCD) MS/MS-IT detection mode. We performed 3 independent replicates.
Bioinformatic analysis
The bioinformatic analysis was performed using MaxQuant/Andromeda engine, setting as fixed modifications: Carbamidomethylation (C), and variable modifications: Oxidation (M) + Phosphorylation (STY). The database employed was UniProt CP_Arabidopsis Thaliana. The quantification was done using signal intensity values of peptides. A peptide was considered quantifiable if it had at least 2 signal intensity values among biological replicates. Missing signal intensity values were imputed with a noise value corresponding to the 1-percentile of all intensity values by sample. The mass spectrometry proteomics data have been deposited to the ProteomeXchange Consortium via the PRIDE [37] partner repository with the dataset identifier PXD008274.
This in-house data corresponds to the etiolated Arabidopsis seedlings’ phosphoproteome, called Et in this study. It contains 933 genes that encode for phosphoproteins quantified in this study (Additional File 1: Table S1), out of a total of 1152 identified filtering with FDR 1%.
GO enrichment analysis in TopGO.
Semi-automated Gene Ontology (GO) terms enrichment analysis (The Gene Ontology Consortium 2001) were performed through the TopGO package as described in [38]. We implemented and applied the ParentChild algorithm [39] for eliminating local similarities and dependencies between GO terms. For statistics, we computed Fisher's exact test which is based on gene counts [40]. Procedures can be found at https://github.com/paulati/arabidopsis_phospho/blob/master/topgo/go_terms_analysis.R
The GO is composed of thousands of functional classes (terms) structured according to a directed acyclic graph (DAG). When a gene is annotated with a class, all inferences emerging from the structure of the GO must also hold true. This is known as the “True Path Rule”. In other words, “an annotation for a class in the hierarchy is automatically transferred to its ancestors, while genes unannotated for a class cannot be annotated for its descendants” [41]. Provided a list of terms enriched in a gene dataset, if the child term has highly statistically significant enrichment, the parent term might appear significantly enriched purely as a consequence of including all the genes from the child term. This is the reason why GO enrichment analysis can not be done independently for each term; if so, each gene would be counted multiple times.
To build the TopGO data object, three elements were required: the GO annotations for mapping the genes to the GO terms [42-44]; the dataset of reference to which every dataset was compared (gene universe or population); and the dataset to be tested for enrichment of GO terms (It´s intersection with the gene universe must be complete). GO annotations for the three ontologies (Biological Process, Molecular Function and Cellular Component) were retrieved from GO.db version 3.12 (Data Sources). Mapping was performed as specified in https://github.com/paulati/arabidopsis_phospho/blob/master/topgo/annotations.R. GenRS was defined as the gene universe for comparing every phosphoproteomic dataset in this study (PredRS vs. GenRS, ExpRS vs. GenRS, UnRS vs. GenRS in Fig. 2; and Et vs. GenRS, Un11RS vs. GenRS in Fig.3). And Un11RS became the gene universe when comparing Et (Et vs. Un11RS in Fig. 3). Statistical analysis yielded p-values that were adjusted for multiple testing. In consequence, FDR correction procedure produced conservative p-values and declared fewer terms as significant. We considered significant those values smaller than 0.01. Output results can be found at https://github.com/paulati/arabidopsis_phospho/tree/master/data/results_topgo.
A binary matrix was constructed with the significant (taking value 1) and non-significant (taking value 0) GO terms, for each comparison in Fig. 2 (https://github.com/paulati/arabidopsis_phospho/blob/master/data/results_topgo/fig2/fig2_go_terms_matrix.csv) and Fig.3 (https://github.com/paulati/arabidopsis_phospho/blob/master/data/results_topgo/fig3/fig3_go_terms_matrix.csv). These results were resumed as binary heatmaps, in which the GO terms sharing the same pattern were classified into groups. For each ontology, the count of GO terms in each group was specified.
Semantic similarity analysis.
In order to identify the phosphorylation pathways in Et, we determined the proportion of genes associated with every GO term in groups CG2 and CG3, that are present in Et sample; and their interrelation in the GO graph. First, we constructed the lists of genes associated with the GO terms in each group. Procedure can be found at https://github.com/paulati/arabidopsis_phospho/blob/master/lists_by_group/go_terms_lists.R and data, at https://github.com/paulati/arabidopsis_phospho/tree/master/data/lists_by_groups. Then, we computed the GenRatio for each term in groups CG2 and CG3. Procedure is specified at https://github.com/paulati/arabidopsis_phospho/blob/master/semantic_similarity/et_terms_perc_tables.R. Finally, semantic similarity graphs were built using GenRatios in order to locate the terms in the GO structure (Fig. 4).
The semantic comparisons of Gene Ontology (GO) annotations provide quantitative ways to compute similarities between genes and gene groups [20]. Considering that the specificity of a GO term is usually determined by its location in the GO graph, Wang et al. (2007)[45] proposed a graph-based strategy to compute semantic similarity using the topology of the GO graph structure. In Wang's method, the semantics of GO terms are encoded into a numeric format and the different semantic contributions of the distinct relations are considered [20]. GO terms Semantic similarity was calculated using ViSEAGO package [46] as it is specified in https://github.com/paulati/arabidopsis_phospho/blob/master/semantic_similarity/viseago_topGo_combinatory_fig4.R. The Wang semantic similarity plots display nodes scattered in a 2-dimension space according to their semantic similarity. Every node represents a significant GO term coloured according to the comparison/s, and their size represents their corresponding GenRatio. DAG graphs were built using GOview (Data Sources) so as to show the relationship of GO terms with higher GenRatios (Additional File 2: Figure S1).