Training dataset and encoding
The Pfam database contains > 19,000 protein families. To obtain a number of protein families adequate to handle and train the neural network, we selected a total of 996 protein families (Supplementary Table S1) by excluding protein families with fewer than 100 annotated sequences within Pfam's underlying sequence database and focusing on protein families within the plant and animal kingdoms. All sequences found in Pfam's underlying sequence database of each Pfam family were extracted using MySQL workbench. The script used is available at https://github.com/LisaVdB/PF-NET. We then excluded long sequences at a cutoff of 1234 amino acids based on the sequence length distribution. As such, the total training dataset consisted of 7,385,028 sequences, covering the entire tree of life. This large database of sequences was used as input for the neural network without any feature extraction.
Protein sequences are composed from 20 amino acids, each represented by a letter, according to the IUPAC guidelines. An additional 4 letters are used to represent a group of amino acids: Z represents a glutamic acid or glutamine; letters O and U represent pyrrolysine and selenocysteine, respectively; and letter X represents any amino acid. Protein sequences were encoded using a binary encoding scheme, in which each amino acid is represented as a string of zeros and ones with length 5 (Fig. 1). The sequences were zero-padded until length 1234 to preserve the original input size.
After binary encoding of the sequences and to facilitate training, the dataset was split into 6 batches of around 1.2 million sequences in a stratified manner, i.e., each batch contained a similar number of sequences for each of the 996 families. Three batches were combined to make 5 different batch combinations, each of around 3.6 million sequences. Subsequently, each batch was split in a stratified manner into three proportions: 60% of sequences are used for training, 20% for model validation, and 20% are held out as test sequences. Stratified train-test-split is performed to ensure all classes are divided proportionally into the training, validation, and testing sets 44.
Neural network architecture
The neural network architecture consists of four different layers: (1) a convolutional neural network layer (CNN), (2) an attention layer, (3) a biLSTM layer, and (4) two dense layers connected to the output vector. To extract putative protein domains within the sequence, we added a 1D CNN that performs a convolution across the encoded sequences with a kernel size of 7 (Fig. 1A). Smaller sequence domains conserved within a protein family can thus be identified with the CNN. The kernel moves along the sequence with a stride of 1 to compute features. A total of 320 filters are applied, leading to the computation of 320 different features (Fig. 1A). To emphasize the learned patterns from the CNN layer, we included an attention layer that integrates the entire sequence generated from the CNN layer, and generates an output sequence that is a function of the entire input sequence and all its hidden states. The function is an “attention” operation, giving different weights to each hidden state (Fig. 1A). As such, the attention layer will assign increased importance to key domains within the sequence. To capture long-distance dependencies within the sequences and between detected domains, we added a biLSTM layer. The biLSTM layer processes the output of the attention layer to identify any distant dependencies within the entire sequence (Fig. 1A). As such, the order of distant conserved sequence domains is taken into account. Lastly, two dense layers are connected to the output vector (Fig. 1A).
Experimental settings and evaluation
The following hyperparameters were found to give the best performance:
Table 1
– Best performing hyperparameters.
Hyperparameter | Value |
Optimizer | Nadam |
Learning rate | 1E− 7 |
Dropout probabilities | 0.3, 0.2, 0.5 |
Filters | 320 |
Batch size | 100 |
Activation function | softmax |
Epochs | 60 (12 per batch) |
Alpha (focal loss) | 0.45 |
Gamma (focal loss) | 2 |
Several learning rates for the Nadam (Nesterov-accelerated Adaptive Moment Estimation) were tested (1E− 3, 1E− 5, 1E− 7) 45. A learning rate of 1E− 7 in combination with a learning rate decay with every epoch gave the best performance with respect to the loss. Additionally, three different values of the number of filters (128, 320, and 480) for the CNN were tested, of which 320 filters gave the best performance with respect to the loss. Two activation functions (softmax, sigmoid) at the final layer were tested of which softmax led to the best performance with respect to the loss 46,47.
The input dataset from the Pfam database is unbalanced, meaning that the number of sequences for each protein family (i.e. the class size) is unequal. Thus, class sizes are skewed: 21.5% protein families are represented by < 1000 sequences, 61.8% have > 1000 and < 10,000 sequences, and 16.7% have > 10,000 sequences. To accurately predict such an unbalanced dataset and correct for skewed class sizes, we applied a focal loss function to evaluate the prediction error of PF-NET 48. Lastly, to take into account the skewed distribution of the classes, different class weights were assigned to the classes when fitting the model that assign different costs to misclassification according to their sample distribution (i.e. misclassification of minority classes with lesser samples are penalized higher than majority classes).
To evaluate the performance of the neural network, the precision, recall, and f1-score of each class individually was calculated as follows:
$$Precision= \frac{True positives}{True positives + False positives}$$
$$Recall= \frac{True positives}{True positives + False Negatives}$$
$$f1 score= \frac{2\times Precision \times Recall}{Precision+Recall}$$
To evaluate the overall performance of the neural network, the accuracy as well as the macro average and weighted average of precision, recall, and f1 score were calculated. For the macro average, the scoring metrics of each class were calculated and averaged. For the weighted average, the scoring metrics for each class were calculated, weighted by the number of true instances for each label, and averaged to account for class imbalance.
All scripts for the neural network training, validation, and for making predictions are available on GitHub (https://github.com/LisaVdB/PF-NET). A webtool was also built to easily make predictions with PF-NET (https://psi-rstudio.cals.ncsu.edu/connect/).
Neural network evaluation
To set up and benchmark our interpretable and robust pipeline, we used the yeast proteome as a benchmark independent dataset. The yeast proteome was downloaded from the yeast genome database (Saccharomyces cerevisiae reference strain S288C release R64-3-1 2021-04-21, Taxon identifier 559292). Sequences larger than 1234 amino acids were split into sequences of 1234 amino acids or smaller. Sequences were encoded and predictions were made with PF-NET. The predictions were compared against a carefully curated list of experimentally validated kinases and phosphatases from published literature, which we refer to as the ground truth (Supplementary Table S3). For comparison with PF-NET, hidden Markov models of the Pfam families Metallophos (PF00149), DSP (PF00782), PP2C (PF00481), Phosphatase_Tyr (PF00102), and Pkinase (PF00069) were used to perform a HMMER search (https://www.ebi.ac.uk/Tools/hmmer/search/hmmsearch). We used the default values (0.01 and 0.03 E value cutoffs) and restricted the HMMER search by taxonomy (organism: S. cerevisiae).
To identify a threshold that minimizes false positives and false negatives, 100 thresholds between 0 and 1 were tested. The cost associated with each threshold was calculated as follows.
$$Cost = FDR - \left(3 * TPR\right)$$
$$FDR = False positives / (False positives + True positives)$$
$$TPR = True positives / (True positives + False Negatives)$$
The threshold with the lowest cost was selected. The R script for threshold identification is available at GitHub (https://github.com/LisaVdB/PF-NET). All plots are generated in Microsoft Excel or R 49 using ggplot2 50.
Neural network predictions in plants
To test whether our neural network will generalize well towards plant proteins, we made functional predictions in several model and non-model plant species: A. thaliana, soybean, wheat, maize, sorghum, and rice. After downloading the proteomes of these species in their respective databases, sequences larger than 1234 amino acids were split into sequences of 1234 amino acids or smaller. Then, PF-NET was used to make functional predictions. The A. thaliana proteome was downloaded from TAIR (Arabidopsis thaliana Genome Annotation Official Release version Araport11, release date June 2016). The soybean proteome was retrieved from Soybase (Williams 82 Genome Sequencing Project, Assembly 2 Annotation 1). The entire proteome of Zea mays (maize) (B73 reference NAM v5.0), Sorghum bicolor (sorghum) (Sbi1.4/SbGDB181), Triticum aestivum (wheat) (IWGSC RefSeq v2.1 High Confidence peptides), and Oryza sativa (rice) (v7.0) were downloaded and tested.
We selected the kinases from A. thaliana and soybean by identifying all protein sequences that were predicted to be part of the Pfam families Pkinase (PF00069) and Pkinase_Tyr (PF07714). To identify the phosphatase of A. thaliana, soybean, rice, wheat, maize, and sorghum at a proteome level, the following four Pfam families were selected: Metallophos (PF00149), DSP (PF00782), PP2C (PF00481), and Phosphatase_Tyr (PF00102).
To test the generalization of our neural network, we compared PF-NET’s results to published computational protein classifications studies and HMMER runs. We did this specifically for the Arabidopsis thaliana kinases and phosphatases, and the Glycine max kinases. Similarly as for Saccharomyces cerevisiae, the cost was calculated for 100 probability thresholds for the predicted Arabidopsis thaliana kinases and phosphatases, and the Glycine max kinases, using the literature-obtained kinases and phosphatases as ground truth. This led to the identification of an optimal threshold of 0.616 for A. thaliana kinases, 0.565 for the A. thaliana phosphatases, and 0.646 for the soybean kinases. A. thaliana and soybean kinases and A. thaliana phosphatases were retrieved from literature studies 51–53. Hidden Markov models of the Pfam families Pkinase (PF00069) and Pkinase_Tyr (PF07714) were used to search for kinases with HMMER (https://www.ebi.ac.uk/Tools/hmmer/search/hmmsearch) (HmmerWeb version 2.41.2). We used the default values (0.01 and 0.03 E value cutoffs) and restricted the HMMER search by taxonomy (species-level). Gene descriptions for A. thaliana were downloaded from TAIR. Gene descriptions, GO-terms and Panther descriptions for soybean were downloaded from SoyBase 54.
To annotate the undescribed phosphatases in other non-model species, we made predictions as described above. To identify the phosphatases of the other crops, we used a mild threshold of 0.5. Neural network predictions were compared with a HMMER search and soybean, maize, sorghum, wheat, and rice orthologs of known A. thaliana phosphatases. Hidden Markov models of the Pfam families Metallophos (PF00149), DSP (PF00782), PP2C (PF00481), and Phosphatase_Tyr (PF00102) were used to search for phosphatases with HMMER (https://www.ebi.ac.uk/Tools/hmmer/search/hmmsearch). Orthologs were identified with PLAZA 4.5 through the Plaza Integrative Method (https://bioinformatics.psb.ugent.be/plaza/) 55. To infer the orthogroups for our six species and a rooted species tree, we performed a comprehensive phylogenetic analysis with the six species using Orthofinder, a software program for phylogenetic orthology inference 56,57.
Label-free phosphoproteomics
Four Altona 58 soybean seeds (USDA-ARS Germplasm Resource Information Network nr: PI 548504) were sown at 2 cm depth in one square plastic pot (7 × 7 × 8 cm) filled with regular potting soil (Beroepspotgrond, Saniflor). After 10 days, seedlings were thinned to two per pot. The pots were placed in a controlled growth room with 400 µmol m−² s− 1 light intensity, continuous 20°C, and 15h/9h dark day/night rhythm. Regular watering was performed to keep the plants under optimal growing conditions, no extra fertilizer was provided as a base fertilizer (NPK 12-14-24 with micronutrients at 1.2 kg m−³) was present in the potting soil. A total of 60 pots per treatment were sown. For the cold treatment, the seedlings were transferred five days after germination to 12°C/5°C. The plants were kept in another growth chamber at 20°C during the night before the start of the cold treatment. The original growth chamber was set at 12°C/5°C during that night. The cold treatment started by transferring the plants into the original growth chamber at the moment that the lights were switched on in the morning time (at a temperature of 10°C). Every six minutes up to one hour after treatment, the tip half of unifoliate leaves from three distinct plants were pooled for phosphoproteomics in 5 mL tubes containing 4x metal beads of 4 mm, followed by snap-freeze in liquid nitrogen. The process was repeated four times for four biological replicates.
To extract proteins, the frozen material of each pool was grounded to fine powder. After, the material was resuspended in homogenization buffer containing 30% sucrose, 50 mM Tris-HCl buffer (pH 8), 0.1 M KCl, 5 mM EDTA, and 1 mM DTT in Milli-Q water, to which one tablet of both cOmplete™ Ultra protease inhibitor mixture (Roche) and PhosSTOP phosphatase inhibitor mixture (Roche) were added. The samples were sonicated on ice to further break cells and subcellular organelles and centrifuged at 4°C for 15 min at 3200 g to remove debris. Supernatants were collected and a methanol/chloroform precipitation was carried out by adding 3, 1, and 4 volumes of methanol, chloroform, and water, respectively. Samples were centrifuged for 10 min at 3200 g and the aqueous phase was removed. After addition of four volumes of methanol, the proteins were pelleted by centrifugation for 10 min at 3200 g. Pellets were washed with 80% acetone, centrifuged for 10 min at room temperature at 3200 g. The supernatants were discarded, and the pellets were left to dry on air. Protein pellets were then re-suspended in 8 M ureum in 50 mM triethylammonium bicarbonate (TEAB, Sigma-Aldrich) buffer (pH 8). Alkylation of cysteines was carried out by adding tris(carboxyethyl)phosphine (TCEP, Pierce) and iodoacetamide (Sigma-Aldrich) to final concentrations of 15 mM and 30 mM, respectively, and the samples were incubated for 15 min at 30°C in the dark. Three mg of protein material was pre-digested with 10 µg of MS-grade lysyl endopeptidase (Wako Chemicals) for 2:30 min at 37°C. The mixtures were diluted 8-fold with 50 mM TEAB, followed by an overnight digestion with trypsin (Promega) with an enzyme-to-substrate ratio of 1:100. The digest was acidified to pH 3 with trifluoroacetic acid (TFA, Biosolve) and desalted using SampliQ C18 SPE cartridges (Agilent) according to the manufacturer’s guidelines. For phosphopeptide enrichment, the desalted peptides were fully dried in a vacuum centrifuge and then washed in 500 µl of loading buffer [80% (v/v) acetonitrile (BioSolve), 6% (v/v) TFA] with gentle agitation (800 rpm). Briefly, the resuspended peptides were incubated with 1 mg MagReSyn® Ti-IMAC (ReSyn Biosciences) microspheres for 20 min at room temperature with continuous mixing. After, the tubes were placed to a magnetic separator for 10 seconds having their supernatant removed and discarded after. Following, the microspheres were washed once more with loading buffer, followed by wash solvent 1 (60% acetonitrile, 1% TFA, 200 mM NaCl (Sigma-Aldrich)) and twice with wash buffer 2 (60% acetonitrile, 1% TFA). The bound phosphopeptides were eluted with three volumes (80 µl) of elution buffer (40% acetonitrile, 1% NH4OH (S)), immediately followed by acidification to pH 3 using 6 µl 100% formic acid (Roche). Prior to MS analysis, the samples were vacuum dried and re-dissolved in 50 µl of 2% (v/v) acetonitrile and 0.1% (v/v) TFA.
Peptides were re-dissolved in 50 µl loading solvent A (0.1% TFA in water/ACN (98:2, v/v)) of which 3 µl was injected for LC-MS/MS analysis on an an Ultimate 3000 RSLC nano LC (Thermo Fisher Scientific) in-line connected to a Q Exactive mass spectrometer (Thermo Fisher Scientific). The peptides were first loaded on a µPAC™ Trapping column with C18-endcapped functionality (Pharmafluidics) and after flushing from the trapping column the peptides were separated on a 50 cm µPAC™ column with C18-endcapped functionality (Pharmafluidics) kept at a constant temperature of 35°C. Peptides were eluted by a linear gradient from 98% solvent A’ (0.1% formic acid in water) to 55% solvent B′ (0.1% formic acid in water/acetonitrile, 20/80 (v/v)) in 120 min at a flow rate of 300 nL/min, followed by a 5 min wash reaching 99% solvent B’.
The mass spectrometer was operated in data-dependent, positive ionization mode, automatically switching between MS and MS/MS acquisition for the 5 most abundant peaks in a given MS spectrum. The source voltage was set at 3 kV and the capillary temperature at 275°C. One MS1 scan (m/z 400 − 2,000, AGC target 3 × 106 ions, maximum ion injection time 80 ms), acquired at a resolution of 70,000 (at 200 m/z), was followed by up to 5 tandem MS scans (resolution 17,500 at 200 m/z) of the most intense ions fulfilling predefined selection criteria (AGC target 5 × 104 ions, maximum ion injection time 80 ms, isolation window 2 Da, fixed first mass 140 m/z, spectrum data type: centroid, under-fill ratio 2%, intensity threshold 1.3xE4, exclusion of unassigned, 1, 5–8, > 8 positively charged precursors, peptide match preferred, exclude isotopes on, dynamic exclusion time 12 s). The HCD collision energy was set to 25% Normalized Collision Energy and the polydimethylcyclosiloxane background ion at 445.120025 Da was used for internal calibration (lock mass).
MS/MS spectra files were searched against the Soybean database (Williams 82 Genome Sequencing Project, Assembly 4 Annotation 1) with Maxquant software version 1.6.10.43, a program package that allows MS1-based label-free quantification 35,59. Searches were performed within replicates (0, 6, 12, 16, 24, 30, 36, 42, 48, 54, and 60 min samples) with both control and cold treatment groups, with “match between runs” feature enabled in order to maximize peptide identification. Next, the four “Phospho(STY).txt” output files from the four replicates were merged into a single file. For that, if two or more replicates shared same values for all columns: “protein id”, “position”, “aminoacid” and “multiplicity”, the quantification values were merged to a pre-existing row; if not, it was appended to the dataframe as a new row. For downstream analysis only the merged file was used. The mass spectrometry proteomics data, the MaxQuant settings, MaxQuant outputs and resulting merged file have been deposited to the ProteomeXchange Consortium via the PRIDE 60 partner repository with the dataset identifier PXD037601 (Reviewer only login username: [email protected] - Password: 27SjTJbp).
Analysis and network inference with NetPhorce
Prior to statistical analysis, data cleaning and quality controls were performed, which includes several filtering steps, missing value handling, and normalization. Specifically, contaminates and reverse peptides identified by MaxQuant were removed, and phosphosites with a localization probability below 0.75 were removed. To ensure a balanced dataset for statistical analysis, a threshold of 3 or more valid values or zeros per replicate for each phosphosite across the time course and experiment was chosen. Phosphosites that do not meet these criteria were filtered out. For example, a phosphosite with 2 valid values and 2 zeros at one of the time points was removed. A phosphosite with 1 valid value and 3 zeros at one or more time points and 3 or more valid values for the other time points was retained and classified as an absent/present phosphosites, which was not subjected to statistical analysis. A phosphosite with 3 or more valid values at all time points was retained and subjected to statistical analysis.
Next, to normally distribute the intensity values, they were log2 transformed. Variance stabilizing normalization (vsn) was performed to reduce the variation between replicates 36,61. To identify phosphosites that were differentially phosphorylated between conditions or timepoints, a linear mixed model was fitted to the phosphoproteomics data depending on the experimental design as follows:
Where \(Y\) is the phosphorylation intensity, 𝛼i is a fixed effect for the condition variable, 𝛽j is a fixed effect for the time variable, 𝛾k is a random effect for the replicate variable, and \(\epsilon\) is the within-replicate error. A random effect for replicate was included in the model to account for correlation between plants grown at the same time. To handle the multiple hypothesis-testing problem, which leads to an increased probability of identifying significant hits, the q-values were estimated from the calculated p-values 62. A q-value cutoff of 0.05 was chosen. Phosphosites that were significantly differentially phosphorylated between the control and cold conditions, as well as the absent/present phosphosites were selected for further network inference.
The network inference was based on dynamic Bayesian network principles and consisted of three major steps: 1) Identifying potential regulator-target combinations, 2) Scoring these regulations according to Bayesian principles, 3) Selecting high scoring regulations and determining their sign. To infer the regulatory interactions, we leveraged the changes over time for each treatment, allowing us to probabilistically calculate the Bayesian score, as follows.
The median was calculated for each condition and subtracted from intensity values to retrieve median centered log2 transformed data. Based on the time series phosphoproteome data, changes in phosphorylation intensities were identified. The delta phosphorylation events of a protein (p) are calculated as follows:
Specifically, an increase or decrease in phosphorylation intensity between two consecutive time points was identified when the intensity increased or decreased with at least 25%, respectively. A bottom percentage of all fold changes equal to 10% was considered as unchanged. Absent phosphorylation intensities are set to 0. Next, potential regulator (r) – target (p) combinations were identified as follows:
$$\left(\frac{{\sum }_{t=1}^{n}{r}_{\varDelta }\left(t\right)= {p}_{\varDelta }\left(t\right)}{n} \right)| \left(\frac{{\sum }_{t=1}^{n}{r}_{\varDelta }\left(t-1\right)= {p}_{\varDelta }\left(t\right)}{n-1} \right)>0.5$$
$${r}_{\varDelta }\left(t\right)=r\left(t\right)-r(t-1)$$
$${p}_{\varDelta }\left(t\right)=p\left(t\right)-p(t-1)$$
Where \(n\) is the number of time points. Only identified kinases and phosphatases from HMMER and PF-NET predictions (analyses were redone using the latest soybean genome annotation Williams 82 Genome Sequencing Project, Assembly 4 Annotation 1) were considered as potential regulators. A protein is a potential regulator of a target protein if and only if it exhibits a change in phosphorylation intensity at the same time (time lapse 0) or immediately prior (time lapse 1) to a change in phosphorylation intensity of the target for at least 50% of the time points 20. NetPhorce R package contains a default list of kinases and phosphatases for 26 species. For rice, soybean, wheat, sorghum, arabidopsis, and maize the kinases and phosphatases from the union of PF-NET’s, HMMERs, and/or orthologs results were included.
Next, the data was discretized into 3 levels: (1) a discretization level below the experiment-determined median, (2) a discretization level above the experiment-determined median, and (3) a discretization level for absent phosphorylation intensities. Inferring a DBN consists of finding the network topology that maximizes a score, i.e., finding the most likely parents of each node. Each potential regulator-target or regulator 1 & 2-target combination were scored with the Bayesian Dirichlet equivalent uniform (BDeu) 20. The BDeu score of a DBN can be decomposed as the sum of the scores of the log conditional probabilities of each node 20:
$$BDeu\left(D,G\right)= {\sum }_{i=1}^{n}{\sum }_{j=1}^{{q}_{i}}\left(log \left(\frac{ℾ\left(\frac{\alpha }{{q}_{i}}\right)}{ℾ({\sum }_{k=1}^{{r}_{i}}{N}_{ijk}+ \frac{\alpha }{{q}_{i}})}\right) + {\sum }_{k=1}^{{r}_{i}}log \left(\frac{ℾ({N}_{ijk}+ \frac{\alpha }{{q}_{i}})}{ℾ\left(\frac{\alpha }{{r}_{i}{q}_{i}}\right)}\right) \right)$$
Where \(G\) refers to the Bayesian graph, \(D\) refers to the dataset containing the time point observations, Nijk indicates the number of data vectors in which target \(i\), has the value \(k\) while its parents are in configuration \(j\). \(\alpha\) equals 1E− 15, a hyperparameter of the Dirichlet distribution. The regulators of target \(i\) are the ones that led to the highest value of the BDeu.
For each inferred edge, a score is calculated to determine whether the inferred interaction is a phosphorylation or dephosphorylation. For time lapse 0, the change over time of the target and regulator was compared at the same time points, while for time lapse 1, the change over time of the target was compared with that of the regulator at the immediate prior time point. If the change over time is in the same direction (for example an increase in phosphorylation) for most of the time points, the direction is considered phosphorylation. While if intensities of the regulator and target are changing in opposite directions (for example one increases in phosphorylation while the other decreases in phosphorylation), the direction is considered dephosphorylation. If the change over time is equal, then the sign of the interaction is denoted as undetermined.
All networks were visualized in Cytoscape® 3.8.0 63. To evaluate how the control and cold signaling networks were rewired, we used DyNet, a Cytoscape application that, among others, allows for the analysis of the most ‘rewired’ nodes across networks with the node attribute “DyNet REWIRING” 64.