ProteinUnet2 for Fast Protein Secondary Structure Prediction: A Step Towards Proper Evaluation

Background: The importance of protein secondary structure (SS) prediction is widely known, its solution enables learning about the role of a protein in organisms. As the experimental methods are expensive and sometimes impossible, many SS predictors, mainly based on different machine learning methods have been proposed for many years. SS prediction as the imbalanced classi�cation problem should not be judged by the commonly used Q3/Q8 metrics. Moreover, as the benchmark datasets are not random samples, the classical statistical null hypothesis testing based on the Neyman-Pearson approach is not appropriate. Also, the state-of-the-art predictors have usually relatively long prediction times. Results: We present a new deep network ProteinUnet2 for SS prediction which is based on U-Net convolutional architecture. We also propose a new statistical methodology for prediction performance assessment based on the signi�cance from Fisher-Pitman permutation tests accompanied by practical signi�cance measured by Cohen’s effect size. Through an extensive evaluation study, we report the performance of ProteinUnet2 in comparison with two state-of-the-art methods SAINT and SPOT-1D on benchmark datasets TEST2016, TEST2018, and CASP12. Conclusions: Our results suggest that ProteinUnet2 has much shorter prediction times while maintaining (or outperforming) the mentioned predictors. We strongly believe that our proposed statistical methodology will be adopted and used (and even expanded) by the research community.


Background
The function of a protein is correlated with tertiary structure, also known as the native structure is a unique, stable, and kinetically accessible three-dimensional structure [1]. The rst tertiary structure was determined for myoglobin by John Kendrew and his associates in 1957 [2]. For the studies on the structure of globular proteins, Kendrew received the Nobel Prize in Chemistry in 1962.
More than 60 years later, there are 177 426 protein structures deposited in the Protein Data Bank [3] as of May 9th, 2021. For comparison, UniProtKB/Swiss-Prot database, which contains manually annotated and reviewed protein sequence (primary structure) has 564 638 sequences deposited and UniProtKB/TrEMBL, which contains automatically annotated and not reviewed sequences, has 214 406 399 sequences deposited as of May 9th, 2021 (The UniProt Consortium, 2021). The cost of determining sequence is signi cantly lower compared to the cost of determining the structures [4]. Hence researchers try to create statistical or machine learning that would predict the structure of the proteins.
For the secondary structure prediction, three generations of methods and algorithms are described in the literature [5]. The rst generation, represented by Chou-Fasman's method, was leveraging statistical propensities of amino acids residues towards a speci c secondary structure class [6]. The prediction accuracy of such methods was usually less than 60%.
The second generation of methods started in the 1980s and was leveraging sophisticated statistical methods, machine learning techniques as well as information about the neighboring residues usually using a sliding window approach [5]. It was represented by methods like GOR [7] or Lim [8], but the accuracy was still less than 65% [9].
The third generation of methods could be characterized especially by deep neural networks and additional features based on multiple sequence alignment pro les such as position-speci c scoring matrices (PSSM) [10] or HHblits [11]. The accuracy of those methods reached 80% for models such as PSIPRED [12]. Given the growing number of known protein sequences, and more effective neural network architectures, recent methods are able to predict the secondary structure with more than 85% accuracy like SPOT-1D [13] based on long short-term memory (LSTM) bi-directional recurrent neural networks (BRNN) or SAINT [14] based on convolutions with self-attention mechanism..
In this study, we present ProteinUnet2, a signi cantly extended and improved version of ProteinUnet, our previous deep neural network architecture for SS3 and SS8 prediction from a single sequence [15]. It is now possible to feed any number of features to the input of the network (like PSSM or HHblits). We performed the analysis of the signi cance of the input features resulting in the selection of their best combination. The architecture has been improved with the addition of attention and dropout layers and training with a variable learning rate. This new architecture allowed us to signi cantly decrease the prediction times compared to the best SS predictors SAINT and SPOT-1D while maintaining similar or better performance on the benchmark datasets TEST2016, TEST2018, and CASP12.
Loading [MathJax]/jax/output/CommonHTML/jax.js For the rst time (to our knowledge), we raised the problem of the incorrect methodology used for prediction e ciency assessment in the previously published works. The SS prediction is a heavily imbalanced classi cation problem and should not be judged using commonly used Q3/Q8 metrics. Instead, we proposed to use the Adjusted Geometric Mean (AGM) metric [16], which has been proven to be more appropriate for bioinformatics imbalanced classi cation problems [17]. Moreover, as the benchmark datasets are not random samples, the classical null hypothesis signi cance testing using the Neyman-Pearson inference approach should not be used. We propose the new assessment methodology based on Fisher-Pitman model of inference -statistical signi cance from the permutation tests. We also suggest supplementing such statistical signi cance with the practical signi cance measured by Cohen's effect size. Using the proposed statistical methodology, we compared ProteinUnet2 with state-of-the-art predictors, SAINT and SPOT-1D.
Thus, we have made the following signi cant contributions: (i) we proposed a new U-Net-based deep architecture that enabled us to decrease the prediction times while maintaining similar or better performance than other state-of-the-art methods, (ii) introduced the new statistical methodology for prediction performance assessment, more appropriate in highly imbalanced SS8 prediction problem.

Results And Discussion
Like the authors of SAINT, we focused only on SS8 prediction analysis as it contains more useful information, does not depend on the SS3 mapping method, and is much more challenging to solve.

Predictors comparison
We compare ProteinUnet2 against the most recent and accurate SS8 predictors SPOT-1D and SAINT. These state-of-the-art methods have been shown to outperform other popular predictors like MUFOLD-SS [18] or NetSurfP-2.0 [19]. For the reasons stated in Methods section, in the comparison of performance, we focus mainly on the Adjusted Geometric Mean (AGM) metric for each structure ( Table 2) as well as the macro-averaged AGM (Table 3) to assess the overall performance. The results for F1 score (  Table S4) are also presented. Figure 1 presents the boxplots of macro-averaged F1 and AGM as well as Q8 metrics at the sequence level on TEST2016, TEST2018, and CASP12 datasets for 3 predictors: ProteinUnet2, SPOT-1D, and SAINT. These boxplots reveal small differences between the predictors' medians and means (denoted by red triangles) for all presented metrics. Also, very high variability in all distributions is clearly visible. Moreover, the distributions of metrics on TEST2016/8 datasets contain many outliers. To compare quantitatively the observed slight difference we used the statistical methodology proposed in Methods section. Table 2 and Table 3 report the performances obtained for AGM metric for 8 structures and the macro-average respectively. Table 4 presents the obtained p-values together with Cohen's effect sizes for two separate comparisons between classi ers: ProteinUnet2 vs.
The obtained macro-averaged AGM results in Table 3 and Table 4 prove that ProteinUnet2 has a statistically signi cantly higher mean than SAINT and SPOT-1D on TEST2016 dataset (p < 0.01). Accompanying Cohen's effect sizes are 0.1399 (very small) and 0.290 (small), respectively. ProteinUnet2 has also statistically signi cantly better macro-averaged AGM than SPOT-1D on TEST2018 dataset (p < 0.01, very small effect 0.144) while on CASP12 dataset we observe a similar very small effect (0.186) but no signi cance (p = 0.102), probably because of the small sample size. Regarding the differences in performances on single classes (Table 2 and Table 4), ProteinUnet2 is signi cantly better (p < 0.01) than SAINT and SPOT-1D on rare class B on TEST2016 and TEST2018 datasets (small effect sizes 0.316, 0.327 and 0.280, 0.324, respectively). Very small effect sizes are observed on this class for both classi ers on CASP12 dataset, but with no statistical signi cance (small sample size). ProteinUnet2 is also signi cantly better than both compared classi ers on rare class G on TEST2016 dataset. It is worth emphasizing that despite the lack of signi cance, ProteinUnet2 obtains small effect sizes (0.259 and 0.199) on class E on small CASP12 dataset when comparing with other classi ers.
In summary, when the appropriate AGM metric is used for assessment of classi ers' performance on imbalanced SS8 prediction problem ProteinUnet2 is signi cantly better in overall performance (macro-averaged AGM) than SPOT-1D and SAINT on TEST2016 dataset, but with small or very small effect sizes. It is also signi cantly better than SPOT-1D on TEST2018 dataset and achieves Loading [MathJax]/jax/output/CommonHTML/jax.js comparable results with SAINT on this dataset. The comparison of ProteinUnet2 on a relatively small CASP12 dataset leads to the conclusion that there is no signi cant difference between our predictor and SPOT-1D nor SAINT.
For the reasons stated in Methods section, we do not discuss and compare classi ers using F1 score or Q8. However, for easier comparison with the previous literature, we report the obtained performances in Table 1 and Table 3. Table 1 The comparison of F1 score for each SS8 separately at the residue level on all test sets for ProteinUnet2 vs SPOT-1D (circle symbol) and SAINT (square symbol). TEST2016  TEST2018  CASP12  TEST2016  TEST2018  CASP12  TEST2016  TEST2018     calculated by HHBlits. This gure shows that metrics increase with the increasing Neff. AGM for all networks is much lower for sequences with less than 4 homologs (Neff < 4). Interestingly, ProteinUnet2 shows the highest scores for these sequences. The advantage of ProteinUnet2 over SPOT-1D is statistically signi cant for Neff = 1 and Neff in range 7 to 12, and over SAINT for Neff 6 and 8.

Analysis of incorrect predictions
We noticed that for particular sequences from TEST2016 (5doiE, 5dokA, 5d6hB) the performance of all networks is very poor (AGM < 0.3). It turned out that they are missing some amino acids in the original PDB les (5doiE -4 gaps with 35 out of 128 AA missing, 5dokA -1 gap with 34 out of 204 AA missing, 5d6hB -8 gaps with 54 out of 152 AA missing). The gaps for 5d6hB chain are presented in Figure 3 generated using PDBsum web server [20] with 3D visualization by VMD software [21] in Figure 4. Even a single missing amino acid may change the secondary structure [22]. It may explain the very low performance for mentioned proteins. Thus, the problem lays in the dataset itself.
Running time  Conclusions ProteinUnet2 signi cantly extends and improves our previous ProteinUnet deep architecture [15]. It introduces multiple inputs with evolutionary pro les like PSSM, HHblits, and contact maps. The performance is increased by additional mechanism of attention and dropouts. ProteinUnet2 achieves comparable or better results to the state-of-the-art models -SPOT-1D based on LSTM-BRNN architecture and SAINT based on self-attention modules while running several times faster than SPOT-1D and 10% faster than SAINT. That makes it especially useful in large-scale predictions and applications on low-cost and embedded devices.
The proposed methodology for assessment of the performance of secondary structure predictors based on an appropriate measure for imbalanced classi cation together with permutation tests as well as analyzing signi cance of performance difference based on effect sizes may and should be further developed through, for example, other measures of effect size or its interpretations appropriately to the application domain.

Datasets
For a fair comparison, we use the same training, validation, and test sets as SPOT-1D and SAINT. The training set TR10029 contains 10 029 proteins, and the validation set VAL983 has 983 proteins. We benchmark our model on 3 test sets: TEST2016 with 1213, TEST2018 with 250, and CASP-12 with 49 proteins (see [13] and [14] for the details about these datasets).
Metric for secondary structure imbalance classi cation problem Some protein secondary structures, e.g. alpha-helices, are much more frequent than others ( Figure 5). This leads to the class imbalance problem [23] which is rarely mentioned or addressed in the literature about SS prediction. Assessing the performance of SS classi ers plays a vital role in their construction process. The most commonly used metrics of SS prediction performance are overall accuracies Q3 and Q8 [5,9,24] that are not appropriate for imbalance problems [25,26]. Using them may lead to the accuracy paradox where high accuracy is not necessarily an indicator of good classi cation performance [26], e.g., a classi er that always predicts class H will have ten times better accuracy than a classi er that always predicts class G (see Figure 5).
The existing popular measures proposed for imbalanced learning like the geometric mean or F-score can still result in suboptimal models [17]. For these reasons, we used the Adjusted Geometric Mean (AGM) well-suited for bioinformatics imbalance problems [16]. It has been shown both analytically and empirically to perform better than F-score. It has no parameters (like beta in F-score). It is given by Eq. 1 where GM is the geometric mean, SP is speci city, N n is the proportion of negative samples, and SE is sensitivity.
AGM's purpose is to increase the sensitivity while keeping the reduction of speci city to a minimum. Also, the higher the degree of imbalance, the higher reaction to changes in speci city. It returns values between 0 (the worst prediction) and 1 (a perfect prediction). We calculate AGM for each structure separately. To assess the overall quality, we use macro-averaged F1 and AGM scores. That is, we take an average of overall scores for each structure. This way we do not favor more frequent classes.
Signi cance testing and effect size Null hypothesis signi cance testing (nhst) is a commonly used statistical method for comparing classi er performances [26,27] although the authors mention their caveats. In the case where the test datasets are not random (like the benchmark datasets used in the evaluation of SS prediction), using classical nhst is problematic [26]. Random permutation tests based on the Fisher-Pitman model of inference [28] are an alternative that is strongly recommended by us in that case. In our experiments, we used a one-sided paired sample permutation test for difference in mean classi er performances (perm.paired.loc function from wPerm R package).
The tests are performed at the sequence level. Tests for separate structures are performed only on the subsets of sequences for which it was possible to calculate a given metric (e.g., if the structure is present in the ground truth or prediction).
Here (to our knowledge, for the rst time), we propose a new methodology to compare the signi cance of classi er performance differences. Signi cance testing as well as permutation tests alone do not resolve the problem of inferential interpretation.
Statistical signi cance shows only that an effect exists, practical signi cance -the effect size -shows that the effect is large enough to be meaningful in the real world. Statistical signi cance alone can be misleading because it's in uenced by the sample size.
Increasing the sample size always makes it more likely to nd a statistically signi cant effect, no matter how small the effect is in the real world. Effect sizes are independent of the sample size and are an essential component when evaluating the strength of a statistical claim. Some authors [29] proposed to use con dence intervals for estimation of effect size, but they require a random sample to enable inference. Cohen's effect size d [30] that we propose to use in our study for a paired-samples can be calculated by dividing the mean difference by the standard deviation of the differences. Whether an effect size should be interpreted as negligible (d < 0.01), very small (d < 0.2), small (d < 0.5), medium (d < 0.8), or large (d < 1.2) depends on the context (application) and its operational de nition [31]. Thus, we propose to report statistical signi cance (denoted by p-values) together with practical signi cance represented by effect sizes (here, Cohen's effect size d for a paired-samples).
ProteinUnet2 architecture U-Net architectures have proven to be extremely effective in image segmentation tasks [32,33]. The U-shaped architecture of ProteinUnet2 is based on the idea from our previous ProteinUnet for secondary structure prediction [15] (for which the results are presented in Supplementary Table S1). The new architecture was adjusted to handle multiple inputs by using multiple contractive paths, one for each input (Fig. 6). After each down-block, the features of all inputs are concatenated together and passed to the upblock via a skip connection. There are two output layers with softmax activations connected to the last up-block, separately for SS3 and SS8. In ProteinUnet2, we limited the maximum supported sequence length from 1024 to 704 to further improve training and inference times without losing accuracy. Anyway, SPOT-1D and SAINT were not trained with proteins longer than 700, and there are no proteins longer than 704 in our datasets. The input features and the number of lters were selected experimentally as described in the next section.
To mitigate the problem of the increased number of inputs and parameters of the network, in the nal ProteinUnet2 architecture (Figure 6), we modi ed the architecture to be similar to the Attention U-Net [34]. That is, we decreased the number of convolutions in each down-block from 3 to 2, added dropouts with 0.1 rate between convolutions in all blocks, and applied attention gates right before the concatenation operation. ProteinUnet2 was implemented in the environment containing Ethics approval and consent to participate and coil (C). These 8 classes are converted into 3-class problem by grouping the states: G, H, and I into H; B and E into E; and S, T, and C into C.
Similar to SPOT-1D, our nal model contains 20 features from PSSM [10], and 30 features from HHM pro les [11]. The features were standardized to ensure 0 mean and SD of 1 in the training data. Additionally, we use contact maps generated by SPOT-Contact [36].
We use the same windowing scheme as described in SPOT-1D, but we do not standardize the contact maps as they are already in the acceptable range < 0, 1>. The window size of 50 was selected experimentally based on the results from Supplementary Table S1 that shows F1 scores and accuracies on the largest TEST2016 set for a single ProteinUnet trained with different input features on TR10029 and validated on VAL983. Supplementary Table S1 suggests that SPOT-Contact features gave better results of SS8 prediction than any other input alone. The worst results are reported for 7 physicochemical properties [37]. Thus, we did not investigate them further in ProteinUnet2.

Training procedures and ensembling
For the initial experiments presented in Supplementary Table S1 and Supplementary Table S2 the single models were trained on TR10029 dataset and validated on VAL983. However, in the nal model, datasets TR10029 and VAL983 were combined and then divided into 10 strati ed folds to ensure a similar ratio of each SS8 structure in each fold. There were nine factors of strati cation: the sequence length -shorter/longer than mean sequence length, and one factor for each of 8 structures occurrence -fewer/more occurrences than a mean number of occurrences per chain. We trained 10 models, each time using a different fold as a validation set and the rest as the training set. The models were trained to optimize the categorical cross-entropy loss using Adam optimizer [38] with batch size 8 and an initial learning rate 0.001. The learning rate was reduced by a factor of 0.1 when there was no improvement in the validation loss for 4 epochs. The training was running until the validation loss was not improving for 7 epochs.
Finally, the ensemble was created from the models with the lowest validation loss for each fold by taking the average of their softmax outputs, forming the nal ProteinUnet2 prediction.
Loading [MathJax]/jax/output/CommonHTML/jax.js   Macro-averaged AGM for predicted SS8 as a function of the number of effective homologous sequences for the TEST2016 set. The circle/square over the point means that ProteinUnet2 has statistically signi cantly larger mean than SPOT-1D/SAINT for that Neff value (one-sided paired permutation test at p < 0.01).

Figure 3
The primary and secondary structure of chain B in 5d6h protein from PDB.  Frequencies of secondary structures. The frequencies of 8 secondary structures in TEST2016 dataset.

Figure 6
The schematic architecture of ProteinUnet2. AA is a one-hot encoded sequence of amino acids. FC and FE are the numbers of features in contractive and expanding paths.

Supplementary Files
This is a list of supplementary les associated with this preprint. Click to download.