Organization of the dataset
The QSbio database contains carefully curated information about the multimeric state of different proteins for which the structure has been determined (16). The starting point of our study is a dataset extracted from QSbio, including only homomeric proteins of highest confidence, and only a single annotation per protein data bank (pdb (26)) entry (a total of 31,994 unique protein sequences, see Methods for full details). In this redundant dataset, each unique protein sequence is included as a separate entry, where very similar sequences will most often, but not always, have the same qs annotation. We separated this dataset into a training and a validation (hold-out) set, so that sequences with over 30% sequence identity would be in the same set. We then used the structural domain-based ECOD database (8) to cluster the sequences into domain families (at the family structural similarity level, “f_id”) for their further investigation. Overall, this dataset covers 19 different qs, ranging from monomers to 60mer homopolymers (Fig. 2A), and many ECOD families contain representatives of various qs (Figs. 2B and 2C, see also Supplementary Figure S1 and Supplementary Tables I & II for detailed information about the dataset and results).
Distinction of different qs states by the language model
First, we generated embeddings for each entry in the database. Each protein is represented by one embedding vector, which is obtained as the average of the vectors of the different residues in the protein sequence (see Methods for more details). In order to assess the capacity of pLMs to capture qs, we used supervised dimensionality reduction to visually demonstrate that the data indeed clusters by qs (Fig. 3). The large groups of labels (namely monomers, dimers, trimers, tetramers and hexamers), as well as some other qs are well separated on this map, demonstrating the model’s ability to characterize distinct features of each group. This suggests that the model could be used to predict the qs.
Inferring qs by annotation transfer from similar proteins: To assess the ability of QUEEN to correctly predict the qs of a protein, we used annotation transfer to assign a qs to each sequence in the test set (Fig. 4A,B and Supplementary Figure S2). In this approach, the qs is inferred from the most similar protein with available annotation, i.e. the nearest neighbor in the embedding space (calculated as cosine similarity, see Methods), similarly to previous studies (27). This can be compared to a corresponding annotation transfer based on sequence similarity. Using the embedding space has advantages: a distance can be calculated between any two vectors even if they are very different, since they represent a feature of a protein sequence. This is in contrast to sequence representation that is residue-based, and therefore necessitates sequence alignment, which can be challenging when comparing distant sequences.
Annotation transfer using embedding distance outperformed corresponding annotation transfer based on sequence identity in predicting qs (Table I). This applies to prediction with available prior knowledge (i.e., on a redundant set that contains sequence-similar proteins, Figure S2). Importantly, this holds also when prior knowledge is not available (i.e., for the test set that does not contain any entry with significant sequence identity to the training set): The balanced accuracy increases from 0.15 to 0.23 (Table I, compare Figs. 4A,B).
Table I: Performance of different models for the prediction of quaternary states (qs)
|
|
|
Annotation transfer, based on
|
pLM MLP model, based on
|
|
|
Sequence
|
pLM
|
ESM-2 embeddings
(QUEEN)
|
Protbert
embeddings
|
No information about sequence homologs available 1
|
BA2
|
0.15
|
0.23
|
0.36
|
0.19
|
F13
|
0.43
|
0.54
|
0.52
|
0.41
|
Full homology information available
|
BA
|
0.6
|
0.67
|
–
|
–
|
F1
|
0.79
|
0.85
|
1 no sequence with > 30% sequence available to transfer from; 2BA: balanced accuracy; 3 F1: F1 score; See Methods for definitions |
When the qs of a new sequence is predicted, it is often homologous to previously annotated sequences, which improves prediction (compare Figure S2 and Fig. 4, and see Table I). In such a setting, (i.e., when including qs information of homologs), we observe a significant separation between cosine similarities used for transfer of correct and incorrect predictions (Fig. 5A; with the exception of qs = 7; Individual p-values are summarized in Supplementary Table III). Thus, in these cases the cosine similarity can be used to assess whether simple annotation transfer may be sufficient to determine the qs. This is however not applicable for qs assignment without information from homolog proteins, as apparent in Fig. 5B.
Training a model for qs prediction based on embeddings
The above initial analysis suggests that the embeddings include information about qs. To optimally leverage their use, we trained an embedding-based MLP model to predict qs. We trained and performed hyperparameter tuning on 5 sets of cross-validation within our training set (each time training on 80% to predict a different 20% set, see Methods) to define the final model and its parameters (Fig. 4C; Importantly, the hold-out set was not used at any step of training and validation of QUEEN). To achieve improved performance we experimented with several strategies, which highlighted the importance of down-sampling of monomers and dimers to obtain a more equilibrated dataset for training (resulting in similar size of monomer, dimer and tetramer qs (see Figs. 2A,C). Even after down-sampling, the resulting confusion matrix (Fig. 4C) still highlights the relatively high success rate of the monomer qs predictions. As for multimers, for many of the wrong predictions, dimers and tetramers are chosen, more so than monomers (see columns of dimer and tetramer predictions in Fig. 4C). This hints that QUEEN may have learned to detect multimerization rather than the exact qs, as shown in the striking example of the predominant classification of heptamers as dimers (see also Fig. 7 below).
Compared to nearest neighbor annotation transfer, QUEEN shows significant improvement (compare Fig. 4C to 4B; balanced accuracy 0.36 vs. 0.23, see Table I). While success is not uniform across all labels, correct prediction (> 40% in the diagonal of Fig. 4C) is achieved in 7 out of the 12 trained and predicted qs classes.
Confidence of prediction as indication of success of QUEEN
While QUEEN provides prediction of a specific qs, it is also interesting to examine the underlying probabilities that drive the label prediction and to assess whether they are indicative of successful vs. wrong predictions. Reassuringly, comparison of the distributions of the corresponding probabilities (i.e., true-positives and false-positives) reveals a significant difference between the two (Fig. 6), with p-values ranging from 0.04 to 6*10− 88 (Supplementary Table IV). Of note, in contrast to the pLM-based annotation transfer for which we can only provide a confidence estimate when information about the qs of homologous sequences is available (based on cosine similarity, Fig. 5), using the QUEEN MLP model, this is now also possible for qs predictions that are not based on homolog information (based on the predicted probability).
The distributions of predicted qs states shown in Fig. 7 reveal a clear preference for monomers for proteins that form monomers, as expected from the successful prediction for the qs of monomeric proteins (See Supplementary Fig. 3 for full plots). In contrast, the distribution of qs predictions for heptamers (that QUEEN did not learn well, see Fig. 4C), is much more varied. However, as also apparent in the confusion matrix (Fig. 4C), the false negative heptamers are predicted to form different multimerization states, despite monomers being the largest class. This reinforces the notion that protein sequences hold information about multimerization, even though not always about the correct state. 24-mer is another interesting example, where besides high probability of 24-mers, the model also predicts the divisors of 24, with enrichment of the rare 8- and 12-mer classes.
Examination of diverse families
Incorrect predictions shown in Fig. 7 could be a consequence of inaccurate learning for ECOD families that adopt more than one qs. We therefore compared performance for ECOD families for which all members adopt the same qs, to those that adopt diverse qs (families with single members were treated separately, as from one annotated sequence it cannot be concluded whether proteins in such a family can adopt different qs). Overall, the variation of qs within the family (i.e., how many different qs it includes) is well captured by QUEEN (Fig. 8A). For homogenous families QUEEN predominantly predicts a single qs, while for families with more qs, the number of predicted different qs predominantly corresponds to the actual number of different qs. In this context, performance tends to be slightly improved for single qs families (Fig. 8B,C). Moreover, for families with diverse qs, predictions for proteins that adopt the dominant qs of that family are more accurate than corresponding predictions for proteins adopting outlier qs.
Comparison of QUEEN performance to other related approaches
How well does this approach perform compared to other related methods? Comparison may provide new insight into the features that determine qs. Of note, to our knowledge, the present model is the first to use sequence information only, in contrast to other methods that all use a solved or predicted structure as input for qs determination or prediction. Not surprisingly therefore, PISA outperforms QUEEN for every class, and EPPIC shows a very similar trend (Note however the better performance of QUEEN for 24mers; Fig. 9A). Nevertheless, examination of the degree of complementarity of these approaches suggests that a significant fraction of proteins is correctly predicted only by QUEEN (5–20% for different qs; Figs. 9B and C). Therefore, a combination of the different approaches may be beneficial, pending our ability to identify the correct predictions.
Comparison to other language models - Protbert
We show here classification and qs assignment based on embeddings of the protein sequences. These embeddings were generated using the language model developed by Meta AI - ESM-2. We examined whether the source of the embeddings has an impact on the final results, and indeed it does. Repeating this protocol using embeddings from protbert_BFD resulted in worse performance, showing less improvement compared to simple sequence based nearest neighbor annotation transfer (see Table I).
Robustness of QUEEN performance demonstrated on the independent holdout set
The results presented until now were calculated on the test sets of the 5 cross-validation runs. Based on these results, we used the optimized hyperparameter set to define our final model. To evaluate its performance, we now opened the hold out set that we had set aside at the beginning of our study. The performance using sequence or pLM-based annotation transfer on this (smaller) hold out set is slightly better (see Fig. 10 for confusion matrices and Table II; compare to Fig. 3 and Table I above). Reassuringly, QUEEN performs similarly to what we report for the cross-validation performance, with balanced accuracy slightly lower at 0.3, and for most qs, correct predictions can be identified based on the estimated prediction probability, as observed during training (compare Supplementary Fig. 4 to Fig. 6).
Table II: Performance on the holdout set for the prediction of quaternary states (qs)
|
|
Annotation transfer, based on
|
QUEEN
|
|
Sequence
|
pLM
|
BA
|
0.18
|
0.26
|
0.30
|
F1
|
0.34
|
0.55
|
0.58
|