Here, we identified self and nonself SCSs throughout the proteome of SARS-CoV-2. This study is based on the theoretical concept that nonself SCSs may be better suited than self SCSs as epitopes for the immune system to boost both T-cell and B-cell responses and not cause autoimmune diseases in the long term. Self-nonself discrimination in vivo is achieved by the complex functions of DCs, Treg cells, and other cell types5–7,20,21 but can be attained relatively simply in silico by SCS-based computation when both host and parasite proteomes are available. From an evolutionary perspective, this concept leads to the sequence mimicry hypothesis.
We examined the SCS distribution in the human proteome (Fig. 1a-c), which suggested a scale-free distribution in the rank-frequency plot, following Zifp’s law (Fig. 1c). Since Zipf’s law is applicable to natural languages, this result justifies the application of SCS-based frequency analysis to human protein “language”, similar to linguistic frequency analyses32,33. The breakdown of linearity in the plot at the largest ranks probably reflects the fact that there are many zero-count SCSs. The zero-count SCSs in the human proteome are nonself SCSs themselves, and they are outside the human proteome vocabulary. In other words, the human proteome is composed of a mathematically coordinated collection of words (i.e., SCS vocabulary), which may make the identification of nonself SCSs (and hence foreign proteins) practically attainable for the immune system.
To our knowledge, most SARS-CoV-2 vaccines available at present are based on the antigenicity of the spike protein37–39. The current mRNA vaccines are highly effective, demonstrating that the use of spike protein for vaccines has probably been the correct choice. Further efforts to search for epitopes continue; studies using neutralizing antibodies and synthetic peptides have identified several epitope sequences in spike proteins11–19. Numerous search efforts for epitopes for peptide vaccines based on bioinformatics have been performed44–47. Potential CTL epitopes have been identified in silico and in mice48–51. However, the concept of self-nonself discrimination has not been incorporated. The present study is a novel attempt to incorporate this concept.
We discovered that most parts of the SARS-CoV-2 proteome are occupied by self SCSs and that nonself SCSs occupied only 8.82% of the proteome and 7.64% of the spike protein. These results may not be surprising, considering that a single SCS in this study contains just 5 aa and that all proteins on Earth may have a common set of SCS distributions24,25. However, this high “similarity” may be surprising considering that the SARS-CoV-2 proteome and its proteins are totally foreign for humans. Theoretically, these results suggest that the human immune system must search for nonself SCSs that are embedded within a sea of self SCSs to avoid the development of autoimmune diseases over the long term.
In reality, however, the immune system produces antibodies against self SCSs as well as against nonself SCSs. Based on a literature survey, we found that COVID-19 patients produced antisera against both self and nonself sequences11–19. This is not surprising, because nonself SCS regions are relatively infrequent and because an antibody often recognizes a few different short sequences simultaneously in a 3D space, as demonstrated in the case of anti-spike antibodies11–19. Furthermore, Treg cells may change the level of the self-nonself discrimination threshold to allow the production of self-targeted antibodies under various conditions20,21.
A similar discussion may be valid regarding the activation of CTLs via MHC class I molecules. Consider a self SCS cluster of 8 aa residues from SARS-CoV-2 that is composed of 4 consecutive self SCSs, which can be fully presented by MHC class I. This means that its N-terminal 5-aa SCS is identical to an SCS from a human protein and that its C-terminal 5-aa SCS is also identical to a different SCS from another human protein. Moreover, the two 5-aa SCSs in the middle are also identical to yet different SCSs from different human proteins. These 5-aa SCSs are all self SCSs, but their combination is novel to humans. In this way, a self SCS cluster can behave as a nonself cluster combinatorially. However, there is a possibility that a single self SCS may be able to function as an epitope.
In any case, various self and nonself epitopes are likely targeted simultaneously during acute infection, and we believe that linear self epitopes are mostly, although not completely, “benign” in terms of autoimmunity. A similar discussion may be valid in immunological memory. If self epitopes are not completely safe in terms of autoimmunity, once pathogenic antigens are eliminated, the immune system should not retain memories of self epitopes of acute pathogens. In contrast, immunological memory for nonself epitopes may safely be retained for life. This may be one of the reasons why it is difficult to establish immunological memory for relatively benign pathogens such as the common cold and influenza. In this sense, establishing a life-long immunological memory for SARS-CoV-2 using vaccines may not be straightforward. The potential risks of autoimmune reactions, although not substantial, should not be ignored in the context of worldwide immunization. Potentially safer and more effective vaccines, from the viewpoint of self-nonself immunological recognition of epitopes, are encouraged in the COVID-19 pandemic era.
Although we found many nonself SCSs and their clusters throughout the SARS-CoV-2 proteome (Fig. 1d, e), we focused on the RBD of the spike protein to narrow our focus to practically important epitopes (Fig. 2a). We indeed discovered nonself SCSs and their clusters in the RBD. All of them, except the single TNVYA nonself SCS, have already been demonstrated to be parts of epitopes of existing neutralizing antibodies in previous studies11–18 (Fig. 2b). Two superclusters were identified. The 17-aa supercluster is composed of the STFKCYGVS and VIAWNSNN clusters, and together they form an antiparallel β-sheet (Fig. 3). The self sequences between these two clusters should be eliminated when designing candidate epitopes for vaccine targets, but their elimination would disrupt the conformational relationship between these two clusters. In this sense, the use of this conformational epitope without the inclusion of self SCSs might not be practical. An additional drawback of the VIAWNSNN cluster is that it contains 4 point mutation sites, 3 of which cause a nonself-to-self status change. This cluster thus may be relatively prone to variation that allows it to become “invisible”.
In contrast, the 19-aa nonself supercluster, PCNGV-GFNCYF–QSYGF, may be more suitable as a vaccine target. This 19-aa sequence contains 4 point-mutation sites, but they are all at boundaries between nonself and self SCSs (two of them are located in the gap between two nonself SCSs). The structure of the PCNGV nonself SCS (the first part of the 19-aa supercluster) has not been determined, suggesting that it may be within an intrinsically disordered region (Fig. 3). Probably reflecting this fact, this region of the 19-aa supercluster is recognized by just a few neutralizing antibodies, whereas its C-terminal region is recognized by many existing neutralizing antibodies (Fig. 2b). Indeed, this region is the most targeted epitope. Among them, CB6 and B38 recognize not only the C-terminal region of the 19-aa supercluster (forming a β-strand) but also the IADYNYKL cluster (forming an α-helix), indicating that this cluster may join the 19-aa supercluster to constitute a confirmational epitope. However, only one side of the α-helix of the IADYNYKL cluster (i.e., D420 and Y421) is likely accessible, suggesting that the contribution of the IADYNYKL cluster to the antigenicity of this epitope is not large. Therefore, the 19-aa supercluster or its C-terminal region alone may be sufficient for vaccines. As an exception, one neutralizing antibody, C144, appears to recognize both superclusters17.
After infection, pathogenic genomes mutate under strong immunological pressure from the host. One consequence of accumulated mutations is CTL escape52,53. Although the mechanisms of CTL escape are elusive and may be multifaceted, CTL escape may be triggered when pathogens continuously mutate to the point that they contain an insufficient number of nonself epitopes for the human immune system to recognize in comparison to the number of self epitopes. The present study suggests that upon a host change of SARS-CoV-2, probably from bats to humans, in December 2019, the proteome of SARS-CoV-2 may be evolving to contain fewer nonself and more self sequences to escape recognition and elimination by the immune system, including CTLs, according to the sequence mimicry hypothesis. The use of relatively invariant nonself SCSs, such as the 19-aa supercluster identified in the present study, as vaccine targets may alleviate this problem. Further perspectives on evolution of SARS-CoV-2 may be important in the global public health (Supplementary Discussion).