Evaluating the Effect of SARS-CoV-2 Spike Mutations by Causal Inference

Xin Wang Beijing Institute of Biotechnology, State Key Laboratory of Pathogen and Biosecurity, Academy of Military Medical Sciences (AMMS) Mingda Hu Beijing Institute of Biotechnology, State Key Laboratory of Pathogen and Biosecurity, Academy of Military Medical Sciences (AMMS) Bo Liu Beijing Institute of Biotechnology, State Key Laboratory of Pathogen and Biosecurity, Academy of Military Medical Sciences (AMMS) Huifang Xu Beijing Institute of Biotechnology, State Key Laboratory of Pathogen and Biosecurity, Academy of Military Medical Sciences (AMMS) Yuan Jin Beijing Institute of Biotechnology, State Key Laboratory of Pathogen and Biosecurity, Academy of Military Medical Sciences (AMMS) Boqian Wang Beijing Institute of Biotechnology, State Key Laboratory of Pathogen and Biosecurity, Academy of Military Medical Sciences (AMMS) Yunxiang Zhao Beijing Institute of Biotechnology, State Key Laboratory of Pathogen and Biosecurity, Academy of Military Medical Sciences (AMMS) Jun Wu Beijing Institute of Biotechnology, State Key Laboratory of Pathogen and Biosecurity, Academy of Military Medical Sciences (AMMS) Junjie Yue Beijing Institute of Biotechnology, State Key Laboratory of Pathogen and Biosecurity, Academy of Military Medical Sciences (AMMS) Hongguang Ren (  bioren@163.com ) Beijing Institute of Biotechnology, State Key Laboratory of Pathogen and Biosecurity, Academy of Military Medical Sciences (AMMS)

This manuscript formulates a principled framework, which utilizes causal inference to estimate the statistical contribution of Spike mutations to viral tness across lineages. To the best of our knowledge, this research is the rst to apply causal inference to mutational analysis on large-scale genomes of SARS-CoV-2. This work, schematically depicted in Fig. 1, is described in detail in Methodology, including the Data Preprocessing, the Effect Estimation, the Validation and Application, etc. In the Data Preprocessing, 7.7 million high-quality SARS-CoV-2 complete genome sequences as of May 11, 2022 are retrieved from GISAID website (12), aligned for Spike amino acid mutations, and mapped into mutation combinations with the corresponding basic reproduction number (R0), as row vectors in the feature matrix. In the Effect Estimation, causal inference is utilized for an unbiased estimation of the average treatment effect (ATE) of each mutation on the outcome R0, and the estimated ATE serves as the effect score of mutations. Mutations are computationally validated for functional in uences, including the Spike protein stability, the host cell-surface receptor (the human angiotensin-converting enzyme 2, ACE2) binding a nity, and the potential for immune escape. Based on effect scores as the quantitative assessment of mutations, important mutations and protein regions can be recognized and studied. Besides, the tness of SARS-CoV-2 variants is estimated, and the transmission capacity of any new variant can be predicted solely based on the viral sequence. Moreover, secondary results of causal inference models can likewise assist further analysis, which may reveal potential interactions between mutations.

Effect Score And Validation Of Mutations
Based on 7.7 million genomes, 107 mostly frequent amino acid mutations on the Spike protein are identi ed. Meanwhile, the basic reproduction number (R0) of each sequence is quanti ed according to the corresponding variant type (16,17). By causal inference methods, mutations are evaluated on their contribution to R0 across lineages, namely the effect score. As validations, mutations are studied by computing methods for detailed in uences, including the Spike stability, ACE2 binding a nity, and immune escape.
Effect score of the top twenty mutations, with computing validations and references for mutational effects, is presented in Table 1. All mutations in Table 1, except P26-mutation, possess one or more validated positive functional in uences, supported by either computing results or literature references, or both. For instance, the T478K mutation may stabilize the Spike protein and can signi cantly enhance the binding a nity between Spike and ACE2 (18, 19). The D614G mutation, which has been found in VoCs since early 2020, may be involved in the Spike stability, viral replication, and Spike conformation shifting, thus improving viral infectivity and transmissibility (9)(10)(11). As a key mutation in BA. 2 In terms of related VoCs, most mutations in Table 1 are typical mutations for VoC, except the V213mutation. On the contrary, over half of the mutations in Supplementary Table 1 are not typical for VoCs. Besides, most mutations in Table 1 have been found in Omicron strains, except for three mutations (V213-, E484K, and A222V). Accordingly, mutations of VoCs, especially those for Omicron variants, usually have high effect scores due to their contributions.
This work further studies mutant RBD proteins by biological experiments. We designed two new RBD sequences (RBD-1 and RBD-2, c.f. Supplementary Table 2) with several key positions replaced by mutations evaluated by our model to be with possible tness enhancements. Mutant RBDs are expressed, puri ed, and evaluated for ACE2 binding a nity, compared with the wildtype RBD (RBD-WT). After the successful expression and puri cation illustrated in Supplementary Fig. 1A-1C, the ACE2 a nity is estimated and presented in Supplementary Table 3

Key Mutation Identi cation On The Spike Protein
Based on the effect score of mutations, key tness-increasing mutations can be recognized. Supplementary Fig. 2 illustrates the treemap of Spike mutations by subunits, in which the size represents the effect score, and the color represents the count of mutational occurrences. Generally, the overall size of the S1 subunit is considerably larger than S2, suggesting that the S1 subunit may be more contributive to viral tness elevation. Moreover, the effect score of mutations is not necessarily correlated with the mutation count. On the one hand, some long-accumulated mutations, such as D614G and T478K, play a signi cant role in the enhancement of viral tness. On the other hand, some recently emerged mutations, e.g., V213-and S704L, can still achieve high effect scores by contributions, despite fewer occurrences.
For a structural study on high-scoring mutations, residues of the top ten mutations are visualized in the Spike-ACE2 complex in Fig. 2. Spike structure in the closed conformation (i.e., receptor inaccessible state) is also visualized in Supplementary Fig. 3 for a comparative study. In Fig. 2, four mutations (S477N, T478K, Q498R, and N501Y), occur in the binding interface between Spike and ACE2. Such mutations locate in the receptor-binding domain (RBD, amino acid position 319-541), possibly engaged in the Spike-ACE2 a nity. Table 1 and references support that these mutations can increase binding a nity (18, 19,[21][22][23]. S371F in the RBD domain of the Spike is reported to increase the Spike stability and ACE2 a nity, and is also involved in immune escape (23,24). Furthermore, S371 residue may participate in the conformational transition of Spike between the open state ( Fig. 2) and closed state ( Supplementary   Fig. 3), namely the up and down positions of RBD, respectively (25). Two mutations, P26-and V213-, occur in the N-terminal domain (NTD, amino acid position 14-303). NTD can be the target of human monoclonal antibodies (mAbs) (6, 26), hence these mutations may play a part in the immune escape of SARS-CoV-2 (6, 26). For the D614G mutation, besides its in uence on Spike stability and viral replications (9,10), it can participate in the Spike conformation shift toward an ACE2 binding-competent state, before viral membrane fusion with host cells (5,11). In the subdomain linking S1 to S2, the H655Y mutation gives rise to a less tight loop that wraps the furin cleavage nger, and increases infectivity in the presence of N501Y (27). In terms of S704L, despite the lack of references for functional effects, computing validations in Table 1 veri es its effect on the Spike stability, ACE2 a nity, and immune escape.

Regional Study Of The Spike Protein
The Spike protein of SARS-CoV-2 consists of two subunits ( Supplementary Fig. 3), S1 and S2, divided by the furin cleavage site (amino acid position 681-685) (3). S1 mainly includes NTD and RBD, and mediates the ACE2 binding to host cells; S2 functionally conducts the membrane fusion with host cells (3,5). In combination with the regional functions, evaluated mutations can produce a better insight into Spike subunits and domains.
Mutations of positive effect scores are illustrated across the Spike gene in the Manhattan plot in Fig. 3.
Among the top twenty mutations, eighteen mutations clustered in the S1 subunit. Hence, the S1 subunit achieves a greater mutational contribution, compared with S2. Speci cally, nine mutations occur in RBD, including the top-scoring T478K mutation. Consequently, RBD mutations are vastly important to tness enhancement, which can be explained by its function of ACE2 binding and immune escape (3,5,8). NTD likewise plays a part in viral infection, and contains six high-scoring mutations. Some important mutations, such as H655Y and P681H, locate near the S1-S2 subunit boundary, which may be related to the furin cleavage site (27) and facilitate the conformational shift of Spike (3). Compared with the S1 subunit, mutations in S2 usually have modest effect scores, except for S704L and N764K.
The distribution of effect scores in different subunits/domains is presented in Supplementary Fig. 4, including both positive and negative scores. Despite the approximate length of S1 and S2, the S1 subunit has considerably more mutations, especially high-scoring ones. Thus, S1 is more contributive in tness elevation. On the contrary, the S2 subunit can be more conserved, with fewer mutations compared with S1 (7). Among the 81 mutations in S1, 46 mutations are concentrated in NTD. Nevertheless, most scores for mutation in NTD are modest. Compared with other regions, RBD generally has a higher distribution of effect scores. Due to its crucial function, RBD serves as an important domain in the tness enhancement of SARS-CoV-2.

Viral Fitness And R0 Prediction
Based on mutations, the tness score of SARS-CoV-2 strains can be estimated. The tness score of one given sequence can be de ned by the sum of effect scores of its mutations. Particularly, the original Wuhan strain (wildtype) serves as a baseline for tness score evaluation.
The tness score and R0 of wildtype and VoCs is illustrated in Supplementary Fig. 5. The rank by the tness score is consistent with the rank by R0, indicating the correlation between the two sides. Further, those strains are plotted as points in Fig. 4. All points except subsequent BA.2.12.1 are used to train a polynomial regression (with degree 3). These points are generally close to the regression, located in the 75% con dential interval (75% CI). The regression line after BA.2 is prediction other than training. As a validation, the value of BA.2.12.1 is subsequently plotted, which is close to the regression and well ts the prediction.
Further, the historical tness score of SARS-CoV-2 is explored. Because of numerous SARS-CoV-2 strains during each period, the historical tness score is extended to be the overall tness during a speci c period, i.e., the weighted sum of effect scores, with the weight being the mutation frequency during that time period (Fig. 5). A steady increase of the historical viral tness score during the pandemic has been illustrated in Fig. 5, whose rises synchronize well with the contemporaneous emergence of VoCs. For instance, the D614G strain rose to prominence in Feb 2020 (9), leading to the tness increase at that time. Similar increases can be found when the Alpha, Delta, and Omicron variant emerged respectively. Over time, the viral tness increase is accelerated, especially by the emergence of Delta and Omicron. As for Spike regions, the contribution of RBD increases signi cantly, from the minority in 2020 to the majority since mid-2021. Similarly, the contribution of NTD and other regions increases, though the promotion is less than RBD.

Discussion
The Spike protein of SARS-CoV-2 is vastly important to viral tness, especially transmissibility (3,5,8), and is one of the most investigated proteins of SARS-CoV-2. This manuscript concentrates on Spike mutations for mutational contributions to viral tness. Despite studies on individual mutations such as D614G (9-11) and N501Y (18, 19,21), extensive assessment of mutational contributions in the context of large-scale genomes is still challenging. Meanwhile, causal inference is one of the most promising methods in data science. It produces an unbiased estimation of the treatment effect on outcomes, as a function of observable characteristics of samples (13)(14)(15)28). Bene tted from causal inference, Spike mutations can be evaluated and identi ed for further analysis.
To employ causal inference models, a quantitative phenotype is required for representing viral tness as the outcome variable, which is set to be R0 in this paper. On the one hand, R0 clearly reveals viral transmissibility, as one of the most important quantitative phenotypes for viral tness; one the other hand, the R0 estimation of SARS-CoV-2 variants is extensively studied and widely recognized (16, 17). Therefore, R0 is considerably representative and quali ed for viral tness. Likewise, other phenotypes of SARS-CoV-2 are quali ed for causal inference models as long as quantitated. Hence, this work can be transferred and applied to other quantitative phenotypes, and even to other viral genomic data.
The validation of this study includes computing methods, whose results are consistent with literature references. This consistency provides a solid cornerstone for the validation and explanation of this research and further demonstrates the effectiveness of effect scores.
Although this paper includes three important mutational effects, there are still other aspects of effects, such as viral replication and viral pathogenicity. Thus, a mutation with no validated in uence in Table 1 does not necessarily imply no positive effect at all. Accordingly, literature references also serve as supplementary validations. For instance, the R408S mutation shows no remarkable in uence by computing validations in Table 1 but may facilitate the opening of RBD by previous research (29). Another noticeable mutation is S704L, which is positively in uential in Table 1 by computing results. To date, S704L has not been thoroughly investigated, which may be an interesting direction for future research.
Another interesting phenomenon is that the effect of Spike mutations can be correlated with the region. For instance, RBD functionally conducts ACE2 binding (5) and can be the target of antibodies (8), hence many a nity-enhancing and immune-escape mutations occur in RBD, making RBD a region of highscoring mutations. Another domain, NTD, has maximum mutations but mostly with modest average scores, which may be not vastly noticeable. Although some mutations in NTD, e.g., T95I, V213G, and A222V, may be involved with immune escape (30,31), the speci c function of NTD and its mutations still remain to be elucidated. Besides, some important mutations locate at/near the furin cleavage site. As the furin cleavage site is crucial to SARS-CoV-2 (3, 32), those mutations may functionally correlate with it, like P681R (33) and H655Y (27).
Particularly, the S1 and S2 subunits can be rather different in immune escape. As shown in Fig. 3, highscoring mutations are concentrated in S1, indicating that S1 can be more mutational. The S1 subunit is usually the target of antibodies (6, 8, 26) and holds many mutations, so S1 may be vastly important to viral immune escape. On the contrary, S2 is more conserved than S1, with fewer mutations. This characteristic may make S2 a potential target of medicine and general vaccine development (7).
In the present study, contributions of mutations are learned from R0, and conversely, reveal the relative viral tness. During the model training, however, BA.2.12.1 strains are not particularly distinguished but recognized as ordinary BA.2 strains. Even so, BA.2.12.1 still achieves a higher tness score than BA.2, hence the tness score can effectively reveal viral tness. Similarly, the regression in Fig. 4 is not trained by BA.2.12.1 strains, but BA.2.12.1 well ts the regression, which validates the effectiveness of the model. Therefore, for an unknown strain of SARS-CoV-2, its relative tness, as well as R0, can be computationally predicted, solely based on its viral sequence, by computing its tness score by mutations and further estimating R0 according to the regression.
In terms of the historical tness in Fig. 5, one interesting phenomenon is the synchronization between the tness increase and the emergence of variants. The driving forces behind may be related to the selective sweep of SARS-CoV-2 (34), in which previous predominant strains are swept and replaced by new ones. The prevalence of novel variants usually implies possible adaptive advantages compared with previous ones, leading to a higher tness score. According to the upward trend of the regression line in Fig. 4, new strains with higher scores may enjoy signi cantly enhanced vial tness such as transmissibility and immune escape, which may further prolong the pandemic. Accordingly, epidemiological surveillance of new SARS-CoV-2 variants is supposed to be strengthened.
The interpretability of causal inference model can produce secondary information. The model of each mutation includes other mutations as covariates, which can be interpreted by SHAP values (35). Supplementary Fig. 6 interprets the model of the top mutations, which represents the unidirectional in uence of other mutations on the object mutation. For mutations that are mutually top in uential to each other, their bidirectional in uences may reveal possible mutation interactions. These interactions are categorized into positive and negative ones, according to the mutational co-occurrence and exclusion, respectively. Supplementary Fig. 7 illustrates the possible interactions discovered in this study, which remains to be further investigated and con rmed.

Methodology
This work is schematically depicted in Fig. 1, consisting of Data Preprocessing, Effect Estimation, Validation and Application, etc.

Data Preprocessing
Data source and data curation. From GISAID Website (12), 7,699,174 high-quality SARS-CoV-2 complete genome sequences as of 11 May 2022 are retrieved (c.f. Supplementary_FastaID.csv for more information). For each genome, amino acid mutations on the Spike gene are identi ed in alignment with the reference sequence Wuhan-Hu-1 (GenBank accession number NC_045512). Spike mutations with global occurrence of less than 1,500 are considered infrequent and then discarded.
Feature extraction. The feature matrix is generated to depict Spike sequences by the mutation combination and R0. Each Spike sequence is mapped into a mutation combination, and then is represented by a bool vector of Spike mutations, as a row vector in the feature matrix, along with R0 according to the variant type (16,17). Mutation combinations with either global occurrence less than 1500 or inestimable R0 are abandoned. Since replicate rows in the feature matrix have no in uence on the unbiased estimation of causal inference, redundant rows are merged. For a sound estimation, mutations are supposed to be observed in at least two mutation combinations, otherwise will be discarded. Consequently, 107 major amino acid Spike mutations are retained for further studies.

Effect Estimation
Mutations are modelled and evaluated successively. For each mutation, a causal inference model is utilized for an estimation of its effect across lineages. For a given mutation as treatment , its effect score is estimated by , namely the average treatment effect (ATE) on outcomes, with R0 as the observed outcome and other mutations as observable characteristics (covariates) on samples. The estimation is based on all the observed i.i.d. samples from the feature matrix, with the -th row being the sample . Speci cally, Linear Doubly Robust Learner (Linear DRL) (36, 37) is employed, with an assumption of linear form of treatment effect (36, 37).

Validation
For validations of effect scores, computing methods are utilized for functional in uences of mutations, including the Spike protein stability, ACE2 binding a nity, and potential for immune escape.

Rbd-ace2 A nity
For biological experimental investigations, mutant RBD proteins based on RBD mutations in Table 1 are expressed, puri ed, and then evaluated for ACE2 binding a nity, compared with the wildtype RBD.
RBD Expression and Puri cation. The wildtype and mutant RBD proteins are expressed and puri ed by the method in prior works (43). Mutant RBDs are designed by mutation sites into pPICZαA-RBD-WT, according to the mutation in Supplementary Table 2. The plasmids of pPICZαA-RBD are linearized by BglII and transformed into the glycoengineered yeast (44). Positive clones of RBD are screened by western blot analysis. After the shake-ask culture, the product is centrifuged at 8500× g rpm for 15 min.
The harvested supernatant is puri ed as described previously (43). Puri ed proteins are analyzed by sodium dodecyl sulfate-polyacrylamide gel electrophoresis (SDS-PAGE).

Conclusions
This manuscript proposes and formulates a principled framework of an unbiased approach for evaluating Spike mutations of SARS-CoV-2 by causal inference models, in the context of large-scale genomes. Based on 7.7 million viral genome sequences, mutations are evaluated on their contributions to viral tness across lineages. As validated, high-scoring mutations possess one or more positive functional in uences, which demonstrates the effectiveness of this research. Based on the effect score, key tness-enhancing mutations and protein regions are identi ed. Particularly, RBD mutations play an important role in the tness elevation of SARS-CoV-2. Besides, the tness and R0 of unknown SARS-CoV-2 strains can be predicted, solely based on the viral sequence. This approach provides reliable guidance about mutations of interest, including some high-scoring but less-studied mutations like S704L. Moreover, the present work can be transferred to other quantitative phenotypes of SARS-CoV-2 for evaluating speci c mutational effects, e.g., immune escape. Altogether, this approach produces an innovative and systematic insight into SARS-CoV-2 mutations, which may contribute to the evolutionary characterization of SARS-CoV-2 and the Spike-targeted medicines and vaccines against SARS-CoV-2. As the rst to apply causal inference to mutational analysis on SARS-CoV-2 genomes, this work may inspire more related applications and promote the development of interdisciplinary elds.

Declarations
Data Availability SARS-CoV-2 genome sequences as of May 11, 2022, are retrieved from GISAID Website (12 Figure 1 Schematic representation for the framework of this study, including the Data Preprocessing, the Effect Estimation, and the Validation and Application. Genome sequences of SARS-CoV-2 are retrieved and integrated for data curation and feature extraction. In data curation, Spike sequences are aligned to the reference sequence for mutation detection; in feature extraction, each sequence is mapped into a mutation combination with the corresponding R0, as a row vector in the feature matrix. For each mutation, the causal inference model is utilized to estimate its average treatment effect (ATE) on the outcome R0, with other mutations serving as observable covariates. Estimated ATE serves as the effect score of each mutation, which is validated by computing methods for detailed in uences, including the Spike protein stability, the human angiotensin-converting enzyme 2 (ACE2) binding a nity, and immune escape. Effect scores can support downstream analysis, including key mutation identi cation, protein region study, viral tness and R0 prediction, etc.   Polynomial regression of tness score and R0. The regression is trained by all points except BA.2.12.1, with the shaded area representing the 75% con dence interval. The point of BA.2.12.1 is subsequently illustrated as a validation.