Predicting Variant-Driven Multi-Wave Pattern of COVID-19 via a Machine Learning Analysis of Spike Protein Mutations

29 Never before such a vast amount of data, including genome sequencing, has been collected for any viral pandemic than for the current case of COVID-19. This offers the possibility to trace the virus evolution and to assess the role mutations play in its spread within the population, in real time. To this end, we focused on the Spike protein for its central role in mediating viral outbreak and replication in host cells. Employing the Levenshtein distance on the Spike protein sequences, we designed a machine learning algorithm yielding a temporal clustering of the available dataset. From this, we were able to identify and deﬁne emerging persistent variants that are in agreement with known evidences. Our novel algorithm allowed us to deﬁne persistent variants as chains that remain stable over time and to highlight emerging variants of epidemiological interest as branching events that occur over time. Hence, we determined the relationship and temporal connection between variants of interest and the ensuing passage to dominance of the current variants of concern. Remarkably, the analysis and the relevant tools introduced in our work serve as an early warning for the emergence of new persistent variants once the associated cluster reaches 1% of the time-binned sequence data. We validated our approach and its effectiveness on the onset of the Alpha variant of concern. We further predict that the recently identiﬁed lineage AY.4.2 (‘Delta plus’) is causing a new emerging variant. Comparing our ﬁndings with the epidemiological

Never before such a vast amount of data, including genome sequencing, has been collected for any viral pandemic than for the current case of COVID-19. This offers the possibility to trace the virus evolution and to assess the role mutations play in its spread within the population, in real time. To this end, we focused on the Spike protein for its central role in mediating viral outbreak and replication in host cells. Employing the Levenshtein distance on the Spike protein sequences, we designed a machine learning algorithm yielding a temporal clustering of the available dataset. From this, we were able to identify and define emerging persistent variants that are in agreement with known evidences. Our novel algorithm allowed us to define persistent variants as chains that remain stable over time and to highlight emerging variants of epidemiological interest as branching events that occur over time. Hence, we determined the relationship and temporal connection between variants of interest and the ensuing passage to dominance of the current variants of concern. Remarkably, the analysis and the relevant tools introduced in our work serve as an early warning for the emergence of new persistent variants once the associated cluster reaches 1% of the time-binned sequence data. We validated our approach and its effectiveness on the onset of the Alpha variant of concern. We further predict that the recently identified lineage AY.4.2 ('Delta plus') is causing a new emerging variant. Comparing our findings with the epidemiological data we demonstrated that each new wave is dominated by a new emerging variant, thus confirming the hypothesis of the existence of a strong correlation between the birth of variants and the pandemic multi-wave temporal pattern. The above allows us to introduce the epidemiology of variants that we described via the Mutation epidemiological Renormalisation Group (MeRG) framework. 30 Introduction this work, a mutation is a single change in the amino acid sequence of the Spike protein (substitution, addition, deletion), 75 while a Spike variant, or simply variant, is a unique sequence of amino acids in the Spike protein that appears in clusters. 76 Like other coronaviruses, the SARS-CoV-2 has relatively low mutation rates 2 , nevertheless the current COVID-19 pandemic 77 has seen the emergence of several epidemiologically relevant variants. Efficient nucleotide sequencing has allowed to track 78 sequence mutations along the genome of SARS-CoV-2, and to identify dangerous variants 3, 4 that appeared to increase the 79 infectivity compared to the initial form that was sequenced from the outbreak in Wuhan, China 5 (GenBank: MN908947.3). 80 Since the second half of 2020, variants of concern (VoCs) and of interest (VoIs) have been identified in various regions of be found on the WHO website (www.who.int). Considering the Alpha VoC as an example, it has been possible to study its 87 infectious power in lab experiments, finding a higher rate of transmission by 67-75%, compared to the previous ones 11 . The 88 transmission advantage has been confirmed by epidemiological data in the UK 15, 16 . Most analyses of the epidemiological 89 data are done applying the time-honoured compartmental models of the SIR type 17-19 , appropriately extended by including 90 more compartments 20 . The main drawback in this approach is the large number of parameters, which need to be fixed by hand 91 or extracted from the data. In this work, we bypassed this bottleneck by using a simplified and effective approach based on 92 theoretical physics methods, the epidemic Renormalisation Group (eRG) framework 21-23 , combined with information directly 93 extracted from the Spike protein sequence via a simple Machine Learning approach. This novel method allowed us to analyse, 94 at the same time, the variability of the SARS-CoV-2 Spike protein in multiple countries and regions of the world, and thus 95 provide a direct comparison of the epidemiological impact of the different Spike variants. A theoretical analysis of the variants 96 within the eRG framework is presented in a companion publication 24 .

97
In the present work, we analysed the protein sequence data for the UK nations downloaded from the GISAID repository 8, 9 . 98 We implemented a simple Machine Learning (ML) algorithm based on the Levenshtein measure (LM) 25, 26 in order to cluster 99 protein sequences based on their distance in terms of number of amino acid substitutions (i.e. the number of amino acid 100 mutations needed to transform one sequence into the other, or vice-versa). The clusters have been defined by setting cutoffs on 101 the Ward distance between branches of a proximity tree, built by the use of a standard hierarchical clustering algorithm. We 102 applied the clustering algorithm on the data binned in temporal units, specifically here identified by months or weeks. Each 103 time unit may include more clusters according to the cutoffs set on the Ward distance and, within each cluster, the dominant 104 variant is the most frequent in terms of identical sequences over the total number of sequences in the cluster. We then developed 105 an algorithm that links clusters appearing in consecutive time units and creates chains of clusters that share the same dominant 106 Spike variant. Empirically we determined that chains that persist for longer than three time units identify emerging variants. In 107 order to reconstruct the origin of each emerging variant, we associated the initial cluster of each chain with a cluster from the 108 previous time unit that maximises the overlap in their sequence content. Hence, a branching relation emerges in our procedure.

109
For clarity, in this work we use the following definitions for variants and mutations:  2. An emerging variant is defined as an established chain that contain more than three consecutive clusters, defined using 114 our linkage algorithm. This criterium is established empirically from the results of the chain reconstruction.

115
It should be noted that some of the emerging variants defined by our procedure can be associated to VoCs and VoIs, as defined 116 by the WHO, as they share the same characteristic Spike mutations.

117
The procedure above has been independently repeated for each geographical region in our study. We validated our results 118 by showing that our approach identifies the Alpha VoC, independently, in all the distinct UK regions we studied. Once the 119 dominant variants were identified, we analysed their temporal spreading within the affected population. Given that only a small 120 fraction of the infected individuals have their viral charge sampled and sequenced, we estimated the number of people infected  physics 27, 28 . The approach has been extensively tested 23,29 , confronted with traditional SIR compartmental models 30 , and, last  The main goal of this work is to understand the viral dynamics that characterises wave patterns stemming from infectious 138 diseases like COVID-19. The eRG approach additionally offers a natural mathematical understanding in terms of the dynamical 139 flow of the system 32, 33 . Importantly, by employing ML analysis to genomic data, we discovered that each pandemic wave 140 is driven by a single emerging and persisting variant. The findings demonstrate that the variant dynamics is one of the main 141 engines behind the emergence of wave patterns for COVID-19. This result can be used as a template for similar infectious 142 diseases. As direct consequence of our study we propose a novel evolutionary model for the interpretation of the virus diffusion 143 that is mutation driven.

145
Spike protein sequences have been extracted from the GISAID repository on a country-specific basis and the date-stamp 146 associated to each sequence has been used to obtain a temporal dimension of viral variants appearance. Note that each genome 147 sequence in the GISAID data collection corresponds to the most frequent Spike variant occurring in a single infected individual.  To define the clusters, we considered a cutoff in the distance so that branches whose Ward distance is larger than the cutoff are 155 considered as separate clusters. We applied the same cutoff to all branches. ones comprising at least 1% of the monthly dataset. The cutoff on the Ward distance r W between branches, as well as the 1% threshold above, were chosen to optimise the coverage of the dataset (i.e. we required that the defined clusters cover at least 90% 165 of the data) while keeping the number of clusters below 10. The optimisation analysis, presented in the supplementary material, 166 showed that the optimal range for the branch cutoff is r W ∈ [50, 200]. After this, we compared the clusters in consecutive 167 months to link those with a "strong similarity", i.e. those that share the same dominant sequence (strong links). More details on 168 this procedure and its validation can be found in the supplementary material. The linkage algorithm we employed allowed 169 to define "chains of clusters" that we associate to emerging variants. The results are shown in Fig. 2 for two choices of the respectively, that last more than three months. In the middle and bottom rows of Fig. 2  VoC) which spins off from v0 closely to v1 and takes over by generating a third wave. We also see the emergence of short-lived 199 variants that do not have the power to start a new wave and therefore die off without infecting a sizeable number of individuals.

200
All short-lived chains have less than two clusters, hence we define a minimum length of three for persisting chains.

217
In Fig. 4 we also show the pattern of diversification occurring along the Spike protein in terms of single amino acid each chain has a typical pattern of substitution. Considered altogether (see Fig. 4B  regions of the Spike protein that also provide hints for more efficient targeting in monitoring or pharmaceutical interventions.    The MeRG framework models the time evolution of the cumulated number of infected by each variant in terms of a logistic function (sigmoid), solution of the eRG equation, and given by: where I c is the cumulative number of infected, γ is the infection rate (in inverse days) and A is the total affected individuals  after the wave (per 100.000 inhabitants). The parameter B controls the timing of the wave, and is of no concern in this study. 240 We recall that the parameter γ encodes the effective diffusion speed of the variant, including not only its intrinsic viral power and Delta VoC with the other ones (in green). We observed a marked increase in the infectivity, by 49% for Alpha, which is 256 compatible with laboratory tests. Interestingly, the frequency percentage for the VoCs, shown in the left plot, can also be fitted 257 very accurately with a logistic function in Eq. (1) as long as only one VoC dominates. The results are also reported in Table 1.

258
The fit parameter γ % is a measure of how more infectious is the new VoC with respect to the previously dominant one. This plot 259 also shows very effectively the switch between the two variants, occurring in May 2021. 260 We repeated the same analysis for South Africa, California and India, which show very good fits notwithstanding the 261 more limited sequencing statistics available on GISAID. This is clearly shown in the left plots, where we report the statistical 262 uncertainly at 65% confidence level, due to the available sequencing. The results, shown in Fig. 5  the MeRG framework provides an excellent modelling of the data.

264
Performance of the ML algorithm as an early warning tool for emerging variants 265 The results for time-ordered chains with monthly clustering (Fig. 2)  this tool, we validated the procedure on the emergence of the Alpha VoC. The first Alpha VoC case has been found on the 269 20th of September, 2020. Following the monthly analysis, we identified a cluster dominated by the Alpha VoC in November 270 2020, at the beginning of chain v2. Once the first cluster is identified, one would need to add the data for the following months 271 to confirm its persistence (as the process is additive, the cluster definitions in the previous months are not affected). We saw 272 empirically from Fig. 2 that persisting chains contain at least three clusters, hence an emerging variant could be defined only 273 two months later (January 2021 for the Alpha VoC).

274
To improve the performance of the ML algorithm in terms of prediction, we reanalysed the same data using a weekly-   Table 2. Early warning for the Alpha VoC. Results of the ML analysis applied to weekly binned data after the first detection of the Alpha VoC in England. The columns contain the 2020 calendar week number, with the initial date (Monday), the total number of sequences in the dataset for each week and the percentage of Alpha VoC sequences identified a posteriori in the data. The fifth column contains the minimal number of clusters in the ML output that allows to isolate the Alpha VoC cases. When this indicator is equal to 1, an early warning can be issued (week 44). After three weeks of the cluster persisting, we identify it with an emerging variant (week 47). This date can be compared to the WHO classification decision (29 Dec., week 53).
rather than monthly-based binning ( Table 2). The cutoff on the Ward distance needs to be adjusted for the weekly analysis, thus would be required. For the Alpha VoC data, this is achieved for weekly binning, as shown in Table 2 (the statistical error is   296 shown as a band in Fig. 6C). Spike variant in cluster 43 (Fig. 2), branching off from the main Delta chain in August 2021 (date at which data acquisition was 300 stopped for the purposes of this work). To determine whether this novel variant could be considered of concern, we carried out 301 our ML analysis with a weekly binning. Despite the fact that this analysis was carried out with a suboptimal r w = 100 (due to 302 the urgency of the situation analyses to determine the optimal r w are ongoing), our weekly analysis clearly indicates that AY. in light of the need to better adjust the clustering cutoff, our analysis of the most recent data in the UK stresses the emergence of 338 the Pango lineage AY.4.2 into a stable chain and thus of a true variant of concern. Our analysis indicates that an early warning 339 could have been issued as of the 19th of September. Thus, our ML analysis tool and these early warning indications could 340 be used by policy makers to implement immediate actions that globally limit the spread of this and of other variants that will 341 emerge in the future.

343
This study was performed on the sequencing data from a single region, England. This is justified by the fact that the sequencing 344 dataset associated to England on the GISAID open-source genome repository is by far the largest and most uniform compared 345 to other countries/regions. Biases relative to specific data-taking practices un England may induce biases in the analysis. To 346 validate the results, however, we have also analysed the data for Wales and Scotland, as presented in the supplementary material. 347 We chose the other two nations of the Great Britain island because they have a very similar epidemiological history compared 348 to England, thus we would expect comparable outcomes. As such, by comparing the results we would test the reliability of the 349 ML procedure alone. In fact, the results for Wales and Scotland, while less significant with respect to statistics, show the same 350 patterns we obtained for England.

351
As an early warning tool, our ML approach is triggered when a new cluster chain branches off. This procedure is sensitive 352 to the working point chosen for the clustering algorithm, and it has a certain chance to produce a false positive. However, the 353 extensive use of this tool on future data and on sequences from other countries/regions will allow to estimate more reliably the  studies are necessary to fully exploit this information.

366
Our work furthermore underlines the importance of sufficient genomic data in order to both scientifically understand and 367 track the temporal evolution of viral diseases, but also to issue sufficiently early warnings of epidemiologically dangerous 368 13/16 variants so that decision makers can take efficient preventative measures. Our approach can be applied to other viral diseases, 369 like influenza, provided that sufficient sequencing data is available.