We analyzed the share of the four selected variants in each HHS region during the study period. The Alpha variant predominantly surfaced in the Southeast. It did not have an absolute dominance over the other variants even at its peak, with less than 80% of share in the Southeast regions and even lower in other regions. The Delta variant initially surged in HHS Region 7 (IA, KS, MO, and NE), with Kansas City at its center, and persisted longest in this region. When the neighboring regions were experiencing a switch to another more prominent variant, the most prevalent variant in Kansas City remained Delta. The Delta variant has the longest time at its peak, having nearly 100% share for over 5 months. The Omicron subvariant BA.5 exhibited a unique nationwide spread, where all HHS regions experienced a surge of BA.5 and it became the most prevalent variant across all regions simultaneously. HHS Region 2 (NY, NJ, PR, VI) with New York City at its center had its BA.5 share decreased first and at the fastest rate. Variant XBB.1.5 showed a higher concentration in the Northeast, but we are not able to see how it behaved along the full span of the variant’s life due to insufficient data (Fig. 1). The set of graphs in Fig. 1 presents a temporal comparison of the prevalence of different COVID-19 variants across ten regions. Each graph corresponds to a specific variant, labeled B.1.1.7 (A), B.1.617.2 (B), BA.5 (C), and XBB.1.5 (D), and displays the share of each variant over time, as evidenced by the x-axis denoting time from early 2021 to early 2023. The y-axis represents the proportion of the variant in the population, ranging from 0 to 1 (0–100%). Each line within a graph represents one of the ten regions, with color coding used to differentiate between them. The graphs allow for the observation of trends in variant dominance, showing how quickly each variant became prevalent and subsequently declined in each region.
The regional dynamics of the share indicate that the speed and patterns of the virus’ surge and diminishment differ across regions for each variant. To understand the contributing factors, we then used the RF regressor to predict the share of each of the four variants across the HHS regions and identify the important predictors. Our models exhibited excellent predictive accuracy, well surpassing the 0.72 R² value for a mixed-variant baseline. The models for Alpha variant and Omicron subvariant BA.5 both displayed high accuracy rates, with R² = 0.94 and R² = 0.93, respectively, followed by Omicron subvariant XBB.1.5 with R² = 0.92 and Delta variant with R² = 0.89. Table 1 provides a summary of the performance metrics for predictive models of various COVID-19 variants over a specific time frame. Each row represents a different variant, including B.1.1.7, B.1.617.2, BA.5, XBB.1.5, and a category labeled‘Other’ (mixed variants), which served as a baseline for comparison purposes.
Table 1
Model Performance Metrics for COVID-19 Variant Predictions
Start Date | End Date | Variant Name | Data Length | MSE | RMSE | MAE | R-Square |
2021-01-02 | 2021-10-30 | B.1.1.7 | 4398 | 0.006641 | 0.081491 | 0.033258 | 0.936054 |
2021-01-30 | 2023-02-11 | В.1.617.2 | 33357 | 0.021188 | 0.145561 | 0.031952 | 0.885554 |
2021-09-25 | 2023-02-11 | BA.5 | 8032 | 0.010736 | 0.103613 | 0.025812 | 0.925882 |
2022-10-22 | 2023-02-11 | XBB.1.5 | 210 | 0.006165 | 0.078517 | 0.040888 | 0.920208 |
2021-01-02 | 2023-02-11 | Other | 38751 | 0.016741 | 0.129385 | 0.049647 | 0.715303 |
MSE, Mean Squared Error; RMSE, Root Mean Squared Error; MAE, Mean Absolute Error
Our findings indicate a complex interplay between environmental factors and the spread of different variants. The top features identified by RF regressor that could affect the spreading patterns include temperature, UV index, ozone value, and air quality index. Each variant has specific favorable environmental conditions. For instance, the Alpha variant showed a strong correlation with the air quality index and the temperature. The Delta variant exhibited a significant relationship with ozone density. Similarly, the Omicron subvariant BA.5 demonstrated a connection with the UV index. Lastly, the Omicron subvariant XBB.1.5 revealed associations with location, land area, and income (Fig. 2). Figure 2 presents a machine learning model’s feature importance for predicting the prevalence of four COVID-19 variants, labeled B.1.1.7 (A), B.1.617.2 (B), BA.5 (C), and XBB.1.5 (D). The bar charts display the top 15 features that contribute to the model’s predictions, with the length of each bar indicating the relative importance of that feature. The x-axis shows the relative importance, quantifying the strength of each feature’s influence on the model’s output, while the y-axis lists the features.
Correlation results showed that the Alpha variant had positive correlations with AQI and negative correlations with the temperature. The Delta variant had a negative correlation with OZ value and a positive correlation with the temperature. The BA.5 variant had a positive correlation with the UV index, and the XBB.1.5 variant had a negative correlation with land area and a positive correlation with the income (Fig. 3). The set of bar graphs in Fig. 3 represents the Spearman correlation coefficients between various environmental and demographic factors and the prevalence of the four COVID-19 variants: B.1.1.7 (A), B.1.617.2 (B), BA.5 (C), and XBB.1.5 (D). The Spearman correlation is a nonparametric measure that assesses how well the relationship between two variables can be described using a monotonic function. For each variant, the factors are shown on the x-axis, which ranges from − 1 to 1 on the y-axis. A value of 1 indicates a perfect positive correlation, -1 indicates a perfect negative correlation, and values closer to 0 indicate a weaker or no linear relationship.
A detailed correlation graph of the BA.5 variant and its top related factor - UV index is depicted in Fig. 4. This collection of scatter plots depicts the relationship between the UV index and the prevalence share of the BA.5 COVID-19 variant across ten different regions. Each plot corresponds to a region, with the x-axis representing the UV index and the y-axis showing the variant’s share within the region. The points on each plot indicate individual observations or measurements. From this plot, we can see there exists a positive correlation between BA.5 variant and UV index, in agreement with Fig. 3.
Decision Tree plots are also generated to illustrate each factor’s impact on each variant, as depicted in Fig. 5. This visualization depicts decision tree models for the prediction of the prevalence of the four COVID-19 variants: B.1.1.7 (A), B.1.617.2 (B), BA.5 (C), and XBB.1.5 (D). In a decision tree, each internal node represents a “test” on an attribute, each branch represents the outcome of the test, and each leaf node represents a class label. This graph provides a detailed explanation of the decision-making process employed by the decision tree, serving as a tool for visualizing the thresholds.