The importance of superspreaders on the spread of the global COVID-19 pandemic

The dynamics of pandemics is most often analyzed using a variation of the SIR (Susceptible-Infected-Recovered) model 1 , the key parameter of which is the basic reproduction number R 0 . Some evidences suggest that the contagion-spreading networks are scale-free, with the biggest nodes corresponding to superspreaders 2,3 . However, current understanding of the scale-free topology of these networks, and of the implications of such topology for the dynamics of pandemics is incomplete. Here we show that the world-wide spreading rate of COVID-19 gives an indirect evidence that the underlying virus-spreading network is scale-free, with the degree distribution exponent close to 2. Furthermore, our results show that the spreading rate of a virus is predominantly controlled by superspreaders who typically get infected and acquire immunity during the initial outbreak stage of the pandemic. Thereby the biggest nodes get immune and hence, removed from the network, resulting in a rapid decrease of the eﬀective reproduction number. These ﬁndings are important for understanding the dynamics of pandemics, and for designing optimal virus control strategies. In par-ticular, screening a population for the number of antibodies of a set of viruses can reveal potential superspreaders, the vaccination or isolation of whom can impede a pandemic at its early stage.

The scale-free nature of the virus-spreading networks has deep implications. To begin with, a small community without potent superspreaders is expected to have a smaller value of effective reproduction number R E than a large community with potent superspreaders.
More importantly, the value of R E may decrease rapidly in time due to what we shall be referring to as self-targeting attacks. This term is derived from the fact that scale-free networks are vulnerable to the so-called targeted attacks 22,23 : the operation of such networks is disrupted if the biggest nodes are destroyed. Self-targeting attacks are acts of the system itself destroying its biggest nodes. The probability of a node getting infected and subsequently removed from a virus spreading network (assuming a recovery from the infection leads to becoming immune) is proportional to the number of links connecting it to the other, possibly infected nodes. Hence, the nodes with the largest number of links are the ones with the highest likelihood of becoming infected and getting removed from the virus spreading network at the earliest stage of the pandemics. Real-world spreading rate of COVID- 19. In order to see if the world-wide spreading rate of COVID-19 carries a fingerprint of self-targeted attacks, we utilized publicly accessible data of COVID-19 spread 24 covering the period from January 2020 until September 2021 by plotting the effective reproduction number R E for a certain country against the percentage r of those who have suffered the virus earlier and have been thereby removed from the virus spreading network. The instantaneous value of R E is calculated as the ratio of the number of new cases at this point in time divided by the number of new cases one week ago. One week is expected to approximately correspond to the average time interval between getting the virus, and infecting the next subjects. This includes the average incubation period of six days 25 , and a time gap from the first symptoms to isolation. When calculating the percentage r, we have excluded those cases which happened before the all-time maximum value of R E was reached. Fig. 1 shows the dependence of R E on the cumulative fraction of infected population. For single countries, the obtained R E (r)-curves fluctuate due to insufficient statistical basis. More stable results are obtained by performing this analysis for a set of countries/regions: for European countries, US counties, and for a set of biggest countries in the world (cf. the section "Data availability"). For each set of countries, we have binned the values of r, and averaged R E (r) over the countries in the set to find the mean value R E . A straight line in this log-log plot corresponds to a power law; for reference, a dependence R E ∝ r −α with α = 1/2 is depicted by a dashed line. The fact that already a very small percentage r of the population with acquired immunity results in a noticeable decrease of R E provides a strong evidence of self-targeted attacks being in action. Scale-free model. In order to verify whether the behaviour observed from the real-world data is really the consequence of the scale-free nature of the virus-spreading network, we have constructed the following mathematical model which, similarly to Ref. 11, combines the SIR model with a scale-free network and is kept as simple as possible.
(i) The network links are (a) fixed in time, and (b) of equal strength: there is a fixed probability p for the virus to spread over each of the links connecting an infected node with a susceptible node. Regarding these simplifications, (a) this model network is to be considered as a snapshot of the real world: all those contacts which are made between people during a certain time period (of duration equal to the average infectious time) form the links of the model network; (b) while in the real world, there are strong and weak links, the strong links (e.g. between family members) can be represented via multiple weak links of the model. (ii) Nodes which have recovered from the infection remain immune until the end of the simulation run. For some simulation runs, a certain fraction of nodes have additionally been "immunized" through "vaccination": random nodes were selected and marked as immune, prior to any infections. (iii) The incubation time is taken to be equal to one time step, i.e. iteration, and each infected node remains infectious during one time step. In other words, if a node got the virus at time t n , it will be infectious at time t n+1 , and will neither be able to get the virus again nor transmit it to anyone for t ≥ t n+2 . This means that in the real world, one time step correspond to the average time interval between getting the virus, and infecting the next subjects, i.e. slightly more than the average incubation period. Keeping in mind that the average incubation time is around six days, one time step corresponds to around one week in real world. Using a probability distribution of the incubation period can increase the accuracy of simulations, but is not expected to affect the scaling behavior. Indeed, scale-free systems which differ only in correlations at the smallest available scale belong usually to the same universality class, i.e. are characterized with the same scaling exponents, cf. 26. (iv) Initially, the whole population is susceptible to the virus, which is introduced by selecting and infecting a random node. The simulation run is considered as a successful one if at least 1% of the population got the virus; otherwise, the data is discarded. For successful runs, only the data after the maximum reproduction number was reached was considered; the averaged R E curves were calculated based on 200 successful simulation runs, mimicking the procedure employed by the analysis of the real-world data. Our networks have been synthesized using the generator from the NetworkX package 27,28 , and are characterized by the cumulative degree distribution exponent κ -the number of nodes n with degree exceeding k scales as n ∝ k −κ . Technical analysis. Simulations were run for different scale-free networks with κ = {1.6, 2.0, 2.5, 3.4} while the virus transmission probability was kept equal to p = 0.1. The middle plot in Fig. 2 shows R E plotted against the cumulative number of infected nodes. Observe that for κ ≤ 2, there is a fairly good power law over a wide range of values, R E (r) ∝ r −α , corresponding to a straight line in Fig. 2. Such a behaviour can be explained as follows. The probability a node will get infected is proportional to its degree, hence the expected reproduction number R exp is proportional to the weighted average degree (averaged over the network), with the weights being equal to the degree: R exp = p k 2 / k , where p denotes the probability of the infection being transmitted over a link, and angular braces-averaging over all nodes of the network.
There is an important difference between the cases with κ ≤ 2, and κ > 2, cf. 9 . For κ ≤ 2, R exp is dominated by the contributions from the largest nodes whose degree is on the order of the maximal degree k max . Hence, R exp can be expected to decrease proportionally to the maximal degree of the effective network : R exp ∝ k max . Here, the term effective network refers to the network where the links to the ineffective nodes (recovered or vaccinated) are removed. The fact that the largest nodes dominate the contribution to R exp means also that the largest nodes catch the infection at the earliest stage of the pandemics, and will be removed from the effective network henceforth. If the number of the biggest removed nodes is proportional to the overall number r of the infected and recovered nodes then r ∼ k −κ max , and hence the maximal degree k max of the remaining nodes scales as k max ∝ r −1/κ , implying that R exp ∝ r −α with α = 1/κ, in an approximate agreement with the simulation results. The scenario described above will cease to work if κ > 2, when R exp is no longer dominated by the contributions from the biggest nodes.
Removal of a small number of the biggest nodes at the earliest stages of the pandemics can no longer dramatically affect the value of R exp . This is confirmed by the simulation results: the curves corresponding to κ > 2 in the middle plot of Fig. 2 start with a nearly horizontal segment, and turn to a steeper descent only once a considerable number of nodes have been infected.
The real-world data are best matched with κ ≈ 2. This fact can be considered as an indirect evidence that the virus spreading network is scale-free, with the degree distribution exponent being close to 2. This match is reasonably good, except that the real world curves follow the law R E ∼ r −α only when R E > 1, see Fig. 1. The reason is that while the simulation runs are stopped as soon as the first wave of pandemics dies out, the real world time series include also data from the subsequent waves.
The remaining plots of Fig. 2 show the simulation results for the same scale-free network with κ = 2. The left plot depicts R E against cumulative number of infected nodes for different values of the virus propagation probabilities. While the curves with smaller p lie beneath the curves with larger p, all the curves exhibit almost the same power law R E ∝ r −α with α ≈ 0.5. Only in the case p = 0.04, this behavior starts breaking down, the reason being that with so small p, the pandemics fluctuates at the verge of extinction and undergoes multiple minor "waves": the infected population remains so small that contrary to the assumptions used to motivate the equality α = 1/κ, sometimes, by chance, there are no infectious superspreaders.
The right plot shows the effect of initial vaccination of the population from 0% to 45%. The scaling law remains intact, vaccination will only affect the prefactor, resulting in a vertical shift of the graphs.
Simulation results show that if the virus spreading network is scale-free, with a big enough degree distribution exponent, κ ≥ 2, the virus propagation rate is controlled by superspreaders, i.e. by subjects with the biggest number of contacts. The fact that the world-wide spreading rate of COVID-19 is consistent with the dynamics on a synthetic scale-free network with κ ≈ 2 suggests that the same is valid for the real world. This triggers a question whether there is a way to identify the superspreaders so that by vaccinating or isolating them, it would be possible to avert the pandemic. Because the superspreaders are highly prone to catch the virus, they are expected to have caught almost all the similar viruses by a certain point in time, developing the corresponding antibodies. Therefore, the potential superspreaders could be found by carrying out a screening based on the number of antibodies for a selected collection of viruses.
The screening viruses in this collection should have neither too high nor too low transmission probability which is henceforth denoted as p v . Viruses with too small p v fail in infecting large enough fraction of the population, and viruses with too large p v infect too many nonsuperspreaders. We have tested the feasibility of this approach by running another set of simulations. First, a collection was built using N different "viruses", all having the same transmission probability p v . For each virus from the collection, a pandemic was initiated by infecting random nodes and letting it evolve until it naturally dies out. Then, the nodes were ranked according to the number of "antibodies", i.e. by the number of times they were infected previously. Then, the target for the vaccination coverage p t was set, and nodes were "vaccinated" (i.e. marked as immune) according to the ranking list, starting with the node with the largest number of "antibodies", and moving downwards until the vaccination coverage target was reached. Finally, the main pandemic was initiated by infecting a random node of the network, and the total number of nodes M which got the virus during the main pandemics was determined. The last step was repeated 300 times to find the average value M .
The simulation results for a population of 100 000 individuals are presented in Fig. 3. The left plot shows M as a function of N and p t by p v = 0.07, and the right plot-M as a function of p v and p t by N = 5. Observe that the number of viruses in the collection does not need to be too big, even just with five different viruses in the collection, vaccination of merely 6% of the nodes can prevent the virus from spreading. In the real world, the results will be somewhat inferior as the contact links between people will vary over time and are not frozen in time as in the case of our simulations. For instance, big conferences temporarily create many superspreading nodes of the contact network. On the other hand, temporary superspreaders create a noise for the screening procedure. Because of that, somewhat bigger virus collection is probably needed for a reasonably reliable screening procedure. Notice also that the suggested screening procedure can neutralize only persistent superspreading nodes, and the temporary ones need to be handled via administrative measures.
the Royal Society of London. Series A, Containing Papers of a Mathematical and Physical Character 115 (772), 700-721 (1927).
[2] Liljeros, F., Edling, C. R., Amaral, L. A., Stanley The second set included the most populous countries in the world. Some of the countries were removed due to lack of reported data, or the data being erroneous, such as the reported number of cases on a day being negative. The included countries were as follows: • Argentina, Brazil, Colombia, France, Germany, India, Indonesia, Iran, Italy, Japan, Mexico, Peru, Poland, Russia, South Africa, Spain, Turkey, Ukraine, United Kingdom, and United States.
For United States counties, the data from 1809 counties were aggregated. Only counties that had reliable COVID-19 and population census data were considered.

Scale-free network
The underlying graph of the scale-free networks used in our model was generated using the scale free graph function from the NetworkX package, which provides a directed scale-free graph. The function takes as its argument three parameters, α, β, and γ where α + β + γ = 1 28 . Changing these parameters changes the cumulative degree distribution exponent of the ingoing and outgoing connections. The directed graph is then turned into an undirected graph and double connections between nodes as well as self-connections are removed.
As an example, in order to get κ = 2.0, values of α = 0.1, β = 0.8, γ = 0.1 can be used. See Extended Data Table 1 for other κ and parameter value pairs. Examples of the cumulative degree distribution for several different resulting graphs with N = 500 000 nodes are shown on Extended Data Fig. 1.

Simulation description
The stochastic SIR model used in the simulations is already described in some detail in the main text. At each time step, the effective reproduction number R E is found as the ratio of the number of new infections I t at the current time step t to the number I t−1 at the previous time step, and its value is recorded together with the cumulative fraction r t = N −1 t−1 τ =0 I t of all the infections during the whole simulation run. Only successful runs that infected at least 1 % of the population were considered.
The parameters used for running the simulations that were used to generate figures 2 and 3 are displayed in Extended Data Table 2.

Simulations involving screening
The simulations used for figure 3 were carried out at 14 different values of initial vaccination percentage p t , 10 different values of the screening virus spreading probability pv, and 14 different values of screening virus count nv. The values used can be found in Extended Data Table 2. Notably, a screening virus count of nv = 0 was used in the simulations, corresponding to the case where random initial vaccination was used.
For each set of parameters (pv, p t , nv), 300 simulations were carried out and averaged with a virus with spreading probability p = 0.1 and a scale-free network with κ = 2.0 (refer to Table 2).

Extended Data
Extended Data Fig. 1: Cumulative frequency distribution plots of the four scale-free networks which were used for the simulations, with different scaling exponents κ, each with N = 500 000 nodes. The cumulative frequency at a degree k is represented as the number of nodes with degree higher or equal to k.