Modelling COVID-19 cases in Nigeria: Forecasts, uncertainties, projections and the link with weather

The World Health Organization (WHO) declared COVID-19 a global pandemic on 11 March 2020 due to its global spread. In Nigeria, the rst case was documented on 27 February 2020. Since then, it has spread to most parts of the country. This study models, forecasts and projects COVID-19 incidence, cumulative incidence and death cases in Nigeria using six estimation methods i.e. the attack rate, maximum likelihood, exponential growth, Markov chain monte Carlo (MCMC), time-dependent and the sequential Bayesian approaches. A sensitivity analysis with respect to the mean generation time is used to quantify the associated reproduction number uncertainties. The relationship between the COVID-19 incidence and ve meteorological variables are further assessed. The result shows that the highest incidences are recorded in days with either religious activities or market days while the weekday trend decreases towards the weekend. It is also established that COVID-19 incidence signicantly increases with increasing sea level pressure (0.7 correlation coecient) and signicantly decreases with increasing maximum temperature (-0.3 correlation coecient). Also, selecting an optimal period for reproduction number estimates reduces the variability between estimates. As an example, in the EG approach, the epidemic curve that optimally ts the exponential growth is between 1- and 53-time units with reproduction number estimate of 1.60 [1.58; 1.62] at 95% condence interval. However, this optimal reproduction number estimate is different from the default reproduction number estimate. Using the MCMC approach, the correlation coecients between the observed and forecasted incidence, cumulative death and cumulative conrmed cases are 0.66, 0.92 and 0.90 respectively. The projections till December shows values approaching 1,000,000, 120,000 and 3,000,000 respectively. Therefore, timely intervention and effective preventive measures are immediately needed to mitigate a full-scale epidemic in the country.


Introduction
Coronavirus disease  is an infectious disease caused by severe acute respiratory syndromerelated coronavirus 2 (SARS-COV-2). Wuhan Municipal Health Commission, China rst reported COVID-19 as a cluster of cases of pneumonia on 31 December 2019 [1]. The novel Coronavirus was o cially con rmed in Thailand on 13 January 2020, subsequently spreading to Japan and South Korea [1].
COVID-19 has now become a global pandemic due to the alarming spread and severity level [2]. There have been 9,473,214 con rmed cases and 484,249 deaths worldwide as of 26 June 2020.
Although the COVID-19 cases in the continent are gradual relative to other parts of the world, the infection has risen exponentially in recent weeks and still spreading. The rst case of Africa's  was reported in Egypt on 14 February 2020 [1] while Nigeria registered its rst case on 27 February 2020.
Since the index case was rst reported, the number of new cases from COVID-19 community transmissions has steadily increased. As of 27 June 2020, Nigeria's cumulative reported cases of COVID-19 had risen to 23,298 with 554 deaths, 8,253 discharged and 14,491 active cases currently receiving treatment [4].
In response to the spread, the Federal and State governments in Nigeria have introduced some measures to curb the spread of the virus and protect the citizens. This includes the suspension of all activities and religious meetings, the permanent closure of public and private schools and universities, the ban on domestic and international ights, building of more isolation centres, closure of borders etc.
Even though data paucity at the onset of this pandemic hindered optimal forecasting and projection for the country, the epidemiological characteristics of COVID-19 dynamism in West Africa and Nigeria, in particular, is different from other parts of the world. Yet, the uncertainties surrounding the modelling of the COVID-19 spread in Nigeria cannot be underemphasized. Nevertheless, few studies have focused on the modelling and forecasting of COVID-19 cases in Nigeria. For example, Ayinde et al [5] compared different statistical models and estimators in forecasting COVID-19 cases without accounting for the uncertainties associated with it. Adegboye et al [6] investigated the early COVID-19 transmissibility using the Bayesian method while incorporating serial distribution uncertainties, however without projections.
Although this method assumes a population random mixing, the susceptible depletion is not considered.
Consequently, the uncertainty surrounding the COVID-19 transmissibility is still inherent. Therefore, it is important to correct these lapses for a robust forecast and projection which could translate to better intervention and effective preventive and mitigating measures against the outbreak in the country.
This study contributes to the existing knowledge of COVID-19 dynamism in Nigeria by modelling and forecasting the incidence, cumulative incidence and cumulate death using the attack rate, maximum likelihood, exponential growth, Markov chain Monte Carlo, time-dependent and the sequential Bayesian approaches of estimating the COVID-19 reproduction number. Subsequently, these variables are projected for the future. The uncertainty associated with the reproduction numbers as well as the reproduction ratio sensitivity to the estimation time-period is quanti ed whilst the correlation between COVID-19 incidence and some meteorological variables are examined.

Data And Methods
Daily Nigeria con rmed and death cases arising from COVID-19 are extracted from the archives of the Nigerian Centre for Disease Control (NCDC) [4] for 117 days i.e. from February 28 (onset of the outbreak) to June 22 2020. The rst death was pronounced on March 23 2020. However, this current analysis focuses on the onset of the locally con rmed case (March 17) till June 22. The spatial representation of the con rmed cases and death across the country is presented in Fig. 1 [7].
To establish the relationship between COVID-19 incidence and meteorological variables in the country's epidemic centre, daily observed maximum temperature (°C), atmospheric pressure at sea level (hPa), humidity (%), rainfall (mm) and maximum sustained wind speed (Km/h) are retrieved from the archives of the Trans-African HydroMeteorological Observatory.
Subsequently, based on the epidemic curve from the initial and time-dependent reproduction number estimations (as described below), the disease transmissibility and death cases are modelled and forecasted. For all time-dependent reproduction number estimations, a weekly sliding window is applied to the incidence series [8,9]. The model outputs are validated for one month. After a performance check, the best model is used to project the future incidence, cumulative incidence and cumulate death till December 2020.

Estimating reproduction numbers
The reproduction number is the average number of healthy people infected per sick person. During an epidemic, reproduction numbers can be estimated at any time. This study assesses the initial and timedependent reproduction numbers using six methods described below.

Attack Rate (AR)
The AR is the proportion of the population who are eventually infected. This is connected to the basic reproduction number and given as [10], where S 0 is the initial proportion of the susceptible population. In this method, there must be a closed population with homogenous mixing and no intervention is provided throughout the epidemic. The method is modelled after the Susceptible -Infectious-Recovered (SIR) model. For this method, the population size is pegged at the country's population of 200 million people.
From above, the reproduction number throughout an outbreak is estimated based on a single serial interval distribution, thereby not accounting for the serial interval distribution uncertainties. Therefore, ve time-dependent reproduction number estimations are further examined and explained below.

Sequential Bayesian (SB)
This method estimates the initial reproduction number sequentially by approximating the SIR model using the Poisson process such that the mean Nte (γ(R-1)) approximates the incidence (N) at time t+1, N(t+1) [11], where 1/γ is the average period of the infectious duration. The method initializes daily reproduction number (R) distribution with the previous day posterior distribution [12] and the generation time is explicitly estimated using an exponential distribution. However, this method assumes a population random mixing while the susceptible depletion is not considered.

Time-dependent reproduction numbers (TD)
Reproduction numbers in TD are calculated by averaging all reproduction numbers over the transmission networks which are compatible with observations [13]. This is given as; where Q kl is the probability that case k with onset-time t k is infected by case l with onset-time t l . N is the incidence and w is the generation time distribution. Therefore, the effective reproduction number (R l ) for case l is ∑ k Q kl . This is averaged over all cases with identical onset date which is represented as; With TD, the importation cases during the outbreak can be taken into consideration.

Exponential Growth (EG)
The EG rate is the change in the number of new epidemic cases per person per unit of time. This is estimated using Poisson regression (pr) [14]. In this case, the initial reproduction ratio can be linked to the EG rate during the early phases of the outbreak [15]. However, there is no assumption on population mixing. The reproduction number (R) following EG is computed as [12]; where m is the function for generating generation-time distribution moments.

Page 6/22
The generation-time distribution is derived from the time lag between all infectee or infector pairs [16].

Maximum Likelihood (ML)
In ML, the reproduction number (R) is estimated by maximizing its log-likelihood. This is given as [17]; ML is calculated based on the exponential growth period. The best exponential growth period is selected based on the deviance of the coe cient of determination. ML also assumes that the number of secondary cases as a result of an index case is Poisson-distributed with an expected reproduction number. However, there is no assumption on population mixing.

Markov Chain Monte Carlo R estimation (MCMC)
In this method, the serial interval distribution is estimated from the exposure data at prescribing intervals using the MCMC while the reproduction number is calculated from the serial interval posterior distribution. The serial interval posterior distribution is based on an uncertainty bootstrap approach with 100000 simulations and 1000 resamples. This accounts for the reproduction number uncertainties. The details of this method are presented in Thompson et al [18].

Projections
The data from daily historical incidence, reproduction number and the serial interval is used to simulate probable epidemic trajectories and to project the future daily incidence. This assumes a branching process where daily incidence follows a Poisson process which is determined by the daily infectiousness. The projected daily incidence φ is calculated as [19]; e is the Euler's number, t is the number of times an event occurs in an interval, s is the average number of events in an interval (rate parameter), δ is the probability mass function of the serial interval, ys is the incidence time. The number of simulated epidemic curves is set to 10000.
The ratio of RMSE to the standard deviation of the observations (RSR) [20], the relative index of agreement [21], percentage bias and the Pearson's correlation coe cient is used to quantify the error between the observed and forecasted incidence at a 30-day projection window.
Additionally, the overall infectivity due to previously infected people at any time step t is calculated by weighting the sum of the formerly infected individuals by their infectivity at time t which is given by a discrete serial interval distribution [8].
On the effect of meteorological variables on the spread of COVID-19, Lagos state, which is the epicentre of COVID-19 cases in Nigeria is taken as an example to assess this. Pearson's correlation is used to understand the relationship between the outbreak of the local cases and the meteorological variables. However, a 7-day lag period is adopted between the meteorological variables and the onset of the local incidence. Subsequently, a calendar plot is used to verify the days of the week with the highest locally con rmed cases.

Result And Discussion
In this session, the results of the COVID-19 incidence and death cases, sensitivity analysis as well as the future projection is presented.

Temporal trend of COVID-19
The temporal distribution of the COVID-19 incidence between February 27 and June 24 is presented in Fig. 2. From the onset of the outbreak until mid-April, the number of con rmed cases is between 0 and 100. However, from May, there is a gradual increase in the incidence for most days in May with Sunday, May 30 having the highest case and Tuesday 5 th and 12 th having the lowest cases. However, the highest incidences are recorded on June 18 th with over 730 cases. The lowest con rmed cases in June is on Tuesday, June 2 nd . The average weekly analysis for the period of the study shows that Friday has the highest range of incidence with values between 100 to 300. Sunday has the lowest range of between 100 to 220 con rmed cases.
However, the trend of incidence decreases from weekdays to weekends. The sharpest increase is from Tuesday to Wednesday. The temporal distribution of the con rmed cases depends on the lifestyle and adherence of people to the quarantine and lockdown measures put in place by the federal government. For example, the highest cases are recorded in days with either religious activities or market days. Moreover, the weekday trend decreasing towards the weekend shows that most people are indoors on weekends due to the total lockdown adopted by many states in the federation. Despite the lockdown, many institutions offer skeletal services during the week. This could contribute to the high incidences during the weekdays. Additionally, the infectivity increases with time irrespective of the incidence pattern (Fig. 3).

Reproduction numbers
Using both the initial and time-dependent approaches for estimating the reproduction number for the entire study period, there are different estimates recorded (Fig. 4) with relatively high variations even though the analysis is carried out over the same dataset. This ranges from 0.9 for MCMC to 11 for SB at a 95% con dence interval.
Particularly for SB, although it is a time-dependent approach, it fails after the rst gap due to periods with zero incidence [12]. However, this does not affect the other time-dependent methods inasmuch the periods with zero incidences is shorter than the maximum generation time. While the AR approach needs the least input information, it is not useful during an epidemic because it considers only the initial reproduction number to estimate the epidemic curve for the whole period [12]. This is mostly inaccurate during an epidemic. Additionally, for the entire period of the outbreak, AR assumes that there should not be any form of intervention to curb the outbreak. This is most unlikely. Generally, when R >1, the disease spreads in the population. R < 1 means the disease has not started spreading. However, when R is large, it becomes impossible for the disease to be controlled which eventually leads to a pandemic. As an example, an Rvalue of 1.2 on the 10 th day for the MCMC (Fig. 4) means an average of 1.2 persons are infected for each infected person on that day. To contain the spread of the outbreak, the herd immunity threshold which denotes the percentage of the population in need of immunization must be greater than 1-R -1 [22]. Given this, with Rin MCMC equalling 1.06 means 11.8 million people (5.9% of the total population) needs to be immunized to contain the outbreak. Subsequently, the epidemic curve generated by all methods (except SB and AR) does not overlap while maintaining proximity with the incidence (Fig.  5). Furthermore, the generation distribution time may be uncertain during an evolving outbreak. Nevertheless, the period when the outbreak growth becomes exponential has to be speci ed in the EG and ML approaches. This is usually the period between the rst case and when the incidence is maximum.
Whilst this outbreak is still emerging in Nigeria, choosing a date could be misleading, thereby increasing the uncertainty. To reduce this uncertainty, a sensitivity analysis is carried out. A deviance R-squared statistic is checked over the entire outbreak period. When the modelled incidence ts the observation optimally, the largest R-squared is recorded. As an example, in the EG approach ( Fig. 6; 7), the epidemic curve that optimally ts the exponential growth is between 1-and 53-time units with reproduction number estimate of 1.60 [1.58; 1.62] at 95% con dence interval. However, this optimal reproduction number estimate (Fig. 6) is different from the default reproduction number estimate shown in Table 1. It is worthy to note that choosing an optimal period reduces the variability between estimates [12]. The variation in estimates according to changes in the generation time distribution shows an oscillating pattern of estimates with the mean generation time (Fig. 8) [23]. However, the general pattern shows an increase in estimates with the mean generation time [24]. The epidemic curve shows a range between 1.28 and 1.40 for the reported reproduction ratio and the mean generation time increasing from 7.80 to 12.80 days (Fig. 8). Therefore, it is pertinent to carry out a sensitivity analysis with respect to the mean generation time to quantify the associated uncertainties. As in the case of complexities resulting from overlapping generations, a joint-estimation of the reproduction number and generation time distribution could be adopted [25]. Alternatively, periods with exponential growth could be chosen [12].

Comparing methods and projections
Based on the goodness of t, the observed and forecasted incidence are compared in time. Fig. 9 shows variant degrees of accurate forecasting power of each model. However, the AR and ML approach constantly underestimates the observed incidence while SB overestimates it. Satisfactory results are observed in TD and MCMC approaches.
Using the MCMC approach, errors in the modelled incidence, cumulative death and cumulative con rmed cases are quanti ed based on a 30-day projection window using four error metrics ( Table 2). The result is satisfactory with Pearson's correlation coe cients of 0.66, 0.92 and 0.90 values for incidence, cumulative incidence cases and cumulative death respectively. Based on this result, the projections are made until December (Fig. 10). There is a continuous increase in the incidence, cumulative incidence and cumulative death with values approaching 1000000, 120000 and 3000000 respectively. RSR is the ratio of RMSE to the standard deviation of the observations, rd is the relative index of agreement, PBIAS is the percentage bias and r is the Pearson's correlation coe cient

Relationship between meteorological parameters and COVID-19
The result of thePearson's correlation between the outbreak and the meteorological variables in Lagos considering a 7-day lag period (Fig. 11) shows that the correlation between the atmospheric pressure at sea level and the COVID-19 incidence is highest with a value of 0.71 at P ≤ 0.001. This means COVID-19 incidence increases signi cantly with increasing surface pressure. However, it decreases with increasing maximum temperature (-0.34 at P ≤ 0.01). The lowest correlation is with the maximum sustained wind (-0.03 at no signi cant level). In general, the maximum temperature and atmospheric pressure at sea level are the main meteorological variables affecting the COVID-19 incidence in the country's epicentre. The interactions between these meteorological variables are not easily dissociated. For example, there is a negative correlation between rainfall and temperature during the wet season [26], while atmospheric pressure at sea level and temperature is negatively correlated. Historically, the highest maximum temperature occurs between March and May [26], however, the highest atmospheric pressure at sea level occurs in July [27]. Given this relationship, more incidence and subsequent transmission of COVID-19 is expected in the coming months. Considering the present realities and future projections, there is a need for crucial preparations and assessments of the consequences and complications that could arise from the COVID-19 pandemic as well as dedicating relevant resources to understanding and managing its probable effects on the country's economy, livelihood and wellbeing of the citizens.

Conclusion
There is a continuous increase in the incidence, cumulative con rmed cases and cumulative death. This increase, especially in the incidence, is disturbing. Therefore, the health o cials, policymakers, the general public and the disease control agencies are reminded of the various risks (e.g. economic recession, social discomfort and insecurity) associated with further increment and spread of COVID-19 cases. Timely intervention, holistic planning, effective preventive measures are urgently needed to mitigate a full-blown epidemic in the country.

Declarations
Funding: This research received no external funding.
Con ict of Interest: The authors declare no con ict of interest  Figure 1 Spatial distribution of Nigerian COVID-19 con rmed (a) and death (b) cases on the 19th of June 2020.
Page 14/22 Estimates of the reproduction ratio and goodness of t for the observed incidence (step function) and model-predicted incidence for each method Figure 6 Reproduction ratio sensitivity to the estimation time-period. The best t is shown as a dot and the solid black lines show the 95%CI.  Reproduction number sensitivity to the generation time distribution. The vertical bars are the 95% con dence interval Observed incidence (bar plot) and model-predicted incidence for each method Using the MCMC approach, errors in the modelled incidence, cumulative death and cumulative con rmed cases are quanti ed based on a 30-day projection window using four error metrics ( Table 2). The result is satisfactory with Pearson's correlation coe cients of 0.66, 0.92 and 0.90 values for incidence, cumulative incidence cases and cumulative death respectively. Based on this result, the projections are made until December (Fig. 10).
There is a continuous increase in the incidence, cumulative incidence and cumulative death with values approaching 1000000, 120000 and 3000000 respectively.
Page 21/22 Figure 10 MCMC incidence, cumulative death, cumulative con rmed cases and their projections Figure 11 Correlation plot between daily meteorological parameters and COVID-19 incidence between March and May with a 7-day lag in COVID con rmed cases. * represents P ≤ 0.05, ** represents P ≤ 0.01, *** represents P ≤ 0.001 otherwise, the correlation is not signi cant. TM is daily observed maximum temperature (°C), SP is atmospheric pressure at sea level (hPa), H is humidity (%), PP is rainfall (mm) and VM is maximum sustained wind speed (Km/h)