We have used logistic curve to model the spread of COVID-19. A Prophet logistic curve is a Sigmoidal-S-shaped curve used to model functions which can be divided into three phases viz. (i) first phase: increases gradually during the start period, (ii) second phase: grows more quickly in the intermediate period and (ii) third phase: gradually at the end period, reaching a saturation level (referred to as carrying capacity) . The Sigmoidal curve of this type is often used to find biological growth patterns which exhibit exponential growth period to start with and subsequently reaches a maximum level. The pattern followed by COVID-19 is similar to this curve as large number of people get infection in early stages and further growth subsequently slows down due to implementation of various preventive measures.
Prophet model gives very good outcomes when data shows either non-linear trends or non-linear growth patterns. Fundamentally, Prophet uses an additive regression concept, which can simply be modelled as a piecewise linear or as a logistic growth curve . We tried using both these options and found that the logistic curve trend gives much better estimations to model the spread of COVID-19. Even intuitively, modelling the COVID-19 spread through logistic method is more realistic. The motivation of this analysis is to identify if the COVID-19 in India is still growing exponentially or the growth had started following a flat curve . It is necessary to fit India’s data to the logistic model because if any country is in the initial part of the curve, it indicates that the COVID-19 is still growing at a fast rate whereas if it is in the second part of the curve, it implies that growth is towards its saturation point. To ascertain the same, growth of the COVID-19 from the available data is modelled with logistic function and forecast of the growth over the next few days is done with Prophet.
a. The data frame used in Prophet needs two columns-(i) ds: to store date time series and (ii) y: to store the corresponding values of the time series in the data frame. The parameter carrying capacity (c) represents the maximum number of infections that can be caused by the virus or the saturation point of virus growth which is estimated via logistic model using the equation:
b. demonstrate COVID-19 spread using this equation three parameters a, b and c are computed using Nonlinear Least Squares Estimation method by using Scipy Curve fit optimization library of python. The logistic curve is then fit to the dataset using these parameters and it is then compared to the actual values of confirmed cases at each time instant (refer Table 1 and Figure 1). To
c. Now before feeding dataset to the prophet model to forecast growth, we need to find the estimated carrying capacity, c which is computed as follows:
Rule 1: When the fastest growth day is still ahead, it implies that the growth is still increasing. In this case add ten days after finding the fastest growth day.
Rule 2: When the fastest growth day is in the past, it means that growth has stabilized. In this case use the current day and add ten days to find the estimated highest number of infections.
d. The fastest growth day identified through our logistic model is 18th May which is in the past as we have analyzed data till 19th May, therefore rule 2 is applicable to our model. The parameter c is 71436567788, which means the maximum limit for the number of infections in India would be 194401.71436567788. Prophet model was the fit using this value of c (194401.71436567788 as shown by the horizontal line at the top (refer Figure 2)). From the forecast curve (Figure 2) it can be inferred that the growth of the COVID-19 in India is expected to stabilize after July 10th (refer Table 2).
Table 2: Growth Stabilization of COVID-19 in India
Expected Stabilization Period
After July 10th
In order to evaluate the model-fit, we used cross-validation. The first case of COVID-19 was confirmed in India on 30th January, and the date of data collection is 19th May, so we had 110 days data. The initial 100 days (train data) were used to train the model and the remaining 10 days (test data) were used to evaluate the model (refer Table 3). The outcomes obtained are shown in Table 4.
Table 3: Parameters of Cross Validation
Predicted vs actual with lower and upper Confidence Interval (Table 4), the data up to 9th May was considered for training hence the cutoff date is 2020-05-09. The effects of the error diagnostics of the model are shown in Table 5. Minimum error is reported at a horizon of 8 days, meaning thereby that the number of confirmed cases can be more accurately predicted by looking at 8 days in the past.
Table 4: Cross validation: Predicted (yhat) vs. Actual (y) values of Confirmed Cases
Table 5: Error Diagnostics of the Confirmed Cases
Significant Inferences from Data Analysis:
The lockdown 4.0 is till May 31st. The confirmed cases data has fit a logistic distribution. This data has been collected in a controlled environment (lockdown, social distancing etc.). From the logistic curve it can be inferred that India is in the second part of the logistic growth curve therefore the number of confirmed cases is going to stabilize soon (provided the controlled conditions sustain). Therefore, the preventive measures have been successful in controlled setup. The maximum people that can be infected by this virus (as inferred from the logistic model) is 194401.71436567788. This means that if the controlled conditions continue to prevail, the growth of the COVID-19 will become stable after 10th July and the confirmed cases will come close to the carrying capacity. Also, evident from the figures below, the confirmed cases on May 31st will be 1,44,335 (refer Table 6) which is much less than the carrying capacity of logistic (194401.71436567788), therefore the strict preventive measures should be continued at least till 10th July.
Table 6: Forecast of Confirmed Cases in India