Analysis and Prediction of Epidemic Infectious Diseases Based on SEIDR Model -- A Case Study of COVID-19 in New York

In the epidemic prevention and control of infectious diseases, improper prevention and control can easily lead to a large-scale epidemic. However, the epidemic of diseases follows certain rules, so it is very necessary to simulate the spread of infectious diseases, which can provide reference for the formulation of prevention and control measures. This paper proposes a SEIDR model for analyzing and predicting epidemic infectious diseases. Taking the development situation of COVID-19 in New York City as an example, firstly, the SEIDR model proposed in this paper was compared with the traditional SIR model, and it was found that the SEIDR model was better than the SIR model. Then the SEIDR model and the L-BFGS optimization method were used to fit the early transmission data of COVID-19 in New York City, and important parameters such as infection rate, latent morbidity rate, disease-related mortality and recovery rate were obtained. Moreover, the value of basic regeneration number 𝑅 0 between 4.0 and 4.6 proved that the situation of COVID-19 in New York City was relatively serious. Finally, these parameters were used to predict the future development of COVID-19 in New York City, and the turning point of COVID-19 in New York City was found. However, even if the turning point be reached, the development trend of COVID-19 will not be controlled in the short term. Data verification shows that the SEIDR model established in this paper can effectively provide a scientific quantitative index for governments in the prevention and control of COVID-19 and other epidemic infectious diseases.

including the economy and people's livelihood. Therefore, it is urgent to analyze and predict COVID-19. Only by understanding the development trend of COVID-19 can we make a more effective response policy.
The United States is currently the country with the largest number of confirmed COVID-19 cases in the world, so this paper takes the COVID-19 data from New York State as an example to analyze and predict COVID-19.
Since the data has a certain lag, there will be slight errors.
The classic models for the spread of this infectious disease include SI model, SIS model, SIR model, SIRS model, SEIR model, etc. The idea of these models is to divide the population into susceptible, infected, recovered, and exposed and other groups, and use the transmission mechanism from one group to another group or multiple  handled, and will not be infected by infected people at the same time, it has no effect on the changes of the number of other groups, so in terms of the probability of infection, its significance is the same as that of restorers. Therefore, the removed people R at this time includes the number of deaths and the number of restorers. Each of these groups will have natural deaths, so changes of the number of people in each group will result in the corresponding number of natural deaths be subtracted.

L-BFGS optimization method
When the scale of the optimization problem is relatively small, the basic optimization algorithms (including gradient descent, coordinate descent, Newton method, and quasi-Newton method) can solve the problem well, but they also have their own shortcomings.  Since gradient descent is based on the gradient of the objective function, the convergence rate of the algorithm is linear, so when encountering ill-conditioned problems or when the problem scale is relatively large, the convergence rate is extremely slow. Therefore, the algorithm can be said to be almost unusable; Compared with gradient descent, although the gradient of objective function doesn't need to be calculated, the convergence speed of coordinate gradient descent is still very slow, which results in a very limited range of use.
Newton's method is based on the Hessian matrix of the objective function. It has a faster convergence speed and fewer iterations, especially the velocity near the optimal value is quadratic. However, if the Hessian matrix of the objective function is too dense, the calculation for each iteration will increase. Because the inverse of the Hessian matrix of the objective function needs to be calculated in each iteration, so when the problem scale is large, it will cause a huge calculation, and the memory space is also needed to store the matrix; The quasi-Newton method introduces the approximate matrix of the Hessian matrix on the basis of the Newton method, which avoids the problem of calculating the inverse of the Hessian matrix in each iteration. The convergence speed of the quasi-Newton method is slightly lower than that of the Newton method, which is between gradient descent and Newton's method.
Similar to Newton's method, when the scale of the optimization problem of the quasi-Newton method is large, the approximate matrix will be very dense, and the calculation and storage of the matrix will consume a lot of memory overhead. Moreover, Newton's method cannot guarantee that the Hessian matrix of the objective function is always positive definite in each iteration. If the Hessian matrix is not positive definite, it will directly lead to the deviation of the optimization direction, and the Newton's method will be useless; The reason why quasi-Newton method is better than Newton method is that quasi-Newton method uses the inverse of Hessian matrix to instead of Hessian matrix. Although each iteration cannot guarantee that it gets the optimal direction, the approximation matrix must be positive definite, that is, the algorithm can always be Searching in the direction of the optimal value. It can be seen from the above that when the scale of the optimization problem is large, none of the above algorithms can solve the problem well. However, in practical application, many problems are ill-conditioned, which will lead to the failure of the gradient-based algorithm. Even if the algorithm is iterated thousands of times, it will not necessarily get a good convergence result; The large amount of data will consume a lot of memory to calculate and save the matrix when using quasi-Newton method and The iteration of BFGS is as follows: Among them, Substituting Eq (3), (4), (5) and (6) into Eq (2), we can completed, the oldest curvature information will be discarded, and the latest curvature information will be saved, so as to ensure that the saved curvature information comes from the most recent m iterations. We can refer to Formula (8): From formula (2), it can be deduced that: Substituting formula (8) into formula (2), then: Joint formula (2)-(9), after m iterations, we can deduce that: represents the outcome we need to predict, that is, over time, how will the numbers of people who are infected, susceptible, exposed and recovered with COVID-19, and those who died due to illness be distributed int the future.

Proposed SEIDR Model
The Under normal circumstances, 0 1 R  ,otherwise the disease will not spread and will be eliminated in evolution. According to the data queried, the larger the value of 0 R , the more serious the situation at the peak of the epidemic. This paper defines the basic reproduction value of COVID-19 in New York City as 0 = + R   . Table 1 In the traditional SIR model of infectious diseases, only three groups: susceptible persons, infected persons, and removed persons are considered. For details, please refer to Fig1 and the system of Eq (1). It can't measure the exposed people with viruses well who have been infected but haven't been found. In order to better analyze and predict the situation of COVID-19 in New York City, the following assumptions are made:   -The spread at this time is in a closed system, that is, the entire New York City, and everyone has the same probability of being infected;   -Since COVID-19 has an incubation period, the incubation period must be considered for transmission. Therefore, introducing  to represent the latent morbidity rate of COVID-19. It can be seen that the number of patients in the incubation period is E  ;   -Because part of people infected with COVID-19 will be cured or die at any time. At this time, in order to distinguish the rate of being died due to illness and the rate of recovering due to treatment, subdividing the removed people of traditional algorithms into D who died of illness and R the who recovered R. At this time, the rate of being died due to illness ε is introduced, and the number of deaths due to disease is I  ;   -It's important to note that the  is not the removal rate, but the recovery rate.
This paper analyzes and forecasts in days. For example, the infection rate  can be understood as follows: the probability that a susceptible person is in contact with an infected person today and he will be found to be a suspected or confirmed case tomorrow; Similarly, the fitting of the latent morbidity rate  , disease-related mortality  , the recovery rate  are also measured in days.
Under the above assumptions, the traditional SIR is modified, and the SEIDR model is obtained, as shown in Among them, Note: In Eq(12), the loss value of each group should be divided by the corresponding length of data set. But for simplicity, it is simplified into the form of Eq (12).    (11), we can find that when ′ ( ) = 0, there is:

Model Analysis of SEIDR
According to equation (13)

S t I t Et
Nv   = + (14) Substituting equation (14) into equation (13), and we have: Solving equation (15), then:  (16) In summary, when equation (16) is satisfied, the number of people infected with COVID-19 will reach a peak, and then the number of people infected will drop, that is, COVID-19 will reach a turning point at this time.
Observing equation (16) carefully, it is found that the number of susceptible people, that is S, can be increased by controlling the latent morbidity rate  , the infection rate  , the recovery rate  , disease-related mortality  and the natural mortality rate v . This paper does not list the natural mortality v as research objective, so the best case is to reduce the infection rate  and the latent morbidity rate  and increase the recovery rate  and the mortality rate  due to illness as soon as possible, so that the value of S in equation (15) Similarly, we can get: This paper uses days as the unit of time, so the change of each group will be relatively small. The number of natural deaths per day is extremely small, not enough to affect the COVID-19 trend. Therefore, in the analysis of the trend of the number of deaths due to disease in equation (16), the influence of natural death factors is not considered for the time being.

 
, so when the effects of natural death factors are temporarily ignored, both D t   and R t   change with the change of () It , so the change of () It corresponds to the change of () Dt and () Rt , it can be found that the curve of () It will be smoother than that of () Dt and () Rt .

Introduction of experimental data
Data is indispensable for the analysis and prediction of COVID-19, and the authenticity of the data source is extremely important. Based on past experience, we can know that the conclusions analyzed based on the wrong data may differ a lot from the fact, and the lack of real data will also cause the model to have deviation in the prediction process. In order to analyze and predict the trend of COVID-19 more accurately, data of COVID-19 must have timeliness, completeness and accuracy. The data's time span of COVID-19 in New York City used in this paper is from 2020-04-01 to 2020-10-26. The data source includes two parts. One part comes from NetEase Internet. The data source is updated in real time.
We use the Uniform Resource Locator (URL) obtains the location and access method of the resource from the Internet, send requests to the address and save the data needed; the other part comes from the official data released by the New York Department of Health. In particular, because it takes a certain time for the health department to report data, the data we use (including the number of confirmed cases, the number of deaths, etc.) may have a certain lag, but the lag is not strong.

Check out trends in COVID-19
In order to understand the real situation of COVID-19 in New York City, at first, we visually display the distribution of various groups in New York City, as shown in Fig. 4:

Prove that the proposed SEIDR model is superior to SIR model
After  Table 2: What needs to be pointed out here is that since the loss value is the sum of the mean square error of the real value and the predicted value of several groups, it may seem a bit large, but this is just the reference standard which we use to measure how the SEIDR model is better than the SIR model.

Fitting parameters
Select different training sets from the same data source to do 12 comparative experiments, and respectively fit the four parameters: infection rate, latent morbidity, mortality due to disease, and recovery rate. The fitting results are shown in Table 3:     region are effectively fitted, and the basic reproduction value 0, which can measure the severity of infectious diseases, is obtained. It also predicted the development trend of the epidemic in New York City. According to the projections, the situation of COVID-19 in New York City will reach a turning point on February 15, 2021 with the same level of prevention and control, but it will not be controlled completely in the short term, and this need to strengthen epidemic prevention and improve medical standards. The predictions in this paper are based on the situation of maintaining the local epidemic status quo. That is, if the government has better treatment policies, the future epidemic situation will be better than the results predicted by this paper.
However, this paper does not take into account the reinfection of the recovered people after recovery, nor does it take into account the isolation of the infected people and the exposed people. If these factors are taken into account, the COVID-19 situation can be better simulated and the prediction results will be more accurate. Since there is no data in this field at present, so it is difficult to study, and we will continue to study in this field if there is an opportunity in the future. Health. In particular, because it takes a certain time for the health department to report data, the data we use (including the number of confirmed cases, the number of deaths, etc.) may have a certain lag, but the lag is not strong.
Availability of data and material. Since this paper predicts the trend of COVID-19 in New York, there is no data available to support the results in this paper. Moreover, this paper extends to the entire epidemiology with Covid- 19, and the SEIDR model proposed in this paper can be used to fit the important parameters of the disease for subsequent studies.
Competing interests. The authors declare that they have no competing interests.
Funding. Not applicable for the moment.

Authors' contributions.
We propose a SEIDR model for analyzing and predicting epidemic infectious diseases.
And we fit the early transmission data of COVID-19 in