Counting the uncounted : estimating the unaccounted COVID-19 infections in India

Undetected infectious populations have played a major role in the COVID-19 outbreak across the globe and estimation of this undetected class is a major concern in understanding the actual size of the COVID-19 infections. Due to the asymptomatic nature of some infections, many cases have gone undetected. Also, despite carrying COVID-19 symptoms, most of the infected population kept the infections hidden and stayed unreported, especially in a country like India. Based on these factors, we have added an undetected compartment to the already developed SEIR model [48] to estimate these uncounted infections. In this article, we have applied Physics Informed Neural Network (PINN) to estimate the undetected infectious populations in the 20 worst-affected Indian states as well as India as a whole. The analysis has been carried out for the first as well as second surge of COVID-19 infections in India. A ratio of the active undetected infectious to the active detected infectious population is calculated through the PINN analysis which gives a picture of the real size of the pandemic in India. The rate at which symptomatic infectious population goes undetected and are never reported is also estimated using the PINN method. Toward the end, an artificial neural network (ANN) based forecasting scenario of the pandemic in India is presented. The prediction is found to be reliable as the training of the neural network has been carried out using the unique features, obtained from the state-wide analysis of the newly proposed model as well as from the PINN analysis.


Introduction
The impact of the COVID-19 pandemic has been devastating and almost all the countries have experienced this new disease, primarily in different phases at different times.Researchers around the globe are putting their constant effort to understand the dynamics of the transmission of the disease.We have also seen extraordinary efforts to roll out the COVID-19 vaccines within a very short time, which could save millions of lives.Many COVID-19-specific mathematical models have been developed over the course of the pandemic which can help us understand the way the COVID-19 disease has spread [42,28,8,12,35,16].These results can be really helpful for policymakers to make strategies to suppress future similar disease spread.Artificial neural networks (ANN), which have a wide ranging applicability due to their universality, have already been used to predict disease outbreaks [41,9,52] and detection [57,43,30], including COVID-19.Many studies also have found the method of 'fractional derivative' to be an efficient method in studying disease modelling [22,14,17,15,20,18,21,19].
Most of the countries in the world have experienced the COVID-19 disease in distinct phases.So far as the pandemic in India is concerned, it has experienced the pandemic in two distinct phases, the first phase effectively started in March 2020 and the second surge of infections started at the end of February 2021.In India, the second phase of the pandemic was observed to be more contagious and lethal compared to the first phase.The rise of the spike in COVID-19 cases in the second phase is now believed to be mainly due to the δ-variant of the SARS-CoV-2 virus.New strains of the virus such as double mutant variant (B.1.617)and triple mutant variant (B.1.618)were seen to be more vulnerable when the population density is high [3].The δ-variant of the SARS-CoV-2 virus (B.1.617.2), which was detected in India towards the end of 2020 was highly infectious and soon became a key reason behind the sudden surge of infections during the second phase of the pandemic [53].Unlike the first phase of the pandemic, more younger people were affected in the second phase.Till mid-December 2021, India recorded ∼ 35 million confirmed COVID-19 cases with 476 thousand deaths while the tally of confirmed cases has crossed 270 million globally with more than 5.3 million deaths [2].
India, being the most populous country in the world now, presents certain unique scenarios as far as the COVID-19 pandemic is concerned.Considering the size of the population in a country like India, while it is of great interest to correctly model the progress of a pandemic to correctly understand the overall global scenario, we at the same time have to deal with some uncertainties with reference to detailed and correct patient data, which are available through verifiable sources.In the case of the COVID-19 pandemic, one primary concern in this regard was the actual number of infected people against what was reported or available officially.There are many reasons why a discrepancy may exist between these two, many of which can not probably be controlled due to widely varying social, economic, and geographical conditions that exist in the country.Our aim of this work is to see what might be the actual size of the infected people, many of whom might not be counted officially, in a pandemic which is still evolving.This may help us understand situations in other countries with similar social and economic conditions and help us understand the future pandemic progression.In this work, we have used our previously and successfully developed SEIR (susceptible-exposed-infectious-removed) model [48] and have come up with a new SEIUR (susceptible-exposed-infectious-undetected-removed) which is being solved with the help of artificial neural network (ANN).We have applied a relatively new technique known as the Physics Informed Neural Network (PINN) [31] to estimate the undetected infectious population and also have used this technique to determine the unknown parameters of the model known as the so-called parameter discovery.Based on our results from the PINN model, we have also proposed a PINN-ANNbased prediction scenario, which might be used during similar situations in the future.We note that artificial neural network (ANN) is rapidly becoming a general purpose tool which has found wide-ranging applications in extremely diverse situations.In this work, we have applied the PINN analysis for the so-called 'parameter discovery' of our newly developed SEIUR model, which itself is based on our very successful SEIR model applied to the COVID-19 outbreak in India.Though, individually all the methods have already been tested and developed, this is the first time that PINN is being applied for parameter discovery of an epidemiological model.
In section 2, we discuss the development of the SEIUR model and the basic reproduction number is calculated.In section 3, modelling using the neural network has been carried out and the model has been solved using the PINN tool for India as well as for 20 Indian states.In section 4, we present a possible forecast scenario of the COVID-19 outbreak in India based on the first and second phases of the pandemic.In section 5, we conclude.

The SEIR model
In a disease like COVID-19, the role of the 'exposed' population becomes very important in mathematical modelling.In our earlier work, we proposed an SEIR model [48] (will be referred to as the SEIR model, hereafter), designed specifically to deal with the unavailability of detailed patient data, which has been successfully applied to predict the progress of the pandemic in the first phase (from March 2020 to February 2021), well beyond the available data [48].The equations of this model are given by dI dt where, S, E, I, and R are the susceptible, exposed, infectious and removed population, respectively, at any point of time.The removed compartment includes the population who have recovered from the disease or have died from the disease.The variables satisfy the condition In this model, we have taken care of the detected asymptomatic population with the zero incubation period [48] and linked them directly to the infectious group without taking an extra compartment for the asymptomatic infectious population.Here, the asymptomatic populations are those who do not develop any symptoms but are infectious and can be detected only via contact tracing or random testing.The two transmission rates β t (disease transmission rate) and ρ t (asymptomatic transmission rate) are assumed to be time-dependent piecewise functions.Piecewise functions are used to capture the changes in an ongoing pandemic due to varying conditions.The incubation period is expressed through a timedependent function, which starts at a very large number approaching a constant value of 14 days.The parameters δ and γ −1 are the natural death rate and recovery period, respectively.As mentioned before, one primary concern is to estimate the so-called undetected population, which can be sizable and never contribute to the officially recorded number of patients.It is very important to distinguish the undetected population from the detected asymptomatic population.While the former is the asymptomatic populations who are detected via random testing and being recorded officially as infected, the latter is the asymptomatic population that has gone undetected.In both cases, they do not develop any symptoms but are infectious.

The SEIUR model 2.2.1 Assumptions
So far, India has seen three distinct phases of the COVID-19 pandemic -the first phase which lasted from January 2020 to February 2021 and the second phase which started in late February, 2021 and peaked in around mid-April, 2021 and subsided with a very prolonged and tapered tail till the end of December, 2021 after which the third phase had started.The second phase of the pandemic was believed to have been driven by the δ-variant of the SARS-COV-2 virus, driving a huge surge of infections within a very short period of time [53].In this paper, we have only considered the first and second phases of the pandemic.
Despite the visible difference between the two phases of the pandemic, the overall ambient socioeconomic conditions during both phases of the pandemic remained almost the same.As the vaccination of the population did not start till late March 2021, with considerable accuracy it can be assumed that in both phases of the pandemic, vaccination had only a little role to play.With these observations, we can safely assume that the primary nature of the pandemic in both phases remained the same with only a larger infection rate during the second phase due to the highly infectious δ-variant of the virus.As such, our compartmental model should remain the same with a difference only in the infection rates.We further assume that all the assumptions that have been made in our previously described model (which was applied only to the first phase of the pandemic) [48] should remain valid during both phases of the pandemic.
With these assumptions in place, we now propose a modified SEIUR (Susceptible-Exposed-Infectious-Undetected-Removed) compartmental model with an attempt to estimate the undetected population.Here, the undetected population comprises both asymptomatic undetected population and those who intentionally hide their disease and are never reported as infected individuals.The undetected group of the population plays an important role in the transmission of the disease in a particular locality.There are some social factors also, which have fuelled the pandemic at different times.To incorporate the significance of these two classes of the undetected population, an extra compartment for the undetected population is added to our previously reported SEIR model.An undetected-removed compartment is also added to the new model for the sake of conservation of the total population.In the new model, we have introduced another infective class U to represent the undetected population.There is always a probability that the interaction between U and S will lead to a more infectious population, who is further reported as detected infectious or may move to the undetected compartment.Similarly, the interaction between I and S also results in an infectious population.The more the interactions between S and I, and S and U are, the more the infectious count will be.The schematic model of this new SEIUR model is shown in Fig. 1.
The governing equations of the newly proposed SEIUR model are given by dU dt dI dt dR dt where U and R U represent the undetected and undetected-removed population, respectively at any given time.The rest of the variables are the same as stated above.Here we introduce a new transmission rate -'undetected transmission rate' denoted by ψ t , which is an important parameter as it determines how fast the pandemic will rise or decline, while σ −1 is the recovery rate for the undetected population and the ϵ denotes the rate at which symptomatic individuals hide their disease (i.e.prevent themselves from being recorded officially as infected).We can call ϵ as the rate of symptomatic unreported individuals.

Modelling the rates
Our gross assumption in formulating the new SEIUR model is that the overall evolution of the COVID-19 pandemic in India is correctly modelled by our SEIR model [48] thereby ascertaining the continuation of the basic assumptions made earlier regarding the transmission rates.We note that the transmission rates of the disease change with time and so, the transmission rates are expressed in terms of time-dependent piecewise functions expressed in certain forms.The transmission rates β t and ρ t are the same as in the SEIR model.The time-dependent incubation period ν t is also kept unchanged.Here we introduce the new transmission rate ψ t in terms of the disease transmission rate β t for the undetected population where µ t = ζ t κ t with κ t is a piecewise function.We note that, from the analysis of the SEIR model [48], the asymptomatic transmission rate η t was found to be ∼ 15% of the disease transmission rate β t for the first phase of the pandemic.Therefore, in this new SEIUR model, we can safely retain the same relation due to the same prevailing conditions during both phases of the pandemic.The recovery periods σ −1 and γ −1 are taken to be 21 days.The natural death rate δ is neglected with the assumption that the deaths caused by COVID-19 during the pandemic period are very high compared to the natural deaths.The new parameters introduced in the SEIUR model are κ t and ϵ.

Basic reproduction number (R 0 )
The basic reproduction number R 0 is a measure of the contagiousness of an infectious disease that defines the average number of secondary infections that stem out from a primary infection.Naturally, R 0 > 1 means an outbreak.For the SEIUR model, the basic reproduction number can be calculated as which is plotted in Fig. 2, for both the previous SEIR (first phase) and the new SEIUR (first and second phases) models.As is seen from the figure, R 0 is larger during the first phase of the pandemic compared to the second and both the SEIR and SEIUR models closely agree with the behaviour during the first phase.
The days indicated in the figure start from the 'zeroth' day, which in the case of the first phase, indicates the day when the first COVID-19 cases were detected in India on 14 March 2020.We also note that, the first case in India was detected on 30 January 2020 in the state of Kerala.After that only a few cases were reported till mid-March 2020 and in our analysis, we have taken zeroth day as 14 March, 2020, the first day when the number of active cases saw a significant jump.The 'zeroth' day for the second phase is taken as the day when the number of daily infected cases rose for the first time after the initial decline of the cases in the first phase, which is on the 24th of February 2021.
We note here the calculation of R 0 always require and underlying numerical model, based on which the R 0 value can be determined [10,39].Naturally, different models yield different vales for R 0 [45].The agreed value of R 0 is thus due the model, which can provide the most reliable pandemic scenario based on modeling of the available data.

Neural network modelling
Of late, artificial neural network, commonly known as ANN has become an indispensable part of the field of science and technology and has made their way deep into almost all fields of science.Physics informed neural network (PINN) [31] is relatively a new concept that has emerged from the ANN and has gained importance due to its universality of applications.PINN is an optimization method, through which differential equations are treated as an optimization problem with embedded initial or boundary conditions, which is then solved using ANN.The name PINN, which originated from physics-related equations does not however limit its use only to physics problems.In this work, we have successfully applied the concept of PINN to our SEIUR model.One important advantage of using the PINN approach is that it can discover unknown parameters from the given dataset.The unknown parameter ϵ in our model is in fact estimated through this PINN approach.For a detailed analysis of the PINN approach, the reader can see the work by [31,25,46].
At this point, it is worthwhile to mention some of the existing methods in epidemiological modelling, which can help the reader distinguish the difference between the PINN approach and others.One class of modelling is the so-called regression modelling, which as the name suggests, obtains the pandemic properties through regression which may include various approaches such as logistic regression, Bayesian ridge regression, and Gaussian regression [40,4,49,56,48].These methods are useful when the pandemic datasets are very large and have relatively less complex and interlinked parameters.In contrast to this ANN-based methods can handle extremely complex interlinked data, requiring large number of underlying differential equations [37,44,50].While various ANN-based methods can be applied to many different stages of the modelling, the PINN approach is used to optimise the unknown interlinked parameters which are used in the underlying model.

The PINN setup
We now set up Eqs.(6-10) through a neural network and assume that our field variables χ = (S, E, I, U, R) can be approximated through a neural network N (t) Fig. 4 The active ratio U/I -of the undetected to the detected population for 20 states.
The corresponding loss functions are where f i (χ i , p j ) are the right-hand sides of Eqs.(6-10) and p j = (α t , κ t , ζ t , ϵ, φ) are the parameters.The parameters (α t , κ t , ϵ) are those which are to be determined through neural network optimization.The initial conditions are further expressed in a set of another five loss functions [31] where χ 0 are the corresponding initial conditions and the function sign(t) is defined as Note that the last Eq.(11) does not need to be included in the optimization problem as it does not affect the rest of the equations.The data sets which are to be used throughout the optimization process were taken from the publicly sourced repository at https://www.covid19india.org[1].
We have solved the system using a neural network consisting of 8 hidden layers, each with 64 neurons with kernel regularization.The activation function used is 'tanh' and the optimization scheme is Adam (see Table .1).Our working interface is the SciANN package, which is basically a Tensorflow-Keras wrapper [25].

PINN results
The PINN results for both phases of the pandemic in India are summarized in Table .2and the results are shown in Fig. 3. From the analysis, it is seen that the ratio of the active undetected population to that of the active detected population was more during the first phase of the pandemic compared to the second.In the first phase, for every detected infection, there were almost 5 persons who went undetected.
Interestingly, this scenario is found to be different for the second phase of the pandemic.The count of undetected population decreases at the second phase of the pandemic and we have just one undetected individual for each detected infectious person.
A state-wise analysis has been done for the 20 most affected states of India up to October 10, 2021.
This analysis is indeed important as all the states are different from each other in terms of economy, demography, and diversity in culture and lifestyle.A state-wise analysis can also provide a picture of the response of Indian states towards the COVID-19 outbreak.The PINN results of the COVID-19 outbreak in Indian states are tabulated in Table .3,for the first and the second phases of the pandemic, respectively.Among the 20 Indian states, Delhi had the highest number of undetected population during the first phase of the pandemic.Also, the ratio of undetected to detected population was nearly 10, the highest among all the states.Our results for India as a whole and Indian states show a good agreement with the already reported results.The results obtained from the analysis are found to be consistent with the MWSIR and other modified SEIR models [23,6,5,36].In this regard, we would like to mention other studies, which have already been carried out to estimate the undetected or missed COVID-19 infection cases as estimated in Refs.[34,27,51,26,32,7].An early COVID-19 pandemic study in Europe estimates the actual undetected count size varies within 3.93 − 7.94 times in different parts of Europe [47].Another early pandemic study reported the actual number of infections may have been 1.5 to 2.029 times more than the actual reported count in the United States and 1.44 to 2.06 times more in Canada [55].A seroprevalence study of COVID-19 infection in rural districts of south India reveals 7 numbers of undetected for every RT-PCR confirmed case [27].Chaubey and his colleagues estimated the real case of COVID-19 infections to be 17 times higher in the first phase of the infection from a serosurvey in India [51].Thus our undetected count for Indian states agrees quite well with the other seroprevalence reports [27,51].For the first phase of the pandemic, it is reported elsewhere that about 10 − 50 cases have gone missing for every detected case [54].This report on South Indian slum areas indicates that the ratio of detected to undetected cases was almost 1 : 195.However, it is indeed very difficult to estimate the exact ratio of active undetected to active detected cases in the absence of reliable data.Our PINN model estimates this ratio of active undetected to active detected population ∼ 4.58 for the first phase of the pandemic in India.This ratio is found to be smaller than the previously reported results.
The symptomatic unreported rate ϵ is found to be higher in the state of Orissa in comparison to the other states.During the second phase of the pandemic, we have observed that the ratio of undetected to detected population is the highest for the state of Uttar Pradesh.Surprisingly, peaks per million active undetected and active detected cases are found to be higher for the state of Kerala.From the analysis, it is also evident that the symptomatic unreported rate is higher in the state of Telangana for the second surge of the outbreak in India.The U/I ratios for these 20 states are shown graphically in Fig. 4.But for some of the Indian states such as Uttar Pradesh and Delhi, this ratio is found to be ∼ 10, which is comparable to the reported results [54].For the second phase of the pandemic, the active U/I ratio is found to be ∼ 1.3 for India.The pandemic became deadliest during the second phase of the pandemic and most of the infectious population needed medical attention to cure the disease, for which the detected infectious count will go higher, which has caused the U/I ratio to decrease, compared to that during the first phase.

ANN-based forecasting scenario
We now test the predictability feature of the ANN-based model with inputs from our PINN model.As mentioned before, our primary premise regarding the two phases of the COVID-19 pandemic in India is that the overall physical situations during both phases are similar except for the more aggressive nature of infection during the second phase.The evolution of the pandemic in terms of active infectives (I) during these two phases in the 20 worst affected states in India is shown in Fig. 5.The horizontal axes in all the panels indicate the number of days starting from the zeroth day of the pandemic in the respective phases.The zeroth day is the day at which we consider the pandemic to have started.The first and second phases are respectively shown in blue and red colours.The vertical axes are the number of active infective cases (Inorm) normalized to unity in both phases.The normalization helps us reduce both the datasets for the number of active infective cases to a single framework for further processing through the neural network, which is similar to standardization and pre-processing of the data.Altogether, we have now 40 sets of data which are to be processed through the neural network.As can be seen from the figure, the shapes of the curves for active cases are almost similar in both phases except for the fact that the curves are wider for the first phase.
The most common and logical information to be predicted during such an evolving pandemic is to have an idea of the timeline for when the pandemic is going to subside or end.Toward this, we have constructed the target values for our prediction as the day when the number of active infections reduces to 10% of the daily active peak value for any particular state.These target values are shown graphically in Fig. 5, by the vertical lines.We have identified six features for the ANN model, which inputs are, peak day (the day at which the active cases reach their peak), the number of cases recorded at the peak day, β t at the zeroth day, β t at the peak day, ϵ(symptomatic unreported rate) and the ratio of

Conclusion
In this work, we have made an attempt to estimate the infectious population of the COVID-19 outbreak in India during the years 2020 and 2021, who goes undetected and do not contribute to the publicly and officially available records.To this effect, we have constructed an new SEIUR model by adding two new compartments 'undetected' and 'undetected removed' to our previously developed and successfully deployed model [48].One of the novelties of this work is the use of a new tool, PINN [31], employed to solve the SEIUR model and estimate the undetected population through 'parameter discovery'.The estimation of the undetected infectious population itself is considered as an important finding with reference to such kinds of pandemics in India, which is now the most populous country.We performed the PINN analysis for 20 worst affected Indian states and estimated the undetected population.The ratio of active undetected (U ) to the active detected (I) cases are calculated for the states as well as for India as a whole.The (U/I) ratio is highest in the state of Delhi (∼ 10) and the state of Uttar Pradesh (∼ 3) in the first phase and the second phase, respectively.For India as a whole, this ratio is 4.58 and 1.3 for the first and second waves of the pandemic, respectively.
One important finding of this PINN analysis is the estimation of the rate at which symptomatic infectious population goes undetected but contribute equally to the spread of the disease.This rate ϵ is found to be the highest in the state of Orissa (0.0607) and the state of Telangana (0.0553) in the first and the second phases, respectively.The value of ϵ for India is found to be 0.0484 and 0.0493 for the first and the second phases, respectively.We can say that this ϵ parameter is a measure of the law-abiding disciplinary index of the country as well as for Indian states.The active (U/I) ratio and the ϵ gives a clear picture of the response of the Indian states and India as a whole towards the tackling of the future COVID-19 or similar outbreaks.

Strength and weakness of the approach
We note that any modelling of disease outbreaks invariably has to employ some kind of epidemiological model [29,33,38].However, depending on the complexity of an outbreak, especially when it is relatively newer like the recent COVID-19 pandemic, the model parameters may vary widely.Also, the more the number of independent parameters of an outbreak, the more involved is the dynamical model.In the economically developed countries with well-defined healthcare systems, the outbreak data are usually reliable and detailed [11,13,24].In contrast to this, for countries with developing economies and relatively unorganised healthcare systems, fine-scale accurate data such as health-related data of hospitalised patients, number of patients requiring various levels of intensive care etc. are extremely difficult to obtain.In such cases, the epidemiological models have to be based on certain loosely defined parameters and one looks forward to replicating the available outbreak data over a broad timeline.It is where the ANN-based methods such as the PINN analysis find their places, where one has to determine a variety of parameters with high degrees of uncertainty.ANN-based methods have the capacity of modelling such data with minimal effort.And we believe, we have shown a relatively novel way of such 'parameter discovery' with the PINN-based epidemiological SEIUR model.
Naturally, the PINN-based method requires large amount of outbreak data to have a successful forecasting.With increasing awareness and preparedness, many countries now have a lot of post-COVID-19 measures, which will be able to yield reliable data for future outbreaks.As number of available datasets increases, the accuracy of ANN-based modelling also increases.

Fig. 1
Fig. 1 Compartmental diagram of the newly proposed SEIUR model.

Fig. 2
Fig. 2 Variation of R 0 with time for the first and second phases with the new SEIUR model and for the first phase with the previous SEIR model.

Fig. 3
Fig. 3 The PINN solutions of the new SEIUR model applied to the first phase (Phase #1) and the second phase (Phase #2) of the COVID-19 pandemic in India for the active infected population (I) and the active undetected population (U ).The actual data points are indicated through the open circles '•'.The bottom panel shows the model loss calculated in terms mean squared error.

Fig. 5
Fig. 5 Graphical representation of the dataset used for ANN-based forecast along with the target values shown by the vertical lines.The blue and red colors represent the first and the second phases of the pandemic.

Fig. 6 .
Fig.6.The root mean squared error (RMSE) for the analysis is found to be ∼ 11.5%.The results of the forecast scenario are shown in Fig.6.

Table 1
Neural network parameters

Table 2
The summary of PINN results for India

Table 3
Estimated undetected (U ) and detected (I) active cases for the first (top) and second (bottom) phases of the pandemic using the PINN analysis.The rate of unreported undetected population is also estimated by the PINN model.The maximum value for each column is highlighted in bold.

Table 4
The dataset for phase #1 (top) & #2 (bottom), used for forecasting using ANN.the undetected active cases to the detected active cases (U peak /I peak ).The transmission rates β 0,t are calculated numerically from our new SEIUR COVID-19 model.The (U peak /I peak ) ratio and ϵ values are taken from the PINN results.Except for the peak day and active cases at the peak day, all the inputs used for forecasting are unique as they are obtained from our new model and PINN analysis.We combine both the first phase data and second phase data of the COVID-19 outbreak for the Indian states and the dataset is supplied to the ANN analysis.We have used 70% of the data to train the model and the rest of the 30% is used for prediction purposes.We have used 6 hidden layers with 64 neurons each and the The ANN predictions for various sample sizes (left) and the MSE of the optimization (right).The samples are here the target values of the twenty states (see text).Except for two values, the rest of the predictions quite agree with the observed values.epochused is 200.The loss function used is mean squared loss (MSE) and optimization is carried out with the help of the Adam optimizer.The rectilinear unit (ReLU) is used as an activation function.The dataset used for forecasting is tabulated in Table.4.The prediction results are shown in the left panel of