Analysis of Individual-Level Epidemic Data: Study of 2018-2020 Ebola Outbreak in Democratic Republic of the Congo

The 2018-2020 Ebola virus disease epidemic in Democratic Republic of the Congo (DRC) resulted in 3481 cases (probable and conﬁrmed) and 2299 deaths. In this paper, we use a novel statistical method to analyze the individual-level incidence and hospitalization data on DRC Ebola victims. Our analysis suggests that an increase in the rate of quarantine and isolation by approximately 12% during the epidemic’s third and ﬁnal wave was likely responsible for the eventual containment of the outbreak. The analysis further reveals that the total e ↵ ective population size or the average number of individuals at risk for the disease exposure in three epidemic waves over the period of 24 months was around 19,000—a much smaller number than previously estimated and likely an evidence of at least partial protection of the population at risk through ring vaccination and contact tracing as well as adherence to strict quarantine and isolation policies. strategy and the need for supporting DRC Ebola treatment centers. To address this issue, University of Kinshasa researchers collected a large number of individual records of disease histories from probable and conﬁrmed Ebola cases during the 2018-2020 EVD outbreak in DRC. This study describes a model-based Bayesian statistical method developed to estimate the e ﬀ ects of ring vaccination, quarantine, and isolation in Ebola treatment centers across northwestern provinces of the country. The method accounts in particular for missing and censored data, heterogeneity of infection patterns and multiple waves of infection with di ﬀ erent intervention patterns.


Introduction
su ciently flexible to incorporate the heterogeneity of individual disease histories and simple enough to require only a small set of population-level parameters. This allows us to estimate the key quantities of interest in the DRC outbreak, such as the rates of disease reproduction and quarantine/isolation (or hospitalization) and the size of the subpopulation at risk of infection though contact with EVD cases. The model also accounts for observed seasonality and spatial variation in the number of cases (e.g., see [14]) by allowing for the three independent sets of parameters to govern the three waves of infections observed over the course of the outbreak. For the purpose of our analysis we have determined, similarly as in [7], the first wave to end in late February 2019 and the second one to end around late May 2020. See Table 1 below for more details. Our approach may be also viewed as an alternative to the complicated multi-phase longitudinal analysis proposed recently for the DRC outbreak data in [15].

Ebola Dataset
The 2018-2020 DRC EVD outbreak lasted over 24 months and spread over 3 distinct spatial and temporal waves. Between the emergency declaration of the EVD outbreak in northern DRC on August 1, 2018 and the outbreak's o cial end on June 25, 2020, the DRC Ministry of Health has reported a total of 3481 cases (including confirmed and probable), 1162 recoveries, and 2299 deaths [16] in the provinces of Northern Kivu, Southern Kivu, and Ituri. The dataset considered here is a large subset of the entire EVD database compiled by the University of Kinshasa School of Public Health, which comprises 3117 total case records (confirmed and probable) recorded between May 5, 2018, and September 12, 2019. The data included partially de-identified but still detailed patient information, such as each person's location, date of symptom onset and hospitalization, as well as discharge due to recovery or death. These individual records came from the Ebola treatment centers in 24 di↵erent health zones, spread out among the three DRC provinces of Northern Kivu, Southern Kivu, and Ituri.
Of the 24 health zones, 77.1% of all cases were from only 6: Beni, Butembo, Katwa, Kalunguta, Mabalako, and Mandima. Only 9.7% of cases were under the age of 18. There is also a slightly larger proportion of females contracting the disease, comprising 57.0% of the cases. Approximately 5% of the cases were health care workers. About one-third of the EVD fatalities were not identified until patient's death and thus not e↵ectively isolated from the time of infection. Although over 170,000 contacts of confirmed and probable Ebola cases had been monitored across all a↵ected health zones for 21 days after their last known exposure by the end of the epidemic, some of the contact tracing was incomplete due to insecurity that prevented public health response teams from entering some communities. The overall case density map is presented in panel (A) of Figure 1 with the animated version of the map presented in the appendix in Figure 5. Notice that the high-density areas, particularly Butembo, Katwa, and Beni, are all spatially small health zones corresponding to cities or towns with larger populations.

Case Alerts and Definitions
Since early August, 2018, the DRC Ministry of Health has been collaborating with several international partners to support and enhance EVD response activities through its emergency operations center in Goma. To the extent possible given regional security considerations [17], the response teams were deployed to interview patients and their suspected contacts using a standardized case investigation form classifying cases as suspected, probable, or confirmed. A suspected case (whether surviving or not) was defined as one with the acute onset of fever (over 100 F) and at least three Ebola-compatible clinical signs or symptoms (headache, vomiting, anorexia, diarrhea, lethargy, stomach pain, muscle or joint aches, di culty swallowing or breathing, hiccups, unexplained bleeding, or any sudden, unexplained death) in a North Kivu, South Kivu, or Ituri resident or any person who had traveled to these provinces during this period and reported the signs or symptoms defined above. A patient who met the suspected case definition and died but from whom no specimens were available was considered a probable case. A confirmed Ebola case was defined as a suspected case with at least one positive test for Ebola virus using reverse transcription polymerase chain reaction (RT-PCR) [18] testing. Patients with suspected Ebola were isolated and transported to an Ebola treatment center for confirmatory testing and treatment [2].

Onset and Removal
In our analysis of the DRC dataset, we focused on dates of symptom onset and removal, with removal defined as either a death/recovery at home or transfer to an Ebola treatment center (ETC). It was assumed that, once in the treatment center, the probability of further infection spread by an isolated individual was very small due to the strict safety protocols-and later due also to vaccination of healthcare personnel and family members who were in contact with the suspected Ebola case. As summarized in panel (B) of Figure 1, we were able to access 3117 out of 3481 individual records of confirmed and probable Ebola cases. Of these 3117 records, 37 were missing both the onset and recovery dates and were removed from further analysis. In about 30% of the remaining records, either their dates of onset or removal were missing. For these records, the imputation of missing dates was performed as follows. First, from the complete records available, we obtained an empirical distribution of the number of days from disease onset   Table 1. to removal. Then, assuming the standard missing-at-random data pattern [19], the missing values for the number of days in between symptom onset and removal were drawn uniformly from that distribution. In order to adjust for spatial and temporal di↵erences in the removal patterns in course of the 3 waves of the epidemic, a time window of 150 days around the onset or recovery dates was used when sampling the missing dates. A flow diagram summarizing the amount of missing data and data imputation leading to the final dataset is presented in panel (B) of Figure 1. The distribution of the original and the partially imputed records across the three waves of infection is provided for further reference in Table 1.

Spatial and Temporal Patterns
Throughout the pandemic, the incidence rates exhibited strong spatial and temporal patterns that can be summarized as three distinct waves of infections with approximate boundaries marked by vertical lines in Figure 1. The distribution of weekly reported cases across the most a↵ected health zones listed in Table 1 is provided in the bar plot and in the corresponding animation in the appendix (see Figure 5). As seen from the bar chart and the animated plot, the epidemic was initially driven largely by infections in the health zones of Beni, Mandima and Mabalako. After several months, the incidence of new cases in these zones subsided, but the epidemic moved south to the health zones of Katwa and Butembo, where the majority of new infections was registered between weeks 22 to 45 of the epidemic (see Panel (A) in Figure 5 in the Appendix). In the final spatial shift, around week 49, the epidemic returned to the health zones of Beni, Mandima, and Mabalako, where it was mostly extinguished around week 60 (September 2019). Isolated Ebola incidences occurred sporadically across northern DRC until end of the outbreak was o cially declared in June 2020. The empirical patterns of incidence and removal for EVD cases are summarized in Figure 2 with the bar and the dot plots representing the daily numbers of new infections and removals, respectively. As seen from the plot, these daily counts closely follow a three-wave temporal pattern in Table 1. This is further evident from the black and red trendlines representing the loess smoothers (see [20]). The daily ratio of new cases and removals may be interpreted as a crude estimate of the effective reproduction number R t defined more formally in (2) in Section 2.2 below. In particular, the blue trendline for R t indicates that towards the end of the observed time period, the number of removals outpaced the number of new infections (R t < 1). The ability to sustain this pattern for a su ciently long time period, mostly by increasing the rate of quarantine and ETC transfers along with ring vaccination of case contacts was largely credited with the end of EVD epidemic in mid-2020. The quantification of this public health intervention e↵ect in 2018-2020 DRC outbreak is one of the main motivations for our model-based analysis. Although the precise cut-o↵ dates for the three waves of 2018-2020 Ebola infections are di cult to establish, the incidence data along with simple statistical analysis (see Section 2.3) indicate that the first wave lasted approximately until the end of February 2019, whereas the second wave ended around the end of May 2019. For the purpose of the data analysis below, the specific break dates used were February 27, 2019 and May 27, 2019 as marked by vertical lines in Figure 2. September 12, 2019 was the cuto↵ date for the individual records data available from the University of Kinshasa (see Table 1).

Model for Data Analysis
The analysis of the individual-level epidemic data is based on the standard ecological model known as the SIR (susceptible-infected-removed) model and developed for the purpose of analyzing average behavior of a large population with a homogenous pattern of interactions [21,11]. Although there are many variants of SIR models in the literature [22], our current analysis considers the classical Kermack-McKendrick SIR model that assumes the proportions of population categorized as susceptibles (s), infected (ι), or removed (r) to evolve according to the di↵erential equationṡ with s 0 =1 ,ι 0 = ρ>0 and r t =0w h e r eβ>0 is the rate of infection, γ>0 is the rate of recovery and ρ>0 is the initial amount of infection. In particular, the model implies the existence of the basic reproduction number R 0 (R-naught), which determines the average speed of disease spread [11] and is given by the formula R 0 = β/γ.
If R 0 > 1, the proportion of infected initially rises and then subsides, with the final proposition of surviving susceptibles given by s 1 =1− τ>0w h e r eτ is know as the epidemic's final size. In typical statistical analysis, an estimate of R 0 is obtained by separately estimating the parameters β and γ.
Another important quantity related to (1) is the effective reproduction number, which is typically defined as Although equation (1) is typically considered in the context of an average behavior of a large population, for our purposes we interpret it as defining the individual histories of infection and recovery, according to the idea of the dynamic survival analysis (DSA) discussed recently in [10] and [23]. With the DSA approach, we interpret equation (1) as the so-called stochastic master equation [24] describing the change in probability of a randomly selected individual being at time t either susceptible, infected, or removed. These respective probabilities are represented by the scaled proportions s t /(1 + ρ), ι t /(1 + ρ), and r t /(1 + ρ) and evolve according to (1). As outlined in [10], the DSA-based interpretation of the classical SIR equations has a number of advantages that make it particularly convenient for analyzing epidemic data consisting of individual histories of infection onsets and removals, which is exactly the type of data available in the DRC Ebola dataset. The fact that the model is individual-based implies also that we can vary the parameters θ =(β, γ, ρ) to account for individual covariates and changes in the parameter values over time, as di↵erent waves of infection sweep through the population. Finally, for the purpose of our analysis, it is also important to note that the DSA model does not require any knowledge of the size of the susceptible population subjected to the epidemic pressure. For the DRC dataset, that assumption would be di cult to justify due to spatial and temporal heterogeneity of the epidemic and the frequent movements of local populations driven by political conflicts and insecurity. Another element complicating the determination of the size of susceptible population was the ring vaccination campaign that has been conducted since 2019 wherever possible in the northern DRC during periods of relative stability, despite local mistrust and supply issues. This campaign ultimately resulted in over 250,000 vaccinations.
Note that, because s 0 = 1, the values of R 0 and R t coincide for t = 0. Moreover, is a decreasing function of time and therefore, so is R t . However, in practice, this implication is problematic. Rewriting R t = −ṡ t /ṙ t suggests that a crude but sensible way to estimate R t empirically is to take the ratio of daily number of new infections to new removals. The empirical R t thus estimated will not be necessarily monotonically decreasing. In the light of possibly changing parameters and the e↵ective population size, we have adopted this approach to estimating the daily e↵ective reproduction number R t in Figure 2.

Parameter Estimation
We assume that, for each of the three waves of the epidemic, we have a separate and independent set of parameters θ and that, in each wave, we observe n histories of infection that may be represented by the times of disease onset and removal (t i ,T i ) for each individual i. Under these assumptions the wave-specific DSA likelihood function is given by (see Appendix) L(θ|t 1 ...,t n ,T 1 ,...,T n ,T)=( where T is the available time horizon and w i is the binary variable indicating whether T i is right censored (that is, T i ∧ T = T ) in which case w i = 0 and otherwise w i = 1. Note that the likelihood formula depends on the parameters β and ρ only implicitly, through the values of the function s t defined by (1). Note also that we assume T to be unique and exactly known although in practice this may not be true as subsequent waves of infection may be too close in time (perhaps even overlapping) to allow precise specification of T precisely. In our analysis below, we solve this practical problem by considering several candidates for the values of T in each wave and then identifying ones that jointly maximize the combined posterior distribution corresponding to the wave-specific likelihoods in equation (3).
The fitting of the model parameters θ =( β, γ, ρ) by maximizing the likelihood function (3) can be conveniently integrated into the Bayesian estimation framework, which allows for a more complete propagation of uncertainty and the use of external information in the statistical model. This, in turn, allows us to produce estimates that reflect all available information and uncertainty. In our DRC data analysis, the approximate posterior densities of θ were obtained using the Hamiltonian Monte-Carlo sampler [25] implemented in the open source statistical software STAN [26] and integrated with the popular statistical analysis language R via the library Rstan [27]. For the Rstan analysis, we have assumed uniform (sometimes improper) prior distributions on the θ components as follows β ∈ (0.15, ∞), γ ∈ (0,β), ρ ∈ (0, 1).

(4)
The lower bound was placed on β based on empirical information, and the upper bound was placed on γ to enforce the constraint R 0 > 1. Given the wave-specific time horizons (T 's), the set of parameters for each epidemic wave was estimated independently using 2 independent chains of 3000 iterations, with a burn-in period of 1000 iterations. The chains' convergence assessed using Rubin's R statistic [27]. The analysis resulted in approximate samples from the posterior distribution of θ for each of the three waves of the epidemic (see e.g., Figure 4).

Ethics Statement on Human Subjects and Methods
The research was conducted in accordance with the relevant guidelines and regulations of the US law and OSU Institutional Review Board. The research activities involving human subjects discussed in the paper meet the US federal exemption criteria under 45 CFR 46 and 21 CFR 56.

Results
The overall comparison of the parametric DSA model predictions with the empirical data in DRC dataset until September 12, 2019 is given in Figure 3, where the scaled theoretical densities of the epidemic are plotted alongside the observed relative daily counts of infection (onset) and removal shown earlier in Figure 2. As seen from the plots, the multi-wave model appears to capture well the empirically observed  Table 2.
patterns of daily counts represented by the histogram bars. The 95% credibility bounds around the model fit (marked in blue) are calculated based on the model parameter posterior distributions estimated via the MCMC algorithm with priors described in Section 2.3. We note that, although the DSA fit curve appears quite similar to the non-parametric loess smoother presented in Figure 2, the parametric fit has an advantage of providing an explicitly interpretable set of parameters describing the outbreak dynamics. This allows, for instance, for a purely quantitative comparison of the 3 di↵erent epidemic waves. The wave-specific results of the MCMC analysis are summarized in Table 2 with some of the posterior plots presented in Figure 4. In Table 2, the posterior mean and corresponding credibility interval for each component of θ =(β, γ, ρ) are listed for each epidemic wave along with the estimated reproduction numbers. Additionally, in the last two rows, the posterior estimates of the e↵ective population size (N ) and the outbreak size (K 1 ) are reported (see appendix Section B for formal descriptions of these quantities). The MCMC estimation scheme that produced the numerical values listed in the table was based on the wave-specific likelihood functions in equation (3) conditioned on the observation periods (T ) according to the cut-o↵ dates in Table 1. As seen from the entries of Table 2 and from the posterior density plots in Figure 4, the parameter values for the infection rate (β), recovery rate (γ) and the initial prevalence of infection ρ all di↵er considerably across waves. The most notable appears to be an average increase of 26% in the posterior β values between waves 1 and 2. This change is seen to correspond to an 11% increase in the value of the posterior mean of R 0 and the subsequent increase in the number of infections in wave 2 of the EVD outbreak. Another interesting observation in Table 2 is that, while the average value of β increased slightly (about 4%) between epidemic waves 2 and 3, the corresponding average value of γ increased three times as much (over 12%). Recalling the plot of the empirical e↵ective reproduction number in the top part of Figure 2, it appears that this increase was crucial in ultimately controlling epidemic growth and ending the outbreak within the next several months. We note that the increase in the removal rate γ corresponds to the decrease in the duration of the infectious period 1/γ (measured in days). Thus, the increase in the respective γ values corresponds in this case to a decrease in the average infectious period from 5.3 days to 4.7 days. This could be further compared with the average infectious period in the initial wave 1 of the epidemic, which was estimated by the model at almost 6 days (corresponding to the posterior mean γ =0 .168). These di↵erences in the wave-specific estimates of β and 1/γ are also clearly seen in their posterior density plots in the top panels of Figure 4, and they appear to be consistent with the empirical onset and removal rates shown in Figure 2.
As already indicated, one of the advantages of the parametric DSA approach is that it does not require knowledge of the underlying susceptible population size but may instead infer that value from the incidence data and the estimated epidemic parameters (see appendix Section B). The posterior means and 95% credibility bounds for the outbreak size (K 1 ) and the e↵ective population size (N ) are listed in the last two rows of Table 2, and their posterior densities are presented in the bottom panels of

Conclusions
Outbreaks of Ebola in Africa are a persistent threat not only to global public health but also to economic and political stability in some of the world's poorest and most vulnerable regions. Despite early evidence of e↵ectiveness of the ring vaccination e↵ort, the prolonged political and armed conflict in northern DRC, where the latest public health intervention took place, has seeded mistrust towards local authorities and international partners. This has impeded e↵ective community collaboration, complicating the vaccination campaign and the overall response strategy [17]. To evaluate the e↵ects of public health response to EVD outbreak in DRC during 2018-2020, we used the individual-level data based on case ascertainment, vaccination records, and contact enumeration collected by researchers at the University of Kinshasa School of Public Health in collaboration with local health authorities in northern DRC from August, 2018 to September, 2019. The analysis of this dataset is crucial for informing current and future EVD intervention policies and strategies regarding vaccination, quarantine, and isolation. However, the analysis is also quite challenging due to incomplete or missing patient information as some families have resisted putting their loved ones in isolation and some individuals have absconded from Ebola treatment centers. Another challenge is the complexity of the data itself, with individual patient histories spanning multiple waves of infections across multiple seasons and spatial environments result in very heterogenous and sometimes incompatible health records.
To overcome these challenges and analyze the University of Kinshasa dataset, we employed the dynamic survival analysis (DSA) method [10], which combines an individual-level Bayesian survival model with a classical SIR epidemic modeling framework. The fusion of the two allowed us to coherently integrate multiple analyses of individual disease histories into a single analysis based on a simple parametric model. Using that model, we were able to estimate the reproduction numbers and the e↵ective population sizes in each of the three major waves of the EVD epidemic while appropriately accounting for uncertainty due to heterogeneity, missingness, or censoring in the records of EVD patients. This Bayesian framework also allowed us to incorporate external information through informative prior distributions and to provide exact inferences for incidence and intervention e↵ects -the information most relevant to policy makers and public health o cials.
Through our study, we estimated the epidemic e↵ective population size (the overall number of individuals at immediate risk of infection) to be around 19,000. This number is much smaller than the demography-based estimate of the susceptible population that one would usually consider in a standard epidemic model. Indeed, the combined population of North Kivu, South Kivu, and Ituri provinces exceeds 16 million and accounts for approximately 15% of the DRC population, with many large population centers (e.g., Goma) exceeding half a million inhabitants. This discrepancy between demographic estimates and the estimated e↵ective size of the susceptible population emphasizes the individual-based nature of our analysis and reflects the e↵ects of public health intervention e↵orts (in particular, ring vaccination and contact tracing) that largely prevented the wide and uncontrolled community spread of the EVD.
Our analysis also indicated that, in di↵erent epidemic waves, the average removal time was statistically di↵erent with the shortening of the removal time from wave 1 to wave 3 by an average of 1.4 days (from 6 days in wave 1 to 4.6 days in wave 3). This finding is consistent with the general view that increased isolation and vaccination e↵orts in late 2019 largely contributed to breaking local chains of transmission and ultimately ending the epidemic by mid-2020. Assuming similar infectivity in future outbreaks, our results suggest that, in order to limit the spread of EVD in future outbreaks, a rate of removal similar to that achieved in wave 3 will likely be required.
Although the DRC has successfully contained Ebola outbreaks in the past [6,28] and an e↵ective vaccine is now available, the security and political challenges in the northern DRC -especially North Kivu and Ituru provinces -continued to create problems for e↵ective public health interventions during the 2018-2020 outbreak. As political challenges in the DRC are likely to persist in the near future, there is great need for a flexible approach in responding to future outbreaks that combines multiple pharmaceutical and non-pharmaceutical strategies. The individual-level EVD data from the 2018-2020 outbreak presented here is, to our knowledge, the first opportunity to comprehensively look at the multi-wave outbreak data and quantitatively assess the strength of non-pharmaceutical interventions while also accounting for the the e↵ects of ring vaccination in decreasing the size of the population at risk of infection. The methodology developed and used here is also of possible relevance for analyzing other outbreaks exhibiting complicated dynamics and multiple incidence waves, including the current COVID-19 pandemic.