Data Source
In response to the public health emergency caused by SARS-CoV-2, Johns Hopkins University developed an interactive web-based dashboard hosted by their Center for Systems Science and Engineering (CSSE), in order to visualize and track reported cases in real-time (2). The data sources include the World Health Organization (WHO) (3), the Centers of Disease Control and Prevention (CDC) (4), the European Centre for Disease Prevention and Control (ECDC) (5) and the China’s National Health Commission (NHC) (6). All the data collected and displayed are made freely available GitHub repository (7). The raw data sources, in the form of “comma-delimited/separated files”(CSV), that were used for the subsequent experiments, are presented in Table 1.
Table 1
Case Category | File a |
Confirmed | time_series_19-covid-Confirmed.csv |
Deaths | time_series_19-covid-Deaths.csv |
Reported | time_series_19-covid-Recovered.csv |
a URL path: https://github.com/CSSEGISandData/COVID-19/tree/master/csse_covid_19_data/csse_covid_19_time_series/{} |
Data Processing
The data preparation part of the algorithm processes the daily cumulative data from the source mentioned above, in order to create a time series with the daily newly reported cases per category. Next, using these time series, it creates three (3) sets that contain one data point for each reported case (confirmed, deaths, and recovered, respectively), using as value the index of the date that it was reported. Specifically, in our experimental case, the available data start on 22nd Jan 2020; thus, all cases reported on this date are appended to the set with the value ‘1’, cases reported on 23rd Jan 2020 with the value ‘2’, etc.
Each confirmed case is matched with the earliest unmatched reported case of death or recovery, whichever comes earlier, with the death cases taking precedence when there are both unmatched death and recovery cases in the same date. The difference between the dates of each pair is calculated, and the cases that take part in this calculation are removed from the respective sets. This procedure is iterated until there are no death and recovery cases left. At this point, the mean of the intervals of the matched case pairs is calculated.
For the rest of the confirmed cases, the difference from the last available date is calculated and, if it exceeds the previously computed average, it is appended to the final set unchanged. If not, it is replaced by the average.
The final set includes estimated time intervals between time of confirmation and time of recovery for all confirmed cases to date. From this complete set, the overall mean time-to-recovery is calculated. As new data reports come in daily, this figure can be updated to provide enhanced estimation accuracy based on a growing body of evidence.
Experiments
For the purpose of checking the main hypothesis experimentally, a computer program was written in Python programming language, able to apply the algorithm described above to the available data (from 22nd Jan to 9th Mar 2020). Moreover, there was the ability to choose any particular country, cluster of countries or global figures to run the calculations on.
The reported results include (i) Global figures (ii) Mainland China as the “cradle” of the infection and the largest available dataset (iii) Italy and Iran as two countries with widespread of the disease (iv) US, UK, Spain, France, Germany, and the Netherlands as countries with highly probable domestic sustained spread of the disease and relevant high quality of healthcare service infrastructure.