Developing an Online Infectious Disease Outbreak Database: Codifying the World Health Organization’s Disease Outbreak News reports

Background: The World Health Organization’s Disease Outbreak News (DONs) reports are the world’s primary source of ocial information on global disease outbreaks. Access to this information is crucial for informing research analyses, global health priorities, and decision making. However, in its current form, the utility of the DONs reports for research and analysis is limited as a result of their reporting format. To this end, we designed a standardized methodology for codifying the data contained in DONs reports and created an online, searchable database. Methods: We coded DONs reports published between the years 1996 and 2019, systematically collecting data from each individual report using a standardized methodology and tabulating data into a single spreadsheet. We created a Year-Pathogen-Country taxonomy to group related disease events and circumvent issues related to reporting inconsistencies in DONs reports. Results: In total, we reviewed 2,806 DONs reports corresponding to 1,105 unique infectious disease outbreaks from 1996-2019. Overall, H5N1 represented the most frequently reported disease, while China was the country with the most reports. We observed the DONs reports to contain numerous issues relating to the standardization, accuracy, and transparency of reporting procedures. Conclusions: Our database represents a new, accessible resource for research that improves the accessibility of the data contained in DONs reports. The World Health Organization should consider standardizing reporting practices, protocols, and procedures as a means of improving the reporting and transparency of infectious disease outbreaks.


Introduction
Infectious disease outbreaks have signi cantly increased over the past few decades, possibly as a result of alterations in various climatic, environmental, biological, socioeconomic, and political factors [1][2][3][4].
Globalization has complicated this trend by accelerating the pace at which infectious diseases can spread around the world, as exempli ed by the COVID-19 pandemic. These realities underscore the need for effective surveillance systems and rapid outbreak detection and reporting as a means of ensuring global health security.
Disease surveillance and reporting are mandated by the revised International Health Regulations (IHR) [5], and there is an abundance of publicly available data on disease events from Internet-based sources.
These sources provide an open route for reporting that promotes transparency and include the International Society for Infectious Diseases' Program for Monitoring Emerging Diseases (ProMED-mail) [6], the Public Health Agency of Canada's Global Public Health Intelligence Network (GPHIN) [7], and HealthMap [8]. These sources allow for researchers to collect and translate these data sources into useable surveillance platforms [9]. The GIDEON database is another resource that provides access to global infectious disease surveillance data [10], but is not freely available to the public. All of these sources are used for research and action but are uno cial records.
The World Health Organization's Disease Outbreak News (DONs) report catalog represents the only o cial, public, Internet-based source of surveillance data and is the world's primary source for World Health Organization (WHO)-con rmed infectious disease outbreaks [11]. This surveillance resource contains data dating back to 1996 on event-based outbreak information provided by countries, and other partners, which is then organized by the WHO into prose-based reports organized by date, country, and disease. The DONs reports serve as a tool for international actors to stay informed on health-related events and emergencies around the globe and include crucial information such as relevant dates, the status of laboratory con rmation and response activities, and other relevant contextual details.
Because the data contained in DONs reports represent the only o cial recordings of outbreaks from the WHO for the last 25 years, many researchers utilize them for their analyses [12][13][14][15][16][17]. However, in its current form, the analytic utility of the DONs database is severely hampered by its unstructured, prosebased system, which makes it infeasible to quickly access information or conduct analyses. This precludes it from being a useful research tool and makes it di cult to identify trends that could inform future outbreak policy, prevention, and response.
We endeavored to create an accessible database of DONs reports that systematically captures information in a non-prose format for every report from 1996-2019. In this paper, we discuss this process, address factors that limit the DONs from being a usable reference for international diseaserelated events, present recommendations for improving disease reporting, and offer the database of DONs reports to be used for future data analyses.

Methods
Previous work has proposed measures including the start, detection, noti cation, veri cation, and laboratory con rmation of an outbreak that could serve as metrics for monitoring infectious disease outbreaks [18]. Based on these measures, we created a standardized methodology for identifying, collecting, and reviewing the information contained in the DONs reports (Table 1). We then reviewed all of the reports contained in the WHO's catalog, strictly adhering to information reported in the text.

Mass Gatherings
Reported mass gathering event * Including: contact tracing, health monitoring, isolation, quarantine, health education and promotion, multimedia community sensitization, deploying experts to the eld, national, multinational or crossborder meetings, vaccination, food, water and/or drug supplementation, or any other preventative or therapeutic interventions aimed at controlling the disease outbreak; † Including: press conferences, news brie ngs or other forms of written or oral media; ‡ Including: armed con ict, skirmishes, clashes, war, civil war, civil unrest, militia, raid, unrest, civil unrest, hostilities, combat, confrontation, bloodshed, use of force, violence, instability, contested areas, insecurity, banditry, security compromised, security situation; § Including: droughts, earthquakes, famines, res, oods, heat, hurricanes, landslides, rain, snow, tornados, tsunamis, and effects from El Niño or La Niña; ¶ Including: refugees, internally displaced persons, population displacements, cross-border migration.
In addition to the data collected from the DONs itself, country reports were also matched with their corresponding 3-letter ISO country codes from the ISO 3166 Online Browsing Platform [19].
Following this compilation, we reviewed the rows of data to create a standardized ontology to group DONs reports of the same outbreak, address inconsistencies in reporting, and classify outbreaks according to pathogen etiology. Our labelling conventions followed the "Year-Pathogen-Country" format.
In the event that one report included more than one outbreak or pathogen, we listed each outbreak separately but indicated that the information was obtained from the same report. In the event that reports for certain types of outbreaks spanned two or more calendar years, outbreaks were labelled using the start year of the multi-year event. In certain cases, if events spanned multiple countries, all countries were listed in the labelling; however, if the DONs report labeled an outbreak as a global outbreak, or more than 10 countries were listed in the DONs reports, we labelled the event as "Global." Additional regional locations included in the DONs and captured in our ontology were West Africa, African Meningitis Belt, Asia, Northern Hemisphere, and Central America. We labelled travel cases based on the reported location, unless the genome sequencing of the pathogen, as described in the report, showed a clear origin of the outbreak. Table 2 details how we categorized diseases according to their pathogen etiology. Additional details on our ontology are available in an appendix (Additional File 1). Table 2 We reviewed a total of 2,806 DONs reports from 1996-2019. The number of reports published annually ranged from 59 reports in 2011 to 205 reports in 2014 (Fig. 1). The average number of annual DONs reports between 1996 and 2019 was 117 and the median number of reports was 114.5. The three most commonly reported diseases were H5N1 in uenza, Ebola virus disease, and MERS-CoV, which had totals of 453, 296, and 291 reports, respectively.
Using our ontology to organize the DONs reports resulted in a total of 1,105 unique infectious disease outbreaks. The number of unique outbreaks reported annually between 1996 and 2019 ranged from 30 unique outbreaks in 2007 to 77 unique outbreaks in 1998 (Fig. 2). Classifying the outbreaks by pathogen etiology, 223 (20%) were caused by directly-transmitted human pathogens, 118 (11%) were caused by seasonal and pandemic in uenza strains, 228 (21%) were caused by vector-borne pathogens, 312 (28%) were caused by environmental or foodborne pathogens, 29 (3%) were caused by non-transmissible zoonotic pathogens, 169 (15%) were spillover events caused by transmissible zoonotic pathogens, and 26 (2%) were caused by other pathogens or events (Fig. 3).
Geographically, China, Saudi Arabia, the Democratic Republic of the Congo, Indonesia, and Egypt were the ve countries most frequently affected by outbreaks reported in the DONs (Fig. 4).

Discussion
This work improves the accessibility of data of the WHO's DONs reports and promotes future research efforts focused on analyzing WHO-con rmed infectious disease outbreaks. The availably of the data will allow for detailed analyses including descriptive work examining correlations between infectious disease outbreaks and other contextual factors, such as climatic events or con ict. Such analyses could help to predict future infectious disease outbreaks.
There are several notable shortcomings related to consistency, standardization practices, accuracy, and transparency that should be addressed. Regarding consistency, there is no clear format or structure for DONs reports. The reports and information included therein seem to be contingent on the speci c disease, the information provided by the source or country, and even the report author. Reports can vary greatly in length and amount of detail provided. The inclusion of data tables, charts or links in reports can provide useful information, but their inclusion is infrequent and does not adhere to speci c patterns or methodologies. While the lack of structure may improve exibility and re ect the information reported to the WHO, these inconsistencies make it di cult to quickly identify important information.
Attempting to characterize the reports also presented a challenge while creating our database. While we strove to adhere strictly to the information contained in DONs reports, the reports do not employ a standardized list of pathogens or naming convention. For example, various DONs reports relating to Ebola referred to the disease as 'Ebola,' 'Ebola Haemorrhagic Fever,' 'EHV,' 'Ebola Virus Disease,' and 'EVD.' Additionally, the DONs reports often switch between spellings of cities or provinces; for example, certain cities in Egypt appeared in consecutive reports with several different spellings. This oversight could make outbreaks appear to be more severe to those not intimately familiar with the geography of the outbreak by suggesting that the disease is spreading to other localities.
To this end, we believe that the WHO could improve DONs reports by implementing a consistent and systematized reporting format. An appropriate place to start is with the criteria put forward by Smolinski and colleagues [18]; case total tables organized by de ned probable, suspected and con rmed cases; and subsections for appropriate or priority contextual factors, such as meteorology or climate hazards, community resistance, con ict, migration, or mass gatherings. Doing so would make information easier to nd, ensure consistent reporting, and could allow for the reports to become machine-readable, and thus a more accessible source of information.
The DONs reports also occasionally contained errors that could call the validity of reported results into question. For example, a report on avian in uenza in Vietnam published on January 11, 2009 reports that an individual rst developed symptoms on January 28, 2009 and was hospitalized on January 31, 2009, several weeks after the report was published [21]. Other reports contain similar errors and chronological inconsistencies. These include the same report published twice on consecutive days [22], and con icting information regarding the subject of the report [23].
Closely related to the issues surrounding inconsistencies and errors in the DONs reports, transparency regarding how the reports are compiled presents an additional concern. The WHO does not publish any secondary information about the DONs reports or catalog. Many questions arise in the absence of this information -namely with regard to the author of reports, the information included in the reports, and the prioritization of pathogens or geographies. We posit that this information isn't included not out of enigmatic intentions, but rather as a result of the aforementioned inconsistencies and reporting errors. We recommend the WHO publish secondary information detailing how reports are prepared and compiled. Establishing these parameters and guidelines would improve the transparency and standardization of the reporting process, could act to improve con dence in the reports themselves as the single authoritative collection of disease outbreaks and might allow for it to be more easily adapted and used for analytic purposes.
Additionally, while reviewing DONs reports, it became clear that there exist signi cant discrepancies between the coverage allocated to various diseases. Some diseases, such as H5N1 in uenza, Ebola virus disease, and MERS-CoV enjoy consistent, precise, and timely coverage. Still, reports on other infectious diseases may not materialize into a DONs report until months after the WHO is noti ed. Cholera, for example, is the fourth most commonly reported disease in the DONs, but some reports are not published until months after initial reporting of the outbreak to relevant authorities [24]. Furthermore, we identi ed multiple events that received support from the Contingency Fund for Emergencies (CEF) that are not included in the DONs. For example, a 2017 outbreak of dengue fever in Pakistan received funding from the CEF, as did a 2018 outbreak of malaria in Nigeria, but both were omitted from the DONs reports. On the premise that these events constitute important health threats that warrant the allocation of millions of dollars for response efforts, their notable absence in DONs reports is surprising, and ultimately supports the conclusion that, though the WHO's catalog of DONs reports represents the only o cial source of surveillance data, it is far from a comprehensive summary of priority outbreaks.
To remedy these challenges, we recommend the WHO decide upon and publish a set of criteria outlining what events merit a DONs report. Acknowledging that a majority of emerging infectious diseases are zoonotic in nature [2,4], discussions surrounding pathogen detection in animal and non-human reservoirs must also be a part of this conversation. At present, some DONs reports contain information regarding the detection of pathogens in animals in the absence of any human cases, but these are not consistent nor comprehensive.
We believe the standardized methodology we provide addresses many of these limitations while maintaining the integrity of the data in the DONs reports and hope that our efforts to create a searchable, standardized database of DONs reports will inform important global health analyses and policy decisions.

Conclusions
This work improves the accessibility of data of the WHO's DONs reports -the only o cial and public source for WHO-con rmed infectious disease outbreaks. The resulting dataset ultimately addresses several limitations in the current reporting practices used by the WHO and will allow for important and informative analyses in the future, including descriptive work examining correlations between infectious disease outbreaks and other contextual factors.