Global.health: a scalable plaform for pandemic data integration, analytics, and preparedness

doi:10.21203/rs.3.rs-1528783/v1

Download PDF

Article

Global.health: a scalable plaform for pandemic data integration, analytics, and preparedness

https://doi.org/10.21203/rs.3.rs-1528783/v1

This work is licensed under a CC BY 4.0 License

Version 1

posted

You are reading this latest preprint version

We present Global.health, a scalable online platform for collecting high-dimensional epidemiological data and transforming those data into a consistent schema to enable distributed analyses. Global.health was originally developed to handle the demands of high-volume, accurate collection of epidemiological line list data in the early months of the COVID-19 pandemic. It has since proven amenable to rapid adjustment as collection of new variables became relevant, for example tracking variants of concern and vaccination status in COVID-19 cases, as well as clinical data. The Global.health platform is based on a microservices architecture deployed to the cloud. We discuss this architecture and the choices that motivated it, as well as the steps needed for an independent group to run their own copy of Global.health in their local environments. We describe the data governance challenges related to providing appropriate privacy to people in multiple jurisdictions while fulfilling the project’s goal to enable open data sharing and rapid science during health emergencies.

Detailed epidemiological data are pivotal for robust and reliable inferences on infectious disease spread¹. Especially early in epidemics detailed data are needed to estimate basic properties of the disease and evaluate the disease surveillance systems (e.g., delays in reporting²), all contributing to an understanding of the feasibility of controlling an epidemic in response to specific (initially non-pharmaceutical) interventions³. However, as epidemics scale and become pandemics, manual curation of data is no longer feasible and computational methodologies and infrastructure are needed to accommodate the collection of data from hundreds of disparate sources.

At the onset of the COVID-19 pandemic, the Open COVID-19 Data Working Group⁴ started a process of manual curation of line list data. Volunteer curators from over 50 countries would monitor news and government sources for information about COVID-19 cases, and enter them manually into shared spreadsheets⁵ (see list of volunteers here: https://github.com/beoutbreakprepared/nCoV2019). This manual process had been used successfully to track previous epidemics, but the rapid geographic spread of the SARS-CoV-2 virus and exponential growth of the case count quickly overwhelmed the infrastructure designed to cope with outbreaks encountered over the previous two decades.

Currently, the research community’s access to detailed, anonymized individual level case data at a global scale is limited and often restricted to members of local or international health organizations and/or research consortia^6,7. While these models have shown great promise their data often remains inaccessible due to concerns around privacy and data sharing agreements among other obstacles. However, creating a distributed global-scale corpus of harmonised detailed epidemiological data could greatly support research innovation and insights during the current and future pandemics. Further, the infrastructure designed to enable rapid data harmonisation during epidemics may allow for distributed data storage and analysis by simultaneously preserving individuals privacy and complying with international data sharing laws.

In this paper, we introduce Global.health, a scalable platform for pandemic data integration and analytics, developed initially to focus on epidemiological line list records (i.e. individual anonymised case records) and growing to support integration and analysis of clinical, serological and genomic data sets. Section 2 describes the Global.health software and data architecture. We describe our approaches to resolving the challenges of decentralised project management and data governance when working with a global data set subject to multiple jurisdictions and information protection regulations. We explain how Global.health is deployed as a scalable cloud service, and how other groups can create their own instances. We conclude by reflecting on the value provided to the scientific community by this kind of open data platform, thoughts on sustainability and describe future work.

History and Evolution

The current Global.health platform grew out of a need to move beyond the scalability problems that researchers faced when curating millions of case records in common file formats such as CSV files. The evolution of the platform has been organic, shifting to reflect the needs of researchers and scalability issues. We chose Amazon Web Services (AWS) as the underlying cloud architecture, along with a schema-less database (MongoDB) which enabled rapid iteration and development, compared to a traditional relational database (RDBMS). We chose to manage deployment via Kubernetes to allow auto-scaling for large datasets. Over time, we have added fields to the core schema, reflecting the changes in stages of the pandemic, such as adding SARS-CoV-2 variants of concern (VOCs), vaccines, and fields for SGTF (S-Gene target failure) which has become an indicator^8–11 obtained from commonly used RT-PCR tests for SARS-CoV-2 VOCs, such as B.1.1.7 (Alpha) and BA.1 (sublineage of the Omicron VOC). During the project we had to move to a different AWS infrastructure due to the continuous rise in cases and complexity of the data structure. We also imported our cloud infrastructure into terraform in order to manage AWS via command line and configuration files.

Related work

Throughout the COVID-19 response, we have seen a variety of governmental, non-governmental, and private organizations collaborate and collate COVID-19 data collection of varying forms. This included efforts to track the global number of cases and deaths^12,13, global vaccination and testing rates^14,15, age and sex stratified COVID outcomes^16,17, non-pharmaceutical interventions¹⁸ and genomic data¹⁹. Whereas most of these data represent aggregates we focussed on collating highly detailed individual level data. Databases such as Global.health therefore, while more limited in scope due to reduced availability of publicly accessible, anonymized data, can allow for more detailed re-analyses of the factors influencing individual level differences in transmission dynamics, facilitate interoperability with other datasets such as those collated by individual hospitals, and can better disentangle the impact of immunity, vaccination and severity of disease as the drivers of the epidemic becomes increasingly complex at local scales. Therefore we provide a complement to the ongoing collection of aggregate summary statistics.

Global.health is an open, scalable, accessible, online, open source platform for curating and analysing line list data for an emerging disease, such as COVID-19. Using Global.health, data collectors can automatically import and manually curate records from multiple sources supplied in various file formats. The ingested data are stored in a standardised format allowing consistent search and retrieval by epidemiologists, journalists, data scientists and other users. Where the licence of the upstream data permits, Global.health supports open science by making data freely available without requiring users to demonstrate affiliation to academic or other institutions.

The software is deployed using a combination of a cloud-based database service and a microservices architecture, also running on the cloud. Curators, epidemiologists and other users access the data via two web front-ends: a data portal^{^[a]} offering searching and filtering over the line list data, and a map visualisation tool^{^[b]}. The deployment context of the system is shown in Figure 1.

Dependencies

Global.health has a small number of external dependencies. The Mapbox geocoding service (https://www.mapbox.com) is used to identify latitude and longitude for places referred to in the data. The geocoding behaviour is all contained in one microservice (see next section) so switching to an alternative provider is readily possible.

Users can create accounts associated with Google and use that as a convenient way to sign into Global.health, but this is not necessary and they can create accounts using their email addresses and a password.

In early design discussions we identified a document database as a good match for Global.health’s data storage needs. This decision is described further in “Database” below. Because of the large scale of data involved we chose to use a cloud-hosted database via the MongoDB Atlas service, though each microservice reads its database configuration from its environment so it is simple to switch providers or even use different MongoDB instances for different components.

Finally, Global.health relies on a cloud provider for deployment. In the early stages of the project we relied on the familiarity the team had with Amazon Web Services (AWS) and selected AWS as the host for the COVID-19 instance of Global.health. As we seek to partner with other organisations on setting up new deployments, we are planning for a provider-agnostic approach.

Microservices

The Microservices design pattern²⁰ is well known in software engineering for breaking a monolithic application into distinct components that can be deployed and maintained separately. The Global.health service is split into three microservices, each responsible for different components of the application; additionally a serverless component provides automated data ingestion (see Fig. 2).

Data

The data service is responsible for maintaining the core data schema for the line list of case records. It maintains data integrity through schema validation, and provides access to basic creation, retrieval, updating and deletion (CRUD) operations on the data. Case records are stored with a complete revision history to allow auditing and reversion of any changes. The data service is implemented in Typescript using the node.js platform.

Curator

The curator service builds on the data service to provide the functionality of the data portal in Global.health. It comprises two components: an API server and the UI.

The API server supplies public access to the line list by wrapping access to the data service in a role-based access control (RBAC) system. Users are identified to the RBAC via the passport^{^[c]} user management library, either using a traditional username and password authentication system or using Google as an external authentication provider. The curator service API server maintains ancillary information used by the data portal, including the list of data sources for automated data ingestion (ADI; see below). Like the data service it is implemented in Typescript on node.js.

Global.health’s UI is served by a second node.js service deployed as part of the curator service. It is a React web application built in Typescript. In addition to visualising the data accessed through Global.health’s API, the UI uses the Iubenda^{^[d]} service to ensure users have seen and accepted the terms of use and cookie policies for Global.health.

Location

Each case in the Global.health line list must be associated with a geographic location, though to prevent risks of re-identification we do not pinpoint these locations to individual addresses. Instead each case is associated with the location of the administrative region (down to administrative level 3 where available (https://en.wikipedia.org/wiki/List_of_administrative_divisions_by_country)) in which it was recorded. As the source data contains locations specified with differing terms and precision, Global.health’s location service uses the Mapbox API^{^[e]} to geocode these into standardised locations and to retrieve latitude and longitude values for the administrative regions. The location service is written in Python.

Database

Data is stored in an external document database hosted by MongoDB Atlas (see Dependencies, above)^{^[f]}. MongoDB’s document database format allows us significant flexibility in the data schema, adding new fields rapidly in response to developments in the pandemic, while permitting database-level validation to ensure the consistency of the core parts of line list records. Two separate instances of the database, for development and testing and for hosting production data, are configured. Each of these instances comprises a cluster of three replicas to provide high availability. At time of writing (1 April 2022), the cases collection (i.e. the line list) in the production cluster occupies 62.5GB, representing over 100 million individual cases.

We are not currently using data sharding, a technique in which a single collection of data is split across multiple logical, and usually physical, database clusters. This decision will be reevaluated as we increase partnerships with regional health organisations and centres for disease control; see the section on Privacy and Data Governance below.

Map

The map visualisation is a standalone application written using the React^{^[g]} framework that uses data transformed from the database by an exporter function run daily in AWS Batch^{^[h]} (Figure 3). The map application itself is deployed by uploading the appropriate HTML and Javascript files to a location on Amazon’s Simple Storage Service (S3).

Ingestion

It is more efficient and accurate to import case details via the Automated Data Ingestion (ADI) facility of Global.health than to manually input them using the curator portal described above (Fig. 4). Curators identify high-quality, open sources of COVID-19 case data, typically official sources from government ministries of health. These sources must at minimum identify the date of confirmation and geographical location of cases, and must have individual case-level resolution. They create a source record in the curator portal identifying the URL for the data source, the licence, and the data format (currently CSV and JSON are supported formats).

A member of the Global.health team (or, in principle, an open source contributor) identifies that this new source is available and inspects the source data to create a parser function. This parser is a Python script that receives the source data and yields records conformant with the standard Global.health schema, suitable for geocoding and insertion into the database. Once the parser has been tested and integrated into the source code repository on GitHub, a software engineer on Global.health can schedule the parser to run on a “serverless” deployment platform based on AWS Batch.

Up to three functions run for any parser. The first is a fetcher, which retrieves the latest data from the upstream source and stores it in a cache on S3. The parser then runs, hosted by a common library that fetches the data from S3, runs the parser, feeds the normalised records to the curator service, and reports status and errors.

The third stage depends on the nature of the input data. If the upstream data source contains stable identifiers for individual cases then records can be “upserted”, that is existing records are updated and new records inserted. No further action is needed. If the upstream data does not contain unique identifiers or they are not reliably stable (for example if they are regenerated each time the provider re-publishes their data), then the parser instead inserts new records representing the entire data set. A “prune” function is then invoked, which sets a flag on the new records indicating that they should appear in the curator portal and data download file, clears that flag from earlier cases from the same source, and finally deletes the earlier cases.

Parsers can typically ingest between 100–300 cases per second depending on the format of the source data and the complexity of any transformation needed during parsing. This is of course many orders of magnitude faster than human curators were capable of entering data using previous pandemic data curation infrastructure: between January 2020 and March 2021, a team of human curators collected nearly 2.7M COVID-19 case records, at a mean rate of approximately one case every fifteen seconds.

^{^[a]} https://data.covid-19.global.health

^{^[b]} https://map.covid-19.global.health

^{^[c]} http://www.passportjs.org/docs/downloads/html/

^{^[d]} https://www.iubenda.com/en/

^{^[e]} https://docs.mapbox.com/api/overview/

^{^[f]} https://cloud.mongodb.com

^{^[g]} https://reactjs.org

^{^[h]} https://aws.amazon.com/batch/

The microservices described above are packaged into container images using Docker. This allows for the environment used in local developer testing, continuous integration (CI), and deployment to all share a configuration, minimising defects introduced by variance in behaviour at development time and run time.

Developers can start a local copy of the platform using the development scripts from the line list repository at https://github.com/globaldothealth/list. The development script uses docker-compose to create a local instance of MongoDB, launch all of the Global.health microservices, and expose network ports so that the developer can use the API or point their browser at https://localhost:3002/ to see the development version of the curator portal. Developers additionally have the option running a full stack version for testing, which in addition to the above creates a local fake stand-in for the Amazon Web Services platform using the Localstack project^{^[i]} so that interactions with AWS can be tested.

Global.health is deployed in production on a Kubernetes cluster hosted in Amazon’s Elastic Kubernetes Service (EKS). The definition of this cluster is supplied in the GitHub repository in the k8s folder. The cluster definition creates two sets of “pods” (the Kubernetes term for a collection of containers): one which runs the production instance of Global.health at https://data.covid-19.global.health and one which runs the development and testing instance, at https://dev-data.covid-19.global.health, and a QA instance at https://qa-data.covid-19.global.health.

Access to Global.health Data

Researchers can work with the line list data in Global.health by downloading the data from the curator portal in CSV, JSON, or TSV formats. The whole data set can be downloaded in CSV format with one file per country, or a researcher can download the results matching a given filter query in their choice of format.

Users who have signed up to the Global.health curator portal and agreed to the Terms of Use have access to a REST API for querying the data. Instructions for accessing the API are given in the project FAQ at https://global.health/faqs/. We are working on libraries in Python (using the pandas library) and R to provide convenient access to the APIs, and invite discussion from interested users on their requirements. We have also partnered with the United Nations Humanitarian Data Exchange (UNOCHA) platform to broaden access to Global.health: https://data.humdata.org/organization/globaldothealth.

Project Management and Software Engineering

The Global.health team develop the software in the open, in a collection of publicly available repositories found in the GitHub organisation globaldothealth^{^[j]} and distributed under the terms of the open source MIT licence. The project team meets twice per week to check progress against the currently-planned milestone, share updates, and identify blocked tasks that need collaboration between developers and data scientists. Additionally, the team uses Slack to communicate asynchronously outside of scheduled meetings.

Milestones are planned with approximately one month intervals. Work allocated to any milestone is tracked using GitHub issues on the list repository. Developers use GitHub pull requests to share code changes for review and integration, which is a de facto standard practice when using GitHub. Code reviews from at least one core team member as well as fully-passing automated tests are conventionally expected before any pull request gets merged into the main branch, but are not enforced by tooling.

Any change that is merged into the main branch automatically gets packaged into a development release that is available to deploy to the development instance of Global.health. To be included into a production build, changes must be merged into the release branch corresponding to the current published version, for example 1.9-stable.

Privacy and data governance

A major challenge in working with patient data is ensuring that data privacy is maintained and that the data is governed in a way that is both compliant with data protection requirements and international privacy law. This challenge is driven by legal and ethical considerations but is inherently a technical one, as it requires building data infrastructure that reflects these values and constraints.

Data privacy and governance has received widespread attention in recent years²¹, particularly in response to the use of technology for disease surveillance to combat COVID-19²² with emphasis on ensuring that the gains attainable by using data-driven approaches in health research and public health decision making should not further increase structural inequalities. However, many of the legal and ethical frameworks developed to address these challenges are not easily translatable into implementable technical solutions²³. This implementation complexity, and its potential for rapid change, is amplified when working at a global scale in which a plethora of regulatory bodies and guidelines exist, each designed according to different norms regarding individual rights, including the right to privacy and equality.

Given the multitude of perspectives and implications for our project, we opted to take a multidisciplinary approach to our infrastructure design process. We created a workstream dedicated to privacy and governance made of a core team consisting of an engineer, epidemiologist, and social data scientist. In addition, we partnered with a data protection consultancy (AWO, https://awo.agency) to add a legal and rights based lens. The aim of this workstream was to conduct an in-depth Data Protection Impact Assessment (DPIA) to identify and remediate data protection risks within the platform. It is important to note that in addition to our compliance lens, this exercise was also motivated by a rights-based approach aligned with the Global.health values aimed at promoting health equity through the use of open science.

The findings and learnings from this work are extensive and thus will be presented in a separate publication. Here we provide a brief summary of the major risks identified in our assessment and which represent the risks of building a publicly accessible platform of epidemiological line list records.

Re-identification: In working with disaggregated, i.e. line list, patient records there is a risk of patient re-identification.²⁴ Given the benefits of analyzing individual patient records, we worked to identify a combination of measures that would enable us to improve patient de-identification without record aggregation. We identified three remediations: a. Aggregate data into ranges, for example instead of specifying exact age an age range can be provided; b. Remove free text when provided as it is challenging to control for the exposure of direct identifiers within it; c. Ensure that the direct identifiers used by Global.health internally for the creation of a unique patient ID are stored separately from the production dataset. These measures complement the already existing removal of direct identifiers from Global.health’s public facing dataset.
Geographic data compliance: The global distribution of our data sources requires particular attention to regional data compliance requirements. This encompasses both matters relating to data subjects and user rights (addressed in 1.) as well as more property related questions such as the data hosting location.
User data protection: Platform misuse: We are acutely aware of the potential for misuse of sensitive patient records for commercial or malevolent purposes (needs citation). To address this challenge we have taken a two pronged approach. The first is designing a terms of use that is aligned with Global.health’s core values and explicitly restricts the platform’s use to outbreak surveillance at the population level, while restricting its use on an individual level. In addition, we embarked on a process to design a new data governance structure for the platform. The aim of the process was to identify an operational workflow that would enable Global.health to incorporate diverse perspectives that represent underrepresented populations in the platform’s decision making processes. The design process included multiple stakeholders and will be discussed in later publications.

We are taking steps to mitigate the risks listed above. To more readily support compliance with different regulatory regimes, we are enabling the rapid deployment of new replicas of Global.health. These instances can be deployed to different hosting regions in their cloud provider or to an institution’s own data centre.

Deploying new Global.health instances

One of the goals of Global.health is to provide outside organisations or individuals the ability to deploy their own instances of the software stack. This serves a number of purposes. including working with data that is not compatible with the Global.health database schema, to track data related to novel outbreaks beyond the scope of Global.health, or for other unanticipated reasons. Doing so currently requires non-trivial IT expertise (see Box 1 for a description of the requirements as they stand at time of writing). We look forward to collaborating with groups who want to run their own copies of the software, learning from the experience and improving the process, including by reducing the technical overhead with a combination of scripting and infrastructure as code.

A brief outline of the steps currently required is given here; further documentation is found in the globaldothealth/list repository.

Create a Kubernetes cluster using the cluster configuration for the appropriate environment (development, QA or production) which is designed for use with Amazon EKS.
Configure application secrets with, for example, API keys for Amazon and Mapbox, and connection strings for a MongoDB database.
Deploy the services and ingress controller to the cluster by applying the definitions in the repository. The ingress controller is the service that controls access to the rest of the services in the Global.health infrastructure.
Confirm that services have started correctly by querying the status of the Kubernetes configuration
Configure your DNS server with a CNAME, A, or AAAA record as appropriate linking your hostname for your Global.health instance to the ingress controller.

Sustainability

The engineering team modified the Global.health platform to operate at a large scale, so that demands on it by users do not lead to bad experiences including long response times and uncompleted requests. When running, the platform must gracefully add and update case data as well as provide that data to users without issue.

To add and update case data, the platform ingests new and updated case data via separate containerised Batch jobs, one for each data source, on a regular and scheduled basis; delegating this work to orthogonal infrastructure allows for any amount of data to enter the system from any number of desired sources.

To provide data to users, the platform exposes a user interface tightly coupled with the curator service, managed by a kubernetes cluster. To handle increases in traffic, the cluster can create new nodes and increase CPU and memory amounts via configuration file changes. At the time of writing, we do not autoscale based on traffic.

To further reduce the load on the data service and provide a better overall user experience through faster download times, the platform uses S3 as a caching layer for country-specific data sets, and the complete data set. At a regular cadence the platform updates each file containing all case data for a country, and when a user applies a filter in the UI for that country’s data and requests it for download, the curator service streams it from S3 to the user, without using the database. In the future this can grow to accommodate frequently requested data sets.

If a service begins to use an excessive amount of CPU and/or memory, the engineering team receives a notification via Slack with links to service logs, and can perform triaging from there. If an ingestion job fails to complete, its logs are available for diagnosis, and it does not affect running services.

^{^[i]} https://localstack.cloud

^{^[j]} https://github.com/globaldothealth

The Global.health platform demonstrates that a scalable platform for open science based on open line list data has an important role to play in collecting, curating and analysing data during a pandemic such as COVID-19. In this paper, we have described the implementation and deployment of the platform and its evolution.

In the future, we intend to automate deployment and federation of Global.health instances^{^[k]} so that partners can use copies of the platform to collect data locally before forwarding it on to a global platform. Between this improved technical capability and partnerships with the public health institutions we can develop a network that will be ready to activate at the onset of a future pandemic, providing crucial support for researchers, policy makers, and response agencies during the first 100 days of the emergence of an emerging infectious disease. Further development will include federated analytics²⁵ to enable analyses of individual level data across Global.health instances to gain insights into the transmission process and epidemiological parameters without sharing data between countries.

We are also working on integrating the line list data with other relevant information sources, including genomic, policy, and clinical data, in addition to extending the line list schema to support any emerging infectious diseases. Access to the data will be available via a public REST API and libraries in the Python and R languages.

Through an interdisciplinary collaboration between epidemiologists, industry software engineers and technologists we were able to build the first step towards a fully integrated, scalable, decentralised data analysis platform for epidemics and pandemics. However, further strengthening data science capacity in public health institutions and more formalised feedback loops between the public and disease modellers will be needed to leverage these technologies to generate the most impact.

^{^[k]} Using tools such as terraform.io which allows orchestration and setup of entire cloud deployments, following the Infrastructure as a Code paradigm

Acknowledgements

M.U.G.K. acknowledges support from the Branco Weiss Fellowship and the European Union’s Horizon 2020 project MOOD (grant agreement #874850). Global.health was supported by grants from the Rockefeller Foundation, the Oxford Martin School, and Google.org. This work was supported by the Foreign, Commonwealth & Development Office and Wellcome [225288/Z/22/Z].

Author contributions

G.L. and M.U.G.K. wrote the first draft. All authors edited the manuscript. G.L., A.D., J.S., A.B., A.C., S.C., M.C., R.E., T.F., L.L., K.O., A.P., and S.R. engineered the Global.health platform (https://blog.google/outreach-initiatives/google-org/how-anonymized-data-helps-fight-against-disease/) and contributors are listed here: https://github.com/orgs/globaldothealth/people. M.U.G.K., J.S.B., and S.V.S. acquired resources to support this project and contributed to design, supervision, conceptualisation and organisation.

Data and code availability

All source code for Global.health is available at https://github.com/globaldothealth, under the terms of the MIT Licence^{^[l]}. Data may be accessed by signing up for a free account at https://data.covid-19.global.health, and agreeing to the terms of use. The version of Global.health that was current at time of preparation is archived at DOI: 10.5281/zenodo.5971680.

Ethics declaration

All other authors declare no conflicts of interest.

^{^[l]} https://github.com/globaldothealth/list/blob/main/LICENSE

1. Kucharski, A. J., Hodcroft, E. B. & Kraemer, M. U. G. Sharing, synthesis and sustainability of data analysis for epidemic preparedness in Europe. Lancet Reg. Health – Eur. 9 , (2021).

2. Epidemiological Data from the nCoV-2019 Outbreak: Early Descriptions from Publicly Available Data - SARS-CoV-2 coronavirus / nCoV-2019 Genomic Epidemiology. Virological https://virological.org/t/epidemiological-data-from-the-ncov-2019-outbreak-early-descriptions-from-publicly-available-data/337 (2020).

3. Hellewell, J. et al. Feasibility of controlling COVID-19 outbreaks by isolation of cases and contacts. Lancet Glob. Health 8 , e488–e496 (2020).

4. Xu, B. et al. Open access epidemiological data from the COVID-19 outbreak. Lancet Infect. Dis. 20 , 534 (2020).

5. Xu, B. et al. Epidemiological data from the COVID-19 outbreak, real-time case information. Sci. Data 7 , 106 (2020).

6. OpenSAFELY: Home. https://www.opensafely.org/.

7. ISARIC. https://isaric.tghn.org/.

8. Wolter, N. et al. Early assessment of the clinical severity of the SARS-CoV-2 omicron variant in South Africa: a data linkage study. The Lancet (2022) doi:10.1016/S0140-6736(22)00017-4.

9. Viana, R. et al. Rapid epidemic expansion of the SARS-CoV-2 Omicron variant in southern Africa. Nature 1–10 (2022) doi:10.1038/s41586-022-04411-y.

10. Kraemer, M. U. G. et al. Spatiotemporal invasion dynamics of SARS-CoV-2 lineage B.1.1.7 emergence. Science (2021) doi:10.1126/science.abj0113.

11. Scott, L. et al. Track Omicron’s spread with molecular data. Science (2021) doi:10.1126/science.abn4543.

12. Dong, E., Du, H. & Gardner, L. An interactive web-based dashboard to track COVID-19 in real time. Lancet Infect. Dis. 20 , 533–534 (2020).

13. Hannah Ritchie, D. B., Edouard Mathieu, Lucas Rodés-Guirao, Cameron Appel, Charlie Giattino, Esteban Ortiz-Ospina, Joe Hasell, Bobbie Macdonald & Roser, M. Coronavirus Pandemic (COVID-19). Our World Data (2020).

14. Hasell, J. et al. A cross-country database of COVID-19 testing. Sci. Data 7 , 345 (2020).

15. Diagnostics & testing. FIND https://www.finddx.org/covid-19/.

16. Riffe, T. & Acosta, E. Data Resource Profile: COVerAGE-DB: a global demographic database of COVID-19 cases and deaths. Int. J. Epidemiol. 50 , 390–390f (2021).

17. The Sex, Gender and COVID-19 Project | Global Health 50/50. https://globalhealth5050.org/the-sex-gender-and-covid-19-project/.

18. Hale, T. et al. A global panel database of pandemic policies (Oxford COVID-19 Government Response Tracker). Nat. Hum. Behav. 5 , 529–538 (2021).

19. Gangavarapu, K. et al. Outbreak.info genomic reports: scalable and dynamic surveillance of SARS-CoV-2 variants and mutations. 2022.01.27.22269965 (2022) doi:10.1101/2022.01.27.22269965.

20. Richardson, C. What are microservices? microservices.io http://microservices.io/index.html.

21. Participatory data stewardship. Ada Lovelace Institute https://www.adalovelaceinstitute.org/report/participatory-data-stewardship/.

22. Bentzen, H. B. et al. Remove obstacles to sharing health data with researchers outside of the European Union. Nat. Med. 27 , 1329–1333 (2021).

23. Mittelstadt, B. Principles alone cannot guarantee ethical AI. Nat. Mach. Intell. 1 , 501–507 (2019).

24. Sweeney, L. Simple demographics often identify people uniquely. Health San Franc. 671, 1–34 (2000).

25. Federated Analytics: Collaborative Data Science without Data Collection. Google AI Blog http://ai.googleblog.com/2020/05/federated-analytics-collaborative-data.html.

Box 1: Infrastructure requirements

Kubernetes Cluster. The microservices are deployed using the Kubernetes container orchestration tool. The definition of the cluster created in Amazon EKS when Global.health was initially deployed is available at https://github.com/globaldothealth/list/tree/main/k8s, along with documentation of the configuration and secrets values required.
AWS Route53 (or a comparable routing service). To manage DNS and other application routing.
AWS EKS. To manage the Kubernetes cluster.
Amazon S3 Storage. S3 is used to hold transformed line list data for the map visualisation, to host static files for the web interfaces, and to store incoming data for ingestion.
AWS Batch Job Queues. To perform ingestion and other periodic tasks.
AWS EventBridge. To schedule ingestion and other periodic tasks.
AWS ECR (or a comparable image repository). To store Docker container images used by Batch jobs.
AWS Simple Email Service. Sends emails to users.
MongoDB. Used as the primary database for Global.health.
Mapbox API key. Required for geocoding case locations.
Iubenda API key. Used to control privacy and cookie policy compliance.
Google API key with access to the account management facilities, for letting users sign in with their Google accounts.
(Optional) Localstack API key. Used to mock AWS services for local testing and development.

There is NO Competing Interest.

Download PDF

Version 1

posted

You are reading this latest preprint version

Global.health: a scalable plaform for pandemic data integration, analytics, and preparedness

Status:

Version 1

Abstract

Figures

Motivation And Background

Methods

Deployment

Conclusions And Future Work

Declarations

References

Box

Box 1: Infrastructure requirements

Additional Declarations

Status:

Version 1