The microservices described above are packaged into container images using Docker. This allows for the environment used in local developer testing, continuous integration (CI), and deployment to all share a configuration, minimising defects introduced by variance in behaviour at development time and run time.
Developers can start a local copy of the platform using the development scripts from the line list repository at https://github.com/globaldothealth/list. The development script uses docker-compose to create a local instance of MongoDB, launch all of the Global.health microservices, and expose network ports so that the developer can use the API or point their browser at https://localhost:3002/ to see the development version of the curator portal. Developers additionally have the option running a full stack version for testing, which in addition to the above creates a local fake stand-in for the Amazon Web Services platform using the Localstack project[i] so that interactions with AWS can be tested.
Global.health is deployed in production on a Kubernetes cluster hosted in Amazon’s Elastic Kubernetes Service (EKS). The definition of this cluster is supplied in the GitHub repository in the k8s folder. The cluster definition creates two sets of “pods” (the Kubernetes term for a collection of containers): one which runs the production instance of Global.health at https://data.covid-19.global.health and one which runs the development and testing instance, at https://dev-data.covid-19.global.health, and a QA instance at https://qa-data.covid-19.global.health.
Access to Global.health Data
Researchers can work with the line list data in Global.health by downloading the data from the curator portal in CSV, JSON, or TSV formats. The whole data set can be downloaded in CSV format with one file per country, or a researcher can download the results matching a given filter query in their choice of format.
Project Management and Software Engineering
The Global.health team develop the software in the open, in a collection of publicly available repositories found in the GitHub organisation globaldothealth[j] and distributed under the terms of the open source MIT licence. The project team meets twice per week to check progress against the currently-planned milestone, share updates, and identify blocked tasks that need collaboration between developers and data scientists. Additionally, the team uses Slack to communicate asynchronously outside of scheduled meetings.
Milestones are planned with approximately one month intervals. Work allocated to any milestone is tracked using GitHub issues on the list repository. Developers use GitHub pull requests to share code changes for review and integration, which is a de facto standard practice when using GitHub. Code reviews from at least one core team member as well as fully-passing automated tests are conventionally expected before any pull request gets merged into the main branch, but are not enforced by tooling.
Any change that is merged into the main branch automatically gets packaged into a development release that is available to deploy to the development instance of Global.health. To be included into a production build, changes must be merged into the release branch corresponding to the current published version, for example 1.9-stable.
Privacy and data governance
A major challenge in working with patient data is ensuring that data privacy is maintained and that the data is governed in a way that is both compliant with data protection requirements and international privacy law. This challenge is driven by legal and ethical considerations but is inherently a technical one, as it requires building data infrastructure that reflects these values and constraints.
Data privacy and governance has received widespread attention in recent years21, particularly in response to the use of technology for disease surveillance to combat COVID-1922 with emphasis on ensuring that the gains attainable by using data-driven approaches in health research and public health decision making should not further increase structural inequalities. However, many of the legal and ethical frameworks developed to address these challenges are not easily translatable into implementable technical solutions23. This implementation complexity, and its potential for rapid change, is amplified when working at a global scale in which a plethora of regulatory bodies and guidelines exist, each designed according to different norms regarding individual rights, including the right to privacy and equality.
Given the multitude of perspectives and implications for our project, we opted to take a multidisciplinary approach to our infrastructure design process. We created a workstream dedicated to privacy and governance made of a core team consisting of an engineer, epidemiologist, and social data scientist. In addition, we partnered with a data protection consultancy (AWO, https://awo.agency) to add a legal and rights based lens. The aim of this workstream was to conduct an in-depth Data Protection Impact Assessment (DPIA) to identify and remediate data protection risks within the platform. It is important to note that in addition to our compliance lens, this exercise was also motivated by a rights-based approach aligned with the Global.health values aimed at promoting health equity through the use of open science.
The findings and learnings from this work are extensive and thus will be presented in a separate publication. Here we provide a brief summary of the major risks identified in our assessment and which represent the risks of building a publicly accessible platform of epidemiological line list records.
- Re-identification: In working with disaggregated, i.e. line list, patient records there is a risk of patient re-identification.24 Given the benefits of analyzing individual patient records, we worked to identify a combination of measures that would enable us to improve patient de-identification without record aggregation. We identified three remediations: a. Aggregate data into ranges, for example instead of specifying exact age an age range can be provided; b. Remove free text when provided as it is challenging to control for the exposure of direct identifiers within it; c. Ensure that the direct identifiers used by Global.health internally for the creation of a unique patient ID are stored separately from the production dataset. These measures complement the already existing removal of direct identifiers from Global.health’s public facing dataset.
- Geographic data compliance: The global distribution of our data sources requires particular attention to regional data compliance requirements. This encompasses both matters relating to data subjects and user rights (addressed in 1.) as well as more property related questions such as the data hosting location.
We are taking steps to mitigate the risks listed above. To more readily support compliance with different regulatory regimes, we are enabling the rapid deployment of new replicas of Global.health. These instances can be deployed to different hosting regions in their cloud provider or to an institution’s own data centre.
Deploying new Global.health instances
One of the goals of Global.health is to provide outside organisations or individuals the ability to deploy their own instances of the software stack. This serves a number of purposes. including working with data that is not compatible with the Global.health database schema, to track data related to novel outbreaks beyond the scope of Global.health, or for other unanticipated reasons. Doing so currently requires non-trivial IT expertise (see Box 1 for a description of the requirements as they stand at time of writing). We look forward to collaborating with groups who want to run their own copies of the software, learning from the experience and improving the process, including by reducing the technical overhead with a combination of scripting and infrastructure as code.
A brief outline of the steps currently required is given here; further documentation is found in the globaldothealth/list repository.
- Create a Kubernetes cluster using the cluster configuration for the appropriate environment (development, QA or production) which is designed for use with Amazon EKS.
- Configure application secrets with, for example, API keys for Amazon and Mapbox, and connection strings for a MongoDB database.
- Deploy the services and ingress controller to the cluster by applying the definitions in the repository. The ingress controller is the service that controls access to the rest of the services in the Global.health infrastructure.
- Confirm that services have started correctly by querying the status of the Kubernetes configuration
- Configure your DNS server with a CNAME, A, or AAAA record as appropriate linking your hostname for your Global.health instance to the ingress controller.
The engineering team modified the Global.health platform to operate at a large scale, so that demands on it by users do not lead to bad experiences including long response times and uncompleted requests. When running, the platform must gracefully add and update case data as well as provide that data to users without issue.
To add and update case data, the platform ingests new and updated case data via separate containerised Batch jobs, one for each data source, on a regular and scheduled basis; delegating this work to orthogonal infrastructure allows for any amount of data to enter the system from any number of desired sources.
To provide data to users, the platform exposes a user interface tightly coupled with the curator service, managed by a kubernetes cluster. To handle increases in traffic, the cluster can create new nodes and increase CPU and memory amounts via configuration file changes. At the time of writing, we do not autoscale based on traffic.
To further reduce the load on the data service and provide a better overall user experience through faster download times, the platform uses S3 as a caching layer for country-specific data sets, and the complete data set. At a regular cadence the platform updates each file containing all case data for a country, and when a user applies a filter in the UI for that country’s data and requests it for download, the curator service streams it from S3 to the user, without using the database. In the future this can grow to accommodate frequently requested data sets.
If a service begins to use an excessive amount of CPU and/or memory, the engineering team receives a notification via Slack with links to service logs, and can perform triaging from there. If an ingestion job fails to complete, its logs are available for diagnosis, and it does not affect running services.