Study setting
The study sites will include 13 hospitals (Kigali University teaching hospital, Butare University teaching hospital, Ngoma regional hospital, Ruhengeri provincial hospital, Muhima district hospital, Kibagabaga district hospital, Nyamata district hospital, Nyagatare district hospital, Kinihira district hospital, Kigeme district hospital, Kirehe district hospital, Gisenyi district hospital and Gihundwe district hospital); two health centers (Remera health center and Nyamata health center); and once centralized dataset gathering 22 COVID 19 test centers (Kanyinya, Rwankeri, gatenga, Kicukiro, ASPEK-Ngoma, Kigali Transit Centre, Rugerero, Kabgayi and Rusizi). These study sites have been selected to include all four provinces of the country for prediction and generalizability purposes.
Study design
The LAISDAR project is a federated data network, based on the Observational Medical Outcomes Partnership (OMOP) Common Data Model (CDM), as well as on open-source Observational Health Data Sciences and Informatics (OHDSI) tools for data analytics and network integration, and R Studio and Python. As demonstrated in Fig. 1, the network will include several hospitals, whose EHR (Electronic Health Records) data will be harmonized to OMOP CDM, and enriched both with COVID-19 test results, COVID-19 survey results from a national database, and the results of the community surveys. An initial proof-of-concept (POC) implementation was set up and tested, which included the central LAISDAR instance and 2 data nodes – one on a Mac Mini and one on an AWS EC2 (Amazon Elastic Compute Cloud) instance.
There are 2 different open source EHR systems used by the participating hospitals; OpenMRS [25] and OpenClinic GA [26]. Therefore, two different ETL (Extract Transform Load) processes will be implemented in order to have as few local adaptations per hospital as possible.
Enrichment of EHR data is part of the ETL process, whereby available COVID-19 test and survey results will be retrieved from a central repository over a secure interface. One critical challenge with this step is consolidating individual patients; different person identifiers (national ID, mobile number, name, address), or combinations thereof, are used across different systems. We envision generating unique identifiers based on the available keys to facilitate a reliable and reproducible matching of records from the different source systems.
The integration of the sites with a central hub will be accomplished by using the open-source version of Arachne, which provides a platform for performing network studies, integrating OHDSI standards and tools.
Software deployments at the participating hospitals will rely on Docker-based containerization; this approach ensures consistent and reproducible installation across the different sites. For most participating hospitals, a pre-configured Mac Mini will be provided with the complete LAISDAR Dockerized software suite.
Data collection, analysis and management
The study will involve four main steps with regards to data collection, analysis and management:
Step 1: Data gathering /collection
That includes the inventory of the existing data from the first 24 months of the COVID- 19 pandemic in Rwanda (the first case was identified in march 2020) and the 20 weeks’ data collection (through community surveys via telephone calls).
It’s anticipated to have different formats of data sources ranging from Covid-19 related data registered in Excel documents, via data sources containing Minimum Clinical Data (MCD) in DHIS2 [25] and other systems, to more granular Electronic Medical Record (EMR) data in Open Clinic GA [26], OpenMRS [25] and other EMR systems. We will start by mapping full hospital patients’ records, focusing on 15 health facilities located in regions with a high number of COVID-19 patients and completing with other isolated datasets.
The new data collection (community surveys) will use standardized protocol and questionnaires and will be done according a longitudinal approach. The participants will be randomly sampled following the sampling frame used for the recent Rwanda Demographic Health Survey (based on the fourth Rwanda Population and Housing Census RPHC) provided by the National Institute of Statistics of Rwanda (NISR) [27]. This sampling frame is a complete list of districts covering the whole country. The data collection is done through be-weekly phone call, 6 phases planned (starting by December 2021), involving 30 well trained data collectors, one per district supervised by 10 investigators. The minimum sample size required was estimated at 107 people per each of 30 administrative district in Rwanda. To anticipate on consent refusal and drop outs, we doubled this number making 214 participants in each district. If the participant has a medical file in participating hospitals, or in other COVID-19 testing dataset, the datasets will be linked with possibilities of other linkage data request in future.
The sample will proportionally include males and females based on the number of inhabitants. Each participant will receive a mobile fee connection and internet bundle each week to allow data collection. To mitigate the expected gap of the gender digital divide but also of selected persons without a mobile phone anymore, the consortium established mitigation measures, including but not limited to, leveraging the community healthcare workers (CHWs). Each village in Rwanda has a CHW who participates in various ministry of health (MoH) programs and they have all received the mobile phones from MoH. If we select a respondent without a mobile phone we will liaise with the nearest CHW to reach out to him.
The questionnaires (which will be translated into 3 languages, Kinyarwanda, English and French) include 10 modules (at least 8 of them has to be fulfilled by the project): 1) Demographics; 2) Face mask use; 3) Hand hygiene; 4) Respect of social distancing measures and risk minimization measures; 5) Recent risk situations exposures and COVID-19 measures. On the outcome side, the collected data will include 6) Coronavirus like-Signs and symptoms; 7) Mental health indicators (based on General anxiety disorder-GAD); 8) Social-economic impact (based on loss of income, or categories) and 9) Covid-19 test results [28].
Gender considerations.
The SARS-CoV-2 virus does not discriminate. In order to respond effectively to the crisis, we need a whole-society approach to understand its differential impact on women and men. Supporting gender analysis and sex-disaggregated data is an integral part of this project. The gender COVID-19 related data are still scarce and a little is known in Rwanda on the topic. Therefore, collecting related data will be a key activity to bridge the gap and contribute to best gender driven policies locally and in the region. Specifically, we articulated the gender considerations in this project in a multi-lever approach including: (1) Fostering gender balance throughout in research teams, in order to close the gaps in the participation of women, and (2) Integrating the gender dimension in research and innovation (R&I) content that will help improve the scientific quality and societal relevance of the produced knowledge, technology and/or innovation. Gender has been integrated as a transversal theme and not a vertical aspect. Gender facets are found in all COVID-19 consequences including morbidity in general and mental health problems in particular and socio-economic outcome. Social and cultural factors related to gender such as specific considerations for some collected data elements will be addressed as well, eg. reproductive health data, the usage of gender-sensitive research questions and gender-impartial language. Moreover, the sampling will pay special emphasis to gender proportional balance while collecting new COVID-19 data and gender key output/aspects will be driven from data analysis.
Step 2: Infrastructure for data harmonizing (developing novel techniques)
For data harmonization, the custom-designed ETL scripts will be developed per data source to extract, transform and load the source data to an OMOP CDM database instance. In the early stages, when the hospital EHRs are not yet harmonized, we will also use synthetic data approaches to help automate harmonization processes. The data owner-side infrastructure will include the OMOP CDM database instance, the Arachne client, the OHDSI Atlas [29] analytical tool, R Studio [29], and Jupyter [29]. The data harmonization process converts the observational data from the format of the source data system to the OMOP CDM supported by the OHDSI organization.
Step 3: Infrastructure for data access, query, and data analysis (Mixing existing methods and innovative techniques)
The central platform data access, query, and data analysis, will handle the participating data sources. The central site will use Arachne for the central portal with the data and study catalogues, but also a PostgreSQL database, an OHDSI Atlas analytical platform instance and an R Studio instance. Additional tools can be added like a Jupyter server instance. As a standard, the database will include an OMOP CDM schema, and additional schema(s) to support the central data catalogue. At the beginning of the harmonization process, as the data from hospitals EHRs will be not yet available, this project will use synthetic data to help automate harmonization processes and training models, specially we will use the OHDSI community’s available mock-up data (like Synthea) to train different algorithms /models before we use them on real data.
The Arachne central server setup will allow central management of network studies, with tight integration with the OHDSI tools such as Atlas. The Arachne data catalogue will incorporate the Achilles output from each participating site; the Achilles tool generates a profile of the participating sites’ data on an aggregate level, which will allow a central view of the descriptive statistics for each site. The R Studio and Jupyter instances will allow the development and testing of R scripts as part of a study design, or to analyze data collected from data source sites as part of studies.
Step 4: Data analysis and interpretation (Mixing existing methods and innovative techniques)
The federated datasets are challenging to analyze with traditional statistical methods, because they are, like other real-world-data (RWD), 1) collected without any intention for being used in research; 2) incomplete and not cleaned and 3) collected in a sporadic way, not pure longitudinal approach so no way to derive cohort-like data from them.
The current project will leverage AI techniques including Machine learning techniques and data mining that bring an added value in discovering hidden patterns or relationships between data points. In this project, like in other similar works [20], we will first evaluate the performances of different deep learning methods, including the hybrid convolutional neural networks-Long short-term memory (LSTM-CNN), the hybrid gated recurrent unit-convolutional neural networks (GAN-GRU), GAN, CNN, LSTM, and Restricted Boltzmann Machine (RBM), as well as baseline machine learning methods, namely logistic regression (LR) and support vector regression (SVR) [20]
Additionally, we will evaluate the added value of the dual mode system consisting of (1) a continuous-time version of the Gated Recurrent Unit (GRU), building upon the recent Neural Ordinary Differential Equations (ODE), and (2) a Bayesian update network that processes the sporadic observations (GRU-Bayes). With this new approach the GRU-ODE [30], is responsible for learning the continuous dynamics of the latent process that generates the observations and GRU-Bayes, responsible for dealing with incoming observations and updating the conditional current estimate of the latent process [30]. Those two steps and modules are similar to the propagation and update steps of a Kalman filter. With GRU-ODE, we expect to evaluate the capacity to project in time the hidden process h(t) and hence indirectly future observations. GRU-Bayes performs the update of the hidden state conditioned on new observations.
After this evaluation a final method will be implemented. It’s hypothesized however that the hybrid models (i.e., LSTM-CNN and GAN-GRU) will potentially improve the forecasting accuracy of COVID-19 future trends, based on previous works [31].
In prediction models, the sequential reproduction number R(t) will be estimated using the Bayesian approach on the Extended SEIR compartmental model. The Bayes rule is used to update the beliefs about the true R(t) based on model predictions and new cases that have been reported each day.
Model definition
We will use an extension of the SEIR model (Fig. 3) inspired by the previous works [32]. This model splits the population into different categories, i.e. susceptible, exposed, infected and removed. The latter two categories are further broken down into super mild, mild, heavy and critical for the infected part of the population, whereas the removed population indicates the immune and dead fraction. A super mild infection refers to the category of asymptotic people who are infected but are unaware of their own infection. Recent figures from Chinese scientists put this number at 86% of all infections [32].
Transitioning between different fractions of the population is indicated by the arrows and its rates are expressed by parameters in the model. The two most important parameters in such a model are: 1) the incubation period and 2) rate of virus spread. Other parameters include the odds of having a super mild, mild, heavy or critical infection. For each type of infection, there is an infectious period, etc. All parameters except one were gathered from the available literature on coronavirus. The parameter that remained to be calibrated is ‘beta’, which determines the rate of transitioning individuals from susceptible to exposed. Beta can be interpreted as the degree of social interaction or the amount of exposure to the virus. It is this parameter that is targeted when governments impose restrictions on their citizens. We will, therefore, focus on this parameter. Finally, a documented mathematical model will be discussed at a later stage, at the beginning of the project implementation.
Study expected results, outcomes and impact
Technical infrastructures:
An initial proof-of-concept (POC) implementation was set up and tested early in the project, which included the central LAISDAR instance and 2 data nodes – one on a Mac Mini and one on an AWS EC2 instance. The data nodes were set up using Docker containers providing the following services: a (PostgreSQL) OMOP CDM database, Atlas/web API, Achilles and Arachne (connected to the Arachne Central instance). The central server was set up with Arachne Central, where the Data Catalogue was configured, and studies were created and executed for testing the integration at the data nodes. The objective of the POC was to test the integration layer (Arachne), as well as to demonstrate the overall process flow for network studies; these objectives were met.
The next phase of the development is well underway, which includes the completion of the ETL implementations, and the integration with the central COVID-19 test and survey results.
The first phase of the project will include 15 hospitals, to include additional hospitals in a later phase of the project.
Capacity building through training.
This project will mainly contribute to research and capacity building through training staff before and during the project both at UR and at participating hospitals. The planned training includes:1) data mapping infrastructure; 2) training on surveys instruments and 3) training on sensitive patient data handling, data harmonization, interoperability and medical terminology: A team from Ghent University (Belgium) will train the Rwandan research team on OHDSI OMOP CDM mappings including terminology and coding.
Clinical, epidemiological, mental and socio-economic outcomes results:
This project will yield prediction models for the burden of COVID-19 in the community but also the potential impact on hospital admissions or overall infection rates and the impact of various public health measures on 1) the pandemic evolution in the country; 2) on the social-economic situation, 3) and on the mental health (stratified by gender and other vulnerable groups). As intermediate results, the community survey will be analyzed separately on all scopes including descriptive statistics of socio-economic impact, epidemiology, mental and clinical outcomes. For socio-economic outcome, the variables to be analyzed are related to the effect of covid-19 on livelihood with a focus on its effect on basic needs (food, medical, care, school fees and transport), income, employment and saving. A logistic model will be formulated and used to analyze the socio-economic characteristics of people who have been experiencing some economic difficulties due to the COVID-19 situation.
On epidemiological aspects we will investigate the prognosis factors associated with clinical outcome of COVID-19 burden in Rwanda, and the drivers of COVID burden in Rwanda.
Regarding the gender and mental health multiple axes of research are planned including 1) the longitudinal study on stigmatization and associated factors during the COVID-19 pandemic in Rwanda; 2) Behavioral/ Gender based violence outcome of COVID-19 in Rwanda; 3) longitudinal study on mental health wellbeing and associated factors during the COVID-19; and others.
Finally, a cultural analysis is planned to investigate how Rwandans deal with the COVID-19 pandemic and the related control measures.