Application of Machine Learning in Prediction of COVID-19 Diagnosis for Indonesian Healthcare Workers

doi:10.21203/rs.3.rs-1996286/v2

Download PDF

Research Article

Application of Machine Learning in Prediction of COVID-19 Diagnosis for Indonesian Healthcare Workers

https://doi.org/10.21203/rs.3.rs-1996286/v2

This work is licensed under a CC BY 4.0 License

Version 2

posted

You are reading this latest preprint version

Background

In developing countries like Indonesia, limited resources for routine mass Coronavirus Disease 2019 (COVID-19) RT-PCR testing among healthcare workers leave them with a heightened risk of late detection and undetected infection, increasing the spread of the virus. Accessible and accurate methodologies must be developed to identify COVID-19 positive healthcare workers. This study aimed to investigate the application of machine learning classifiers to predict the risk of COVID-19 positivity in high-risk populations where resources are limited and accessibility is desired.

Methods

Two sets of models were built: one both trained and tested on data from healthcare workers in Jakarta and Semarang, and one trained on Jakarta healthcare workers and tested on Semarang healthcare workers. Models were assessed by the area under the receiver-operating-characteristic curve (AUC), average precision (AP), and Brier score (BS). Shapley additive explanations (SHAP) were used to analyze feature importance. 5,394 healthcare workers were included in the final dataset for this study.

Results

For the full model, the voting classifier composed of random forest and logistic regression was selected as the algorithm of choice and achieved training AUC (mean [Standard Deviation (SD)], 0.832 [0.033]) and AP (mean [SD], 0.476 [0.042]) and was high performing during testing with AUC and AP of 0.753 and 0.504 respectively. A voting classifier composed of a random forest and a XGBoost classifier was best performing during cross-validation for the Jakarta model, with AUC (mean [SD], 0.827 [0.023]), AP (mean [SD], 0.461 [0.025]). The performance when testing on the Semarang healthcare workers was AUC of 0.725 and AP of 0.582.

Conclusions

Our models yielded high predictive performance and can be used as an alternate COVID-19 screening methodology for healthcare workers in Indonesia, although the low adoption rate by partner hospitals despite its usefulness is a concern.

Covid Likelihood Meter

COVID-19

machine learning

risk prediction

screening

Indonesia COVID-19

Jakarta COVID-19

Semarang COVID-19

Since the first confirmed case of COVID-19 in Indonesia in March 2020, there have been over 6.05 million confirmed cases and more than 155 thousand deaths by the virus as of June 14, 2022.¹ Healthcare workers in Indonesia are at a high risk of COVID-19 exposure and infection due to the nature of the profession, with 2,087 COVID-19 related deaths as of June 16, 2022.^{2, 3} The COVID-19 pandemic is burdening the public health and economy of Indonesia, especially due to Indonesia having the highest fatality rate for healthcare workers in Asia, with 647 deaths as of January 2021.⁴ Although the government had implemented large-scale social restrictions at the national and regional level, initiated vaccination programs for high-risk population groups, and advocated the usage of personal protective equipment (PPE),⁵ Indonesia still had the highest daily and cumulative COVID-19 cases in Southeast Asia as of June 2021⁶. Due to the possible asymptomatic nature of COVID-19 infection,⁷ efforts to prevent COVID-19 transmission were constrained by the ability to immediately detect and isolate the infected people.^8–10

Mass testing by reverse transcription polymerase chain reaction (RT-PCR), the gold standard of COVID-19 diagnostic testing, remains one of the key measures to reduce transmission of COVID-19.¹¹ However, the implementation of mass RT-PCR testing is limited in developing countries such as Indonesia due to financial, capital, and logistical constraints.^12,13 Despite the rapidly increasing COVID-19 cases since the Eid al-Fitr holiday in May 2021, some parts of Indonesia have been facing limited reagent supply and inadequate laboratory capacity to provide sufficient testing. In June 2021, Indonesia had the second-lowest testing rate in Southeast Asia with only 7.5 tests per confirmed case, far below the World Health Organization (WHO) recommendation of 10–30 tests per confirmed case.^15,16 The study was conceived in 2021 as a response to the surge in positive cases and resource constraints for mass RT-PCR testing.

The most commonly used screening method in Indonesia is temperature check using a digital thermometer, which only had a sensitivity of 9.9 percent according to a study conducted in Uganda.³⁹ The limitations of mass RT-PCR implementation in Indonesia underscore the development of a COVID-19 detection method that is accurate and can be accessible to users with minimum equipment or resources. Machine learning tools can achieve these goals and have already shown promise in several countries such as the United States, China, Israel, and Slovenia.^17–20 Some models have already been generated for COVID-19 screening using data from sources such as Computed Tomography (CT) scans,^17,21 clinical symptoms,^18–20 and laboratory tests.^22–24 The sensitivity and specificity values of these tools are high, ranging from 0.86–0.93 and 0.56–0.98 respectively. In Indonesia, the development of COVID-19 screening tool using machine learning has been carried out, coined Corona Likelihood Metric 1.0 (CLM 1.0). CLM 1.0 was high-performing, with an accuracy of 78% (precision = 0.79, F1 score = 0.74), and was used as a method to detect COVID-19 for the mass population in DKI Jakarta. CLM 1.0 was a survey consisting of questions about symptomatology, demographics, and behavioral tendencies.

Currently, most machine learning COVID-19 screening tools have been deployed in technologically advanced countries.^{17–19,22,24} Since most of the available models used data from hospitalized patients, the tools may not be effective for COVID-19 screening for healthcare workers, the spearhead of public health resilience during the pandemic. This study aims to apply machine learning algorithms to develop software tools that augment COVID-19 screening for Indonesian healthcare workers. This method is expected to ease the burden in Indonesia’s healthcare system by COVID-19 through the implementation of a fast, accessible and widespread screening methodology while triaging and promoting systematic allocation of RT-PCR testing for healthcare workers.

1. Study Design And Data

The study was designed as a cross-sectional observational study to test the performance of five machine learning models in predicting COVID-19 positivity.

A total of 5,394 participants including healthcare and non-healthcare hospital workers participated in the study as shown in Table 1. The data was collected between 20 January 2021 and 30 June 2022 in 103 healthcare facilities: 21 healthcare providers in the Greater Jakarta Area and West Java, and one testing facility in Semarang which covers 80 healthcare providers. The hospitals and healthcare providers were selected through online recruitment, recommendations from medical associations, and a partnership agreement with the following inclusion criteria: 1) the ability to collect nasopharyngeal oropharyngeal (NPOP) swabs and/or conduct RT-PCR testing by trained healthcare workers, 2) the presence of healthcare or non-healthcare staff with COVID-19 symptoms or close contact with COVID-19 patients, and 3) the support from health facility’s management to participate in the research.

Table 1

Respondent data
Area	Total Respondent
Area	N	%
Greater Jakarta & Bandung	4,820	89.4%
Semarang	573	10.6%
Total	5,393	100%

Initially, the hospitals had an authority to prioritize their staff to be included in the research by following this criterion: either (1) had close contact with at least one COVID-19 patient within the last fourteen days or (2) developed COVID-19 related symptoms within the last fourteen days. However, during the surge of asymptomatic cases caused by the Delta and Omicron variants, the inclusion criteria was expanded to include all healthcare workers as a routine screening to help minimize the spread of COVID-19 in Indonesian hospitals.

For each respondent, we collected NPOP specimens of RT-PCR as well as personal and health information through a self-administered questionnaire, written in Indonesian. All NPOP swab specimens from respondents in the Greater Jakarta Area and West Java were tested at Primaya Hospital, East Bekasi, while specimens from respondents in participating healthcare providers in Semarang were tested at Diponegoro National Hospital. Both labs are qualified by the Indonesian Ministry of Health to perform COVID-19 RT-PCR test.

2. Study Variables

The survey questions included demographics, behavioral (protective, social, and travel) tendencies, COVID-19 vaccination status, working conditions, symptoms, comorbidities, and level of COVID-19 exposure and interaction with infected patients at health facilities (Supplementary Material 1). The dependent variable in this study was the result of the COVID-19 RT-PCR test. Respondents with inconclusive results were not included in the processed dataset. Behavioral questions were chosen based on risk factors identified in the current literature and encompassed handwashing, mask-wearing, PPE adherence, social distancing, and domestic and foreign travel tendencies. Hand-washing behaviors were assessed by the level of adherence to the six-step hand washing protocol.^25,26 Mask-wearing and social distancing behaviors were assessed according to current World Health Organization (WHO) guidelines.²⁷

3. Modeling And Prediction

To predict COVID-19 diagnosis in our cohort, we trained and evaluated several machine learning classification algorithms, such as logistic regression, random forest,²⁸ extra trees classifier,²⁹ and model ensembling were implemented using the scikit-learn Python library,³⁰ while XGBoost was implemented using scikit-learn compatible packages in Python.³¹ Preprocessed respondent features were used as inputs for each model, generating an output prediction between 0 and 1. The output was then converted to a class label by a thresholding function. Hyperparameters were tuned and chosen using the random search optimization method in scikit learn.³² Feature selection was implemented using the sequential feature selection method in scikit learn. Model performance was analyzed and interpreted using the area under the receiving-operating characteristic curve (AUC) and area under the precision-recall (PR) curve (AP), while model calibration was assessed using the Brier Score (BS) for each model. Feature importance and model explainability was assessed using Shapley additive explanations (SHAP) from the SHAP package in Python.³³

Two sets of models were developed: one using respondents from health facilities in both Jakarta and Semarang (full model), and one on respondents from Jakarta health facilities only (Jakarta model). The full model was trained with stratified 5-fold cross-validation on 80% of the dataset and tested on the remaining 20%. The Jakarta model was tested on inputs from the Semarang healthcare workers and was developed to assess predictive capability on a population in a different population of healthcare workers within Indonesia. On the other hand, the Jakarta model was tested on two datasets. The first one is inputs from the Semarang healthcare workers and was developed to assess predictive capability on a population in a different population of healthcare workers within Indonesia. The second one is a dataset from Jakarta healthcare workers which was taken in 2022. Through this test, we want to evaluate the model in different periods and see if there is a significant difference in its performance.

Due to an imbalance between positive and negative classes in the dataset as shown in Table 2, the model was trained using sample weights to give more exposure to the minority (positive) class. The weights are inversely proportional to class frequencies, thus the minority class will have more weight than the majority (negative) class.

Table 2

Summary of test results.
Dataset	Positive	Negative	% Positive
Jakarta 2021	402	3,514	10.3
Semarang 2021	299	213	58.4
Jakarta 2022	621	283	68.7
Semarang 2022	28	33	45.9

Data Summary

Data from 5,393 healthcare workers were included in the final dataset for model building. Of the healthcare workers, 89.17 of respondents had received at least the first dose of the COVID-19 vaccine. The cohort was 74.50% female and the average age was 30.5 years.

Regarding protective behaviors in the prior month, most respondents indicated they had been trained in PPE standards (98.13%) and six-step hand washing techniques (97.70%). Around 18.99% of respondents were currently self-isolating after having close contact with COVID-19 patients and 30.82% were involved in aerosol-generating procedures on COVID-19 patients. Furthermore, approximately half of positive cases were currently self-isolating. Additionally, around 75.43% and 62.3% of respondents reported having activity in a closed room and having outdoor activities at least 1–3 times a month, respectively. This study also found that 74.87% of respondents always used a mask outside the home, 51.07% always avoided shaking hands, and 50.34% always maintained physical distance. In terms of the use of public transportation, 79.73% of respondents never used mass public transportation and 73.37% never used door to door transportation.

Approximately 80% of positive cases were symptomatic. Cough (53.23%), headache (46.11%), runny nose (42.76%), sore throat (38.13%), and myalgia (36.95%) were the five most reported symptoms among those who had COVID-19 positive test results. 90.82% of respondents stated that they did not have any comorbidities. Pregnancy (3.13%), lung disease (2.38%), and hypertension (2.16%) were the most reported comorbidities among those who were COVID-19 positive.

Model Performance and Explainability

Full Model

Regarding the performance of the model, Fig. 1 displays validation set ROC and PR curves as well as training set calibration curves for all the algorithms applied to the full patient cohort training set. The fraction of positive classes in the validation and testing datasets was 17.2%. The best performing model during 5-fold stratified cross-validation was the extra tree classifier and random forest classifier with an average AUC (0.832 ± 0.015), followed by XGBoost classifier (0.830 ± 0.018), voting classifier (0.826 ± 0.017), and the logistic regression (0.812 ± 0.015) (Fig. 1A). The XGBoost classifier also had the best AP (0.513 ± 0.047) followed by that of the random forest (0.514 ± 0.04), voting classifier (0.512 ± 0.039), extra tree classifier (0.0.511 ± 0.034), and logistic regression (0.492 ± 0.04) (Fig. 1B). The training set calibration curves showed random forest, voting classifier, and extra trees were well-calibrated, while the logistic regression and XGBoost overpredicted risk for COVID-19 infection in the bottom half of bins and had the highest brier scores (0.168 ± 0.007 and 0.204 ± 0.005) (Fig. 1C). Random forest had the lowest Brier score (0.124 ± 0.005), followed by voting classifier (0.142 ± 0.007) and extra trees classifier (0.154 ± 0.005).

On the held-out test set, random forest produced the highest average AUC of 0.849 as compared to voting classifier (0.848), extra tree classifier (0.842), the XGBoost (0.841), and logistic regression (0.835) (Fig. 2A). The random forest classifier had the highest AP (0.51) while extra tree, XGBoost, voting classifier, and logistic regression had AP of 0.509, 0.503, 0.497, and 0.487 respectively (Fig. 2B). Using operating thresholds from the cross-validation step, we could calculate the recall, specificity, positive-predictive value (PPV), negative-predictive value (NPV), and the F1 score of our models (Fig. 2C). In this perspective, the extra tree classifier had the highest F1 score (0.565) and was followed by the voting classifier (0.562) in second place. Throughout the evaluation process, random forest produced the most stable results by either taking the first or the second place among the models that we applied. Furthermore, random forest also had the lowest Brier Score, which shows that the model has the best probability predictions. Hence, we picked random forest as our main model for the full model.

After training the tuned random forest classifier on the entire training set, feature importance and model explainability was assessed through SHAP values. Figure 3 displays the SHAP summary plot for the top 20 most important features in the full model. Features related with symptoms are regarded as the most important by the model, with asymptomatic, chills, cough, and headache are included in the top five. The model also successfully identifies that someone who is in self-isolation after COVID-19 exposure will have high risk of infection. Common COVID-19 symptoms such as fever, runny nose, sore throat, and muscle pain were also deemed important features by the model. Health workers that wore N95 masks, medical gloves, hazmat suits, and face shields regularly were also at lower risk of COVID-19 diagnosis. Furthermore, other behavioral features such as frequency and duration of outdoor activity, frequency of offline meetings, and density of people in the room most frequented by the health worker are also associated with higher COVID-19 risk. The high ranking of behavioral features highlights the benefit of adding these factors to complement symptomatic factors.

Jakarta Model

The Jakarta model was trained and cross-validated on health workers from Jakarta who submitted their results in 2021. Then, the model was tested on two datasets, which are data respondents from Semarang and Jakarta in 2022. The Semarang test set and Jakarta 2022 set was composed of 12.7% and 18.76% of the entire dataset respectively. The fraction of positive classes in the validation dataset was 10.27%, while the Semarang and Jakarta 2022 testing datasets were 42.2% and 31.3% respectively. As shown in Fig. 4, the XGBoost classifier had the best predictive performance during cross-validation. The mean AUC (0.857 ± 0.017) outperformed random forest (0.856 ± 0.016), extra trees (0.856 ± 0.019), voting classifier (0,856 ± 0.017) and logistic regression (0.843 ± 0.015) (Fig. 4A). The random forest (0.434 ± 0.039) and extra trees (0.434 ± 0.052) produced the best AP, followed by voting classifier (0.429 ± 0.043), XGBoost (0.416 ± 0.045), and logistic regressor (0.409 ± 0.041) (Fig. 4B). The calibration curves from training showed that random forest is the most well-calibrated model, while the other models appear to have poorly calibrated predictions in upper predicted probability bins (Fig. 4C). Additionally, random forest had the lowest brier scores of 0.080 ± 0.0003.

While testing the models on the Semarang dataset, the random forest had the best AUC of 0.745 followed by extra trees, XGBoost, voting classifier, and logistic regression classifier with 0.744, 0.743, 0.740, and 0.726 (Fig. 5A). XGBoost had the best AP of 0.705 followed by random forest (0.694), voting classifier (0.694), extra trees classifier (0.689) and logistic regression (0.672) (Fig. 5B). Looking at the F1 score, extra tree classifier got the highest score with 0.657 followed by logistic regression, random forest, XGBoost, and voting classifier with 0.651, 0.649, 0.646, and 0.646 respectively (Fig. 5C).

On the other hand, testing on Jakarta 2022 dataset produced the best AUC of 0.762, which is associated with the voting classifier. Moreover, the random forest had a slight difference with AUC of 0.761. Meanwhile, XGBoost, extra trees, and logistic regression got 0.760, 0.757, and 0.753 respectively (Fig. 6A). The highest AP is achieved by voting classifier and logistic regression with 0.548 and 0.547. Furthermore, random forest, XGBoost, and extra trees followed with 0.535, 0.529, and 0.524 (Fig. 6B). The F1 score of random forest outperformed the other models with 0.582. The second and third places were occupied by XGBoost (0.502) and voting classifier (0.481) (Fig. 6C). Unfortunately, the F1 scores for this dataset experienced poorer performance compared to other test sets. It highlights timed-based drift in the data which influences the choice of threshold to be suboptimal.⁴⁰

Based on the AUC, AP, and Brier Score, random forest classifier was chosen for the Jakarta model due to well-calibrated predictions and high training and testing performance. SHAP analysis is then executed using random forest as the model. Figure 7 displays a SHAP summary plot for the Jakarta model. The feature importance for the Jakarta model is highly identical with the full model.

Almost half of the top 20 features are related to symptoms of COVID-19, which shows that symptoms are key to predicting infection. Moreover, outdoor activities also contribute to the risk of COVID-19 infection as shown by the high ranking of SHAP values of the average duration, frequency, and people density of outdoor activities. Similar to the full model, the model also picked wearing personal protective equipment (such as N95 mask, medical gloves, hazmat suit, and surgical hood) regularly as features which could influence the risk of infection. Additionally, the Jakarta model also recognized washing hands after wearing masks as a predictive feature for the model.

This study investigated the capability of sociodemographic information, behavioral tendencies, protective behaviors, COVID-19 vaccine status, symptoms, comorbidities, and working conditions among others to predict COVID-19 diagnosis for Indonesian healthcare workers. We demonstrated that machine learning methods, specifically the random forest, XGBoost, logistic regression, extra trees, and ensemble algorithms of these models were able to predict COVID-19 diagnosis with high performance. Models built on symptom data only are insufficient for application to clinical practice, possibly due to the lack of sociodemographic, behavioral, comorbidity, or other critical risk factors for COVID-19 infection.³⁶ Our models incorporate many of these factors, especially inputs on behavioral tendencies that are important in COVID-19 transmission and infection. The importance of behavioral tendencies is demonstrated by the SHAP summary plots for both models (Fig. 3), where behavioral tendencies ranked amongst the most important features for the models.

The full model performed well during training and testing with the random forest produced the best result. Based on the AUC and AP, tree-based models (random forest, XGBoost, and extra trees) performed much better compared to linear models (logistic regression). One reason for this is because the tree-based models could extract non-linear relationships between the features and the target values, while logistic regression is restricted to linear patterns. Another advantage of tree-based models is their ability to capture the interaction between features and use those interactions as predictive signals. The outperformance of the tree-based model demonstrated the importance of non-linear relationship and feature interaction in predicting COVID-19 infection. Lastly, the tree-based models are an ensemble of decision trees which are aggregated to make a prediction. This method helps reduce the variance and avoid overfitting the training data.

A similar effect also can be observed in the Jakarta models, where the logistic regression had the worst AUC and AP than the other models in most cases. Random forest is also chosen as the final model for the Jakarta model due to its stability in performance and calibration. Additionally, the poorer performance of the Jakarta model on the testing Semarang healthcare worker dataset is likely due to potential drifts and biases in the testing data relative to the training data. This also happened in the Jakarta 2022 dataset, which suggests a time-based drift that changes the pattern in the data. One solution to this is to retrain the model periodically using the latest data. However, this approach will require a much larger dataset to build and evaluate a stable system.

Improved protective behavioral tendencies, such as wearing protective equipment regularly and limiting outdoor activities, as indicated by the models, are critical to the protection of healthcare workers during the pandemic. The models we built are inclusive of many behavioral risk factors of COVID-19 infection among a myriad of other inputs and perform well on unseen data. Future work involves collecting more data for these models, as well as investigating model building for relevant outcomes other than COVID-19 diagnosis.

This study has several limitations. First, from the Semarang dataset, survey participants are screened first, and only symptomatic patients are tested. In Jakarta, both symptomatic and asymptomatic patients can get a PCR test. Second, there are some differences in class proportion in the dataset. Hence, we observed that for the Jakarta datasets, the test set AP is higher than the training set AP despite having a lower AUC, suggesting that the model can more accurately predict a positive result with less false positive error at the testing datasets than the validation sets. The reason for this is there is a significantly lower percentage of positive cases in the Jakarta 2021 dataset compared to the test datasets (Table 2). Since AP only focuses on the positive class, the baseline for AP is proportional to the percentage of positive cases. Therefore, a model used on a more balanced dataset will likely have a higher AP compared to an imbalanced one. On the other hand, both Semarang and Jakarta 2022 datasets have a more balanced distribution, causing a higher AP in the testing datasets.

Meanwhile, the full model has the same class distribution in the training, testing, and validation datasets. In this case, the gap between the cross-validation and test results is small. It shows that our models also work well on a test dataset with imbalance class distribution. The similar performance between validation and test dataset also shows that our models are not overfitting the training data.

Despite being reasonably sensitive and specific, the utilization rate of the CLM prototype is low among CLM’s partner hospitals. Although some of the hospitals expressed their interest in utilizing our prototype during the earlier stages of the pandemic, none actually used our screening model in their day-to-day operations. Some reasons for rejection from partnering hospitals include: (a) A sharp decline of COVID-19 cases in 2022 (b) Lower need to screen hospital workers due to high vaccination rates among healthcare workers. (c) Potential for personal data breach in sharing patient information with a third party. (d) Cost of RT-PCR testing had decreased significantly due to a government-imposed price ceiling. A study investigating the low adoption of this digital health intervention in Indonesia and how this could be mitigated, for instance, by public advocacy, may be warranted in the future.

In 2022, the Indonesian government began reducing surveillance and easing restrictions, citing significantly lower mortality rates and high levels of immunity as COVID-19 is entering an endemic phase in Indonesia.⁴¹ With the loosening of public health measures, our models can position itself in identifying healthcare workers at high risk of COVID-19 infection allowing confirmatory testing and self-isolation. It may promote early detection of COVID-19 infection and warn of any surge in cases to prevent outbreaks in healthcare facilities and anticipate a seasonal surge of COVID-19 cases.⁴² This is especially valuable in some provinces outside Java with vaccination rates below the national average.⁴³

AUC - area under the receiver-operating-characteristic curve

AP - average precision

BS - Brier score

SHAP - Shapley additive explanations

SD - standard deviation

PPE - personal protective equipment

RT-PCR - reverse transcription polymerase chain reaction

CT - computed tomography

CLM - Corona Likelihood Metric

WHO - World Health Organization

NPOP - nasopharyngeal oropharyngeal

PPV - positive-predictive value

NPV - negative-predictive value

Ethics Approval and Consent to Participate

Institutional Review Board (IRB) approval was granted by the Institute of Research and Community Service of Universitas Katolik Indonesia Atma Jaya (Jakarta, Indonesia) under the IRB Reference Number of 626A /III/LPPM.PM.10.05/05/2020. Informed consents were obtained online from respondents before they were enrolled in the study.

Consent for Publication

Not Applicable.

Availability of Data and Materials

The datasets generated and analyzed in this study are not publicly available since it contains sensitive personal information, but are available from the corresponding author on reasonable request.

Competing Interests

The authors are affiliated with their respective organizations, as indicated in the first page. SS, MAM, LS, JK, AI, FA, BR and AT are affiliated with Nalagenetics. The models used in this study were also used by Nalagenetics in a now defunct COVID-19 screening tool previously offered to hospitals in Indonesia. OH, SAKS, FZK, NL, DS and AT are affiliated with CISDI. CISDI is a healthcare think tank organization that uses research findings to advocate public health policies. The authors have no financial gain or loss in any form that could result from the publication of this manuscript, but the absence of authors without any organizational affiliation could be considered a non-financial competing interest.

Funding

This study was primarily funded by Yayasan Satriabudi Dharma Setia, a philanthropic organization funded by the Indonesian Coordinating Ministry for Maritime and Investment Affairs (Kemenkomarves). They provided the labs and PCR test kits used for the study, but had no role in sample collection or data analysis. Additional funding for operational and manpower expenses were also provided by Nalagenetics Pte Ltd (Singapore), Nalagenetics, CISDI and RSND Semarang.

Authors’ Contributions

Name	Affiliation	Contribution(s)
Shreyash Sonthalia	Nalagenetics	Developed machine learning model, conducted data analysis of initial datasets and assisted in manuscript writing
Muhammad Aji Muharrom	Nalagenetics	Conducted data analysis of initial datasets and assisted in manuscript writing
Levana Sani	Nalagenetics	Conceptualized study design and supervised execution
Jatin Khaimani	Nalagenetics	Conceptualized study design and supervised execution
Olivia Herlinda	CISDI	Conceptualized study design and supervised execution, provided public health insights in public health perspective
Sinta Amalia Kusumastuti Sumulyo	CISDI	Managed data collection from survey respondents, reached out to hospitals for partnership, and assisted manuscript writing
Fariza Zahra Kamilah	CISDI	Assisted in data analysis and compiled data collected into spreadsheet
Rebriarina Hapsari	RSND	Coordinated recruitment of Semarang respondents and manuscript writing
Astrid Irwanto	Nalagenetics	Conceptualized study design and supervised execution
Fatma Aldila	Nalagenetics	Assisted in data collection and manuscript writing, provided clinical study insights
Bijak Rabbani	Nalagenetics	Refined machine learning model, conducted data cleaning and analysis of final datasets from the survey, provided data science insights in manuscript writing
Andhika Tirtawisata	Nalagenetics	Assisted in data analysis and presentation, liaised with different stakeholders involved in the study, assisted in manuscript writing
Nurul Luntungan	CISDI	Advised on study design and supervised execution
Diah Saminarsih	CISDI	Advised on study design and supervised execution
Akmal Taher	CISDI	Advised on study design and supervised execution

Acknowledgements

We would like to thank Joseph L Greenstein from Johns Hopkins Department of Biomedical Engineering for advice given in the study design. We would also like to acknowledge Primaya Hospital Bekasi and RSND Semarang for their support in facilitating sample processing activities and all study participants who have voluntarily provided their time to fill in the questionnaire and provide their specimen, and Yayasan Satriabudi Dharma Setia for allowing us access to their computer systems for data collection in partner hospitals.

KawalCovid19. Informasi Terkini COVID-19 di Indonesia [Internet]. [cited 2022 Jun 14]. Available from: https://kawalcovid19.id/
World Health Organization. Prevention, Identification and Management of Health Worker Infection in the Context of COVID-19 [Internet]. Available from: https://www.who.int/publications/i/item/10665-336265
Lidwina A. 654 Tenaga Kesehatan Gugur Lawan Pandemi Covid-19 di Indonesia [Internet]. [cited 2021 Jun 2]. Available from: https://databoks.katadata.co.id/datapublish/2021/01/28/654-tenaga-kesehatan-gugur-lawan-pandemi-covid-19-di-indonesia
Widadio NA. Coronavirus kills 647 health workers in Indonesia [Internet]. [cited 2021 Jun 1]. Available from: https://www.aa.com.tr/en/asia-pacific/coronavirus-kills-647-health-workers-in-indonesia/2125642
Fitria Chusna Farisa. Setahun Covid-19: Upaya Indonesia Akhiri Pandemi, dari PSBB hingga Vaksinasi [Internet]. 2021 [cited 2021 Jun 17]. Available from: https://nasional.kompas.com/read/2021/03/02/10213641/setahun-covid-19-upaya-indonesia-akhiri-pandemi-dari-psbb-hingga-vaksinasi?page=all
CSIS. Southeast Asia Covid-19 Tracker [Internet]. [cited 2021 Jun 17]. Available from: https://www.csis.org/programs/southeast-asia-program/projects/southeast-asia-covid-19-tracker
Day M. Covid-19: identifying and isolating asymptomatic people helped eliminate virus in Italian village. BMJ [Internet]. 2020 Mar 23;m1165. Available from: https://www.bmj.com/lookup/doi/10.1136/bmj.m1165
Brinati D, Campagner A, Ferrari D, Locatelli M, Banfi G, Cabitza F. Detection of COVID-19 Infection from Routine Blood Exams with Machine Learning: A Feasibility Study. J Med Syst [Internet]. 2020 Aug 1;44(8):135. Available from: http://link.springer.com/10.1007/s10916-020-01597-4
Xie J, Tong Z, Guan X, Du B, Qiu H, Slutsky AS. Critical care crisis and some recommendations during the COVID-19 epidemic in China. Intensive Care Med [Internet]. 2020 May 2;46(5):837–40. Available from: http://link.springer.com/10.1007/s00134-020-05979-7
Grasselli G, Pesenti A, Cecconi M. Critical Care Utilization for the COVID-19 Outbreak in Lombardy, Italy. JAMA [Internet]. 2020 Apr 28;323(16):1545. Available from: https://jamanetwork.com/journals/jama/fullarticle/2763188
WHO. WHO provides one million antigen-detecting rapid diagnostic test kits to accelerate COVID-19 testing in Indonesia [Internet]. 2021 [cited 2021 Jun 17]. Available from: https://www.who.int/indonesia/news/detail/17-03-2021-who-provides-one-million-antigen-detecting-rapid-diagnostic-test-kits-to-accelerate-covid-19-testing-in-indonesia
WHO. Global partnership to make available 120 million affordable, quality COVID-19 rapid tests for low- and middle-income countries [Internet]. 2020 [cited 2021 Jun 18]. Available from: https://www.who.int/news/item/28-09-2020-global-partnership-to-make-available-120-million-affordable-quality-covid-19-rapid-tests-for-low--and-middle-income-countries
Syambudi I. Pasokan Reagen PCR Menipis, Testing COVID-19 Terbengkalai [Internet]. 2021 [cited 2021 Jun 17]. Available from: https://tirto.id/pasokan-reagen-pcr-menipis-testing-covid-19-terbengkalai-ga6z
BBC. Lonjakan Covid-19 di Indonesia diprediksi sampai awal Juli, daerah lain bisa menyusul Kudus [Internet]. 2021 [cited 2021 Jun 17]. Available from: https://www.bbc.com/indonesia/indonesia-57492990
Our World in Data. Coronavirus (COVID-19) Testing - Statistics and Research [Internet]. [cited 2021 Jun 17]. Available from: https://ourworldindata.org/coronavirus-testing
Tirto. Ridwan Kamil Kritik Pelacakan COVID-19 RI Jauh dari Standar WHO [Internet]. 2020 [cited 2021 Jun 17]. Available from: https://tirto.id/ridwan-kamil-kritik-pelacakan-covid-19-ri-jauh-dari-standar-who-f9RL
Wang, S. et al. A deep learning algorithm using CT images to screen for Corona virus disease (COVID-19). Eur. Radiol. (2021) doi:10.1007/s00330-021-07715-1.
Shoer, S. et al. A Prediction Model to Prioritize Individuals for a SARS-CoV-2 Test Built from National Symptom Surveys. Med 2, 196-208.e4 (2021).
Tostmann, A. et al. Strong associations and moderate predictive value of early symptoms for SARS-CoV-2 test positivity among healthcare workers, the Netherlands, March 2020. Eurosurveillance 25, (2020).
Zoabi, Y., Deri-Rozov, S. & Shomron, N. Machine learning-based prediction of COVID-19 diagnosis based on symptoms. npj Digit. Med. 4, 3 (2021).
Li, W. T. et al. Using machine learning of clinical data to diagnose COVID-19: a systematic review and meta-analysis. BMC Med. Inform. Decis. Mak. 20, 247 (2020)
Bayat, V. et al. A Severe Acute Respiratory Syndrome Coronavirus 2 (SARS-CoV-2) Prediction Model From Standard Laboratory Tests. Clin. Infect. Dis. (2020) doi:10.1093/cid/ciaa1175.
Kukar, M. et al. COVID-19 diagnosis by routine blood tests using machine learning. (2020).
Feng C, Huang Z, Wang L, Chen X, Zhai Y, Zhu F, Chen H, Wang Y, Su X, H. S. & Al., E. A Novel Triage Tool of Artificial Intelligence-Assisted Diagnosis Aid System for Suspected COVID-19 Pneumonia in Fever Clinics. medRxiv (2020).
Ran, Li et al. Risk Factors of Healthcare Workers With Coronavirus Disease 2019: A Retrospective Cohort Study in a Designated Hospital of Wuhan in China. Clinical infectious diseases : an official publication of the Infectious Diseases Society of America 71, 16(2020).
Arias, Ariadna V et al. Assessment of hand hygiene techniques using the World Health Organization's six steps. Journal of infection and public health 9, 3 (2016).
World Health Organization. Coronavirus disease (COVID-19) advice for the public: When and how to use masks [Internet]. Available from: https://www.who.int/emergencies/diseases/novel-coronavirus-2019/advice-for-public/when-and-how-to-use-masks/
Breiman L. Random Forests. Machine Learning 45, 5-32 (2001).
Geurts, P., Ernst, D. & Wehenkel, L. Extremely randomized trees. Mach Learn 63, 3–42 (2006).
Buitinck L, Louppe G, Blondel M, et al. API design for machine learning software: experiences from the scikit-learn project. 2013.
Chen T and Guestrin C. XGBoost: A scalable tree boosting system. Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 785–794. ACM, 2016.
Bergstra J & Bengio Y. Random search for hyper-parameter optimization. J Mach Learn Res 13, 281–305 (2012).
Lundberg SM, Lee SI. A unified approach to interpreting model predictions. In: Advances in Neural Information Processing Systems 30 (NIPS 2017). 2017 Presented at: 31st Conference on Neural Information Processing Systems (NIPS 2017); 2017; Long Beach, CA.
Yunus F, Andarini S. Letter from Indonesia. Respirology 25, 1328-9 (2020). doi: 10.1111/resp.13953
Centers for Disease Control and Prevention. How to Protect Yourself & Others [Internet]. Available from: https://www.cdc.gov/coronavirus/2019-ncov/prevent-getting-sick/prevention.html
Callahan, A., Steinberg, E., Fries, J.A. et al. Estimating the efficacy of symptom-based screening for COVID-19. npj Digit. Med. 3, 95 (2020). https://doi.org/10.1038/s41746-020-0300-0
Huang, F et al. COVID-19 outbreak and healthcare worker behavioural change toward hand hygiene practices. The Journal of hospital infection 111, 27-34 (2021). doi:10.1016/j.jhin.2021.03.004
Dyer O. COVID-19: Indonesia becomes Asia’s new pandemic epicentre as delta variant spreads. BMJ (2021).
Nsawotebba, Andrew et al. "Effectiveness Of Thermal Screening In Detection Of COVID-19 Among Truck Drivers At Mutukula Land Point Of Entry, Uganda". PLOS ONE, vol 16, no. 5, 2021, p. e0251150. Public Library Of Science (Plos), https://doi.org/10.1371/journal.pone.0251150. Accessed 19 July 2022.
Duckworth, Christopher et al. "Using Explainable Machine Learning To Characterise Data Drift And Detect Emergent Health Risks For Emergency Department Admissions During COVID-19". Scientific Reports, vol 11, no. 1, 2021. Springer Science And Business Media LLC, https://doi.org/10.1038/s41598-021-02481-y. Accessed 19 July 2022.
Hafizon I, Kautsar F, Amalia S. Tren kasus COVID-19 Sepanjang Bulan April 2022 (Tanggal 1-28 April). CISDI Insight [Internet]. 2022 [cited 28 July 2022];1. Available from: https://storage.googleapis.com/cisdi_document/CISDI-Insight-Vol-1.pdf
Kucharski, Adam J et al. "Effectiveness Of Isolation, Testing, Contact Tracing, And Physical Distancing On Reducing Transmission Of SARS-Cov-2 In Different Settings: A Mathematical Modelling Study". The Lancet Infectious Diseases, vol 20, no. 10, 2020, pp. 1151-1160. Elsevier BV, https://doi.org/10.1016/s1473-3099(20)30457-6. Accessed 26 July 2022.
Vaksin Dashboard [Internet]. Vaksin.kemkes.go.id. 2022 [cited 28 July 2022]. Available from: https://vaksin.kemkes.go.id/#/vaccines

Competing interest reported. The authors are affiliated with their respective organizations, as indicated in the first page. SS, MAM, LS, JK, AI, FA, BR and AT are affiliated with Nalagenetics. The models used in this study were also used by Nalagenetics in a now defunct COVID-19 screening tool previously offered to hospitals in Indonesia. OH, SAKS, FZK, NL, DS and AT are affiliated with CISDI. CISDI is a healthcare think tank organization that uses research findings to advocate public health policies. The authors have no financial gain or loss in any form that could result from the publication of this manuscript, but the absence of authors without any organizational affiliation could be considered a non-financial competing interest.

REVCLM2.0BMCMedicalInformaticsandDecisionMakingSubmissionSupplementaryMaterial.docx

Download PDF

Version 2

posted

You are reading this latest preprint version

Application of Machine Learning in Prediction of COVID-19 Diagnosis for Indonesian Healthcare Workers

Status:

Version 2

Abstract

Background

Methods

Results

Conclusions

Figures

Background

Methods

1. Study Design And Data

2. Study Variables

3. Modeling And Prediction

Results

Data Summary

Model Performance and Explainability

Full Model

Jakarta Model

Discussion

Conclusions

List Of Abbreviations

Declarations

References

Additional Declarations

Supplementary Files

Status:

Version 2