Evaluation of screening parameters and machine learning models for the prediction of neonatal sepsis: A systematic review

About 2.9 million neonates die every year worldwide, and most of these deaths occur in low-resource settings. Neonatal sepsis occurs when there is a bacterial invasion in the bloodstream; the immune system begins a systemic inammatory response syndrome (SIRS) damaging to the body and can quickly advance to severe sepsis, multi-organ failure, and nally, death. Sepsis in neonates can progress more rapidly than in adults; therefore, a timely diagnosis is critical. The standard gold test for diagnosing neonatal sepsis is blood culture, which takes at least 72 hours. Hence, identifying key predictor variables and models that work best can help reduce neonatal morbidity and mortality. The matching articles were identied by searching the PubMed, IEEE, and Cochrane bibliography databases. For the inclusion of articles, the abstract and titles were rst screened based on some predetermined criteria and then, the full-text articles were screened. Thirty-one studies met the full inclusion criteria. The duration of ROM was found to be more signicant than other maternal risk factors. Heart rate and heart rate variability were found to be more signicant than other neonatal clinical signs. C reactive protein and I/T ratio were found to be more signicant than other laboratory tests. The main limitation is the variation in the performance measures used in the studies, which made it dicult to perform a quantitative assessment. shown by some the studies. Predictive algorithms combine multiple variables are urgently needed to improve models for early detection, prognosis, and treatment of neonatal sepsis.


Introduction
About 2.4 million neonates die every year worldwide, and most of these deaths occur in low resource settings [1] [2]. The third Sustainable Development Goal (SDG) for child health aims to end the mortality of newborns and children under ve years of age, which is preventable by 2030. However, this may not be achieved if there is no signi cant reduction of neonatal deaths directly related to infection in developing countries [3]. Sepsis is a signi cant cause of neonatal mortality and morbidity around the world [4] [5] [6] and most of the morbidity and mortality from sepsis is preventable.
Neonatal sepsis is classi ed as early-onset (< 48-72h) and late-onset sepsis (> 48-72h), and this depends on the age at onset [7] [8]. About 30-50% of neonatal sepsis survivors obtain signi cant longterm impairments, including prolonged hospitalization, chronic lung disease, and neurodevelopmental disabilities [9] [10] [11]. Sepsis remains one of the most expensive causes of hospitalization, as recent data highlight its costs and burdens [12] [13] [14] [15]. Physicians caring for infected neonates are faced with multiple challenges in diagnostic and treatment decisions. Despite the increased understanding of its pathophysiology and efforts to improve clinical decision support in intensive care, there have been just fair improvements in neonatal sepsis outcomes [16]. Neonatal sepsis occurs when there is a bacterial invasion in the bloodstream; the immune system begins a systemic in ammatory response syndrome (SIRS), which is damaging to the body and can quickly advance to severe sepsis, multi-organ failure, and nally, death [17] [18]. However, early recognition and prompt treatment have been predicted to improve the clinical management of sepsis and serve as the key to reducing morbidity and mortality [19] [20] [21] [22] [23].
Delays in recognition and treatment of sepsis is still a challenge despite the explored importance of early intervention [6] [16] [24] [25] [26] [27] [28]. Neonatal clinical presentation is non-speci c and overlaps with other newborn disease processes. The laboratory tests have limited diagnostic accuracy, which makes rapid diagnosis for neonatal sepsis di cult. The standard gold test for neonatal sepsis diagnosis, blood culture, faces the challenge of insu cient blood volume for blood culture and low amount of invading microorganisms in the blood, which usually generates false-negative results [29] [30]. Infants suspected of having sepsis are subjected to prolonged antibiotic therapy despite negative cultures. In other to tackle the challenges associated with sepsis recognition and care management studies are making use of machine learning and statistical modeling approaches [31] [32] [33] [34].
Compared to other signi cant conditions, neonatal sepsis receives less substantial international investment as a public health priority despite the heavy burden of newborn deaths related to neonatal sepsis [3]. Knowledge of neonatal sepsis's predictor variables, early identi cation, and early interventions can reduce neonatal mortality and morbidity rates. This study aims to review the existing screening parameters and models based on their diagnostic performance, strength, and weaknesses to better understand the algorithm development process.

Materials And Methods
Selection of screening parameters for analysis A preliminary examination of the available literature was carried out, after which a list of parameters was consolidated for further review. These parameters were selected based on their publication and their potential for diagnosing and prognosis of neonatal sepsis. The parameters include; Maternal risk factors (which include; intrapartum fever, chorioamnionitis, postnatal distress, duration of ROM, GBS colonization, and intrapartum antibiotics).
Neonatal clinical signs (which include; gestational age, birth weight, heart rate, and feeding di culty).
Laboratory tests (which include; absolute neutrophil count, C reactive protein, I/T ratio, micro-ESR, platelet count, and total leukocyte count).

Search Strategies
In order to carry out a landscape analysis to identify studies with the diagnostic performance of the previously mentioned parameters, PubMed, IEEE, and Cochrane's bibliography database were searched. The search strategies for the databases were carefully made to give maximum output. A combination of text words was used to develop the search strategy, which includes; "neonatal sepsis" AND "prediction" AND "machine learning", "neonatal sepsis" AND "prediction" AND "EHR", "neonatal sepsis" AND "prediction" AND "model", "neonatal sepsis" AND "prediction" AND "algorithm", "neonatal sepsis" AND "diagnostic algorithm" AND "machine learning", "neonatal sepsis" AND "screening parameters" AND "models", "neonatal sepsis" AND "screen" AND "models". The search strategy was restricted to the subject (humans) and the time period (January 2000 to April 2020). A total of 463 PubMed, 305 citations from IEEE, and 86 Cochrane citations from Cochrane were retrieved. These references were imported as separate les into an excel sheet except for the Cochrane database's references; it was imported only as CSV le. The duplicates were removed, the titles and abstracts of the retrieved citations were screened to nd the articles relevant to the study. Additional relevant studies were retrieved by scrutinizing the bibliography of searched studies.

Inclusion Criteria
For the inclusion of articles, the abstract and titles were screened based on the following predetermined criteria: The subject population are neonates.
Subjects have culture-proved sepsis or suspected sepsis based on a clinical algorithm.
The article evaluated any of the consolidated screening parameters and algorithms/models for neonatal sepsis diagnosis or prognosis.
The exhaustive search based on the titles and abstracts returned a broad spectrum of infection-related studies from which only cases of neonatal sepsis were considered. Finally, full-text articles with the following criteria were included for analysis: The subject population are neonates.
The study provided a clear de nition of neonatal sepsis.
The study provides neonatal sepsis onset de nition (i.e., time of onset).
The study clearly described the predictor variables used.
The study clearly described the machine learning models used or evaluated in any of the consolidated screening parameters.
The study must have provided diagnostic performance results (i.e., AUROC results).

Exclusion Criteria
It was a great challenge trying to select the relevant articles for this review from the large number of papers retrieved (n=854) based on the selection criteria. To make a comprehensive list of appropriate papers, articles that did not deal with neonates, duplicates, reviews, meta-analyses, abstracts, editorials, and commentaries were excluded.

Data Extraction
The available full papers were downloaded from PubMed, IEEE, and Cochrane sources. The data was extracted and compiled in an Excel spreadsheet. The following information was extracted from all the studies: a. Publication characteristics (author's name, year of publication).
b. Study design (retrospective, prospective data collection and analysis). i. Methods to avoid over tting and also any additional external validation approaches.

Yes/No
The quality of the selected ML studies was assessed based on 14 criteria relevant to the objectives of the review, which was adopted from [35]. The assessment consists of ve categories described in table 1 above. A quality assessment table was provided by listing "yes" or "no" for each category's items using the provided criteria.

Results
Out of 854 studies, 31 studies met the inclusion criteria. The literature search results with reasons for exclusions at each stage are presented in gure 1 above.

Study Characteristics
Of the 31 included studies, 16 employed solely prospective analyses, 13 employed solely retrospective analyses, while 2 studies used both retrospective and prospective analyses [37,38]. The most frequent data sources used in the studies were the University of Virginia Hospital (n = 8; 26%), followed by MIMIC-III (n = 3; 10%). In terms of neonatal sepsis de nition, the majority of the studies employed Blood culture (n = 26; 84%) or Observational condition (use of clinical signs) (n = 16; 52%) or combination of Blood culture and Observational condition (n = 12; 39%). The studies modi ed observational and Laboratory de nitions based on available data and the predetermined neonatal sepsis onset time; this is mainly due to the absence of a consensus de nition of neonatal sepsis. The prevalence of neonates with sepsis ranged between 0.27% and 87.0%. Five studies did not report the prevalence [39,40,41,42,43]. Regarding the category of neonatal sepsis of interest, the category with a high focus is late-onset (n = 18; 58%) and early-onset (n = 4; 13%). While 9 studies [39,44,45,46,47,48,49,50,51] did not report the category of focus. In demographics, 6 studies reported the median or mean age of the neonates, 11 reported the prevalence of male neonates, 2 reported the prevalence of female neonates, and only 3 reported the investigated cohorts' ethnicity (see supplementary table 1).

Overview of Machine Learning Algorithms and Variables
A wide range of ML algorithms has been employed to build models for the early detection of neonatal sepsis, with some models being speci c to the study population. Regression was the most used model of which various types (n = 25; 81%) were used. This includes Logistic Regression or Linear Regression [52]. Furthermore, boosted tree models were the second most used model (n = 6; 19%), including gradient boosting [42] , or random forest [43]. And lastly SVM [42] (n = 5; 16%). Most of the studies (n = 24; 77%) arbitrarily chose one-or two-ML models without arguing the reasons. Seven studies (23%) [53,54,55,42,43,50,56] compared several models and identi ed the one with the best performance.
As for the analyzed variables, the most common variable used was neonatal clinical signs (n = 28; 90%), followed by laboratory tests (n = 10; 32%), and maternal risk characteristics (n = 5; 19%). Sixteen studies (52%) were found to use one variable, while the remaining fteen studies (48%) were found to combine these variables, which includes neonatal clinical signs and laboratory tests (n = 10; 32%), maternal risk characteristics, and neonatal clinical signs (n = 5; 16%). None of the studies were found to explore the combination of maternal risk characteristics, neonatal clinical signs, and laboratory tests. The number of screening parameters included in the respective models ranged between 2 [57] and 22 [50] . Concerning the features for detecting neonatal sepsis, the reviewed studies show that duration of ROM was found to be more signi cant than other maternal risk factors [46,58,50,59]. Heart rate and heart rate variability were found to be more signi cant than other neonatal clinical signs [40,43]. C reactive protein and I/T ratio were found to be more signi cant than other laboratory tests [45,57].  [43] Step 1: Create a set H of 17 heart rate variability, H = (v 0 …v d ), 1 ≤ d ≤ 17 Step 2: Initialize elements of set H; T = (c 0 … c ⅈ ), 1 ≤ ⅈ≤ 17 Step 3: Check result of the time, frequency and non-linear analysis in elements of set T IF ⅈ is defined as "Absolute" THEN RETURN True ELSE RETURN False END IF Step 4: IF number of Absolute is defined as "High" THEN RETURN "Neonatal Sepsis" ELSE RETURN "Normal" END IF Table 3: Phase I: Pseudo code for the observational condition Medical Decision Support Algorithm for Neonatal sepsis [54] The algorithm consists of three phases: observational condition, laboratory condition, and neonatal sepsis.

Model Validation
Approximately 58% of the studies did not report what valid methods were used to prevent over tting, while 29% employed cross-validation technique (e.g., 4-fold, 5-fold, 10-fold, or leave-one-out crossvalidation), and 19% employed bootstrap to avoid over tting. Concerning the models' limitations, 13% of the studies recommend that the models require additional variables to optimize their performance.
Additional external validation of the models was only performed in seven studies [60,44,40,58,61,48,38].  Table 6 is in the supplementary les section.] Table 6 above shows the results of the quality assessment of the studies. The 31 studies' quality ranged from poor (meeting≤ 40% of the criteria) to very good (meeting≥ 90% of the criteria). None of the studies ful lled all 14 criteria as none of the studies met ≥ 90%of the criteria. Few studies made the data used in their study available (n = 3; 10%). Only ten studies (32%) explained how features were generated before model training. Only two studies (6%) provided the code used for data cleaning and analysis. Only one study (3%) provided code to reproduce the exact sepsis labels [58]. Few studies reported the hyperparameters needed for study replication (n = 5; 16%). Finally, only seven studies (23%) validated their study result on an external data set. With the exception of two studies [64,53] , all other studies had sample sizes larger than 50, which is a requirement for the interpretation, power, and validity of machine learning methods.
[Tables 7-10 are in the supplementary les section.] Association is stronger with EOS than LOS.

Maternal age
Neonatal sepsis is common among infants of older mothers and maternal age < 20 can be associated with EOS risk factors.
Not validated or considered as a determining risk factor for neonatal sepsis.

Parity
Has strong association with neonatal sepsis.
Association with neonatal sepsis is controversial.

Antibiotic treatment
Reduces the risk of infection to a mother and neonate.
Increases other health risks to newborn infants.

Maternal CRP
It is associated with neonatal sepsis and a significant risk factor for neonatal sepsis.
CRP values are also affected by other factors.

GBS status
Strongly associated with neonatal sepsis. Even though GBS remains the most frequent pathogen for EOS, there has been a shift in this as Escherichia coli (E. coli) becomes the most important pathogen causing EOS in preterm and very low birth weight infants. Intrapartum fever It's generally considered a major risk factor for EOS.
The risk of neonatal sepsis in newborns delivered by mothers with intrapartum fever is low. Heart rate variability It is a significant risk factor for neonatal sepsis, and neonates have reduced heart rate variability (HRV) before clinical signs of sepsis.
Its main drawback for early diagnosis of neonatal sepsis is the high false-positive rate.

Birth weight
It is one of the determining factors for neonatal sepsis as newborns with less than 2.5 kg are 1.42 times more likely to develop neonatal sepsis than newborns born with 2.5 kg and above.
Infants with low birth weights are at increased risk for other forms of infection and infectionrelated mortality.

Respiratory rate
Its variability can be an indicator of sepsis.
The variability in respiratory rate is also associated with other respiratory problems. Heart rate It is one of the most important clinical indicators to evaluate sepsis.
An elevated score is not specific for sepsis and may occur in other conditions associated with nonspecific inflammation. SpO2 Performs well for preclinical detection of sepsis.
High altitudes and other factors may affect what is considered normal for a given neonate. Poor feeding It appears to be crucial in a diagnosis of sepsis.
It is a nonspecific symptom seen in newborn

Discussion
The review summarized studies on neonatal sepsis with ML algorithms to facilitate early prediction. Looking at ML methods, which includes cohort selections, predictor variables, outcomes, the building of models, and validation methods. A wide range of ML algorithms was chosen for analysis in the studies to leverage neonates' digital health data to predict sepsis. Based on the ndings from the reviewed studies, this section outlines three major challenges that studies on neonatal sepsis prediction leveraging machine learning are currently facing: (i) asynchronicity, (ii) comparability, and (iii) reproducibility.

Asynchronicity
Studies focused on predicting neonatal sepsis with ML have shown to have the advantage of increasing the prediction power and have promising results [43,50]. But so far, the reports on which of the open challenges are the most pressing challenges that need to be addressed are diverging, which poses di culty in achieving the goal of early neonatal sepsis detection. On one hand, the blood culture test, which is the standard gold test, has been stated as the most reliable test for con rming neonatal sepsis [62]. While on the other hand, recent ndings have cast doubt on the validity and meaningfulness of the blood culture test. As it has been stated to be unreliable due to the longer time (48-72 hours), it takes to obtain the result and the insu cient amount of blood obtained from neonates, which produces false-negative results [54,50]. Also, it was stated that neonatal clinical signs (e.g., the use of heart rate variability) alone are su cient in detecting neonatal sepsis [40,61]. However, recent studies are posing doubts to this as they state that a combination of predictor variables yield better results in the detection of neonatal sepsis [48,57,43,50]. The developed ML models need to be explored in clinical trials to ascertain their clinical settings usage as most of the models are developed retrospectively, facing multiple obstacles.

Comparability
In terms of comparability of the reviewed studies, several challenges were identi ed that are yet to be overcome; (i) neonatal sepsis de nition, (ii) implementation of a given neonatal sepsis de nition, and (iii) performance measures of the models. Each of these challenges is discussed below.

De ning and Implementing Neonatal Sepsis
The choice of neonatal sepsis de nition is an obstacle that affects the comparison of studies in terms of septic neonates' prevalence. A various set of neonatal sepsis de nitions (and modi cations) were used in the reviewed studies. Having a large set of septic neonates is anticipated to be useful in training ML models (most especially the deep neural networks). However, having a high number of septic neonates could make it di cult to differentiate the septic neonates from the non-septic neonates. Neonatal sepsis is inherently hard to de ne as, over the years, there has not been a consensus de nition for it. The previous study shows that the use of different sepsis de nitions on the same dataset gives a largely dissimilar cohort [68]. This study found that blood culture is less inclusive, leading to a small cohort showing severe symptoms, which has been reported in several studies [54,43,50]. It was also seen that even the use of the same de nition on the same dataset gives dissimilar cohorts. This can be con rmed from studies carried out at the University of Virginia and studies that used the MIMIC-III dataset (see Table 9 above). The underlying problem cannot be easily discovered, as the code for assigning the labels are not available in 30 studies out of 31 (97%) studies. The diversity of neonatal sepsis prevalence is another factor that is increasing the problem of comparability. Some studies balance their datasets to improve the training of the ML models, but this training setup can partly affect the study [56]. While other studies keep the observed case counts to see how their approach will work in clinical settings. From this study ndings, it has been identi ed that the neonatal sepsis de nition used and the data pre-processing steps affect the prediction of sepsis and also the prevalence [68]. The maximum prevalence reported is 87.0% [64].

Performance Measures of the Models
The choice of performance measures is the last obstacle to be discussed that is obstructing comparability. This obstacle is largely affected by the prevalence of neonatal sepsis in the study.
Accuracy is a simple performance metric directly in uenced by class prevalence; comparing two studies with different prevalence values is problematic. Some studies report the area under the receiver operating characteristic curve (AUROC, also known as AUC) to improve the performance metrics report. However, AUROC also depends on class prevalence and can be less informative on highly imbalanced classes [69].
The area under the precision-recall curve (AUPRC, also known as average precision) is preferable in such a situation. Both AUPRC and AUROC are affected by prevalence. However, AUPRC allows comparison with a random baseline that just "guesses" the neonate label, and it's useful when considering the positive class. While AUROC can be high even for classi ers that could not classify the minority class of septic neonates. The effect of the choice of performance metrics is greatly seen with highly imbalanced classes. Recent research recommends reporting the AUPRC of models, particularly in clinical studies [70], which is a good recommendation.

Comparing Studies of Low Comparability
Based on this review study's ndings, comparing the reviewed studies quantitatively is currently a challenging task to accomplish, which was also seen in a study by (Moor, et al., 2020). The studies were assessed qualitatively to identify underlying biases that could lead to unduly optimistic results. This was done as the best-performing methods could not be ascertained by just evaluating the performance measures' numeric values. A meta-analysis will be preferable to sum up, an overall trend in the performance of the models.

Reproducibility
Reproducibility, which is the ability to obtain consistent results using the same data and code as the original experiment, is the means for scienti c accountability. There have been failures of this accountability in several disciplines, including ML [35]. The use of sensitive data makes it di cult to make available the dataset used in studies, which is one of the challenges digital medicine poses to reproducibility. Another challenge is the failure to provide detailed preprocessing methods used in ML papers. Based on the quality assessment carried out, important areas were outlined that need to be improved. As it was seen, only two studies [50,66] made available their analysis code. Only one study [58] made available their code for generating a "label." Both cases amount to less than 10% of the eligible studies. In addition, only three studies [57,50,66] made available the dataset used for their study.
Only eight studies were found to share the hyperparameters used in their studies. However, a positive nding of this analysis is that a considerable number of studies (n = 10) shared the preprocessing methods used in their studies, which is useful information in the reproducibility of computational experiments.
This review focused on publications that studied the prediction of neonatal sepsis implementing ML algorithms. The majority of the reviewed studies investigating neonatal sepsis prediction de ned neonatal sepsis as having positive blood culture or observational condition (i.e., the use of neonatal clinical signs). None of the 31 included studies re ects an African cohort, which shows a signi cant dataset bias in the publications and insu cient research in Africa (see Supplemental table 1 for an overview of demographical information). The review found a lot of room for improvement, which will bene t the comparability of different models, most importantly, when ML models are going to be evaluated prospectively.

Limitations
This review was carried out with some shortcomings. The reviewed studies had certain inherent limitations, as previously mentioned. The diagnostic performance evaluation report of the models was suboptimal. The studies were assessed qualitatively due to the variation in the performance measures used. A meta-analysis will be preferable to evaluate the performance of the models. Some studies may have been omitted from the review as English language restrictions were applied.

Conclusion
Combination of these variables have been predicted to strengthen the prediction of neonatal sepsis which was shown in some of the studies above. This study seeks to inform researchers on what predictor variables are required to develop algorithms/models with better diagnostic performance which will improve the detection of neonatal sepsis. The parameters and machine learning models used in the reviewed studies were largely different, so diagnostic performance was different. It should be considered that neonatal sepsis is consistent with other symptoms as well as underlying conditions. What is important here is the weight assigned to a variable. Suggestions for risk strati cation based on maternal risk factors (such as; intrapartum fever, chorioamnionitis, duration of ROM, GBS colonization and intrapartum antibiotics), neonatal clinical signs (such as; gestational age, birth weight, heart rate, postnatal distress and feeding di culty), and laboratory tests (such as; absolute neutrophil count, C reactive protein, I/T ratio, micro-ESR, platelet count and total leukocyte count) could be considered for future studies.

Declarations
The work presented in this Manuscript is the result of our original research work. Where we have used the works of other persons, due acknowledgements are clearly stated. This work has not been submitted for publication in any journal before.
Ethical approval and consent to participate Not applicable; as the study reviewed only published data.

Consent for publication
Not applicable Availability of data and materials Not applicable.

Competing Interests
The authors declare that they have no competing interests.

Funding
No external funding was obtained for this study.
Author's contributions EDP, WW and AM carried out the preliminary literature search following the PRISMA guidelines, tabulated and analyzed the collected data and developed the rst draft of the manuscript. KS contributed the neonatology expertise and edited the manuscript accordingly. All the authors reviewed and approved the nal manuscript.

Acknowledgement
Not applicable. The supplementary le contains the lled PRISMA statement checklist for the study.  PRISMA ow diagram showing the search strategy and identi ed articles [36] Supplementary Files This is a list of supplementary les associated with this preprint. Click to download. Tables.docx