Non-invasive Estimation of Clinical Severity of Anemia Using Hierarchical Ensemble Classifiers

Current techniques of anemia classification are either invasive, expensive or inaccurate, making them ill-suited for community health-worker based screening programs. In this study, we propose an Artificial Intelligence (AI) based anemia classification method using a multi-wavelength non-invasive photometry device. A finger mounted photo-plethysmogram (PPG) device was designed to acquire PPG signals at four wavelengths (590, 660, 810, and 940 nm). A set of 13 attenuation and ratio-of-ratio features, derived using the peak and trough information extracted from the PPG signals, were used to develop a three-way hierarchical ensemble classification scheme using a machine-learning algorithm. PPG data from the device and true hemoglobin data from laboratory-based cell counters was collected for 1583 women of childbearing age and subjects were classified into either healthy (Hemoglobin, Hb > 11 g/dL), anemic (Hb: 7–11 g/dL) or severely anemic (Hb < 7 g/dL) categories. We report a classification sensitivity of 92% (p < 0.05) and specificity of 84% (p < 0.05) in differentiating anemic and non-anemic women. We also report a sensitivity of 76% (p < 0.05), and specificity of 74% (p < 0.05) in identifying severe anemia. We believe that the proposed anemia classification algorithm, along with the associated sensor has the potential to be productized as a low-cost non-invasive anemia-screening device to rapidly determine next steps in clinical decision making in widespread community interventions.


Introduction
Anemia is a condition characterized by reduced levels of hemoglobin (Hb) in the blood, impacting the amount of oxygen in circulation [1], affecting billions of people globally. Women and children are especially susceptible to the condition in both urban and rural settings, with very high prevalence and mortality in Low-and Middle-income countries (LMICs) [2]. Additionally, frequent monitoring of Hemoglobin is required in several other conditions, including infectious diseases such as dengue fever [3,4].
The gold standard for Hb estimation involves an intravenous blood draw followed by testing the sample in a laboratory, making this method unsuitable for large scale communities and/or rural areas. As an inexpensive and appropriate alternative, the World Health Organization (WHO) released the Hemoglobin Color Scale (HCS) in 2001, a method to assess Hb concentrations outside the laboratory using frontline community health-care workers (CHWs) at the last mile. [5]. Although this color scale based semi-qualitative method serves as a useful tool in primary healthcare settings, its sensitivity and specificity in field conditions has been largely inconsistent [6] due to the subjectivity in interpretation, adherence to instructions, and ambient lighting conditions. The invasive nature of the method also poses a barrier to adoption due to hesitation from subjects, and the heightened risk and awareness of blood borne infections [7,8].
Other conventional methods of estimating Hb concentrations include photometric techniques measuring absorbance of Hb in blood samples (Hemocue®, Hematocrit, among others) [9,10]. These methods have acceptable clinical accuracies and can be used for on-the-spot analysis but are invasive, and expensive [11]. The issue of invasive techniques has been addressed by the advent of non-invasive methods to estimate Hb. Devices like Masimo's Pronto and Orsense's NBM 200 have recently introduced non-invasive hemoglobinometers. However, several studies have reported variations in sensitivity and specificity of these devices, as well as low levels of precision and large limits of agreement [12,13]. Additionally, these devices are cost prohibitive for large scale use in public health programs in LMICs.
A study by Swaminathan et al. provides a quick, noninvasive, machine learning based method to estimate Hb value [14]. Although advantageous, for population-based screening of anemia especially by CHWs, clinical severity of anemia (normal, mild, moderate, or severe as defined by WHO) determines the next steps in clinical decision making, and decimal level accuracy is seldom needed. For instance, severe gestational anemia requires urgent referral to a hospital for possible blood transfusion or parenteral iron infusion, whereas mild anemia during early stages of pregnancy can be managed usually by increased iron supplementation and dietary modifications. Hence, instead of estimating/predicting the Hb value, interpreting the anemia level based on age and gender and then deciding the treatment plan for an individual, a CHW could be equipped with an objective easyto-use classification tool to rapidly determine the clinical severity of anemia thereby improving the efficiency of the heath care facilities provided in most LMICs.
Additionally, a report on the recommendations for anemia treatment by the National Health Mission, Govt. of India [15] clearly details that in a public healthcare setting, the therapeutic regimen for the categories 7-9 g/dL and 9-11 g/ dL do not differ significantly. In order to enhance the efficacy of CHWs, we intended to develop a rapid three way classification technique for anemia screening, which exhibits better classification accuracy in comparison to the conventional classification scheme.
In this study, we leverage recent advancements in machine learning based approaches [16] to propose a non-invasive method of anemia classification, combining a multi-wavelength photoplethysmographic sensor and a novel classification algorithm based on hierarchical ensemble classifiers. We trained and tested this method against laboratory-based gold standard cell counters and achieved improved accuracies and reliability over the WHO color scale [6].

Hardware
An in-house PPG acquisition device with a mechanical lever operated, fiber-optics based sensor developed at Robert Bosch Engineering and Business Solutions Pvt. Ltd. (RBEI), India in collaboration with the Center of Bioengineering Innovation & Design (CBID) at the Johns Hopkins University, USA was used in the study (Fig. 1). The device architecture consists of a single-point light source, a finger slot, a photodetector, an analog front-end board and a single board computer to orchestrate the proposed solution. Light from four LED sources of different wavelengths (590 nm, 660 nm, 810 nm and 940 nm) were made to pass through the subject's finger from the dorsal end and the attenuated intensity of light was detected at the ventral end using a Fig. 1 Schematic representation of PPG collection device, automated peak and trough detection, and the classification scheme implemented in the machine learning algorithm photo detector. The four LEDs were switched ON at preprogrammed intervals of 10 ms. Only one LED was ON at any point of time. The intensity of light absorbed by the finger was de-multiplexed to correspond to the four LEDs, resulting in four independent PPG signals at the respective wavelengths. Actual values of total blood hemoglobin were calculated using the gold-standard automated cell counters. Automated Beckman Coulter LH780, Artocell-200 [Haematology Analyzer], Nihon Kohden Celltac and Mek-6420P were used during the process of data collection. Further details of the architecture and hardware and the system design is provided as part of another study [17].

Data Collection Protocol
With the assistance of the Prasanna School of Public Health, Manipal Academy of Higher Education (MAHE), India and in compliance with the Manipal University Ethics Board (MUEC/012/2016-17), data collection was conducted at public sector hospitals of Gangavathi and Koppal and a private sector hospital of Karkala. A total of 1583 women (17-52 years), with 825 of them being pregnant, provided the requisite informed consent to participate in the study, in accordance with the procedure laid out by the ethics committee. A set of four PPG signals of 60 s duration each was recorded for each subject. Each participant also provided an intravenous blood sample to determine the true hemoglobin values using automated cell counters (Beckman Coulter LH780, Artocell-200 [Haematology Analyzer], Nihon Kohden Celltac and Mek-6420P). The demographic distributions of the participants of the study is described in Table 1.

Peak and Trough Detection
All PPG signals were bandpass filtered with a second order Butterworth filter (passband = 0.25-10 Hz). This band was chosen to eliminate signal artefacts due to breathing (baseline drifts), motions artefacts and high frequency noise [18]. Since the information of the PPG signal is in the frequency range of 0.5-4.0 Hz, signal characteristics remained unaffected by the application of the filter. Peaks and troughs in the PPG signal were detected using the methods outlined in Wang et al. [19]. Figure 1 illustrates an example of the automated peak and trough detection on a representative segment of acquired PPG.

Visual Inspection for Motion Artifact Removal
The peak and trough labeled PPG signals were visually checked to remove signals that showed no PPG like waveforms and/or displayed significant motion artefacts, thus eliminating 67 of the 1583 total subjects from the dataset used for this analysis.

Feature Extraction
Estimation of Hb from PPG signals relies on the application of the Beer-Lambert's law which states the relationship between the incident intensity (from the LEDs), attenuated intensity (captured by the photodetector) and the concentration of an analyte (blood molecules): where I is the LED intensity, Iʹ is the attenuated intensity, c is concentration, ε is the extinction coefficient, and d is the path length traversed by LED light within the medium.
The absorption of incident intensity by the steady-state components of the blood (tissues, plasma, etc.) form the DC component and the absorption due to the pulsatile component of the blood corresponds to the AC component of the PPG signal [16]. The four LED wavelengths (590 nm, 660 nm, 810 nm and 940 nm) were chosen to maximize the absorption of the four Hb components (Oxygenated Hb, Deoxygenated/Reduced Hb, Carboxy Hb, and Methemoglobin) based on their absorption spectra [14]. A set of features representing the amount of light absorbed (Attenuation features) and the pairwise ratio-ofratios parameters to capture the influence of the blood hemoglobin concentration on the attenuated signal were chosen for the development of this classification algorithm.

Ratio of Ratios
A ratio ( r i ) of amplitude value of peak to the amplitude value of the subsequent trough was calculated for every peak-trough pair and a set of such ratios ( R ) for each subject at each of the four wavelengths (λ)-590 nm, 660 nm, 810 nm and 940 nm was calculated.
where i is the index of the peak or trough, n is the total number of peaks/troughs detected in one PPG signal at wavelength . (1)

Attenuation Values
A value defined as the logarithm of the difference between the peak amplitude and the subsequent trough amplitude were also considered as one of the features to train MLA (Eq. 5).
where A is attenuation value, I n is the set of all peak to peak amplitude in the PPG signal, i represents the index of the peak or trough.
For each of the four different LEDs, four distinct values ( A 590 , A 660 , A 810 , A 940 ) were created for each subject and used as features to train the MLA.
The final feature set used to train the MLA consisted of nine ratios of ratios and four attenuation values, listed in Table 2.

Classification Structure
A hierarchical classification scheme, with two classifiers, was adopted to develop the proposed MLA to classify subjects as healthy, mild to moderately anemic or severely anemic (Table 3).
Since the combined performance of a committee of models is expected to outperform a single model [19], both classifiers: Stage-1 and Stage-2 were modelled using an ensemble of eleven MLAs comprising of four types of classification models each. Eight Support Vector Machine (SVM) models with different parameter tunings, one Logistic regression model, one Random forest model and one Boosted trees model were used. Each model within an ensemble was a binary classifier. The Stage-1 models classified a subject into healthy or anemic category, while The Stage-2 models classified the anemic subjects into moderately or severely anemic category. The final classification, described in Fig. 1, of a subject to a category at each stage was determined by the majority of votes across the models in that ensemble.

Model Training and Testing
The data was randomly split into three mutually exclusive and exhaustive partitions of 70% for training, 15% each for validation testing. For the Stage-1 classifier, all subjects from the training set were considered for training the ensemble models, while for the Stage-2 classifier, subjects from the training set with true Hb values less than 11 g/dL were considered as depicted in Fig. 2. In order to train the models independently, each model within the ensemble was trained using randomly chosen subset of the training data.
To estimate a distribution of the model performance, the procedure was repeated for multiple iterations, and test sets generated at every iteration were kept aside for evaluation of the model performance.
Each model generated a probability estimate, indicating the chance of a subject belonging to one of the two categories (at each stage). Based on this value being above or below a threshold, the subject was classified into either category.
For a model with probability estimate p and threshold x, an individual was categorized as per the following rule: To optimize the performance of each classifier, an optimal value of threshold was derived from a range   Mild to moderately anemic ≥ 11 Healthy (non-anemic) 1 3 of threshold values using the validation set. The set of threshold values considered for this purpose consisted of 100 equally spaced values from 0 to 1. The predictions of classifiers at these thresholds for the validation set were calculated and an array of sensitivity and specificity values were generated. This process was repeated 30 times using random splits resulting in a distribution of sensitivity and specificity values at each threshold for every model. A receiver operating characteristic (ROC) curve was generated from the median values of the distribution of sensitivity and specificity values. ROC curves for the eleven models arranged row-wise (Eight Support Vector Machine (SVM) models with different parameter tunings, one Logistic regression model, one Random forest model and one Boosted trees model) in Stage-1 and Stage-2 classifier ensembles are shown in Fig. 3. The threshold value corresponding to the best trade-off between sensitivity and specificity was then finalized for a model by calculating the youden index ( y x ) [20] for the threshold values from the ROC. The final threshold x opt for a model was chosen according to the equation: Once the two classifiers were optimized and successfully modelled, the proposed MLA for anemia classification was assembled and the tested.

Feature Extraction
A plot of the R values against actual hemoglobin values for all subjects (in the training set) at all wavelengths depicted better correlation at higher values of R , as shown in Fig. 4. (The ratios for R 590 differed from the other wavelengths because the gain settings for the 590 nm LED was kept higher than the other three wavelengths to compensate for the higher absorption of light by the finger at lower wavelengths which resulted in a very low amplitude signal.) The higher percentile values (namely 70th, 80th, 90th and 100th) of R were chosen for feature development.
Using these R as input, a further set of pairwise ratio of ratios for the four wavelengths at four percentiles were calculated as features. The top two pairwise ratios with maximum correlation with true Hb values in each combination of wavelength were then selected as input features to the MLA (see Table 4).

Model Selection
To select the optimal number of models constituting the ensemble, we observed the effect of number of models on the ensemble performance. The sensitivity and specificity values did not change significantly for Stage-1 Classifier (Fig. 5A, C) but indicated improvement in sensitivity values with increasing number of models for Stage-2 Classifier (Fig. 5B, D). The metrics stabilized at the count of eleven models, as described in Sect 2.3.3, beyond which there was no effect of addition of models to the ensemble.

Model Performance
The trained models within the ensembles were evaluated in terms of their sensitivity and specificity values at each stage [21,22]. The models reported a median sensitivity of 0.92 (p < 0.05) for Stage-1 classifier and a median specificity of 0.84 (p < 0.05). For Stage-2 classifier the median value of sensitivity and specificity were noted as 0.76 (p < 0.05) and 0.74 (p < 0.05) respectively. Stage-1 sensitivity reflects the ability of the trained classifier to detect the presence of anemia and specificity reflects the ability to detect its absence. Stage-2 sensitivity reflects the ability of the trained classifier to detect the presence of severe anemia and specificity reflects the ability to detect the absence of severe anemia. Stage-1 metrics indicate that the trained classifier was highly sensitive and precise in detecting anemia cases (> 11 g/dL) whereas Stage-2 classifier, although significantly sensitive, lacks precision in detecting severely anemic cases. The Positive Predictive Value (PPV) and Negative Predictive Value (NPV) indicate the tendency of the trained classifier to exhibit bias towards a particular class thereby inflating the sensitivity or recall value of a particular class.  The median PPV values for Stage-1 and Stage-2 classifiers were 0.92 and 0.43, whereas the median NPV values were 0.8 and 0.9 respectively. The low PPV value for Stage-2 classifier could be attributed to the imbalance in distribution of hemoglobin values across the dataset. The performance of the MLA, as calculated across multiple test iterations, is illustrated in Fig. 6. Table 5 summarizes the sensitivity and specificity values of the individual models of the ensemble along with the performance of the voting scheme. The voting classifier  performed better than the individual models in Stage-2 and a moderate performance improvement is observed in Stage-1.

Confusion Matrix
The confusion matrix, shown in Fig. 7a, reflects the accuracy of classification (healthy, mild to moderately anemic and anemic). The matrix was calculated taking into consideration the average performance of the MLA across all test sets. While the rows of the confusion matrix represent the actual class of anemia, the columns represent the class assigned by the MLA to the subjects.
As evidenced by the results, the risk of extreme misclassification (i.e. an error of more than one class shift) was minimal. None of the subjects in the maximum risk category i.e. severe anemia were misclassified as healthy and among all the healthy subjects across all test sets around 2% were wrongly classified as severely anemic.

Four Way Anemia Classification
An attempt was also made to develop a 4-way anemia classifier to better differentiate the 'mild to moderate anemia' category into 'mild anemia' and 'moderate anemia'. The middle bucket of Anemic individuals (7-11 g/dL) in the three way classifier was subject to bifurcation into two sub-classes, as 'moderate anemia' (Hb 7-9 g/dL) and mild anemia (Hb 9-11 g/dL).
From the training sample, a regression model was developed for subjects with Hb values between in the 7-11 g/ dL. The performance of the trained regression model was tested on the data set of observations that were previously assigned to the anemic category of anemia by the three way classifier. The performance of the newly trained four way classifier was assessed in similar manner as discussed in the previous sections, and a four way confusion matrix was computed by averaging the performance on all the test sets (Fig. 7b). The bucket for Hb level (9-11 g/dL) reported a  p-value (0.01). The p-value confirms the hypothesis that the developed classifier indeed works better than a random guess. For the bucket (7-9 g/dL) the observed sensitivity is around 0.38 and the p-value reported is 0.01.

Discussion
In this study we demonstrate the capability of a non-invasive opto-electronic device coupled with machine learning techniques for the non-invasive classification of Anemia in community health settings. Anemia is an easily treatable but rampant condition across emerging economies. Technology utilized in the field today includes venous blood-draw based laboratory techniques, finger prick based point of care methods, and sophisticated non-invasive devices. Each have there challenges related to accessibility and/or accuracy in field settings.
By using a rapid point of care, handheld, battery operated automated Anemia classification device, doctors, nurses and/ or community health workers can easily diagnose and treat a patient in accordance with the WHO therapeutic guidelines.
The non-invasive nature of the device facilitates increased acceptance by patients and reduces the problems of poor follow up, safety, biohazard disposal and supply chain requirements. Photoplethysmographs from an individual's finger provided enough information to a machine learning algorithm to reliably predict Hb ranges. The study described in this paper is a step in this direction.
Features derived from the PPG signals, proposed in this paper, were sufficient to develop a supervised machine learning algorithm with a three way classification of anemia-severity with accuracies that are comparable to current community based (invasive) anemia screening tools. The algorithm developed in this paper is unique in terms of the data volume and heterogeneity. It has been trained on a significantly large cohort size (n = 1516) spread over a wide range of hemoglobin values from 1.6 to 14.8 g/dL. Most studies in the past have been implemented on smaller and less diverse data sets. Kavsaoğlu et al. [16] proposed machine learning based approach for estimating hemoglobin non-invasively using data from 33 subjects with a considerably narrow range of hemoglobin values: 10.1-17.4 g/dL. Ding et al. [23] proposed an MLA for non-invasive hemoglobin monitoring based on principal components analysis and artificial neural networks which included 109 subjects with hemoglobin values ranging from 11-18 g/dL.
The area under the ROC curve of the individual models quantify the overall ability of the model to discriminate between healthy and anemic individuals by Stage-1 classifier, and moderate and severe anemia by the Stage-2 classifier. The proposed machine learning based classifier could differentiate between a healthy and an anemic individual with a sensitivity of 92%, and a moderate to severely anemic individual with a sensitivity of 76%. Another study [18] proposing a smart phone based non-invasive monitoring set up, reports sensitivity and specificity values for a two way classification (presence or absence of anemia) as 85.7 and 70.6 respectively calculated from a set of 31 individuals.
Although the reported median PPV for severely anemic cases is low (43%), we believe that it is significantly influenced by the class prevalence [24], which was around 14% in the data set. The performance of the proposed model may be further enhanced by including more data from the higher end (> 14 g/dL) and lower end (< 7 g/dL) of the Hb range.
Kavsaoğlu et al. [16] has proposed machine learning based approach for estimating hemoglobin non-invasively from features extracted from PPG signals on 33 subjects but the range of hemoglobin values in the data set was considerably narrow (10.1-17.4 g/dL). Another study [18] proposing a smart phone based non-invasive monitoring set up, reports sensitivity and specificity values for a two way classification (presence or absence of anemia) as 85.7 and 70.6 respectively calculated from a set of 31 individuals. Ding et al. [23] proposed a MLA for non-invasive hemoglobin monitoring based on principal components analysis and artificial neural networks, which included 109 subjects with hemoglobin values ranging from 11 to 18 g/dl.
For the data used in this study, the performance of the four way classification was observed to be low for the mild and moderate anemia classes. Since the clinical handling of the intermediate buckets is not significantly different [25], we propose that, a three way classification scheme might provide a more suitable and sufficient method for community health interventions.
We believe that the results from the trained machinelearning algorithm presented in this study show significant potential to be implemented into a non-invasive classification device useful for rapid and widespread hemoglobin screening for diagnosis of anemia. A rapid point-of-care, automated anemia classification device would eliminate problems related to subjectivity in interpretation due to ambient lighting conditions as posed on the HCS. It could enable doctors, nurses and CHWs to prescribe on-spot treatment plans and aid in rapid clinical decision making thereby strengthening healthcare facilities. Non-invasive nature of such a device would further facilitate increased acceptance by patients, reduce the problems of poor follow up, safety, biohazard disposal, and supply chain requirements.
The data in the current study was focused on the female population. The results of this study presume that the PPG signal acquired from the device architecture is of good quality, finger motion artefacts are removable and the finger nails are not polished/cracked. Further improvisations to the study could be aimed at evaluating the effects of other confounding factors like age and gender. Also, the variations in light absorption characteristics due to differences in skin pigmentation and the impact of pregnancy status should be investigated in order to make the machine learning algorithms robust to the regional and ethnicity variations. Other applications in a similar direction have explored the possibility of evaluating critical health conditions non-invasively, like non-invasive estimation of blood glucose and bilirubin.