Prediction Medical Vaste Utilizing Ensemble Machine Learning Algorithms: A Case from Turkey


 It is an important task to predict medical waste (MW) estimation accurately for effective Waste Management System (VMS). The main aim of this study was to compare three ensemble machine learning algorithms to predict medical waste for İstanbul which is the biggest city in Turkey. There exists new machine learning (ML) algorithms called Ensemble Machine Learning Algorithms that have shown significant success in other disciplines yet have not been examined for MW. To bridge the literature gap, in this study, for the first time, a total of three ensemble machine learning algorithms: Random Forests (RF), Gradient Boosting Machine (GBM) and AdaBoost are developed to predict MW generation. To employ this study. 17-years real data were obtained from İstanbul Metropolitan Municipality Department Open Data Portal with the input variables namely number of hospitals, number of bed available at the hospital, crude birth rate and Gross Domestic Product (GDP). 80% of the total database being used for developing the models, whereas the rest 20%  were used to validate the models In order to compare their performances, 5-fold cross-validation was applied and performance measures (MAE,RMSE and R-squared) were calculated in this study .Of the ensemble models, the RF model provided better performance than those of other models with RMSE, MAE, and R 2 of 1194.2, 898.12, 0.95, respectively, whereas the second best GBM accuracy with RMSE, MAE, and R 2 1290.76, 1160.43, 0.94, respectively. Although, CatBoost was interpreted as the efficient model for small datasets among the Machine Learning algorithms, was poorest accuracy with RMSE, MAE, and R 2 of 3349.57,2698.4,0.61. In addition, the findings revealed that GDP and number of hospitals were the most important inputs for the predicting MW generation using ensemble machine learning algorithms. These results will helpful for decision makers regarding both planning and designing medical waste management systems in the future facilities in the sense of sustainable management.

50. the number of input variables requires more complex modelling. Due to modeling the non-linear 51. relationship between the input and output variables, some studies compare the conventional statistical 52. techniques such as MLR with machine learning algorithms such as ANN, SVM and several Neuron and 53. Kernel-based machine learning methods (Jahandideh et al. 2009; Karpušenkaitė et al. 2016; Thakur and 54. Ramesh 2018; Golbaz et al. 2019). The studies showed that machine learning algorithms gave better results 55. than statistical techniques and these results were attributed to the ability of solving the non-linear 56. relationship between input and output variables (Karpušenkaitė et al. 2016). However, significant input 57. variables were not addressed in these methods which emphasizing its performance in predicting medical 58. waste production rate. Some studies utilized time series modeling as different autoregressive integrated 59. moving average (ARIMA) models to predict the MW generation when they have time-based MW amount 60. data (Chauhan and Singh 2017; Ceylan et al. 2020). But MW prediction is a regression problem rather than 61. a time series problem because of limitations of time series analysis: have long time historical data to capture 62. seasonality; data have missing value and outlier; many external factors that affect MW generation 63. (Pavlyshenko 2019). Some studies revealed that traditional algorithms and machine learning based 64. algorithms for time series data seem to be equally competitive for prediction problems (Papacharalampous 65. et al. 2018;Pavlyshenko 2019). Ensemble machine-learning methods such as Random Forest and Gradient 66. Boosting Machine will detect patterns in the time series even if having small data (Pavlyshenko 2019). 67. These previous studies shown that ML can be utilized for predicting MW generation in sense of higher 68. flexibility that their ability detect patterns, trends and fluctuations more accurately according to 69. conventional regression analysis (Nguyen 2021). Also, most of the studies for predicting MW generation 70. didn't performed to determine most significant input variables, which will be a useful information for 71. effective Medical Waste Management system. On the other hand, lack of historical MW database in 72. especially developing countries may cause the difficulties with understanding the current situation and 73. forecasting medical waste generation (Nguyen et (Gutierrez 2020). ML algorithms are divided into two categories: 105.supervisedand unsupervised. Supervised learning algorithms to discover relationships between potential 106.explanatory features and a known target outcome and divided into two categories namely classification and 107.regression. In this study Ensemble Methods Random Forests, Gradient Boosting Machine and AdaBoost 108.algorithms performed to predict the MW generation. 109.2.1 Ensemble Methods 110.Ensemble methods is a machine learning algorithm build and combine several base models to solve 111.classification problems, regression problems, feature selection with excellent performance (Li and Chen 112.2020). It has two types namely a parallel method represented by Bagging and sequential method 113.represented by Boosting based on the base learner generation process (Li and Chen 2020). Multiple base 114.learners areconstructed simultaneously that means that independent of each other and this feature lead to 115.improve the performance of final model while in the sequential ensemble type multiple learners is 116.constructed in sequence hence the model is improved by the next learners can avoid the errors of previous 117.learners. The bootstrap statistical technique is used to randomly sample from the initial data set (training 137.data) for creating a sequence of sub-data sets then using regression trees based on these sub-data sets, the 138.forest is built. Each tree is trained by choosing a set of variables at random and two important parameters 139.namely the number of trees (ntree) and the number of variables (mtry) can be adjusted during the training 140.stage.Stage 2: A prediction can be made after the model has been trained. Input variables are evaluated 141.for all regression trees first, and then the final output is calculated by measuring the average value of each 142.individual tree's prediction (Ahmad et al. 2021).

Boosting
144.The basic idea that firstly a weak classifier is constructed on the training set, each sample assigned a weight 145.based on classification performance. If the sample classified correctly, the weight relatively small number 146.otherwise it will be a large number. Boosting is an iterative process to make samples with large weights by 147.adding weak learners to the previous weak learners and the data weights are readjusted in order to obtain 148.final strong classifier. Boosting algorithm fits this kind of ensemble models 150.where represents the initial guess, ∅ ( ) denotes the base estimator at iteration and is the weight 151.for ℎ estimator. The product of * ∅ ( ) is the "step" at iteration . Most of the boosting algorithms 152.can be viewed as to solve 154.at each iteration, where −1 represents the current estimation.

AdaBoost
156.Adaptive Boosting (AdaBoost) is well-known algorithm among Boosting algorithms that will when the 157.loss function is exponential type and the weights and classifiers are derived by means of forward stage-158.wise additive modeling. It is proposed by Freund and Schapire (Freund and Schapire 1996)   Step 1: The initial constant value is obtained as: Step 2: the gradient of loss function is written as: (4) 188.
Step 3: The initial model ℎ( ; ) is formed by fitting sample data, the parameter is calculated by using 189.the least square method: 190.
Step 4: The new weight of the model is expressed as follow by minimizing the loss function Step: Optimized the model as 1.

193.3.PREDICTION MODEL
194.In order to compare of three machine learning algorithms for the prediction medical waste, the experiment 195.consist of five steps respectively, data acquisition, data pre-processing, utilizing ensemble methods, hyper-196.parameter tuning and evaluation criteria. The prediction model experiment procedure given in Figure 3. All 197.analyses were performed using Python version 3.8.6 on PyCharm based on the scikit-learn and XGboost 198.libraries. The Python function random.seed () is used in order to ensure the reproducibility of the dividing 199.process.

200.
201. Fig. 3 The process of prediction model    2019). 220. MW is an output variable (i.e., the 221.dependent or predictive variable or target output), while the other variables were the inputs (i.e., the 222.independent variables). The boxplots of the variables are given Figure 5 for presenting distribution of 223.variables and detecting outliers (Schwertman et al. 2004). Outliers are data points in a dataset that are 224.abnormal observations between normal observations and can lead to odd accuracy scores that can skew 225.measurements because the results don't present true results. The presence of outliers suggests the need for 226.pre-processing of the data to improve the accuracy of the result or the need for more advanced methods. It 227.is appearance that all variables except for CBR don't have any outlier observations. 228. 229. Fig. 5 Boxplots for input and output variables used to develop models

Data Pre-processing
231.Data pre-processing is a crucial step in the machine learning modelling since, real-world data is usually 232.incomplete, inconsistent, inaccurate (contains errors or outliers), and often lacks specific attribute 233.values/trends (Ramírez-Gallego et al. 2017). Analyzing this kind of data will lead produce misleading 234.results so it is extremely important that data pre-processing before feeding it into model.  (Nguyen et al.2021). Five-fold CV 252.method is utilized in this study that the original training set is divided into five equal size subsamples. A 253.one part is chosen as the validation set, the remaining four subsamples are used as the training sub-set. This 254.method is repeated five times until each subsample is used as a validation set. Next, the average accuracy 255.of the five validation sets is to assign the optimal hyperparameter values. The procedure is shown in Figure  256.6. The data set is divided into two groups as training set (80% of samples) is used for training the model 257.and testing set (20% of samples) is used for evaluating samples.

Hyperparameter Optimization
271.Where, n is the number of observations, is the predicted of waste generation for i th observation, is 272.the actual value for ith observation and ̅ is the average of predicted values. The models with high R 2 273.values are better than the models with low R 2 values . Also the closer the value of R 2 is to 1 means better 274.fitting., the better the effect of the model.

Overall Prediction Results
277.The prediction results of RF, GBM, AdaBoost ensemble algorithms were obtained on the test set data. 278.Performance measures namely MAE, RMSE and R 2 for each algorithm was calculated performance 279.measures as presented in Table 3. 280.  7. As seen in Figure 8 and Figure 9, the same 283.performance order is achieved with respect to R 2 and MAE with values 0.95,0.94,0.61 and 1194.29, 284.1290.76,3349.57 respectively. According to the results, RF and and GBM had good performances in the 285.predicting MW. On the whole, all performance measures considered, RF outperforms other machine 286.learning algorithms GBM and AdaBoost with the lowest MAE and RMSE also the higher R 2 performance 287.scores. ANN provides higher accurate results than linear regression and random forest models. These 288.results indicated that, even with a small dataset, an adequate prediction model can be developed utilizing 289.ensemble machine learning algorithms.

Feature Importance of input variables
292.The relative importance of input variables can be an important information to an effective medical waste 293.management system. In this study, the importance of each input variable was obtained Random Forests 294.(RF), Gradient Boosting Machine (GBM), AdaBoost algorithms which was shown in Figure 10 295.respectively. For the Random Forests (RF) algorithm, the rank of importance was NB>GDP>NH>CBR. 296.Using GBM algorithm, the rank of importance degree was GDP>NH>CBR>NB. According the AdaBoost 297.algorithm the rank of importance degree was NH>NB>GDP>CBR.
298. Fig. 10 The feature importance of each ensemble machine learning algorithm 299.The comparative analysis of each variable with respect to algorithms was shown in Figure 11. For instance, 300.the number of beds most significant factor in RF algorithm while the less significant variable for GBM 301.algorithm. Number of hospitals, nearly the same importance degree for different algorithms in this study. 302.Overall, according to total important degree Fig. 11 Importance degree of each variable with respect to algorithms a. Fig. 12 Total importance degree for each variable 303.Overall, according to total important degree for each variable as shown in Fig 12. GDP is the most 304.influential factor that effect MW generation. Daskalopoulos et al (1998) and Dissanayaka and 305.Vasanthapriyan (2019) concluded that the high correlation between GDP affected by the MSW and this 306.study concluded that the high correlation is also valid between GDP and MW. There is a direct relationship 307.between the increase GDP or wealth of a city and the increase in waste generation as well as medical waste 308.generation. NH and NB are the second and third most important factors for this study for predicting MW 309.generation and consistent with previous study results (Golbaz et al. 2019;Bdour et al. 2020). Obviously, 310.NH and NB affect the production rate of infectious waste .CBR contributed least to the predictive models. 323.There are some limitations in this study. Firstly, the data set is relatively small. İstanbul which is the biggest 324.city in Turkey, the data on MW generation is still incomplete regarding the other factors such as social 325.economic, health institutions type and medical waste type may affect the MW generation as well as 326.effective waste management system. Although the ensemble machine learning algorithms work well with 327.small data set, prediction performance will be better for a larger dataset. This is the main limitation on 328.prediction of MW generation. In the future, the adequacy of these algorithms for prediction can be 329.concluded with a larger database and including the other input variables such as number of doctors, type 330.of medical institutions which may significant effect to MW generation. Second limitation of this study is 331.utilizing only one hyperparameter tuning method called grid search. Other hyper-parameter tuning 332.approaches like Random Search and Bayesian hyper-parameter optimization will be utilized in the future 333.research.
334.The results of this study will help decision makers and practitioners for establishing an efficient medical 335.waste management system like selecting the suitable algorithms for accurate estimation of MW as well as 336.a further insight about the significant (most important) input variables. With this study, it is hoped that 337.machine learning algorithms will increase in medical waste estimation models.