4.1 Model Development
With the intention to develop a suitable predictive modelling for the purpose of early prediction of lupus disease in patients based off the medical symptoms that are exhibited by them, six different experiments were carried out. There were two experiments for each model, where the first experiment is the application of the base model while the second experiment is further enhanced with parameter tuning. The dataset that was used for all the models were similar. Once the models were fitted onto the dataset and the prediction was performed, the overall model performance and the results were put together for compilation as well as thoroughly analysed to identify the best model that fulfils the aim of this research. The six experiments that were carried out were inclusive of three tree models which are the Decision Tree Classifier, Random Forest Classifier and the Extreme Gradient Boosting Classifier (XGBoost).The base model were developed using the pre-set parameters while the second model for each technique is adjusted. Each experiment was trained using the training dataset and the testing dataset was used to perform prediction in order to test how well the model would perform on a real dataset for the classification of disease.
4.2 Decision Tree
The Decision Tree model is a commonly sourced tree model that can be useful for the classification task. There were two experiments that were created for this model where one is a base model while the other is a model with parameter tuning.
Experiment 1: Decision Tree Classifier
The fitted model is used on the test dataset in order to come up with predictions which were compared to the actual test values. With visualization techniques, it was found that although there were quite a few correct classifications, there is still a bigger number of cases that were wrongly classified. Metrics such as “accuracy”, “precision”, “recall”, and “confusion matrix” were computed. As observed, the accuracy is very low for this model, which is only 39.53%. It does not indicate a very good fit. This could be due to the imbalances that were found in the target variable or it could be due to the lack of data to train the model. When the precision and recall is taken into account, it can be seen that the highest precision obtained by the model is 0.556 and it goes as low as 0.000. The “0.000” could most probably be due to a very small sample for “Class 5”. The recall on the other hand has its top value as 0.600 to as low as 0.000. Overall, the model seems to be poor in the task of classifying patients into categories.
The decision tree in Fig. 6 above looks very complex because all the variables are considered. This could become very complicated and may prove harder to analyse. It can be identified that “Out_R”, “Ethnic” and “PD_mth” were the most significant factors.
Experiment 2: Decision Tree Classifier with Parameter Tuning
The prediction results that were obtained from the second model were found to have many instances where the model was not able to correctly identify which class the patients belonged to. Thus, we can say that the model is not accurately predicting patients.
The accuracy of this model is seen to be 58.14%. This value is fairly acceptable, as a high accuracy would indicate that the model is performing better. However, to evaluate the performance, alone is not enough. In that case, the “precision” and “recall” is computed to further evaluate the model. The precision indicates the ability of the model to identify the true positives among all the other positives, while recall is the number of true positives from the entire data. The precision for this model has a highest value of 0.600 and a lower value of 0.400 while the recall has a value of 0.920 and 0.182. These values show that although the accuracy of the model is important, precision and recall prove to be more beneficial in evaluating the performance of a model.
The decision tree in Fig. 7 above is fairly easier to read and understand as compared to the first visual that was developed. The tree indicates that the top and most significant variable is the “Out_R” variable as this variable is the variable that indicates the renal status of the patients.
4.3 Random Forest
Experiment 3: Random Forest Classifier
The prediction results that were found for the first experiment using this model indicated that quite a big chunk of the predictions that were made were not correctly predicted. This indicates that the model did not fit the data very well.
The accuracy of the model is 55.81% which is not very high, but it is acceptable. The precision of the model shows that it has the capability of correctly identifying the “Current Status” of patients up to 58.5% and the recall of the model was seen to be 96% which indicates that among the entire data, the model can correctly predict the right status of lupus in patients.
The random forest tree was plotted once predictions were made, however, since this is the base model and no specific adjustments were made to reduce the size of the parameters that were taken into account, the Random Forest tree developed was very complex and it proved to be a challenge in understanding the output. It was found that “Ethnic” and “C_nephrotic” were the top attributes.
Experiment 4: Random Forest Classifier with Parameter Tuning
Feature importance analysis is an importance step in analyzing the attributes that are found in the dataset. Doing so will allow the selection of the best attributes that would be suitable to be fitted to the model. From this output, six of the variables will be used to create a new dataset to fit to the model which are “D_Illness”, “Ethnic”, “PD_Mth”, “D_Age”, “Out_R” and “C_hypert”.
From the predictions, it can be seen that there is a very big difference in the prediction and the actual results. All the predicted results were from the status of “1” while in the test dataset there was a fluctuation between the different classes. This indicate that the results are not very good as the model did not fit well. The accuracy of the model was seen to be 58.14%. This is not very different from all the other models however it does appear to be slightly higher than the rest. Using the precision as a guideline, it was seen to be about 58.1% while the recall was 100%. These results show that the model is capable of correctly predicting more than half of the correct classes while it has the capability to predict correctly among the entire datasets.
Based on the Random Forest tree above in Fig. 8, it was found that “D_Illness”, “Out_R” and “PD_Mth” were the most important features for analysis.
4.4 Extreme Gradient Boosting (XGBoost)
Experiment 5: XGBoost Classifier
The prediction outputs shows that a fair amount of the results was correctly identified however there were some cases where it was misclassified. The accuracy of the model was 48.84% which is very low. It shows that the model needs to be adjusted to fit the dataset better to achieve better results. Referring to the precision which is seen to be 58.8% and the recall was 80%. The prediction accuracy of the model in correctly identifying the status of the patients is good.
This visualization will explain the most important factors that are significant to the “Current Status” of the patients with lupus disease. Figures 9 and 10 shows the trees that were plotted from the experiment. According to the results, it was found that the most important attribute was “Out_R” and “Ethnic”.
Experiment 6: XGBoost Classifier with Parameter Tuning
The prediction outputs for this experiment indicates that the model was able to classify the classes well at some points of the data but at some it showed errors. The accuracy of the model was found to be 55.81% which shows that it has significantly increased from the base model, but it still is not able to fully classify the “Current Status” of the patients correctly. The precision and recall of the model were also computed which was 59% and 92%, respectively. This indicates that the model was able to correctly classify more than half of the correct classes and most of the cases were correctly classified as a whole. The visualization of the XGBoost tree was plotted for better understanding as shown in Figs. 11 and 12 below. It was found that the most important attribute was the “D_Illness” and “PD_Mth”.
Table 1 below shows the experiment summary and the accuracies for each of the models that have been experimented. It was found that the highest accuracy observed was 58.14%, which came from the Decision Tree Classifier with Parameter Tuning and the Random Forest Classifier with Parameter Tuning.. Both of the models were models that included hyper parameter tuning to ensure that the model fits better to the dataset.
Table 1: Experiments Summary and Results
No
|
Experiment
|
Accuracy
|
Decision Tree Classifier
|
1
|
Decision Tree Classifier
|
39.53%
|
2
|
Decision Tree Classifier with Parameter Tuning
|
58.14%
|
Random Forest Classifier
|
3
|
Random Forest Classifier
|
55.81%
|
4
|
Random Forest Classifier with Parameter Tuning
|
58.14%
|
Extreme Gradient Boosting (XGBoost) Classifier
|
5
|
Extreme Gradient Boosting (XGBoost) Classifier
|
48.84%
|
6
|
Extreme Gradient Boosting (XGBoost) Classifier with Parameter Tuning
|
55.81%
|
In order to further narrow down to which is the best model for the purpose of classification and early prediction of lupus in patients, the precision and the recall scores indicate that the Random Forest Classifier is slightly better. However, to further evaluate this, there is a need for further analysis. Doing so will help identify and develop the best model that can be used by medical practitioners when making decision in the medical field.