One of the fundamental challenges in ML technology is choosing a technique appropriate for the field of activity. Specifically, when the data is related to healthcare, the significance of the results is doubled as experts make decisions about how to treat or diagnose diseases based on this information, which can directly impact an individual's health. So it is essential to select techniques that can provide the highest level of confidence and correctness in the obtained results to combine the correct decision of the situation with the best analysis of the situation and to obtain the best possible result. During this research, we have attempted to select two k-nearest neighbor techniques and SVM for selecting these two techniques. The reason for this was the widespread use and the high quality of the analyses provided by these two techniques, which can be considered the reason for the choice. An SVM is a ML algorithm that can be applied to regression and classification problems and has a supervised approach. Due to its robustness, it is commonly used to solve classification problems. This algorithm first represents the data points in an n-dimensional space. Based on statistical approaches, the algorithm determines the best line that distinguishes between the various classes in the data. [18]. The K-nearest neighbor technique divides the data into groups based on their characteristics. When a new sample is being examined, this sample is placed in the group most similar to the sample currently under examination. This technique provides an example of a simple data classification technique. Hence, using the K-nearest neighbor technique, the accuracy of the analysis results can be enhanced by placing a new sample in the classification close to that of the previous data and using the classification information from the previous data [19]. 'Hepatitis' in the dictionary refers to liver swelling, which can be caused by various factors, such as smoking, drinking alcohol, or using chemical substances. It is essential to point out that hepatitis can have different types, among which there are the following:
-
Hepatitis type A
-
Hepatitis type B
-
Hepatitis type C
It is possible to distinguish different types of Hepatitis based on the type of virus responsible for inflaming and swelling the liver. People can protect themselves from contracting Hepatitis type A and Hepatitis type B diseases by vaccinating themselves against them. However, people cannot be protected against contracting Hepatitis type C disease through vaccination as no vaccine is available. As a result, a person could contract two or more types of Hepatitis simultaneously. The transmission of hepatitis type C is through blood directly or through tattoos, drug use, or traditional medicine. Hepatitis type C is considered one of the most dangerous types of Hepatitis [20]. The data used in this research are related to 150 patients with hepatitis. A prestigious university in the United States, the University of California at Irvine, has made it available on the Kaggle website. This information can be classified into different categories based on its content, and this dataset provided information about gender and the mortality rate for those individuals who have undergone these tests. A preliminary analysis of some important and influential factors has been conducted in this study to examine their impact on the survival and death of individuals with hepatitis disease. In this study, novel technologies, such as machine learning, are used to predict the probability of survival or death for patients with hepatitis while assessing the effectiveness of the proposed techniques based on the defined measurement criteria. Information obtained from the patients includes various agents and factors examined during the examination. Due to the diversity of existing factors and various examinations, it has been attempted to focus on four factors out of the various factors available in the patients' information as the factors that will be analyzed in this study. A patient's survival or death depends on the factors selected, which can also be called variables. Essentially, these factors are examined concerning the mortality or survival of individuals with hepatitis. As a result of the selection process, the following factors were considered:
-
ANOREXIA
-
LIVER BIG
-
LIVER FIRM (fatty liver)
-
PALPABLE SPLEEN
A diagram illustrating the implementation process is shown in Fig. 1.
3.1 Primary processing
ML techniques require formal and standardized information structures to function correctly and optimally. Consequently, the data related to hepatitis patients are not exempt from this rule. They should have a structure appropriate to the techniques used in the ML process to be effective. In light of this, modifying the initial raw data is necessary.
3.1.1 Standardization of patient information
The ML techniques selected for this research are supervised learning techniques based on classification approaches, so it is necessary to organize and display the information on hepatitis patients in columns and rows. However, because this information contains a primary structure, it must be modified to obtain the desired result with just a small amount of modification. In other words, it turned them into a standard format.
3.1.2 Convert the results into P and N
The patients' information relates to the tests that they underwent. After converting the data to the basic standard form, some changes need to be made to the values of some features. The results of these tests are displayed as numbers. As these numbers have values of 2 and 1, the results of these tests are either positive or negative, according to the explanations given in connection with the information provided by the patient's test. In other words, when a value of 1 is placed in front of a variable, it indicates that the patient's test result for that variable was negative; on the other hand, when placed in front of a variable of value 2, it indicates that the patient's test result was positive. Therefore, to utilize optimally and increase the performance of ML techniques and to increase the readability and more straightforward understanding of the results, we have substituted the values of P and N, which represent positive and negative values, respectively, instead of the values of 2 and 1.
3.2 Feature selection and data preparation
Available information on patients includes all types of examinations. In order to prepare for the implementation of ML techniques, due to a large number of experiments and the extent of clinical parameters tested, it is necessary to select some clinical parameters in the form of required features and factors for analysis.
3.2.1 Features selection
A total of 20 clinical parameters have been analyzed and checked in the dataset of hepatitis patients. In other words, 20 clinical parameters have been analyzed and checked for hepatitis disease. In light of the fact that all of these clinical parameters are of relatively not equal importance; therefore, it is to examine them outside this research's scope. Hence, this research has attempted to select only four experimental and critical clinical parameters, and in the following, ML techniques are used to examine and analyze their effect on the death or survival of patients, and the results will be analyzed. The reason for choosing these four parameters, which can also be called factors or features, is that they have the most significant impact on the treatment process of the affected people in terms of their importance, frequency, and effectiveness. Table 1 illustrates the selected parameters.
Table 1
Properties
|
Values
|
CLASS
|
LIVE
|
Death
|
ANOREXIA
|
P (Positive)
|
N(Negative)
|
LIVER BIG
|
P (Positive)
|
N(Negative)
|
LIVER FIRM
|
P (Positive)
|
N(Negative)
|
SPLEEN PALPABLE
|
P (Positive)
|
N(Negative)
|
3.2.2 Placement of missing values
Among the features selected as main variables, some fields in some experiments do not have values; therefore, the absence of these values can impact the results and analyses resulting from implementing ML techniques. It is possible to complete these empty fields using the methods provided to eliminate the effects caused by the absence of some values. Many methods have been proposed for filling the empty fields; however, the averaging method has been used in this study. The empty fields in columns are completed by averaging the values of all fields in the column related to the selected feature. Table 2 shows the missing values in each of the selected features.
Table 2
Missing values for selected features
Features
|
Missing Values
|
SPLEEN PALPABLE
|
7
|
LIVER FIRM
|
14
|
ANOREXIA
|
13
|
LIVER BIG
|
9
|
3.3 Modeling process
After the data preparation stage, data modeling can be introduced as a main part of the ML implementation process. In this stage, ML techniques will be implemented on data whose structure is appropriate and standard. Modeling by ML techniques will involve obtaining prior knowledge and using this prior knowledge to make a new model. In light of this, it is possible to say that there will always be two components involved in the implementation of ML techniques, which include a part of the data used to teach the techniques and a part of the data used to make predictions based on their learning from the previous training. Hence, the more knowledge a ML technique has of datasets, the more accurate and reliable his or their predictions will be.
3.3.1 Selection and implementation of ML techniques
Due to the vast number of available ML techniques, selecting techniques appropriate for the desired target is necessary before applying them to the dataset. On another side, it is necessary to prepare the data according to the selected techniques. In this research, the selected techniques are classification-based supervised techniques. Hence, the data should be organized based on these two techniques.
3.3.2 Model creation
After preparing the data and choosing the appropriate techniques, it is necessary to implement the selected techniques on the data for the modeling process. Hence, this is the final stage in the implementation process for machine learning. In this stage, a ML model can be obtained by implementing ML techniques on the prepared data, resulting in a result. An analysis of the obtained results will be presented in this section. The analysis of the obtained results consists of two parts. The first part predicts the impact of the selected characteristics on the mortality rate of hepatitis patients. The second section will evaluate the efficiency of the ML techniques and introduce a more efficient technique. As a result, evaluation criteria can be used to assess the ML techniques' performance. Hence, this research has tried to use criteria such as ACC, ERR, SPE, and NPV to check ML techniques' performance. These criteria will assess the performance of two KNN and SVM techniques.
3.3.2.1 Specificity
An essential criterion in assessing the performance of ML techniques is specificity. As a result, to calculate the SPE criterion, the number of times the model true negative cases model correctly predicted the class is divided by the total number of false positive and true negative cases predicted. In this case, this evaluation criterion is calculated according to Eq. 1 [21].
$$\text{SPE = }\frac{\text{TN}}{\text{FP+TN}}\text{ }\text{ }\text{ }\text{ }\text{ }\text{ }\text{ }\text{(1)}$$
3.3.2.2 Accuracy
The accuracy criterion is among a model's most essential and commonly used evaluation criteria. As a result, to calculate the ACC criterion, the number of times the model correctly predicted the class is divided by the number of times the class was predicted .Hence, this evaluation criterion is calculated using Eq. 2 [22].
$$\text{ACC = }\frac{\text{TP+TN}}{\text{TP+FN+FP+TN}}\text{ }\text{ }\text{ }\text{ }\text{ }\text{ }\text{ }\text{(2)}$$
3.3.2.3 Negative Prediction Value
NPV, or negative prediction value, is a widely known evaluation criterion. NPV is obtained by calculating the ratio of predicted true negative cases to the total of predicted true negative and false negative cases. This evaluation criterion is calculated according to Eq. 3 [23].
$$\text{NPV = }\frac{\text{TN}}{\text{FN+TN}}\text{ }\text{ }\text{ }\text{ }\text{ }\text{ }\text{ }\text{(3)}$$
3.3.2.4 Error rate
The error rate evaluation criterion is the inverse of the correctness criterion's performance. Hence, to calculate the ERR criterion, the number of cases whose classes are incorrectly predicted is divided by the total number of cases predicted by the model. In other words, the ERR criterion is calculated by subtracting the accuracy criterion value from the numerical value of 1. This evaluation criterion is calculated using Eq. 4 or 5 [24].
$$\text{ER}\text{R}\text{ = }\frac{\text{FN+FP}}{\text{TP+FN+FP+TN}}\text{ }\text{ }\text{ }\text{ }\text{ }\text{ }\text{ }\text{(4)}$$
$$\text{ER}\text{R}\text{ = 1 - ACC }\text{ }\text{ }\text{ }\text{ }\text{ }\text{ }\text{(5)}$$
The constituent components of equations 1 through 4 represent predictions reached by techniques implementation on samples and created models. Those parameters are as follows:
-
TP: Positive samples True identified.
-
TN: Negative samples True identified.
-
FP: Positive samples False identified.
-
FN: Negative samples False identified.
3.4 Results analysis
The purpose of implementing this research is to assess the effect of the selected characteristics on the mortality rate of people with hepatitis. Therefore, each characteristic is analyzed individually. By using evaluation criteria, it attempted to examine and analyze the performance of ML techniques and to introduce a technique with superior performance. Therefore, at this stage, it is attempted to analyze the obtained results.