3.2. Features
The data set in the file named “dermatology.data” consists of 366 records of 34 different features separated by a comma. When the features are examined, the features 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11 and 34 contain the values of the clinical findings. Features between 12 and 33 contain the values of the histopathological findings. Except for the age feature number 34, all features between 1 and 33 have values between 0 and 3. Age feature has 8 unknown value and 358 values varying between 0 and 75. Number of patients are : Psoriasis 112 patients, Seborrheic dermatitis 61 patients, Lichen planus 72 patients, Pityriasis rosea 49 patients, Cronic dermatitis 52 patients, Pityriasis rubra pilaris 20 patients. Totally 366 patients available on dataset.
The clinical and histopathological features are explained as below.
Clinical features
(with values 0, 1, 2, 3)
1: Erythema (The severity of erythema in wounds)
2: Scaling (Squam, dandruff peeling off the skin, dandruff amount in the lesions)
3: Definite borders (Whether the wounds are sharply circumscribed)
4: Itching (Intensity of itching in wounds)
5: Koebner phenomenon (Limited manifestation of dermatological disease in the area of stimulation as a result of traumatic stimulation of the skin (Rifaioğlu et al., 2014))
6: Polygonal papules (Multi-edged, raised, less than 1 cm in diameter lesions on the skin)
7: Follicular papules (Swellings less than 1 cm in height, distributed at equal distances from each other)
8: Oral mucosal involvement (Lesions formation in the oral mucosa)
9: Knee and elbow involvement (Lesions formation on knees and elbows)
10: Scalp involvement (Lesions formation on the scalp)
11: Family history, (0–1) (Whether there is a family history)
34: Age (Have linear values)
Histopathological features
These are the findings obtained by biopsy taken from patients. (values are in the range of 0, 1, 2, 3)
12: Melanin incontinence (Brown granules that appear on the skin under the epidermis layer)
13: Eosinophils in the infiltrate (An increase in a type of white blood cell)
14: PNL infiltrate : Polymorphonuclear leukocyte spread. Migration and arrival of neutrophils to the disease site. Increase in the number of white blood cells of leukocytes, inflammation.
15: Fibrosis of the papillary dermis : Accumulation of new fibrotic material (collagen) due to disease in the papillary dermis layer of the skin.
16: Exocytosis : Accumulation of white blood cells towards the epidermis.
17: Acanthosis : Thickening of the epidermis layer.
18: Hyperkeratosis : Thickening of the keratin layer.
19: Parakeratosis : Nuclear cell formation in the keratin layer.
20: Clubbing of the rete ridges : Clubbing of the ridges of the rete.
21: Elongation of the rete ridges : Elongation of the ridges of the rete.
22: Thinning of the suprapapillary epidermis : Thinning of the epidermis over the papillary dermis.
23: Spongiform pustule : Spongy vesicles (pustules) filled with pus (neutrophils)
24: Munro microabcess : Small vesicles filled with neutrophils in the epidermis.
25: Focal hypergranulosis : Focal thickening of the granular layer of the epidermis.
26: Disappearance of the granular layer : Disappearance of the granular layer of the epidermis.
27: Vacuolisation and damage of basal layer : Formation of spongy cavities as a result of damage to the basal layer.
28: Spongiosis : Edema between epidermis cells.
29: Saw-tooth appearance of retes : Formation of rete ridges in a sawtooth appearance.
30: Follicular horn plug : Formation of plugs in hair follicles.
31: Perifollicular parakeratosis : Presence of nucleated cells around the hair follicle in the corneum layer.
32: Inflammatory mononuclear infiltrate : Migration of mononuclear inflammatory cells.
33: Band-like infiltrate : Migration of white blood cells in band appearance.
3.3. Classification of Dataset Features
In order to determine which disease class is associated with clinical and histopathological features in the data set, a ratio between 0–3 values was first made. All features are scaled except for Age and Class features. As a result of this process, the ratio of each feature belonging to the disease class was determined with a value between 0–3. The classification rates of each feature are determined by graphics and these values are shown separately in the table. While some features are seen in more than one disease, it has been determined that some of them are specific to the related disease.
All the features in the data set were charted separately and the average values of each feature were calculated according to the disease it belongs to. Values between 1 and 6 on the x-axis of the figures represent 1-Psoriasis, 2-Seberoic dermatitis, 3-Lichen planus, 4-Pityriasis rosea, 5-Cronic dermatitis and 6-Pityriasis rubra pilaris diseases, respectively. The numbers in the range of 0–3 on the y-axis in the figures show the average values of these diseases in the data set of the related feature.
Figure 1 shows the mean values of the diseases of the Erythema feature in the data set. In Psoriasis as number 1 and Seboric dermatitis as number 2, Erythema has the highest incidence rate with a value of 2.3 according to the ratio values in the 0–3 range. With a value of 1.5, the Erythema feature has the lowest incidence in chronic dermatitis.
In Fig. 2, the Definitive borders feature in the data set is seen at the highest rate in 1-Psoriasis and 3-Lichen Planus diseases, with the lowest rate of 0.8 in 5-Cronic dermatitis.
In Fig. 3, the Itching feature in the data set is seen at the highest rate in 3-Lichen planus disease, with the lowest rate of 0.5 in 4-Pityriasis rosea and Pityriasis rubra pilaris diseases.
In Fig. 4, the Koebner phenomenon feature in the dataset is included in 1-Psoriasis, 3-Lichen planus and 4-Pityriasis rosea diseases, while it is not seen in other diseases such as 2- Seborrheic dermatitis, 5- Cronic dermatitis and 6- Pityriasis rubra pilaris and does not have any value.
The Band-like infiltrate feature in Fig. 5 and dataset is only seen in 3-Lichen planus disease and has a rate of 2.7 in this disease.
The Clubbing of the rete ridges feature in Fig. 6 and dataset is seen only in 1-Psoriasis disease and has a rate of 2.1 in this disease.
Perifollicular parakeratosis feature in the data set in Fig. 7 is seen only in 6-Pityriasis rubra pilaris disease and has a rate of 2.0 in this disease.
The Fibrosis of the papillary dermis feature in the data set in Fig. 8 is only seen in 5-Cronic dermatitis disease and has a rate of 2.3 in this disease.
Figure 9 shows the age distribution of the Age feature in the data set according to the diseases. While patients between the ages of 0–40 are observed in 1-Psoriasis, 2-Seborrheic dermatitis, 3-Lichen planus, 4-Pityriasis rosea and 5-Cronic dermatitis diseases, the age range of patients in 6-Pityriasis rubra pilaris disease has values of 0–10.
In Fig. 10, a statistical graph was drawn instead of the mean values of the age ranges of the patients in the data set. Thanks to this graph, the value of a 22-year-old patient who was found to be outlier in 6-Pityriasis rubra pilaris disease outside the 7–16 age range is shown as a dot in the 6th column.
Table 1 : An outlier value detected on Koebner phenomenon feature of Class 2 (Seberoic dermatitis)
When the conditions seen in seborrheic dermatitis are examined in the introduction part of our article, it is understood that Koebner phenomenon should not be seen in this disease. When the dataset is examined, the outlier value of "2" in the Koebner Phenomenon attribute is seen in the data of Seborrheic Dermatitis disease. Table 1 shows the outlier value of the Koebner phenomenon in red sign.
In dataset, Class 2 indicates the disease name of Seborrheic dermatitis and has 61 patient records. Koebner phenomenon feature has 60 “0” value and only one “2” value. This single outlier value affects the ratio of machine learning classifications scores
The presence of such outliers in the dataset reduces the classification performance of machine learning algorithms. For this reason, it is aimed to develop an algorithm and delete the outliers that are not compatible with the data density from the data set.
A mathematical and logical algorithm has been applied as stated below to detect and delete outlier data.
\(C\) letter represents database class feature. The index i of letter \(C\) represents the class of diseases range 1–6.
\(F\) letter represents database features consist of 1 to 33. The index i of letter \(F\) represents the feature range 1–33. Age and Class features are not members of \(F\). Eq. 1 will be as follows:
\(\left\{\left({C}_{1}{F}_{1}\right), \dots , \left({C}_{i}{F}_{j}\right)\right\}, i=1,\dots ,6, j=1,\dots , 33\)
|
(1)
|
Since the number of elements belonging to each class is different, the letter \(n\) represents the number of elements of the active class.
The average values of all the features belonging to the \({C}_{i}\) class are calculated separately.
The average value of each feature belonging to the relevant class can be calculated by Eq. 2.
\(mean {C}_{i}{F}_{j}= \frac{{\sum }_{1}^{n}{F}_{j}}{n}\)
|
(2)
|
The values found on \({C}_{i}{F}_{j}\) are indexed. The number of each index value in the corresponding attribute is counted. As seen in Eq. 3 letter \(v\) indicates indexed values of active feature of active class. And letter \({q}_{1-33}\) indicates that the count of the indexed \(v\) values for each feature seperately.
\(count\left(\forall {C}_{i}{F}_{j} \forall {v}_{n}\right)=\left\{\begin{array}{c}{v}_{1}, {v}_{2}, .., {v}_{n}\to {q}_{1}\\ \dots \\ { v}_{1}, {v}_{2}, .., {v}_{n}\to {q}_{n}\end{array}\right.\)
|
(3)
|
\(x\) letter represent the maximum value of \(q\) gives us the most repeated values of active feature. Eq. 4 will be as follows:
\(x=\text{m}\text{a}\text{x}\left(count\left({q}_{1}, \dots {, q}_{n}\right)\right)\)
|
(4)
|
y letter represent the minimum value of \(q\) gives us the least repeated values of active feature. Eq. 5 will be as follows:
\(y=\text{m}\text{i}\text{n}\left(count\left({q}_{1}, \dots {, q}_{n}\right)\right)\)
|
(5)
|
As used before in Eq. 2, we used \(n\) as the number of elements of the active class. n reflects us the count of active classes active features elements.
The ratio of the number of the most repeated values in the studied feature to the number of values of the related feature is calculated by Eq. 6.
\(ratio= \frac{x}{n}\)
|
(6)
|
We need a treshold value to be able to evaluate the ratio value we found.
In order to determine the outlier data density at the most appropriate rate, a threshold ratio was determined and classification accuracy rates were determined according to the relevant threshold value. When the threshold value is 100%, classification work is performed on all data without any elimination. Since feature loss occurred, the minimum treshold value was not lowered below 70%. Threshold values in the range of 100 − 70% were considered with 10% slices and classification studies were carried out on the data set from which outlier values were removed.
If ratio value resulted in 1, its reflects the situation that all values of active feature has same values without not having any outlier value.
Otherwise if ratio value bigger than treshold and lower than 1, its shows that active feature has outlier values to be delete from database. Algorithm will delete records which indicates \(y\) from Eq. 5.
This is illustrated in Eq. 6 below.
\(ratio= \left\{\begin{array}{c}1\to All values of active feature are same. \\ \left(ratio>treshold\right)\bigwedge \left(ratio<1\right)\to delete record which value {y}^{\text{'}}s indicate\end{array}\right.\)
|
(6)
|
With Eq. 7, the Ratio evaluation is applied separately for all the features in the entire dataset.
\(ratio\to \left(\forall {C}_{i} \forall {F}_{j}\right)\)
|
(7)
|
With this method, all records that will be considered as outlier data are removed from the data set. Then we can apply machine learning methods on the remaining error-free data set.
Diagram 1 : General flow chart of classification work
The general flow chart of the classification work to be done is shown in Diagram 1.
Diagram 2 : Detailed flow chart of detecting the best classification method.
With Scheme 2, the flow chart of the machine learning methods applied on the dataset obtained by evaluating the features according to the treshold value and deleting the related records is given.
3.4. Performance Analysis of Classification Methods
In our study, classification methods of the Sklearn library in the Python programming language were used. The data set is divided into 33% test and 67% train sections, and classification studies will be carried out on 121 records, which corresponds to 33% of the 366 records in the data set. By using training and test data, LogisticRegression, KNeihgborsClassifier, Support Vector Classification (SVC), GaussianNB, DecisionTreeClassifier and RandomForestClassifier classification methods were applied, respectively.
The classification results were evaluated by k-fold cross validation values. The results obtained without deleting the outlier records were first evaluated with the single fold cross validation method.
Since feature loss occurred, the minimum treshold value was not lowered below 70%. Threshold values in the range of 100 − 70% were considered with 10% slices and classification studies were carried out on the data set from which outlier values were removed. By combining the classification results obtained according to the threshold values in the range of 100 − 70%, the threshold value that provides the highest number of high classification rates among the classification methods used was determined. By applying the 5-fold cross-validation method to the results obtained, the machine learning method that provides the highest classification rate was determined.
Logistic regression, KNeighbors Classifier, Support Vector Classification, Gaussian Naive Bayes, Decision Tree Classifier and Random Forest Classifier methods were used as classification methods.
3.4.1. Logistic Regression
With logistic regression, a discrimination model is created according to the number of groups in the structure of the data. With this model, the new data taken into the dataset is classified. The purpose of using logistic regression is to create a model that will establish the relationship between the least variable and the most suitable dependent and independent variables. (Bircan, 2004).
3.4.2. KNeighbors Classifier
In the Kneighbors Classifier classification (k-nearest neighbors), a clustering is created according to the distance values of the classes depending on the k parameter value in the existing data set, and the method of classifying the new data according to the similarity to these clusters is applied. (Keskenler et al., 2021)
3.4.3. Support Vector Classification (SVC)
Support vector classification is the process of predicting what will be the outputs of new data based on existing data. Support vector classification performs classification by finding the separator plane with the widest range between classes (Güner et al., n.d.).
3.4.4. Gaussian Naive Bayes
Gaussian Naive Bayes classification applies a classification algorithm based on the probability of the Gaussian distribution (Karatay & Algahani, 2021).
3.4.5. Decision Tree Classifier
Decision tree classifier method consists of 3 components consisting of node, branch and leaf. Questions are asked to create a tree structure using the attributes in the training data, and these processes continue until node or leaves without branches are reached (Çölkesen & Kavzoğlu, 2010).
3.4.6. Random Forest Classifier
The purpose of the random forest classifier classification method is to bring together the decisions made by many trees trained in different training sets instead of a single decision tree. (Daş et al., n.d.).
Table 2
Single fold cross validation scores obtained between 100% − 70% treshold values
No
|
Treshold
|
Classification Type
|
Single fold cross validation Score
|
Cronic dermatitis
|
Lichen planus
|
Pityriasis rosea
|
Pityriasis rubra pilaris
|
Psoriasis
|
Seboreic dermatitis
|
Total patients
|
1
|
1.0
|
Logistic Regression
|
0.992
|
52.0
|
72.0
|
49.0
|
20.0
|
112.0
|
61.0
|
366.0
|
2
|
1.0
|
KNeighborsClassifier
|
0.934
|
52.0
|
72.0
|
49.0
|
20.0
|
112.0
|
61.0
|
366.0
|
3
|
1.0
|
SVC
|
0.975
|
52.0
|
72.0
|
49.0
|
20.0
|
112.0
|
61.0
|
366.0
|
4
|
1.0
|
GaussianNB
|
0.818
|
52.0
|
72.0
|
49.0
|
20.0
|
112.0
|
61.0
|
366.0
|
5
|
1.0
|
DecisionTreeClassifier
|
0.95
|
52.0
|
72.0
|
49.0
|
20.0
|
112.0
|
61.0
|
366.0
|
6
|
1.0
|
RandomForestClassifier
|
0.959
|
52.0
|
72.0
|
49.0
|
20.0
|
112.0
|
61.0
|
366.0
|
7
|
0.90
|
Logistic Regression
|
0.99
|
42.0
|
65.0
|
45.0
|
15.0
|
103.0
|
46.0
|
316.0
|
8
|
0.90
|
KNeighborsClassifier
|
0.981
|
42.0
|
65.0
|
45.0
|
15.0
|
103.0
|
46.0
|
316.0
|
9
|
0.90
|
SVC
|
0.981
|
42.0
|
65.0
|
45.0
|
15.0
|
103.0
|
46.0
|
316.0
|
10
|
0.90
|
GaussianNB
|
0.933
|
42.0
|
65.0
|
45.0
|
15.0
|
103.0
|
46.0
|
316.0
|
11
|
0.90
|
DecisionTreeClassifier
|
0.99
|
42.0
|
65.0
|
45.0
|
15.0
|
103.0
|
46.0
|
316.0
|
12
|
0.90
|
RandomForestClassifier
|
1.0
|
42.0
|
65.0
|
45.0
|
15.0
|
103.0
|
46.0
|
316.0
|
13
|
0.80
|
Logistic Regression
|
1.0
|
29.0
|
41.0
|
40.0
|
15.0
|
87.0
|
31.0
|
243.0
|
14
|
0.80
|
KNeighborsClassifier
|
0.988
|
29.0
|
41.0
|
40.0
|
15.0
|
87.0
|
31.0
|
243.0
|
15
|
0.80
|
SVC
|
1.0
|
29.0
|
41.0
|
40.0
|
15.0
|
87.0
|
31.0
|
243.0
|
16
|
0.80
|
GaussianNB
|
1.0
|
29.0
|
41.0
|
40.0
|
15.0
|
87.0
|
31.0
|
243.0
|
17
|
0.80
|
DecisionTreeClassifier
|
1.0
|
29.0
|
41.0
|
40.0
|
15.0
|
87.0
|
31.0
|
243.0
|
18
|
0.80
|
RandomForestClassifier
|
1.0
|
29.0
|
41.0
|
40.0
|
15.0
|
87.0
|
31.0
|
243.0
|
19
|
0.70
|
Logistic Regression
|
1.0
|
29.0
|
30.0
|
31.0
|
6.0
|
63.0
|
12.0
|
171.0
|
20
|
0.70
|
KNeighborsClassifier
|
0.965
|
29.0
|
30.0
|
31.0
|
6.0
|
63.0
|
12.0
|
171.0
|
21
|
0.70
|
SVC
|
1.0
|
29.0
|
30.0
|
31.0
|
6.0
|
63.0
|
12.0
|
171.0
|
22
|
0.70
|
GaussianNB
|
1.0
|
29.0
|
30.0
|
31.0
|
6.0
|
63.0
|
12.0
|
171.0
|
23
|
0.70
|
DecisionTreeClassifier
|
0.947
|
29.0
|
30.0
|
31.0
|
6.0
|
63.0
|
12.0
|
171.0
|
24
|
0.70
|
RandomForestClassifier
|
0.982
|
29.0
|
30.0
|
31.0
|
6.0
|
63.0
|
12.0
|
171.0
|
Table 2 shows the single-fold cross-validation score values of the machine classification methods applied to the database according to each threshold value. While the Threshold value was 100%, all records were listed, while at 70%, the number of patient records decreased to 171.
When the whole table was evaluated, it was determined that the classification algorithms with 80% treshold value had the highest accuracy rate. Since the rates here are determined by the single fold validation method, the 5-fold cross validation evaluation method was applied to the results obtained for the 80% treshold value found in order to obtain more accurate results.
Table 3
5-fold cross validation result table
|
Normal Classification
|
Classification by %80 Treshold value
|
Classification Method
|
Single Fold
|
5-Fold Cross Validation
|
Single fold
|
5-Fold Cross Validation
|
Logistic Regression
|
0.992
|
0.954
|
1.0
|
0.987
|
KNeighbors Classifier
|
0.934
|
0.922
|
0.988
|
0.975
|
Support Vector Classification (SVC)
|
0.975
|
0.962
|
1.0
|
0.975
|
Gaussian Naive Bayes
|
0.818
|
0.881
|
1.0
|
1.0
|
Decistion Tree Classifier
|
0.95
|
0.959
|
1.0
|
0.968
|
Random Forest Classifier
|
0.983
|
0.967
|
1.0
|
0.981
|
When Table 3 is examined, it is seen that there is a difference between single fold validation and 5-fold cross validation values. Since the 5-fold cross validation is applied on the entire dataset due to its structure, it gives more accurate results about the performance of the classification method.
When the normal classification results and the classification results of the Treshold value are compared, it is seen that the success rate of the threshold value classification is higher. Again, when the 5-fold cross-validation method was applied, the Gauss Naive Bayes method achieved the highest classification success with 100%.
As can be seen from the Fig. 11, it has been determined that classification success rates are in the range of 100–96% when the treshold value is 1, and when it is at the 0.8 point, all methods show success at values close to 100%. From this graph, it is understood that by decreasing the treshold value from 1 to 0.8, 20% outlier record was detected and deleted from the table.
As a result of the normal classification study obtained in Table 2, the correlation graph of the dataset features was drawn with Graph 2. As a result of the normal classification, it was observed that the features of the diseases did not appear clearly.
The correlation graph of the outlier-free dataset obtained after the treshold value determined in Table 2 is shown in Fig. 13. The white areas in the graph show that there is a high correlation between the related feature and the disease.
Clinical and histopathological findings such as knee and elbow involvement, scalp involvement, clubbing or the rete ridges, elongation of the rete ridges, thinning of suprapapillary dermis are compatible with psoriasis. It was observed that the correlation graph of the psoriasis disease in the database, which was cleared of outlier data, was exactly compatible with the findings of this disease.