The real time skin diseases data is collected from popular private hospital in Chennai, it is always crowded with patients with various illnesses. According to the in-patient records, in an average of 2000 to 6000 patients visit the hospital per day in multiple departments. Among that 200 to 250 patients are found to be visiting the Dermatology department with skin-related ailments. The inflow of the patients to the Dermatology department explodes with the increase in the temperature. The dataset comprises data collected from year 2000 to 2018 (Temperature, Rainfall, Humidity and Precipitation) with daily and monthly readings. We have carried out this research in two phases, in first phase, we have collected the hospital data and experimented. In the second phase, we have received the climatic data set from the National Data Centre, India. This work aims to find the association between climate data and hospital data, and we have mentioned the Female as 0 and the Male as 1. From the experimental results, we found out that Males have been more affected when compared with the Females. The reason we found out is that it exposes them to the sun for a longer time than the Females. With the results, we have plotted a graph, and it is visible that 69.5% of Male and 31.5% of females are affected by skin disease.
We have proposed a framework to address this issue. The proposed frame work is a trial and error combination. Machine learning models are used to measure sensitivity and accuracy of the disease's outbreak. The output values are used for better forecasting and applying ensemble feature selection.
Table.1 Representation of all feature names with data types
Table 1
represents the 15 attributes, its data types and feature label assigned for experimental investigation.
Feature | Data Type | Feature label |
Year_week | Int64 | F1 |
Recorded_year | Int64 | F2 |
Recorded_month | Int64 | F3 |
Air_temp | Float64 | F4 |
Humidity | Float64 | F5 |
Surface_water3 | Float64 | F6 |
Total_vegetation | Float64 | F7 |
Min_air_temp | Float64 | F8 |
Surface_water5 | Float64 | F9 |
Total_precipitation | Float64 | F10 |
Max_air_temp | Float64 | F11 |
Total_precipitaion in KG | Float64 | F12 |
Northeast_NVDI | Float64 | F13 |
Mean_duepoint | Float64 | F14 |
Mean_humidity | Float64 | F15 |
Figure 2 represents the architecture diagram of the proposed approach. It consists of two phases namely training phase and testing phase. During training pahse, real time data collected from patients are propcessed and stored in database. During test phase also, the same procedure is used for preprocessing and feature extarction. The test feature extracted during this phase are compared against the training feature stored in the dataset by using KNN, SVM and Random forest algorithm.
3.1 Preprocessing Step
The data collected from hospital consists of various formats such as videos, structure, audio and image. During preprocessing, the data is converted in to machine-readable format, i.e., 0's and 1's. Machine learning mechanism has a single process call pre-processing by which a machine can read it. The data set is the combination of sample, Entities, points, cases, patterns and observation. The data object refers to the number of attributes or variables; data is classified into two types categorical and numerical.
Categorical is the Boolean set, which is constant [yes, no]. The numeric is continues, and the value is dynamic (temperature, age, etc.). Quality of data archived by applying the pre-processing technique, as the data generated from the different source it is raw data, which can affect model accuracy. It forms the missing value gaps during data collection, either machine or human-made, a mistake at recording time. Eliminating rows and columns are a few methods, but this won’t be useful because it reduces the sensitivity of data. The most commonly used plan for addressing missing value in rows and columns by mean, mode, and the median value of the relative feature.
3.2 Feature selection Step
Feature selection is a process of automatic selection of variables in the data set, which are more relevant for foresting methods. As medical data consist of irrelative attributes that may affect model performance, not all attributes are useful. Numbers of features always change the model and lead to complex. Any feature selection technique aims to improve the model performance, to provide a cost-effective and faster understanding of the hidden patterns. Feature selection has three types of methods. The various feature selection methods existing for data analysis is represented in Fig. 3
It ranks fitter method is applied by a statistical technique to assign a score to each variable depends upon score variable are selected. Wrapper method: selecting features on different features combination and compare with other combination. Ensemble methods like regularization technique to penalization of the data to reduce coefficients to zero.
3.3 Classification by Supervised learning model
In this paper, supervised learning methods such as KNN, SVM and Random forest are considered for classification. Machine learning algorithm performance is measure by statically test: precision, accuracy, and F – measure. Precision defines as a percentage of exact forecast accuracy class; accuracy defines as a percentage of correct forecast cases between all cases, F-measure defies as the weighted mean of recall and precision. These 3 performance metrics have used to select significant attributes; typically, most of the machine learning model uses some performance metrics. The performance metrics play an essential role in considerable selection attributes. The different combination set of the variable is applied. The main idea is to remove irrelative characteristics that are influencing the performance of the model. All three metrics performance is calculated and recorded in different tables below.
Accuracy = \(\frac{(\text{T}\text{P} + \text{T}\text{N})}{(\text{T}\text{P} + \text{T}\text{N} + \text{F}\text{P} + \text{F}\text{N})}\) (1)
F-Measure =\(\frac{\left(2 \text{*} \text{P}\text{r}\text{e}\text{c}\text{i}\text{s}\text{i}\text{o}\text{n} \text{*} \text{R}\text{e}\text{c}\text{a}\text{l}\text{l}\right) }{ (\text{P}\text{r}\text{e}\text{c}\text{i}\text{s}\text{i}\text{o}\text{n} + \text{R}\text{e}\text{c}\text{a}\text{l}\text{l})}\) (2)
Precision = \(\frac{\text{T}\text{r}\text{u}\text{e}\text{P}\text{o}\text{s}\text{i}\text{t}\text{i}\text{v}\text{e}\text{s}}{(\text{T}\text{r}\text{u}\text{e}\text{P}\text{o}\text{s}\text{i}\text{t}\text{i}\text{v}\text{e}\text{s} + \text{F}\text{a}\text{l}\text{s}\text{e}\text{P}\text{o}\text{s}\text{i}\text{t}\text{i}\text{v}\text{e}\text{s})}\) (3)