This step is performed in the NN. This phase is utilized to ensure that the input data is presented in a clear and organized format. The primary advantage of feature extraction is its ability to identify the most effective features for the model classifier to learn the representation [15]. certain errors may arise due to human mistakes during the data collection phase, resulting in labeling errors.
3.3.6. Train/Test
The dataset is split into three subsets a train set, a validation set and a test set. The Train set is utilized for training the model, while the validation set is employed to evaluate its performance. Finally, the test set serves as the ultimate benchmark, allowing us to assess the model’s effectiveness in real-world scenarios. We divide the MUCHD dataset into smaller batches with 70% of the data allocated for training and validation purposes, validation, and 30% reserved for testing data. Then we feed those beaches into the DLNN technique.
Algorithm 1: The Train Algorithm: |
1: Start 2: The input features are fed to the input layer of the NN. 3: Randomly initialize the hyperparameters of the NN. 4: Compute the output of each neuron in the hidden layers and output layers. 5: Calculate the gradients of the hyperparameters. 6: Apply the activation function to the output of each neuron computed in step 4. 7: Update each parameter using the gradient optimization. 8: Update the weights of the NN based on the propagation approach. 9: Repeat steps 3–8 until all conditions are completed. If the output is > = 0.5 then diagnosis = "diabetic" else diagnosis = "non-diabetic" End If. 10: End |
The feature vector is directly fed into the input nodes. These nodes have initialized a random number of weights and fine-tuned parameters to the DMLP. Each node generates an output using an activation function. The outputs are then connected to the next hidden layers. The activation functions vary across the different hidden layers. Then, the features are retrieved and concatenated to create a new feature vector. The new feature vector is then received by the classifier to determine the confidence of each relation. Then the classifier produces a binary output vector. Training the classifier is the most crucial aspect of the classification process. The role of this phase is to generate a model by training it with a predefined diagnosis class label, that will be used later to classify unlabeled diagnoses. The training data is essentially a means of learning the classifier model. In the feed process, after the data has been fed, forward propagation occurs. The losses are compared against the loss function, and the parameters are adjusted accordingly based on the incurred loss. throughout the training process, the algorithm searches for patterns that correlate with the desired output, as declared in Eq. 3.3.6.1.
In a neural network, hidden neurons perform a calculation involving the weighted sum according to (1) of its input, along with the addition of a bias term, and then decide whether it should be ‘fired’ or not. So, a specific neuron will be as follows.
The value of Y can be -∞ to +∞. So, the neuron can’t decide whether it will fire or not. Here the activation function is used to decide where the neuron will fire or not. We have used ReLU as an activation function [18].
\(Y=\sum \left(input*Weight\right)+bias\) | (3.3.6.1) |
Where Y is the activation function, input means input features.
The maximum number of hidden neurons that won't result in over-fitting [19] is calculated as shown in Eq. 3.3.6.2.
\(\text{N}\text{h}=\text{N}\text{s}\left({\alpha }\text{*}\left(\text{N}\text{i}+\text{N}\text{o}\right)\right)\) | (3.3.6.2) |
\(\text{N}\text{h}\) = 548 / (4*(17 + 1) =7.61 ~ 8 neurons
Where Nh is the number of neurons in the hidden layers, Ns is the total number of samples in the MUCHD dataset, α is an arbitrary scaling factor that usually has any value between 2 to 10 and No is the number of the output layers.
The NN is trained using the gradient descent algorithm to control the range of weight values throughout the training phase. We used the Rectified Linear Units (ReLU) activation function in the first hidden layer, as depicted in Fig. 4 (a), and a Sigmoid activation function in the second layer is declared in Fig. 4 (b). Generally, the sigmoid and ReLU activation functions are employed for binary classification outputs, whereas the Softmax and the Logarithmic activation function are typically utilized for the multi-classification output [20].
In the Validation process, we run the suggested model on various subsets of both training and validation datasets, then we get model quality measures [23]. This step can be further categorized into two techniques: exhaustive and non-exhaustive cross-validation. In the exhaustive cross-validation approach, training, and testing are performed on all data samples. A portion of the dataset is designated for testing purposes, while the remaining portions are used for training. It is also divided into:
In a non-exhaustive cross-validation approach, the dataset is divided into multiple subsets, each consisting of several blocks. Each block is divided into subsets of training samples and test samples. So, the overall result is the average of all test samples. It is divided to:
-
K-fold Cross-Validation involves splitting the data into k subsets. One of the k subsets is used as the validation set, while the other k-1 subsets are used as the training set [22–23].
-
The holdout method removes a portion of the training dataset and sends it to the model to train on the rest of the dataset [21].
-
Stratified K-fold Cross-Validation works on an imbalanced dataset. Each fold contains approximately the same strata of samples for each output class.
In our model, the data is divided into 5 pieces. Each fold contains 20% of the full dataset portion. We employ K- folds = 5, which means the training portions are 4/5 and only one block is used for validation. In iteration one, we designate the first fold as the validation set and utilize the remaining folds for training. This is valuable for quantitative evaluation to measure the model quality based on a 20% holdout set. We repeat this process, using each fold once as the holdout set according to the number of iterations and the error is averaged see Fig. 5.
The stratify parameter is a valuable tool for addressing imbalances in data. It means that if the number of diabetes patients is 75% of class one diabetes, and the non-diabetes patients, comprise 25% of class zero, then the stratify parameter will make sure that the same percentage portion of the data split is remaining true. The validation structure is presented in Fig. 6.
The data with the unlabeled classes has been prepared previously in the pre-processing phase. The mapping function will be employed to classify unseen or unlabeled data to determine which label it belongs to. In the test phase, the model finds the data features that correlate to a defined class. Then the classification technique is tasked with the responsibility of assigning an accurate class label diagnosis to unlabeled cases specifically distinguishing between diabetes and non-diabetes disease. We train the DLNN sequential model using various hyperparameters. By experimenting with different values to determine the best-fit parameter such that epochs (the number of training times) is of three values (10, 50, 100), batch size (the number of sub-samples fed to NN after updating the parameter) are of six values (10, 20, 40, 60, 80, 100), optimizer (a computed past squared gradients) are of seven values ('SGD', 'RMSprop', 'Adagrad', 'Adadelta', 'Adam', 'Adamax', 'Nadam'), activation function ('softmax', 'softplus', 'softsign', 'ReLU', 'tanh', 'sigmoid', 'hard_sigmoid', 'linear'), weight constraint consists of five values (0.01,0.02,0.03,0.04,0.05), neurons are of eleven values (1, 5, 6, 7, 8, 10, 15, 17, 20, 25, 30), momentum are of six values (0.0, 0.2, 0.4, 0.6, 0.8, 0.9) and learning rates are of five values (0.001, 0.01, 0.1, 0.2, 0.3)). We utilized the Keras and Tensor-Flow libraries to create a NN of sequential models. In NN, the Stochastic Gradient Descent (SGD) optimizer is required to reduce the output error during the feed-forward approach. We used the train-test split and cross-validation functions from the scikit-learn library to perform the splitting task.