The pictorial form of our samples is given in Fig. 2. These samples are categorized into training and test data of our selected three models.
4.1 Decision Tree
One of the quantile-supervised learning algorithms is the decision tree. This algorithm is mainly used for regression and classification tasks. The decision tree has different parts like branches, roots, internal nodes, and leaf nodes. By using the divide and conquer method only, Decision Tree searches to identify the root node within a tree. This process is continued until all node’s gini[4] values are calculated using the Entropy formula.
The salient features of decision tree algorithms
-
They require less effort for data preprocessing.
-
It doesn’t require any normalization of data.
-
Missing values in the dataset do not affect the construction of the Decision Tree.
-
We can easily get the result from the Decision Tree model.
When this occurs, it is known as data fragmentation, and it can often lead to overfitting. To reduce the complexity and prevent overfitting, pruning is usually employed; this is a process, which removes branches that split on features with low importance.
Pruning is the process of removing connections from a network to increase the speed of inference and reduce its storage size. Pruning of a network deletes the unneeded parameters from an overly parameterized network. The model’s fit can then be evaluated through the process of cross-validation. This classifier predicts more accurate results, particularly when the individual trees are uncorrelated.
To choose the best attribute at each node
We must select the best attribute in each node among multiple ways like information gain and Gini impurity. They help to evaluate the quality of each test condition and how well it will be able to classify samples into a class.
Entropy and Information Gain
Entropy is used to measure the uncertainty of data. It is an essential metric that helps to evaluate the quality of a model and its ability to make accurate predictions. Here we used this entropy to determine the best split at each node. By using entropy only, we can build more robust and accurate models. Information gain is related to Entropy. It measures the impurity of the sample values. It is defined by the following formula [7]
Entropy values lie between 0 and 1. The entropy value is zero when all samples in the data set, S, belong to the same class. If half of the samples are classified under one class and the other half of the samples are in another class, then the entropy value is 1. To select the best feature to split on and find the optimal decision tree, the attribute with the smallest amount of entropy should be used. The difference in entropy before and after a split on a given attribute is represented by Information gain. The attribute that has the highest information gain will produce the best split as it is doing the best job at classifying the training data according to its target classification.
4.2 K-means clustering
Among several unsupervised machine learning algorithms, K-means clustering is one of the most effective ones. K-means clustering assigns data points to clusters based on which reference point is closest after constructing a centroid for the appropriate number of classes. Choosing the K value is the key point of the K-means algorithm. Here, we've covered a common technique for choosing K in the machine learning K-means algorithm.
K-Nearest Neighbor Algorithm steps
Step 1: Choose the number of clusters as K.
Step 2: Select random K points or centroids.
Step 3: Assign each data point to its closest centroid. It forms the predefined K clusters.
Step 4: Calculate a new centroid of each cluster.
Step 5: To take an average of samples from the same cluster.
Step-5: To reassign each data point to the new closest centroid of each cluster.
Step 6: If no new reassignment occurs, the model is ready. Else, go to step 4.
4.3 Naive Bayes
The Naive Bayes algorithm is based on the Bayes Theorem. It is also one of the simplest supervised learning algorithms. Naive Bayes classifier is a fast, accurate, and reliable algorithm. Naive Bayes classifiers have high accuracy and speed on large datasets. Prior and Posterior probabilities are used in this algorithm. Figure 3 shows the median value of water quality index of our samples using this algorithm. The steps used in the Naïve Bayes Algorithm are listed below.
Ø For the given class labels, calculate the prior probability.
Ø Apply the Bayes formula, and find the posterior probability.
Ø The given input belongs to the class which has a higher probability.