The ML sequence of workflow for the classification of undegraded and aged PET MPs is illustrated in Fig. 1. The entire process, including data processing, model training, and classification, was executed using the Python programming language within a Jupyter Notebook environment. The computations were conducted on a system equipped with a 64-bit Intel Core i5 vPro processor and 4 GB of RAM.
2.3.5. ML data training and testing Methodology for PET MPs classification
Seven (7) ML models Random Forest classifier (RF), Logistic Regression (LR), Support Vector Machines classifier (SVM), Neural Networks based on multilayer perceptron classifier (MLP), Gradient Boost (GB), Decision Trees (DT) and k-Nearest Neighbor (k-NN) were evaluated for classifying the undegraded and aged PET MPs in this study.
RF
The RF classifier is composed of multiple DTs. When making a new classification, each DT independently provides a classification for the input data. The RF algorithm then evaluates these classifications and selects the final prediction based on the class that receives the most votes from the individual trees (Mao and Wang, 2012; Cinar and Koklu, 2019). RF is particularly efficient in handling datasets with a large number of variables (Enyoh et al., 2023a). The simplified equation for the RF, as represented by Eq. 3, is as follows:
$${RF}_{\left(x\right)}= \text{m}\text{o}\text{d}\text{e}(\text{D}\text{T}₁(\text{x}), \text{D}\text{T}₂(\text{x}), ..., {\text{D}\text{T}}_{n}(\text{x}\left)\right)$$
3
Here, RF(x) represents the class prediction made by the RF for a given input instance x. The mode function selects the most frequently occurring class prediction from the individual decision trees DT₁, DT₂, ..., DTn, where n is the number of trees in the forest. Based on a randomly selected portion of the training data, each decision tree in the RF is built individually. A random selection of predictor variables is also taken into account for partitioning the data at each node of the tree.
LR
LR primary purpose is to elucidate the relationship between these dependent and independent variables. To achieve this, LR fits the weights of the input variables to the training data, aiming to minimize the discrepancy between the predicted probabilities and the actual class labels (Cruyff et al., 2016). The simplified equation for logistic regression, represented as Eq. (4), is as follows:
$$y=\frac{1}{\left(1+ {e}^{(-z)}\right)}$$
4
where the variable "y" denotes the predicted output or the probability of a specific class. This probability is obtained by passing the linear combination of the input variables and their respective weights, represented by "z," through the sigmoid function. The sigmoid function transforms any real-valued number to a value within the range of 0 to 1, enabling the interpretation of the output as a probability. This property makes logistic regression suitable for tasks where the prediction is associated with a probability score, allowing for a more nuanced understanding of the model's predictions.
SVM
SVM is a fundamental technique used for both classification and regression tasks. It creates a hyperplane that aids in distinguishing between different classes or predicting numerical values. In two-dimensional space, SVM achieves linear separation, while in three-dimensional space, it uses a planar separation. In multidimensional space, it relies on a hyperplane for effective separation of data points (Schölkopf et al., 2001). The classification process in SVM involves identifying the optimal hyperplane that maximizes the margin between different classes. The larger the margin, the better the separation and generalization of the model (Cinar and Koklu, 2019). The simplified form for the predicted output from SVM, represented by Eq. (5), is as follows (Enyoh et al, 2023):
$${y\left(x\right)}_{pre}= \sum _{i=1}^{n}{a}_{i}.K\left({x}_{i},{x}_{j}\right)+b$$
5
where K(xi ,xj ) is the radial basis function kernel. αi and b denote Lagrange multiplier and threshold parameter, respectively.
MLP
In this study, we further utilized a popular artificial neural network (ANN) known as the Multilayer Perceptron (MLP). The MLP learns through a technique called backpropagation, where weights are adjusted either after analyzing the entire dataset or after each individual data point. The architecture of the MLP involves organizing neurons into layers, with a hidden layer situated between the input and output layers. Depending on the complexity of the problem, an MLP can consist of multiple hidden layers. The input layer captures information about the problem to be addressed, while the output layer produces the final results or predictions. The study's findings and data processing within the network are conveyed through this output layer (Enyoh et al., 2023). The Eq. (6), in which f is the activation function, N is the number of inputs per neuron, and k is the layer (hidden, output), may be used to represent the ANN system in its simplest form (Enyoh et al, 2023a).
\({Y}_{j}^{k+1}=f(\sum _{i=1}^{N}{X}_{i}^{k}{w}_{ij}^{k}+{b}_{i}^{k}\) ) (6)
In this research, the model was configured with 100 hidden layers, and the activation function used was the Rectified Linear Unit (ReLU). ReLU, represented by the function f(x) = max(0, x), introduces non-linearity to the model and effectively addresses the issue of vanishing gradients. It is a widely adopted activation function in deep learning due to its popularity and effectiveness. For optimizing the training process, the Adam optimization algorithm was employed, and a random state of 42 was set. Adam is known for its adaptive learning rate strategy, which dynamically adjusts the learning rate during training. This adaptive approach scales the gradients based on their estimated first and second moments, resulting in faster convergence and improved performance when compared to traditional gradient descent algorithms. By using Adam, the model achieves faster convergence, allowing for better generalization to unseen data.
GB
The Gradient Boosting (GB) classifier is an ensemble learning technique that combines multiple weak learners, represented by Decision Trees (DTs), to create a robust and accurate model (Hastie et al, 2009). The algorithm follows an iterative process, where it gradually adds new DTs to the ensemble. Each subsequent tree focuses on reducing the errors made by the previous trees (Hastie et al, 2009). During each iteration, the algorithm calculates the gradient of the loss function concerning the predicted values and constructs a new tree to minimize this gradient (Piryonesi et al, 2021). The predictions from all the trees in the ensemble are then combined to make the final prediction. The simplified equation for the Gradient Boosting classifier is the sum of weak learners, where each weak learner compensates for the errors made by the preceding learner. It can be expressed as shown in Eq. (7).
$$y\left(x\right) = y0\left(x\right) + \eta * g1\left(x\right) + \eta * g2\left(x\right) + \dots + \eta * gn\left(x\right)$$
7
Where y(x) is the predicted output (whether undegraded or aged), y0(x) is the initial prediction, g1(x), g2(x), …, gn(x) are the weak learners (usually decision trees), and η is the learning rate (in this case = 0.1). At the beginning, y(x) is initialized with y0(x), which is the mean or median value of the target variable.
DT
DT is often visualized as a tree diagram, where each branch and node represent a classification query. The root node stands for an attribute, and the inner nodes indicate tests or evaluations of properties. The branches depict the outcomes of these evaluations, leading to the final decision represented by the leaf nodes, which correspond to the classes (Enyoh et al., 2023; Rokach and Maimon, 2005). DT offers several advantages, making it well-suited for handling complex problems and providing inferences in the form of logical classification rules (Cinar and Koklu, 2019). Its distinct advantages include ease of implementation, seamless integration into databases, and high reliability (Wu et al., 2008). In its simplified form, a Decision Tree can be expressed as shown in Eq. (8).
$${y}_{\left(x\right)}=({x}_{1},{x}_{2}, {x}_{3}\dots \dots \dots {x}_{n}y)$$
8
Where y is the target variable for classifying (undegraded or aged). The vector x is composed of the features, \({x}_{1},{x}_{2},\dots\)etc., that are used for that task.
k-NN
k-NN is a popular and widely used machine learning model, especially for large-scale training datasets. It operates based on a distance metric to identify the most similar data points in the training set (Ibeto et al., 2021). In the k-NN algorithm, each data point is conceptually plotted in a multi-dimensional space, where each axis represents a different variable or feature. When a new data point needs to be classified (the test data), the algorithm compares it with all the available data points in the training set. The test data will have several neighbors that are close to it in terms of all the measured characteristics. To determine the class of the test data, the algorithm selects the k nearest data points based on the distance metric. The class with the majority of data points among these selected neighbors is assigned to the test data (Richman, 2011).
In this specific study, the k value, representing the number of nearest neighbors to consider, was chosen as 5. This means that when classifying new data points, the algorithm will look at the class labels of the 5 nearest neighbors to make the final prediction.