Enhancement of Multilayer Perceptron Model Training Accuracy through the Optimization of Hyperparameters: A Case Study of the Quality Prediction of Injection Molded Parts

Injection molding has been broadly used in the mass production of plastic parts and must meet the requirements of efficiency and quality consistency. Machine learning can effectively predict the quality of injection molded part. However, the performance of machine learning models largely depends on the accuracy of the training. Hyperparameters such as activation functions, momentum, and learning rate are crucial to the accuracy and efficiency of model training. This research further analyzed the influence of hyperparameters on testing accuracy, explored the corresponding optimal learning rate, and provided the optimal training model for predicting the quality of injection molded parts. In this study, stochastic gradient descent (SGD) and stochastic gradient descent with momentum were used to optimize the artificial neural network model. Through optimization of these training model hyperparameters, the width testing accuracy of the injection product improved. The experimental results indicated that in the absence of momentum effects, all five activation functions can achieve more than 90% of the training accuracy with a learning rate of 0.1. Moreover, when optimized with the SGD, the learning rate of the Sigmoid activation function was 0.1, and the testing accuracy reached 95.8%. Although momentum had the least influence on accuracy, it affected the convergence speed of the Sigmoid function, which reduced the number of required learning iterations (82.4% reduction rate). O ptimizing hyperparameter settings can improve the accuracy of model testing and markedly reduce training time.


Introduction
Injection molding is a key step in polymer processing and comprises the five stages of clamping, filling, packing, cooling and plasticizing, and demolding. When polymer materials are molded, the melt and mold temperatures, filling speed, packing pressure, and time are the primary factors that affect the quality of the parts [1]. In particular, polymers are temperature sensitive; different temperatures exhibit distinct rheological properties that affect the flow properties of the melt. Melt filling is driven by pressure, and the required pressure is related to the setting of the forward screw speed. A filling speed that is too low or high results in short shots and jetting problems, respectively. In addition, the holding pressure (also called postfilling) can compensate for the gap between the polymer melt after cooling and shrinking in the mold cavity, ensuring that the finished product meets the size requirements. Therefore, machine settings also influence the quality of the final product. However, under the same machine settings, because of the adverse effects of actual machine movement, material stability, and environmental factors, this quality cannot be guaranteed.
To determine the actual flow behavior of the polymer melt, pressure sensors installed on the surface of the mold cavity can measure the pressure changes of the melt during molding [2][3][4]. The temperature distribution of the melt on the mold surface directly affects the quality of the product. The use of a composite sensor to track melt pressure and temperature changes during the injection molding process reveals the relationship between pressure and temperature, which can be monitored to further control the volume change in injection molded products [4].
Furthermore, some researchers have employed nondestructive ultrasonic sensing technology to measure changing melt pressure in the mold during the injection process, through which the melt pressure state can be monitored without damaging the mold structure and the various stages of the molding process can be identified [5].
Following the capturing of the molding data, diverse methods can be applied to predict quality, including domain knowledge, statistical methods, and artificial intelligence techniques.
In regards to domain knowledge, the pvT theorem is crucial to describing the specific volume state of the polymer melt corresponding to pressure and temperature changes during the injection molding process. The optimum setting of the pvT molding path (e.g., the use of scientific molding methods [6][7][8][9]) assists in obtaining optimal product quality.
Statistics-based mathematical models can describe the relationship between process parameters and part quality. Among them, the Taguchi experimental method combined with analysis of variance or correlation coefficient is widely used to identify the ideal combination of injection molding parameters. This method can be used to reduce the number of experiments and determine the factors that reduce manufacturing costs and achieve stable quality goals [10][11][12][13][14]. The emergence of artificial intelligence has also offered highly nonlinear fitting possibilities. Through the establishment and testing of different training models, it can be effectively applied to various scenarios. Artificial neural networks (ANNs) enable users to rapidly create artificial networks and quality prediction solutions through appropriate hyperparameter adjustments. These adjustments link a large amount of process parameter information with resultant molding quality [15][16][17][18][19]. ANN technology can be employed in the learning process of different data types from large volumes of data to achieve clustering, classification, prediction, and regression functions. In particular, ANNs are suitable for modeling tasks involving nonlinear relationships, such as thermoplastic injection molding processes with complex viscoelastic material behavior and nonlinear relationships between quality, process, and machine parameters. Hyperparameters, including the hidden layer architecture, activation functions, optimization solver, learning rate, and momentum, play key roles in model learning. Common optimization solvers include the stochastic gradient descent (SGD) [20,21], stochastic gradient descent with momentum (SGDM) [22,23] and Adam [24,25]. Learning rate and momentum are both influential in the quality and speed of model learning but are rarely studied in the literature.
To explore the influence of hyperparameters in the actual training process on model training speed, this research examined the injection molding of integrated circuit (IC) trays, and extracted the pressure curve indicating the quality of the part into the quality index. Through application of the index and width of the part to the input and output layers of the neural model, respectively, the two optimization solvers SGD and SGDM were explored. Among them, the activation function, learning rate, and momentum varied, and their influence on the accuracy, convergence speed, and oscillation of the training model was analyzed to provide a reference for adjustment.

Multilayer perceptron model
The multilayer perceptron (MLP) model is a supervised ANN learning model with a forward propagation structure that maps a set of input vectors to a set of output vectors. An MLP can be regarded as a directed graph composed of multiple node layers, with each layer fully connected to the next layer. The MLP model contains three layers, namely the input, hidden, and output layers. Except for the input node, each node is a neuron with a summation function and activation function. The activation function, expressed in Eq. (1), is a nonlinear function used to map the summation function (xw + b) to the output value y. The terms x, w, b, and y represent the input vector, weighting vector, bias, and output value, respectively.
The weighting values range between o and 1. These values change with the training data and represent the memory of the neural network related to the input and output model training.

SGDM
SGDM is an optimized solution algorithm typically employed in MLP model training. As described in Eqs. (2) and (3), the SGD optimization algorithm, also known as the steepest descent method, generates function solutions in reference to the opposite direction of the gradient and step distance (or learning rate, α) to iteratively search the weighting value (w).
Although the weights can be updated iteratively using the SGD algorithm, if multiple local minimums are present in the function, the local minimum or saddle point can be searched, but the global minimum cannot be obtained. This consequently halts the training iterations and leads to erroneous learning results.
The SGDM algorithm that combines SGD and momentum (β) adjustment has attracted considerable attention. The SGDM algorithm detailed in Eqs. (4) and (5) takes β into account, and thus, the optimization algorithm can jump out of the local minimum during the model training process. This procedure enables the entire iteration to stably converge to the minimum value of the loss function.

Learning rate
The learning rate (α) determines the iteration speed of weight adjustment in the neural network according to the gradient loss function. That is, the smaller the learning rate is, the slower the decline along the loss gradient is. A lower learning rate can prevent potentially optimal values from being overlooked, thus obtaining higher training accuracy, but this process requires a longer convergence time. Generally, the setting of the learning rate depends on experience, model size, and numerical complexity. Even for the same training model, when faced with input values of diverse dimensions, the convergence of the optimal learning rate must also change. Therefore, the learning rate must be adjusted for various data conditions.
Error! Reference source not found. depicts the optimization status for various learning rates.
A low learning rate has a slow convergence speed but ensures that the minimum is identified at each step of training to obtain optimal training accuracy. By contrast, a high learning rate can accelerate the convergence speed but may fix on a suboptimal solution.
The Tanh activation function detailed in Eq. (7) and Fig. 3(b) compresses the input value within a range of −1 to 1. This function is similar to the Sigmoid function; thus, when the number of inputs is relatively large or small, the neuron resembles a dead state.
When using the Sigmoid or Tanh activation functions, the gradient decreases as the number of layers increases. This reduces the gradient value of the initial layer, and these layers thus cannot be correctly learned; when the depth of the network moves the value to 0, their gradient tends to vanish, which is known as the vanishing gradient problem [26].
The rectified linear unit (ReLU) presented in Fig. 3(c) and Eq. (8) is characterized by the value of the first derivative located at 0 and 1. Compared with the Sigmoid and Tanh activation functions, the gradient of the ReLU is simple, does not face saturation, and converges more rapidly. However, when a large error gradient accumulates and leads to a large update of the neural network weights, an exploding gradient occurs, resulting in unstable learning. Moreover, a dying ReLU can become problematic when the gradient of all negative input values is 0; if a ReLU neuron is stuck on the negative side and consistently outputs 0, it is considered "dead." Because the slope of the ReLU in the negative range is also 0, once the neuron becomes negative, it is unlikely to recover. Such neurons cannot play any role in distinguishing inputs and are essentially useless.
The Leaky ReLU activation function, described in Eq. (9) and Fig. 3(d), represents an improved version of the ReLU. This function primarily prevents the disappearance of the gradient generated when x is less than 0. When a neuron is in an inactive state, it allows a nonzero gradient, where C1 is a small positive constant.
The exponential linear unit (ELU) activation function presented in Eq. (10) and Fig. 3(e) contains an adjustable positive constant C2 to avoid the possible saturation of the ELU.

Quality indices
To predict the quality of parts in terms of molding condition changes and reduce the amount of calculation data in the experiment, this study combined domain knowledge and data mining technology to convert the injection molding data into quality indices. The correlation of these indices with the quality of the finished product was then calculated. Those that correlated highly were selected as input parameters to improve the prediction accuracy and effectively reduce the amount of calculation required during model training. The following four main quality indices were introduced in this study [16,17]: (1) First-stage holding pressure index ( ℎ ): This represents the average holding pressure in the first stage, as expressed in Eq. (11). The function of the holding pressure (g) is to compensate for the volumetric shrinkage of the part caused by the cooling of the polymer melt. The holding pressure affects the geometric dimensions of injection molded parts and is crucial to precision injection molding.
(2) Peak pressure index ( ): This represents the maximum pressure during filling and compression, as described in Eq. (12). Injection molding constitutes a series of pressuredriven (f) melt flow processes. The pressure not only determines the speed of the melt flow and its flow inertia but also affects the quality of the melt flowing into the mold cavity.
Therefore, the maximum pressure index affects the quality and geometry of injection molded parts.
These four indices have different ranges of values and are normalized between 0 and 1.
Using these indices as inputs for model training can support rapid convergence and high accuracy.

Mold, machine, materials, and measurements
In this experiment, an IC tray of 76 × 76 × 4.4 mm 3 and flow length-to-thickness ratio of 124 was fabricated as the research carrier for method verification (Fig. 4).    Table 1 lists the process parameter settings in the injection molding experiment. For this experiment, a two-factor full factorial method was adopted, in which the injection speed range was adjusted from 40 to 120 mm/s and the first-stage holding pressure varied from 50 to 100 MPa; data were collected four to eight times with the same parameters. At each shot, a system pressure curve and seven cavity pressure curves (SN1-SN7) were recorded. The system pressure curve was used to obtain two quality indices, namely the Phindex and PIindex. The seven cavity pressure curves (SN1-SN7) were used to obtain the Ppindex, and four curves (SN4-SN7) produced by sensors installed far from the gate were used to obtain the Prindex. A total of 445 subexperiments were conducted, each comprising 11 quality indices. These quality indices were used for the input data of the MLP model.  To classify the quality of injection molded parts, we converted the measured geometrical width into multiple grades. We subsequently aggregated the data into five grades evenly spaced between the minimum and maximum values in width, which were used as the output data of the MLP model.

Experimental procedure
A flowchart of the entire process operation is presented in Fig. 6, including data preprocessing, hyperparameter adjustment, and data training. A detailed description is outlined as follows: (1) Pressure signal extraction and quality measurement: The pressure curve of the molding process was obtained through the in-mold sensors, and a three-dimensional measuring instrument was used to measure its width value, thereby establishing a pressure history and width value database.
(2) Outlier filtering: The standard score was used as the basis for judgment, with the z value of 1.78 selected as the standard to eliminate outlier values under the same molding parameters. Among them, the 1.78 standard score is equivalent to retaining 92.5% of the data volume [17].
(3) Quality indices transformation: The extracted in-mold pressure curve was converted into quality indices for subsequent machine learning.
(4) Data normalization: The size and dimension of the input values have distinct effects on data convergence. Processing prior to data normalization reduces the adverse effects of the different dimension values on the convergence results [27]. We used max-min normalization in this experiment.
(5) MLP model training: We employed the SGD and SGDM as optimizers in this experiment.
These were applied to a fixed-architecture MLP for data training. The nodes of the input, first hidden, second hidden, and output layers were 11, 50, 25, and 5, respectively.      Table 3.

Results and discussion
To explore the difference between conventional SGD and SGDM, the model was trained through adjustment of the activation function, learning rate (α), and momentum (β) to capture its training and test accuracy. The effects of five activation functions (Sigmoid, Tanh, ReLU, Leaky ReLU, and ELU) on the learning rate was also analyzed. The α was divided into Group 1, ranging from 10 −5 to 10 −1 , and Group 2, ranging from 0.1 to 0.9, for experimental adjustment.
Finally, the β was set in a range of 0.1 to 0.9.

Group 1 with zero momentum
The goal was to compare the effect of learning rate adjustment on the training accuracy of the MLP by using five activation functions in the SGD. We evaluated the training rate set for various activation functions following 7200 iterations. Figure 7(a) and 7(b) illustrate the training and testing accuracy when the learning rate was set to 10 −5 and 10 −4 , respectively, both of which proved inefficient (14%65%) and are not recommended for training MLP models.
However, when the learning rate was set to 10 −3 , the ReLU and ELU activation functions exhibited high training and testing accuracy (over 90%) and rapid convergence. When the learning rate was increased to 10 −1 , the accuracy of all activation functions reached over 90%.
Among them, the Tanh and ELU functions had the most rapid convergence speed at a learning rate of 10 −1 , although the accuracy of these activation functions exceeded 90% in only 151 iterations (Table 4). At a low learning rate of 10 −3 , the ELU and ReLU reached a learning accuracy of over 90% in 6193 and 6332 iterations, respectively, indicating effective convergence in model training. A learning rate of 10 −1 provided effective training and convergence results in relation to learning accuracy (Table 4). For the Sigmoid function, the highest training accuracy was observed at a learning rate of 10 −1 , but 1177 iterations were required before its training accuracy exceeded 90%, which was the slowest rate of increase of all activation functions.   -: less than 90% at 7200 iterations

Group 2 with zero momentum
Different activation functions require an appropriate learning rate to achieve high learning accuracy. A learning rate of 10 −1 enabled all five activation functions to obtain the optimal model learning performance. The learning rate was thus increased beyond 10 −1 to investigate its performance in terms of accuracy, convergence, and stability. Figure 8 and Table 5    In addition to the stability and quality of the learning accuracy, learning efficiency is critical to reducing computation time. Figure 9 and Table 6 describe the number of iterations required for Group 2 with zero momentum to exceed a training rate of 90%. To further analyze the relationship between the learning rate and number of iterations required to exceed 90%, the correlation coefficients are listed in Table 7. As described in Fig. 9, the activation functions Tanh, ReLU, Leaky ReLU, and ELU and number of iterations required to exceed a 90% learning rate were positively correlated with the learning rate, with correlation coefficients of 0.82, 0.74, 0.84, and 0.88, respectively. These strong positive correlations indicated that an increase in the learning rate may slow convergence speed, thus requiring more iterations.
Moreover, even at the highest learning rate, the ReLU and ELU functions could not achieve 90% training accuracy. Conversely, the Sigmoid function had a strong negative correlation coefficient (−0.84), demonstrating that effective convergence can be obtained with an increased learning rate. Notably, with a learning rate of 0.8, only 271 iterations were required to obtain a learning accuracy of over 90%.

Fig. 9
Number of iterations required for Group 2 with zero momentum to achieve 90% training accuracy  Although a high learning rate generally promotes rapid convergence, the stability of the combined learning rate and activation function must be monitored in model learning. This study evaluated the stability through calculation of training accuracy deviation when the learning rate exceeded 90% in model training. Figure 10 illustrates that a learning rate in the range of 0.1 to 0.5 resulted in a training accuracy deviation of approximately 7% for Group 2 with zero momentum. By contrast, when the learning rate was in the range of 0.6 to 0.9, the training accuracy deviation of the ReLU and ELU functions increased considerably, indicating model instability and a high correlation with the learning rate. The Sigmoid activation function maintained a deviation of less than 5% under various learning rates, demonstrating a low oscillation phenomenon at a high learning rate and the ability to effectively maintain convergence (Table 8).

Fig. 10
Range of learning accuracy for Group 2 with zero momentum after exceeding 90% Unit of Accuracy: % -: less than 90% at 7200 iterations

Effect of momentum on convergence
This experiment employed momentum acceleration in the SGDM method to explore the influence of momentum on modeling accuracy. Figure 11 and Table 9    Unit of Accuracy: %  Figure 12 and Table 11 detail the number of iterations required for model learning to achieve a 90% training rate, and

Conclusion
This study examined the influence of hyperparameters on the accuracy and convergence of MLP model training. In this investigation, IC tray injection molding cavity pressure curves were measured using sensors and converted to normalized quality indices to serve as the input data for model training; the part size was used as the output data. The MLP architecture comprised 11 nodes in the input layer, 75 and 50 nodes in the first and second hidden layer, respectively, and 5 nodes in the output layer. In addition, this study provided a comparison of the SGD and SGDM optimizers, a comparison of five activation functions (Sigmoid, ReLU, Tanh, Leaky ReLU, and ELU), and an evaluation of learning rate and momentum adjustment and its effects on the accuracy and efficiency of model training. The results are summarized as follows: (1) Regarding the change of learning rate in the SGD algorithm without momentum, when the learning rate was altered from 10 −5 to 10 −1 , functions with a learning rate of 10 −1 performed most effectively, with the learning rate exceeding 90%. The number of iterations required for the training accuracy of the five excitation functions to achieve more than 90% are as follows: Sigmoid, 1177; Tanh, 151; ReLU, 417; Leaky ReLU, 239; and ELU, 151.
Therefore, the Tanh and ELU excitation functions combined with the SGD can obtain rapid convergence, whereas the Sigmoid function was relatively slow, with a 7.8 times difference in the convergence rate.
(2) With the introduction of momentum ranging from 0.1 to 0.9, the Sigmoid function achieved an average training accuracy of 93.8%, outperforming the other five functions.
In addition, its strong correlation with momentum was reflected in the rapid convergence rate. For instance, when the learning rate was 0.8, only 271 iterations were required to obtain a learning accuracy over 90%.
(3) When the learning rate was between 0.1 and 0.5, the training accuracy ranges of the five functions were all within 7%. When the learning rate was between 0.6 and 0.9, only the training accuracy of the Sigmoid activation function remained below 5%, indicating its stability in MLP model training. read and approved the final manuscript.
Funding Not applicable.

Data availability
The authors confirm that the data supporting the findings of this study are available within the article.

Declarations
Ethical approval Not applicable.

Consent to publish
Not applicable.

Competing interests
The authors declare no competing interests.