A methodology based on the cross-industry standard process for data mining (CRISP-DM) [16] was applied to develop the predictive model, a. Six fundamental steps, each with its particularities and functionalities, were followed. The first three steps seek to contextualize, collect and organize the data to be analyzed. They contain the definition of the problem, the definition of the raw data, and the pre-processing of the base. The last three steps aim to create the model based on the previous steps and put this model into practice. These are algorithm definition, model validation, and deployment. This set of predefined steps, ordered in a flow, aims to ensure the correct transcription of the project scope into a product capable of implementing improvements in an existing process. The implementation stage will not be addressed in this work.
4.1 Contextualization and problem definition
Machining and dimensional control of journal-bearing housings require high-tech equipment. Still, it is observed that human interaction is used in decision-making, which causes unpredictability in processes, resulting in low productivity and increased costs. In this context, a predictive model based on machine learning techniques is essential to achieve predictability in processes and increase industrial competitiveness.
Based on the description of the manufacturing process and dimensional control (Section 3), the housing size corresponding to the shaft diameter ranges from 160 to 220mm, and the critical dimensions D1, D2, D3, and D4 were defined for the development of the predictive model.
The critical dimension D1 corresponds to the measure of the radius of the spherical where the shell is supported. The D2 dimension corresponds to the measurement from the spherical center to the housing divider. Dimension D3 corresponds to the distance from the center of the spherical to the flanged face. Dimension D4 corresponds to the distance from the center of the spherical to the face opposite the flanged face. The criterion for choosing these dimensions was based on the requirements of dimensional tolerances, and the complexity of the machining and assembly process. The critical dimensions D1, D2, D3, and D4 of the upper half are shown in Fig. 2, and the lower half in Fig. 3.
In order to detail the characteristics and requirements of the critical dimensions, an analysis of the technical information contained in the housing manufacturing documents, dimensional control reports, and their relationships with the analysis stage and process decision-making was carried out. Afterward, it was possible to identify and describe the data for use in the proposed model. The possible minor adjustment on the machining center axes is 0.001mm. Table 1 contains the critical dimensions and respective machining and geometry information.
Table 1
Dimension | Machining process | Measurement Tolerance (mm) | Geometric Tolerance |
Characteristic | Tolerance (mm) |
D1 | Milling | 0,029 | Concentricity | 0,010 |
Circular |
Interpolation |
D2 | Milling | 0,020 | Flatness | 0,050 |
Linear | Perpendicularity | 0,050 |
Interpolation | Symmetry | 0,050 |
D3 | Milling | 0,040 | Perpendicularity | 0,050 |
Linear |
Interpolation | Parallelism | 0,100 |
D4 | Milling | 0,040 | Perpendicularity | 0,050 |
Linear |
Interpolation | Parallelism | 0,100 |
4.2 Raw data definition
The data of interest for the elaboration of the model were obtained in the dimensional reports, in the historical adjustments in the CNC machining program, in the ambient temperature record at the time of machining, and in the thermal equalization condition of the piece at the time of measurement. Two types of parameters are classified [17]:
-
Dependents variables: represents a quantity whose value depends on the independent variables. It is the output data of the model and the adjustments in the CNC machining program.
-
Independent variables: represents a quantity whose value significantly influences the dependent variables. It is the input data of the model.
As a premise for the cause-and-effect analysis (the definition of relevant variables), specialists' knowledge and experience were used to create a relationship matrix between the dependent and independent variables shown in Table 2. This matrix presents the dependent variables A1, A2, A3, and A4, which are the adjustments of the displacement parameters of the machining center axes of the respective critical dimensions D1, D2, D3, and D4, and the following independent variables:
-
Temperature: this is the measured value of the ambient temperature at the time of the final machining of the housings.
-
Climatized: refers to the condition in which the piece has its temperature equalized with the coordinate measuring machine before being measured.
-
Deviation: represents the difference between the measured value and the nominal measure of a critical dimension.
-
Out of tolerance (OOT): represents the value exceeding a critical dimension's measurement tolerance limits.
It is possible to verify that the independent variables temperature and climatized influence all the dependent variables (A1, A2, A3, and A4). In addition, the independent variable deviation of the critical dimensions, D1 and D2, have a relationship. This fact indicates that the definition of adjustment A1 depends on the measures of D1 and D2, and the same is applied to the definition of adjustment A2.
Table 2
Relationship matrix between variables.
Independent variables | Dependents variables |
| A1 | A2 | A3 | A4 |
Temperature | X | X | X | X |
Climatized | X | X | X | X |
Deviation (D1) | X | X | - | - |
OOT (D1) | X | - | - | - |
Deviation (D2) | X | X | - | - |
OOT (D2) | - | X | - | - |
Deviation (D3) | - | - | X | - |
OOT (D3) | - | - | X | - |
Deviation (D4) | - | - | - | X |
OOT (D4) | - | - | - | X |
4.3 Base pre-processing
To obtain valuable and efficient data from raw data, it is essential to carry out a set of data preparation, organization, and structuring activities. This step is very important, as it determines the quality of the data that will be analyzed. It can even impact the predictive model generated.
The data (detailed in Section 4.2) were initially evaluated concerning its structure. First, it is verified that the data have a predefined pattern, a well-defined and rigid structure. It can be represented by a collection of rows and columns organized within tables. These characteristics define the database as structured [18]. Next, the database was analyzed for missing and irrelevant data, identifying and removing outliers, and resolving inconsistencies. Finally, the data were in appropriate and suitable formats for the analyses and tests. Therefore, no transformations were performed on the original data.
4.4 Algorithm Definition
The definition of the algorithm considered the following conditions:
-
Non-linear relationships.
-
Ability to define decision boundaries.
-
Speed of execution.
-
Metrics and scoring.
-
Interpretability of the algorithm.
-
Domain knowledge.
The decision tree is a robust algorithm built on the basic human decision-making process based on some criteria or thresholds. Decision tree-based models are among those with the fastest response, in addition to visually presenting a set of “if-then” rules to improve the understanding and interpretation of results. This feature is fundamental because the model will be applied in a manufacturing environment with intense man-machine relationships. The problems and corresponding datasets, objects of decision-making, are domain specific. Therefore, choosing the algorithm is highly dependent on the experience of the team members involved.
A decision tree consists of a hierarchy of nodes connected by branches. The node, also known as the decision node, is the decision-making unit that evaluates, through a logical test, which will be the next descendant node. The leaf or terminal node associated with the result value is found at the last level of the hierarchy [19].
Instead of implementing our algorithm version, we used the Python Sci-kit-learn module to develop the predictive model. It implements many popular machine learning algorithms while maintaining an easy-to-use interface fully integrated into Python [20].
Sci-kit-learn uses an optimized version of the classification and regression tree (CART) algorithm. It is characterized by building binary trees, so each internal node has exactly two output conditions. CART builds binary trees using the feature and threshold that generate the most information gain at each node. An important feature of CART is its ability to generate regression trees. In regression trees, the leaves predict a real number rather than a class. In the case of regression, CART looks for splits that minimize the squared error of the prediction. The prediction in each leaf is based on the weighted average for the node [21].
4.5 Model validation
The dataset used to train and validate the model was based on 62 pieces manufactured between May and December 2022. Sci-kit-Learn provides some functions to divide datasets into several subsets in different ways. The most straightforward function is train_test_split [21]. However, the k-fold cross-validation method was used to split the total dataset into 5 mutually exclusive subsets of the same size, and from there, a subset was used for testing. The remaining 4 were used for parameter estimation.
The assembled random forest algorithm was used to compare the performance of the proposed regression tree model. Random Forest uses many individual decision trees created by randomizing the split at each decision tree node. The random forest algorithm is very efficient for analyzing large multidimensional datasets. However, due to its random nature, it is not always intuitive and understandable for the user [19]. The mean absolute error (MAE) was defined as a metric for error analysis and subsequent evaluation of the model result.
The execution of the simulations was based on scaling as a method for incrementally improving the model's performance and the consequent definition of the independent variables that contribute to the improvement of its performance. The following parameters have been defined for scaling:
-
min_samples_split: minimum number of samples to split a node.
-
min_samples_leaf: minimum number of samples needed on a leaf.
-
kfold_split: number of subsets for testing and validation.
-
n_estimators: number of trees in the forest.
Tables 3 and 4 show the configuration of the selected parameters for the simulation.
Table 3
Parameters for simulation in the regression tree.
Condition | A | B | C | D | E | F | G | H | I | J | K | L | M | N |
min_samples_split | 2 | 3 | 4 | 5 | 2 | 2 | 2 | 3 | 3 | 4 | 4 | 4 | 5 | 5 |
min_samples_leaf | 2 | 2 | 2 | 2 | 3 | 4 | 5 | 3 | 4 | 3 | 4 | 5 | 4 | 5 |
Kfold_split | 5 | 5 | 5 | 5 | 5 | 5 | 5 | 5 | 5 | 5 | 5 | 5 | 5 | 5 |
Table 4
Parameters for simulation in the random forest.
Condition | A | B | C | D | E | F | G | H | I | J | K | L |
n_estimators (D1) | 10 | 10 | 10 | 10 | 35 | 35 | 35 | 35 | 310 | 310 | 310 | 310 |
n_estimators (D2) | 10 | 10 | 10 | 10 | 55 | 55 | 55 | 55 | 300 | 300 | 300 | 300 |
n_estimators (D3) | 3 | 3 | 3 | 3 | 29 | 29 | 29 | 29 | 47 | 47 | 47 | 47 |
n_estimators (D4) | 3 | 3 | 3 | 3 | 7 | 7 | 7 | 7 | 11 | 11 | 11 | 11 |
min_samples_leaf | 2 | 3 | 4 | 5 | 2 | 3 | 4 | 5 | 2 | 3 | 4 | 5 |
Kfold_split | 5 | 5 | 5 | 5 | 5 | 5 | 5 | 5 | 5 | 5 | 5 | 5 |
The first simulation run was performed by setting the dependent variable and toggling the independent variable separately. Then, the best result was set for the next round. This systematic continued until all independent variables were evaluated individually in each simulation run. At the end of the process, a regression tree model and respective set of correlations of variables with the smallest MAE for each adjustment of the respective critical dimension were obtained. Tables 5 and 6 show the results with the lowest MAE after simulations.
Table 5
Results obtained after regression tree simulations.
ADJUSTMENT | ALGORITHM | CONDITION | MAE (mm) |
A1 | Regression Tree | E / H / J | 0,001495 |
A2 | Regression Tree | E / H / J | 0,002042 |
A3 | Regression Tree | E / H | 0,001456 |
A4 | Regression Tree | D | 0,001121 |
Table 6
Results obtained after random forest simulations.
ADJUSTMENT | ALGORITHM | CONDITION | MAE (mm) |
A1 | Random Forest | A | 0,001312 |
A2 | Random Forest | C | 0,002011 |
A3 | Random Forest | A | 0,001226 |
A4 | Random Forest | E | 0,001186 |
After obtaining the results, through the simulation stage that culminated in defining the optimal variables and parameters for each predictive model related to each critical dimension adjustment, the graphic model of a regression tree was generated to improve the understanding and interpretation of the results by the operators. Figures 4, 5, 6, and 7 show the graphic model for each adjustment variable.