Explore data and establish noise
Data exploration was fulfilled using Panda, a data analysis tool for python programming. Time series plots of attributes were analysed for trends and anomalies. Many of the stage 1 and stage 2 output measurements exhibited trends similar to those shown in Fig. 2. In the dataset, noisy values were identified as values outside of a user-specified range, including zeros and outliers.
A correlation matrix was generated to determine if attributes were strongly, moderately or weakly correlated. Strong correlations are an indication that one in a pair of attributes may be redundant. It is also an indication that statistical tools can be used to explain pairwise relationships. For example, if stage 1 measurement output and stage 2 measurement output are highly correlated, then either can be used to explain the other, through simple statistical expressions. The correlation matrix however indicated mainly weak correlations, see correlation coefficients shown in matrix of Fig. 5.
Convert noisy values to missing values
Converting noisy values to missing values was undertaken using Pandas and NumPy python-based libraries, which were also used to explore the raw dataset. Focus was on those stage 1 and stage 2 output measurements that exhibited significant noisy values. User-specified range of correct values was used to sift out erroneous values falling outside of the range. Table 2 shows the attributes with missing values.
Table 2
Attributes in the dataset with their missing values.
|
At location
|
% missing
|
Stage 1 output measurements
|
0
|
0.5
|
1
|
58
|
2
|
3
|
3
|
1
|
4
|
1
|
5
|
95
|
6
|
88
|
7
|
62
|
8
|
6
|
9
|
5
|
10
|
2
|
11
|
74
|
12
|
23
|
13
|
14
|
14
|
57
|
Stage 1 output measurements
|
0
|
9
|
1
|
41
|
2
|
12
|
3
|
6
|
4
|
91
|
5
|
1
|
6
|
1
|
7
|
7
|
8
|
7
|
9
|
30
|
10
|
5
|
11
|
4
|
12
|
6
|
13
|
3
|
14
|
15
|
Remove non-relevant attributes
Some attributes do not add value to a learning task. When included in the dataset for learning, it hinders the model performance. For the learning task at hand, the time stamp attribute was not needed. In addition, the set point measurements in the dataset were also not needed, since they were constant, non-changing values. If the correlation analysis had established correlation between any two attributes, one can be discarded as being redundant.
Attributes with significant missing values can also be discarded. Attributes with more than 70% missing values fall into this category. The reasoning behind this is because the non-missing which would have been used to predict the missing values is insufficient to comprehensively describe the dataset pattern for the entire set of values for the attribute. Three attributes in stage 1 output measurements (5, 6 and 11) and one in stage 2 output measurements (4) fell into this category.
Set class attribute
A class attribute can be described as the particular attribute whose values are to be predicted, also known as the target, dependent or output variable. In the dataset being worked on, there are multiple class attributes since there are multiple attributes whose values are noisy, something that is common in real life datasets. This is a classic multi-target classification problem, where many attributes are considered as the target. There are two known approaches for treating this type of multi-class classification problem (Last et al., 2010). One is to build a separate model for each target. The second approach is to use a multi-target classification algorithm [46], such as Random Linear Target Combination and Multi-Objective Random Forest [47]. Using single-target classifiers for a dataset that has many class attributes would be an arduous and time-consuming task, if one is to build a separate model for each class attribute. Despite this, they perform just as well [48] and even sometimes better [47] than multi-target classifiers. Moreover, single-target classification approach is more prevalent and more straightforward [48]. For the few numbers of attributes that are to be corrected in the current dataset, a single-target classification approach is sufficient.
In this research, the prediction learning is on the basis of a single-target classification approach, hence steps four to ten (see Fig. 4) is done repeatedly for each of the class attribute that is to be corrected for noise. The approach is a systematic one where attributes with erroneous values are corrected, one attribute at a time [49]
Exclude non-relevant attributes
The dataset being corrected represents a multi-stage manufacturing process. Specifically, the system is a two-stage continuous flow manufacturing process. It is better to split the class-specified dataset into relevant and not-relevant sets to improve learning performance [50]. With reference to Fig. 1, relevant to predicting stage 1 output measurements are those variables upstream of stage 1 output measures. Stage 2 processes and stage 2 output measures are not relevant, since they are downstream of stage 1 output measurements. Similarly, stage 1 output measurements and stage 2 processes are relevant for predicting stage 2 output measurements. Including stage 1 process parameters in the set for predicting stage 2 output measurements will be a redundancy, because stage 1 output measures are the results of stage 1 process parameters.
Bearing this in mind a dataset is carved out of the class-specified dataset. The generated dataset includes the class attribute as well as those attributes that represent process parameters upstream of the attribute. As an example, if stage 1 output measurement 2 (S1M2) has been chosen as the class attribute, then the dataset is pruned to exclude all other stage 1 output measurements and stage 2 process attributes with stage 2 output measurements. On the other hand, if stage 2 output measurement 7 (S2M7) is selected as the class attribute, the dataset includes the stage 1 output measurements and stage 2 attributes, but excludes other stage 2 output measurement attributes. The stage 1 output measurements should have been corrected prior to this, as noisy attributes can cause prediction errors. It is possible to use only the stage 2 process attributes (representing machines 4 and 5) to predict the stage 2 output measurement values, but by including stage 1 output measurement attributes, it is assumed that stage 2 output is formed from stage 1 output. In other words, stage 1 output measurements can be used to explain stage 2 output measurements. If a stage 3 process existed, the same analogy is applied, and so on in subsequent stages. To eliminate confusion, it is best that the dataset attributes are ordered is such a way that the ordering represents the actual flow process.
Partition non-missing dataset
Although tree-based algorithms such as Decision Tree and Random Forest have been known to learn noisy datasets very well [42], performance is often degraded where noise exists in a dataset. As a result, the methodology proposes using a clean (not missing values) dataset to build and test the prediction model. In order to do this, a dataset that has non-erroneous (non-missing) values for the class attribute is carved out of the class-specified dataset (which has both missing and non-missing values). This dataset is progressed for building the classifier model for predicting the class attribute. The built model would be applied to the class-specified dataset to predict values for the missing values. The data partitioning was accomplished in the WEKA explorer environment.
Apply feature reduction modelling
In the methodology, the main function of feature reduction is to boost the performance of the classifier model. To confirm that the few selected features can adequately represent the entire feature space, a classifier model is applied to the feature-reduced dataset (as described in Fig. 3). The performance is compared with learning using the full feature space. If prediction accuracy is not degraded significantly but modelling cost is improved, then the selected features are capable of representing the entire feature space, otherwise, the full dataset is used.
For the dataset being corrected, a feature reduction model was built on the basis of ReliefF feature selection algorithm [51, 52]. ReliefF ranks features from most sensitive to least sensitive to the class attribute. Data modellers can then decide which cut off point to test on the classification model. Principal Component Analysis was ruled out as it is computationally costly. Additionally, it uses orthogonal transformation to map features that may be correlated into a new feature space where there is minimum correlation between the new features, thereby creating new features from existing ones [16]. The new features diminish the physical meaning of the original features, and requires additional modelling and computation to understand what the new features represent. Correlation Based Filter Selection was also considered, but this does not give a ranking of how features are important in explaining a class attribute.
In the current research, subsets of the dataset made up of the top ranked 5, 10, 15 and 20 ReliefF selected attributes were assessed using models built with Logistic Regression (LR), Random Forest (RF), and Neural Networks (NN). The results (see Fig. 6) indicate the optimal number to be the first 10 ReliefF selected attributes, beyond which there is marginal or no improvement in prediction accuracy.
Table 3 has been used to compare a Random Forest classifier model prediction accuracy using: a) the full dataset; b) the top 10 ReliefF (missing and non-missing) and c) the top 10 ReliefF using non-missing set only. From the results presented in Table 3, it can be concluded that there is improvement in the prediction accuracy when using the ReliefF reduced subset instead of using the full dataset. There is only very marginal improvement of 2.2% by using the non-missing (clean) subset instead of using the subset that contains both missing and non-missing values for the class attribute. For the current dataset problem, there is therefore no need to set apart a clean subset to be used in building the classifier model, as this would add to the computational steps and time. The partitioning step 6 in the methodology (see Fig. 4) is therefore bypassed for the current dataset, but would be needed for a dataset where missing values significantly degrade the performance of the built classifier model.
The results indicate that the top 10 ReliefF selected attributes are able to represent the entire feature set for predicting S1M1, and are also able to boost the performance of the classifier model. The prediction results for the three candidate classifiers (see Fig. 4), suggest that a Random Forest-based classifier model is the best learner for the dataset. On this basis, the classification prediction is progressed on a hybrid modelling approach namely, ReliefF to reduce the feature space and Random Forest for value prediction.
Table 3
Comparison of prediction accuracy on the dataset and feature selection-based dataset.
Dataset
|
MAE
|
% improvement
|
Full dataset
|
0.0421
|
-
|
Top 10 ReliefF (missing and non-missing)
|
0.0363
|
13.78%
|
Top 10 ReliefF using non-missing set
|
0.0355
|
2.20%
|
Build classifier model
A Random Forest classifier model was used to evaluate the feature selection subset. Random forest algorithm is a combination of tree-based classifiers, such that after a large number of trees is generated, they vote for the most popular class [43]. Random Forest by nature is an ensemble i.e. consisting of multiple classifiers (tree models), the accuracy of which has been shown to be quite high for small to medium sized datasets [53]. The drawbacks for Random Forest are speed deficiencies as a result of combining multiple classifiers. In the current research the dataset is considered small to medium sized and so speed deficiency was not noticed. Moreover, the feature space has been reduced through feature reduction, which is supposed to boost the speed performance of the classifier model. Another drawback with Random Forest, like other tree-based algorithms, is that they cannot provide good estimations outside the boundaries of the training dataset, in other words they extrapolate poorly, unlike regression algorithms [53]. The current prediction task does not warrant extrapolating the prediction forecasts outside the boundaries of the current dataset, so a tree-based classifier is sufficient for the present purpose.
The dataset used for training is the top 10 ReliefF reduced dataset which included missing and non-missing values for the class attribute: an 11 attribute x 14,074 instance dataset. The dataset instances were partitioned 80:20 into train and test sets, using random sampling. Mean absolute error was used as the evaluation metric. It is a prediction accuracy metric for classification models that learn numeric data types. It is indicative of the error variance between the predicted values and the actual values. It is given by the Eq. 1, where ei is the prediction error of the ith sample, and n is the number of samples.
Values of mean absolute error closer to 0 show good prediction capability of the model. Mean absolute error for the Random Forest classifier model for predicting Stage 1 output measurement 1 (S1M1) values, using the top 10 ReliefF selected attributes, was 0.0363, see Table 3. The model is saved to be applied for missing value imputation.
Apply built classifier model on class-specified dataset
The saved model is re-applied to the same dataset, but this time without partitioning the dataset instances. The reason for this is so that the dataset instances are not disordered as is the case with random sampling and partitioning. Re-ordering a disordered dataset after generating prediction results, would add to the computational complexity of the machine learning process.
An excerpt of the prediction results (instance 1 to 4 and 9671 to 9689) is shown in Fig. 7. Under the column with actual values, ‘?’ denotes a missing value. These results are saved to a text file to enable extraction of relevant information.
Figure 8a is a plot of the prediction error for predicting S1M1. The points lying along the 0 axis are majorly the missing entries for the error values due to missing actual values. An analysis of the prediction error for the actual values showed that 99% of values lie below 0.05 mm and 56% below 0.01 mm (see Table 4). The results indicate that if 100 samples are taken, 99 can be predicted to an accuracy that is within 0.05 mm, using the model. Figure 8b shows the scatter plot for predicted vs actual values. Most data points lie along the line of best fit, indicating good prediction accuracy of the model.
Table 4
Prediction accuracy summary.
Samples with prediction errors
|
Percentage of values
|
0.01 and below
|
56
|
0.03 and below
|
94
|
0.05 and below
|
99
|
Extract predicted values
A python-based code was used to extract and organize the relevant information from the WEKA generated result. The logic in the code created a csv file for the results. Where the actual column value is ‘?’ the predicted value is selected, otherwise the actual value is selected. By so doing, the column in the generated csv data contains the actual values and the predicted values for the class attribute. The code is shown in Table 5. The code truncates information (by extracting only relevant column values) and parses information (from text to csv).
Table 5
Python-based code to extract and save relevant information from results buffer.
import csv
fhand = open ('S1M1.txt') # open saved text results of predicted values
for line in fhand :
word = line.split() # split words on each line
if word [1] == '?’ : # if first indexed word is ‘?’
value = (word [2]) # set value to second indexed word
else :
value = (word [1]) # otherwise, set value to first indexed word
with open ('S1M1_predicted.csv', 'a', newline='') as f :
thewriter = csv.writer(f)
thewriter.writerow([value]) # append each word (value) to a row
|
Replace missing with predicted values
The steps four to ten are repeated until the missing values in the selected stage 1 output measurement values have all been replaced with their predicted values. The combination of ReliefF and Random Forest is applied. The corrected dataset is then progressed for correcting stage 2 output measurement values. And steps four to ten are applied accordingly.
It is important to note that missing values of less than 5% are considered trivial [54, 55]. They are insignificant to cause any major performance degradation or cast suspicion on the results of applying a machine learning algorithm. And so, attributes with less than 5% missing values were not corrected for this research.