This study adopts CRISP-DM approach to analyze the problem and apply data analytics using machine learning techniques to build predictive model that could be implemented to improve operational efficiency of the production line [20]. Fig 1 shows the phases in CRISP-DM, followed by the details of the research process undertaken in this research.
Business Understanding
The CRISP-DM starts with the Business Understanding phase, which consists of identifying business goal and data mining goal. As mentioned in above, the dataset used for this research came from case study Company that has been operating for more than 40 years. As part of their initiative to improve their overall performance especially in the production and operation sector, the case company is executing a project that requires them to explore on data analytics and machine learning. The source of delay, which can affect the ultimate output and production in days, is a common issue that occurs throughout the manufacturing process. As a result, the business goal is to identify significant delay reasons, which will aid in making manufacturing more efficient and removing the potential and cause of delays.
The purpose of data mining is to use the details of production operations to predict the cause of the delay. To achieve the data mining purpose, four machine learning algorithms are used to develop prediction models: Naive Bayes, Decision Tree, Neural Network, and Random Forest. Stratified sampling is utilized in dealing with imbalanced data. To evaluate the performance of the various predictive models, standard metrics such as sensitivity, precision and accuracy are used. KNIME analytics platform, open source data science software was used to carryout data mining process.
Data Understanding
The target dataset was obtained from the Operation Department of the case company containing 180 rows and 24 columns such as Job Start, Job End, Total Operation Time, Operation Start, Maintenance Plan, Maintenance Unplanned, Insurance Briefing, Full Stockpile, Blasting, Pump Cleaning, Out of Stone, Rain, Stone Stocked, Late Lorry, Quarry-Top Full Water, Road Expansion Quarry-Top, Real-Time Operation Hour, Lorry Trip, Total Output and Total Tonnes per Hour. The outcome variable is column “Delay”, which has True / False values indicating whether the production has occurred delay or not in the current production period. Fig 2 below shows the screenshot of the dataset for a few samples before data pre-processing.
Data Preparation
Primarily in Data Preparation phase, this research explored the dataset to see whether the input dataset is standardized and any missing values are observed. In the preliminary process, we observed that the data for each month had different format and was not standardized. Therefore, major preparations were made to standardize all parameters in the data for each month. On the other hand, the dataset has many missing values that were represented by “-” in most delay predictors, in which later was changed to “0” to signify that there is not delay value within the predictor. Besides that, the dataset is segregated into Machine 1
and Machine 2 to different spreadsheets, which later was restructured with added columns labelled “Machine” and “Delay”. Furthermore, the row where the Date falls on a holiday and no production was produced is removed as there is no data input in the dataset. As a result, there are 151 rows and 23 columns of data merged for both machines into one spreadsheet.
Most of the pre-processing work was mainly on formatting the parameters, in which most data input was not in the same category. Prior discussion suggested that the predictor operation time data input was labelled in the unit of hours. However, the input was not standardized as some data was in time format and few in number format. Besides that, the time format is changed to 24 hours formatting. The column “operation” was removed as it was observed redundant and overlapping the category of real-time operation. Therefore, it was not used in the prediction model since most of the data under the “operation” column is empty. Apart from that, all the variables were combined into one spreadsheet to ensure that it is readable by the software.
Observing the distribution of number of delays, it is found that 19 occurrences of the production days are delayed due to Maintenance Unplanned and 67 occurrences are delayed caused by Late Lorry. Hence the delay was labeled as two categories where “True” means the production are delayed and “False” means that there is no delay in the production. All the columns were normalized using min-max normalization method as part of the pre-processing to apply neural network technique.
Modelling
In this step, various machine learning techniques were used to develop predictive models. Neural Network, Naive Bayes, Random Forest and Decision Tree are used in predicting the cause of delays in the production days. Naïve Bayes models can produce robust predictions if the predictors have small correlations, even with a simple architecture [3]. Decision Trees are easy to interpret and are capable of giving insights about the important features. Random Forest is an improved version of decision tree, which can produce really good and robust predictions [15]. Artificial Neural Network (ANN) allow complex nonlinear relationships between the target variable and its predictors [13].
Stratified sampling method was used with k-fold cross validation to handle imbalance dataset. Ten numbers of validations are set for training and testing of data. Besides that, cross validation helps us to evaluate the quality of the model, facilitating us in selecting the model that will perform the best on unseen data and help avoid overfitting and under fitting of the dataset. Lastly, for the performance evaluation precision, accuracy and sensitivity are chosen to determine which model of the machine learning would give the best results. Fig 3 shows the overall KNIME workflow in predicting the production delay within a manufacturing company. The overall workflow consists of three major parts, which is the descriptive analysis, the unsupervised clustering and supervised learning classification.
It is noticed that the dataset has no obvious segregation of groups, therefore, clustering is required to cluster all data sets in which is deemed fit. For this study, K-mean clustering technique was applied to cluster our dataset [3]. Based on the Silhouette coefficient score, the optimal number for k = 3 was selected. The three clusters are labelled as Low Performance, Medium Performance and High Performance. From the clustering process, the data is segmented into high performance and low performance production which can be viewed through its productivity. Through observation, it is found that the high performance operation has a higher number of delays compared to the low performance operation, whereby high performance has 67.5% of the occurrence delayed production whereas the low performance only has 50% of the occurrence labelled as delayed production. Therefore, the company would have to prepare themselves should they receive a job that requires a high number of productions.
Once pre-processing is completed for each prediction technique, four predictive models have been built, which consist of Decision Tree, Neural Network, Random Forest, Naive Bayes. As the data set is small, k-fold cross validation method of sampling the data is used, in which the number of validation is set to ten (10). K-fold cross validation allows the machine learning process to increase the accuracy of the prediction by learning the concepts from all type of data. After the Machine Learning process is complete, the performance of the predictive models is calculated.
Evaluation
In this phase of CRISP-DM, all the machine learning models will be evaluated and compared to select the best model to predict potential delay in the case company production operations. Most commonly used performance evaluation metrics such as accuracy, sensitivity and precision are calculated and compared for all the four machine learning models.
Deployment
The best model is recommended for deployment with the insights found from the dataset for data-driven decision making after analyzing the performance of multiple machine learning models using standard metrics; accuracy, sensitivity and precision.