Towards big industrial data mining through explainable automated machine learning

Industrial systems resources are capable of producing large amount of data. These data are often in heterogeneous formats and distributed, yet they provide means to mine the information which can allow the deployment of intelligent management tools for production activities. For this purpose, it is necessary to be able to implement knowledge extraction and prediction processes using Artificial Intelligence (AI) models, but the selection and configuration of intended AI models tend to be increasingly complex for a non-expert user. In this paper, we present an approach and a software platform that may allow industrial actors, who are usually not familiar with AI, to select and configure algorithms optimally adapted to their needs. Hence, the approach is essentially based on automated machine learning. The resulting platform effectively enables a better choice among the combination of AI algorithms and hyper-parameters configurations. It also makes it possible to provide features of explainability of the resulting algorithms and models, thus increasing the acceptability of these models in practicing community of the users. The proposed approach has been applied in the field of predictive maintenance. Current tests are based on the analysis of more than 360 databases from the subjected field.


Introduction
Data driven decision-making may be defined by the set of practices aiming to make decisions based on data analysis rather than on intuitive insights [1]. The business entities that have deployed data-driven decision-making activities have been observed as more profitable and productive compared to the traditional ones [2]. Nowadays, decision making tools are mainly based on results of the current AI research works [3]. The success of AI-based tools is mainly due to the advances in machine learning approaches [4]. This is particularly stimulated by the availability of large datasets concerning various real-world features [3] and also through the increase of the computational gains which are generally attributed to the powerful GPU cards [5].
The manufacturing area is one of those generating huge amounts of data gathered by means of Cyber Physical System (CPS) devices. The availability of such data combined with the knowledge of manufacturing experts may be an opportunity to build AI-based processes and models providing high value insights and assets for decision makers. Nevertheless, building such processes and models requires AI and data science skills and expertise that are not always available in the manufacturing area workbenches and laboratories.
The work shown in this paper aims to bridge the gap between AI expertise and manufacturing experts. We believe that automated machine learning (AutoML) [6] is one of the most powerful approaches that can effectively deal with this problem. Instead of searching appropriate Machine Learning (ML) algorithms and manually tuning the hyper-parameters in a virtually infinite space, the AutoML leads to automatically and iteratively configure the appropriate hyper-parameters of multiple machine learning algorithms in a virtually infinite space, in order to optimize these parameters for a predefined search space.
The interest of building complex AI models that are able to achieve unprecedented performance levels has been gradually replaced by a growing concern for alternative design factors leading to an improved usability of the resulting tools. Indeed, in a manifold of application areas, complex AI models become of limited practical utility [12]. The major reason lies on the fact that AI models are often designed to focus the performance factors, thus leaving aside other important and even sometimes the crucial aspects such as confidence, transparency, fairness or accountability. The absence of explanation for predicted performing factors make the AI models usually black boxes, which only allows the prominent exhibition of input and output parameters but conceal the visibility of inherent associations among them. It is more preferably desired to avoid such lack of transparency in real-life applications such that in industrial manufacturing processes. Since, these applications may imply critical decision choices, it is favorable to have some justifications of individual predictions which are perceived trough an AI algorithm, more particularly, in an automated environment. Therefore acceptance of, and the trust in, an AutoML system is highly dependent on the transparency of the recommendations.
Because of the lack of transparency in AutoML systems as Decision Support Systems (DSS), users tend to question the validity of automatic results, such that: did the AutoML run long enough? Did the AutoML miss some suitable models? Did the AutoML sufficiently explore the search space? Did the recommended configuration over or under fit?, etc. Such queries may cause reluctance for users to apply the results of AutoML in more critical situations [13]. Meanwhile, when AutoML provides unsatisfactory results, users are unable to reason and thus cannot improve the obtained results. They may only increase the computational budget (e.g., the run-time) as much as possible, which can result as barriers of the AutoML effectiveness.
It is therefore a preliminary objective of the current work to make the outcome from such well-performing AutoML systems transparent, interpretable and self-explainable. This shall make AutoML support systems more reliable and operational through a set of different visual summary levels of the provided models and configurations. It may render the AutoML system more transparent and controllable, hence increasing its acceptance.
In the current work, we attempt a transparent and autoexplainable AutoML system for recommending the most adequate ML configuration for a given problem and explain the rationale traceability behind a recommendation. It may further allow to analyze the predictive results in an interpretable and reliable manner. In the proposed approach, the end users can explore the AutoML process at different levels, such as described in the following: -The AutoML-oriented level (i.e. exploring the AutoML process from recommendation to refinement). -The Data-oriented level (i.e. exploring data properties through different visualization levels). -The Model-oriented level (i.e. exploring the models provided by the AutoML system (e.g. model performance, what-if-analysis, decision path, etc.)).
The system consists of two integrated modules: the Automated Machine Learning tool for Big Industrial Data (AMLBID) module and the Automated Machine Learning explainer (AMLExplainer). However, the later, AML-Explainer module is not system or algorithm specific; it is inter-operable with a variety of AutoML frameworks.
The main contributions of this work are summarized as follows: 1. We present AMLBID, a premier transparent, interpretable and auto-explainable meta-learning [14] based AutoML tool that identifies the optimal or near-optimal ML configuration for a given problem. 2. Provide an assisted traceability of the reasoning behind the AutoML recommendation generation process. 3. Develop a module that can explain the predictions of any recommendation through linked visual summary and/or textual information. 4. Provide a multi-level interactive visualization tool that facilitate the model operation and performance inspection to address the "trusting the model". 5. Provide a reliable guidance, when AutoML returns unsatisfying results in order to improve the expected performances by assessing the importance of an algorithm hyperparameters.
The rest of the paper is organized as follows: Sect. 2 discusses the closely related works in respect of ML-based data analytics solutions and the need for transparency to gain trust in AI models. Section 3 briefly describes the different types of explanations and also their respective information content with their use in practice. Section 4 provides an overview of the proposed framework and discusses how these components collaborate to achieve the pursued goals. Finally, the contents of this paper are concluded in Sect. 5 along with outlines of future perspectives.

Related works
The available literature testifies a considerable progress in respect of automated machine learning methodologies and their application in multi-disciplinary areas. We have been attentively studying some of the most relevant works, particularly the recommendation-based decision support systems in manufacturing industry. We particularly considered the works dealing with transparency and explainability of automated machine learning and those related to big industrial data mining. Through the advances on high-tech sensing and the widespread use of applications such as electronic manufacturing records, mobile sensors, and Industrial Internet of Things (IIoT) tools, manufacturing data are being accumulated at an exponentially growing rate annually. According to a recent research report [15], the global big data market size will grow from USD 138.9 billions in 2020 to USD 229.4 billion by 2025 (increase from Petabytes to Exabytes). Machine learning is a key technology to transform large manufacturing data sets, or "big industrial data", into actionable knowledge.
In this section, we observe the limitations of available research works on explainable AI/AutoML systems as DSS that primarily motivate the work described in this paper. The current literature, in this regard can be generally summarized in form of an overlapped overview of three major research areas, as described in the following subsections.

Challenges in selecting and configuring machine learning algorithms
The machine learning is widely used in many industrial applications across different levels, including processes, machines, shop floors, and supply chain levels. For instance, machine learning models can be used to control product quality [16], to monitor the condition of tools by tracking the evolution of their state [17], or to monitor the health of machines by predicting the time of occurrences of machine failures and also to estimate the criticality of these failures [18]. However, despite its countless benefits and advances, building a machine learning pipeline is still a challenging task, partly because of the difficulty in manually selecting an effective combination of an algorithm and hyperparameters values for a given task or problem.
Owing to the development of open source ML packages and the active research in the ML field, there are dozens of machine learning algorithms, where each machine learning algorithm has two types of model parameters: (1) ordinary parameters that are automatically optimized or learned during the model training phase; (2) and hyperparameters (categorical and continuous) that are typically set by the user manually before the training of the model (as shown in Table 1).
To achieve the desired performance for a particular problem, users typically try a set of models and configurations based on their understanding of the algorithms and their observation of the data since there is no algorithm that performs well on all possible problems (i.e., No Free Lunch [19]). Then, based on the feedback about how the learning tools performed, the practitioner may adjust the configuration to verify if the performance can be improved. Such a trial-and-error process terminates once a desired performance is achieved or the computational budget runs out (as shown in Fig. 1).

Automated machine learning
The Automated Machine Learning or AutoML [20] field is among the rapidly emerging sub-fields of ML that attempts to address the theoretical and algorithmic challenges in order to fully automate the ML process. It addresses also the development and deployment of systems in this regard. AutoML has two main goals: (1) democratizing the application of ML to non-experts of data analysis by providing them with "off the shelf" solutions, and (2) enabling the knowledge practitioners to save time and effort.
Over the last few years, a plethora of AutoML systems have been developed, providing partial or complete ML automation, such as Auto-sklearn [7], TPOT [8],Auto-WEKA [9], ATM [10], as well as commercial systems such as Google AutoML [11], RapidMiner [21], DarwinAI [22], and DataRobot [23]. These tools range from automatic data preprocessing [24,25], automatic feature engineering [26,27] to automatic model selection [20,28] and automatic hyperparameters tuning [29,30]. Some approaches attempt to automatically and simultaneously choose a learning algorithm and optimize its hyperparameters. These approaches are also known as Combined Algorithm Selection and Hyperparameters optimization problem (CASH) [7][8][9][30][31][32][33]. Table 2 shows a comparison among some of the most popular AutoML tools, in terms of training framework, supported ML tasks, automatic features engineering, user interface and process transparency. While all of these tools provide partial or complete ML process automation, each one works differently and targets different dataset structure, platform, algorithm, or end user. Similarly, they have simultaneously unmatched advantages and disadvantages. For instance, Auto-Sklearn is embedded in Python, however, it only operates on structured data. Auto-WEKA supports Weka ML algorithms with the provision of a graphical user interface (GUI), but it is limited to statistical algorithms. Rapid-Miner provides features engineering capability but requires expert guidance. Likewise, Google AutoML supports most types of datasets and algorithms but the dedicated service is commercial and it is only available as cloud-based.
The interest in developing complex AI models, despite their revolutionary characteristics and capabilities of achieving unprecedented levels of performance has been progressively perishing. The loss of sustainability is mainly concerned with alternative design factors particularly to make such models more usable in practice [3]. These systems emphasize on useful assistance but their usability is greatly limited to provide detailed analysis about the recommended configurations and the insightful working of the black-box models [13]. The reason lies on the fact that AI models are often designed with performance as their only design goal, thus ignoring other important matters such as privacy awareness, confidence, transparency, and accountability makes them usually untrustworthy black-boxes [13,34].

The need for transparency to trust in AI and in AutoML
Black-box AI systems have been used in various areas. Their implication in critical domains, like in power consumption forecasting or supply chain management to analyze the brands trends or consumer sentiments; usually have less focus to consider the quality features such as transparency and explainability rather considering more importantly the system's overall performance. However, even if these systems fail, e.g., the Quality Control system is mostly not able to detect the failure, the Equipment Failure Prevention system are less expected to identify the exact cause of failure and generally produces false or inaccurate predictions. The consequences are rather underwhelming. In industrial critical applications, the situations are different where the lack of transparency of ML techniques can be a disqualifying factor, if not limited. Specifically, a single wrong decision can be highly risked to put in danger the entire production line (e.g., failure of a critical unit) and can cause significant financial deprivations (e.g., product conformity). It is therefore, relying on an incomprehensible black-box data-driven system would not be the best option. The lack of transparency is among the most relevant reasons to question the adoption of AI models in manufacturing industry. The stakeholders are more cautious than doing so in the consumer entertainment, or e-commerce industries.
Explaining the reasoning behind one's decisions or actions is an important part of human interactions in the social dimension [35]. As the explanations help to build trust in human-to-human relationships, similarly, these should also be part of human-to-machine interactions [3]. In this work, we investigate the contributions and feasibility of a process designed to make such powerful DSS transparent, interpretable and self-explainable to foster trust, both in situations where the AI system has a supportive role (e.g., production planning) and in those where it provides directions and decision-making (e.g., Quality Control, predictive maintenance or autonomous driving). In the former cases, explanations provide extra information, which help the human in the loop to gain an overall view of the situation or the problem at hand in order to take decisions. It is similar to an expert who has to provide a detailed report explaining his/her findings, a supportive AI system should explain the decisions in detail instead of providing only a prediction or a decision.

Explainable AI
Explainable AI (XAI) [12] refers to artificial intelligence technologies that can provide human-understandable explanations for their output or actions [36]. End users, by nature, may wonder about the reasoning behind how and why algorithms make or arrive to decisions [34]. As the complexity of the AI algorithms and systems grows, they are viewed as "black-boxes" [32]. Increasing complexity can result in the lack of transparency that hampers understanding the reasoning of these systems, which negatively affects the users trustiness.
Model explainability can be divided into two categories: global explainability and local explainability. Global explainability means the users can understand the model directly from its overall structure. Local explainability just consider a specific input and it tries to find out why the model makes a certain decision.
The development of methods for explaining, visualizing and interpreting machine learning models has recently gained increasing attention under the Explainable AI (XAI) area [12,13,34,36]. In the recent years, the advancements in XAI are grown rapidly but there are still broader gaps to generalize XAI approaches. The current major XAI methodologies are only applicable to specific types of data and models. Such specificities mostly require the pre-configuration of input parameters that are not easily coded by non-experts. In contrast, our proposed system intends to support the analysis and inspection of all machine learning classification models without any data type dependency, neither even having to write any line of code. A variety of XAI methods characteristics in terms of data explanations level, data and model dependency, and pre-configuration requirements are highlighted in Table 3.
It can be useful to primarily establish a consensus of understanding on what the term explainability may refer in the context of artificial intelligence and, more specifically, in the area of machine learning. Different levels of explanations provide insights into different aspects of the model, ranging from information about the learned representations to the identification of distinct prediction strategies and the assessment of the overall model behavior. Depending on the recipient and his or her intent, it may be advantageous to focus on one particular level of explanation. Recent design recommendations put more focus on the importance of intuitive interfaces, along with a clean and concise presentation, among the explanation facilities, and easy user interactions [42]. In order to make AI systems as decision support systems accessible and effortless for both machine learning experts and neophytes, system builders and designers should present not only the final model prediction or recommendation coming out of the system, but also the pipeline steps and decisions made in each of those steps along with the prediction or decision generation process. In this argument, more clarity and transparency for users are expected from the AutoML system. To accommodate the user needs of transparency and trust, a few recent works have proposed prototypes design for increasing AI systems transparency [42][43][44]. However, most of these systems fall short in providing an overview of the AutoML process of how and why a recommended pipeline configuration was generated, and their interfaces are complex and not suitable for manufacturing routine and needs. Furthermore, the incorporated explanation facilities are often insufficient and/ or not tailored to the industrial domain.
Our proposed system aims to provide guidance to solve particular problems (Fig. 2). Given a dataset, the tool automatically recommends the most adequate ML configurations and allow users to easily observe and analyze these models through an interactive multiple views module that explain the inner working of any machine learning classifier in an interpretable and faithful manner. The goal is manifold: (1) facilitate the models working and performance inspection through linked visual summaries and textual information (2) provide a visual summary of all evidence items and their relevance for the computation result, and (3) present a guided investigation of the reasoning behind the recommendation generation.

The conceptual framework
Given a predictive modeling problem for an industrial application, it is often difficult to build an accurate machine learning-based predictive model that is easy to develop and to be interpreted by non-ML experts. The key idea for our transparent and explainable automated machine learning vision is to separate recommendations from explanations by using two modules simultaneously. The first module is used for making the recommendation of the most adequate ML configuration for a problem at hand and aims to maximize the requested predictive performance metric (e.g. Accuracy, Precision, Recall). The second module is used for providing the rationale behind the recommended configuration as well as automatically explaining the inner workings of the model.
The following section describes the design and implementation choices of the proposed tool, a complete, transparent and self-explainable AutoML system. As it is shown by Fig. 3. For the recommender module (AMLBID), AutoML is performed when a new dataset is presented, and a list of candidate pipelines is provided based on the given task. The dataset characteristics, the output of AutoML and the list of candidate pipelines are supplied to the explanatory module in order to generate an interactive dashboard to better understand the provided results. It also helps the end-user to diagnose the performance of the generated pipelines and explore the possibilities of performance refinement.

The recommender module
The AMLBID module is a Meta-learning [14] based system for automating the algorithm selection and tuning problem using a recommendation system that is bootstrapped with a collaborative meta-knowledge base. This knowledge-base, derived from a large set of experiments conducted on 360 real-world datasets from different manufacturing levels generating more than 3 millions of different ML configurations (pipelines). Each pipeline consists of a choice of a machine learning model and its hyperparameters configuration. By exploring the interactions between datasets and pipelines topology, the system is able to identify effective pipelines without performing computationally expensive analysis.
Building a meta-learning-based system to deal with the algorithms selection and configuration problem requires a meta-knowledge base for the learning process. This involves collecting datasets, choosing machine learning algorithms, extracting meta-features (datasets and pipelines characteristics), and determining the performance of the algorithm configurations according to different evaluation measures (e.g. Accuracy, precision, Recall). Though, when a new problem is presented to the system, the meta-features are extracted, and a recommendation mechanism that makes use of the meta-knowledge base provides the ranking of the pipeline(s) (algorithms and configurations) for the unseen problem according to the desired performance measure. AMLBID consists of two main phases: the learning phase and the inferring one. During the learning phase, we evaluate different classification algorithms, analyze multiple datasets (to extract meta-features), and train a ranking meta-model. During the inference phase, the meta-model generated in the training phase is used to produce a ranked list of promising classification pipelines for a new dataset and a classification performance metric.

The learning phase
During the learning phase, we evaluate different classification algorithms with multiple hyperparameters configurations on a large collection of various datasets. Then, we generate meta-features that are used to train a meta-model able to recommend promising classification pipelines for a given dataset and performance metric. The entire training phase is illustrated in Fig. 4.

The datasets
We conduct the experiments on 360 real-world manufacturing classification datasets which are collected  from the popular UCI 1 , OpenML 2 , Kaggle 3 and KEEL 4 repositories, among other real-world scenarios. These datasets cover various tasks with respect to their size, the number of attributes, their composition and class imbalance.
It is worth noting that the used datasets cover a broad range of application areas. They include, among others, the process level studies, machine related problems and supply chain level (as shown in Table 4). We have not performed any preprocessing operation on the datasets to avoid any potential bias or impact on the classifiers performances and in order to ensure the fairness for the performance comparisons.
The meta-features The meta-learning core paradigm is to relate the performance of learning algorithms and configurations to data characteristics (Meta-features). Meta-features are common characteristics of several problems, and their aim is to identify structural similarities and differences among problems. These characteristics can be divided into three categories [33]: Simple Based on general measures, such as the number of instances, attributes and classes, dataset dimensionality. They are designed to some extent to measure the complexity of the underlying problem.
Statistical Based on statistical measures obtained from the dataset attributes, such as means, standard deviation, class entropy, and correlations, etc.
Landmark That characterize the extent of datasets when basic machine learning algorithms (with default configuration) are performed on them. The used landmark characteristics in our system include performance of the linear discriminant analysis (LDA), Gaussian Naive Bayes (GNB), Decision Trees (DT), Gaussian Naive Bayes (GNB) and the K-Nearest Neighbor (KNN) landmarks.
The pipelines generation To build the Meta-knowledge base, we used 08 classifiers from the popular Python-based machine learning library, Scikit-learn. These classifiers are AdaBoost, Support Vector Classifier (SVC), Extra Trees, Gradient Boosting, Decision Tree, Logistic Regression, Random Forest, and Stochastic Gradient Descent (SGD) classifiers. Detailed description of the algorithms and their tuned hyperparameters are described on the Tables 11-17 in the Appendix.
For each run of a classifier C over a dataset D, we generated 1000 different combinations of their hyperparameters configurations. This process resulted in an average of 8000 pipelines per dataset. In particular, for each classifier, we have generated a list of all possible and reasonable combinations where we conducted, for each dataset, a random search among them [45].
During the training phase, we used a fivefold stratified cross-validation strategy to construct our meta-datasets. As a result, our knowledge base consists of more than 3 millions evaluated classification pipelines. It is noted that due to the different number of algorithms hyperparameters, not every algorithm had the same number of configurations/ evaluations. The knowledge-base is continuously improved by running more tasks, making AMLBID smarter by achieving more experience, based on the growing knowledge-base.

Measures
As part of our core idea, we aim to recommend high-performing ML pipelines for a given combination of datasets and evaluation measure. The point that most of state-of-the-art systems do not take into account, the proposed system supports various classification performance measures to evaluate the performance of the ML pipelines (ML algorithms and related hyperparameters configuration). Table 5 shows supported measures details.

The recommending phase
The recommending phase is initiated when a new dataset to be analyzed occurs. At this point, the user selects a predictive analytic metric to be used for the analysis (e.g. Accuracy, Recall, F1 score), and then the system automatically recommends a set of machine learning algorithms and their related hyperparameters configuration to be applied, such that the predictive performance is the first-rate. To do so, the system first, extracts the dataset characteristics (metafeatures). Then, the extracted meta-features are fed to the meta-model to provide the candidate pipelines. Finally, the suggestion engine, according to the meta-knowledge base, ranks the pipelines in respect to the provided analytic metric. The recommending process is shown in Fig. 5. Meta-model After having generated a meta-dataset with all the necessary metadata, the goal is to build a predictive meta-model that can learn the complex relationship between a task meta-features and the utility of specific ML pipelines to recommend the most useful ML algorithm(s) configuration given the meta-features M of a new task t new .
Formally, each task t j ∈ T is described by a vector F(t j ) = (m j,1 , … , m j,K ) of K meta-features m j,K ∈ F , the set of all known meta-features. This can be used to define the task similarity measure based on, for instance, the Euclidean distance between m(t new ) and m(t j ) , so that we can transfer information from the most similar tasks to the new task t new .
The distance measured between meta-features of t new and t j is given by Eq. (1): One of the aims of our work is to produce an enriched metamodel able to recommend the top-performing classification configuration(s) for a combination of an unseen dataset and classification evaluation measure. For this purpose, two state of the art learning algorithms were used to produce metamodels able to predict the most appropriate pipelines for the dataset at hand: Random Forest (RF) and k-Nearest Neighbor Recall is used to retrieve fraction of relevant instances that are retrieved. Accuracy The accuracy is the proportion of the total number of predictions that were correct. Accuracy is related to the degree of bias in the measurements Accuracy is used to represent the correct answer or percentage of accurate classification. F1 score F1 score or F-measure is defined as the harmonic mean of precision and recall. Commonly used as a single metric to evaluate the classifier performance.
A value closer to one implies that a better combined precision and recall is achieved by the classifier

Meta-Knowledge ta Knowle base + Datasets meta-features + Meta-datasets + Pipelines performances
Ranking pipelines list according to the performance criterion

Meta-Features extraction New Dataset
Optimal pipeline Performance criterion (e.g., Acc, Recall)

Fig. 5
Recommending phase workflow (kNN). Thus, when the meta-learning system is applied to a new dataset, the meta-model returns a ranking of the most suitable classification algorithms and configurations, based on its meta-feature values.
Ranking using KNN classifier is a commonly used strategy to obtain the top-K rankings. When a new dataset is presented to the meta-learning system, the kNN identify the closest neighbors of the candidate dataset in the metaknowledge base, using the Euclidean distance measure (Eq. (1)). Based on this measure, a vector d = [d 1 , d 2 , … , d k ] containing the dissimilarity among all characteristics (meta-features) of datasets is built and a weighted average of each individual neighbor's is used for forecasting the optimal pipeline configuration based on the relevant measure.
While for the Random forest meta-model, we produce for each supported classification evaluation measure E a large labeled training set using the following process: 1. For each combination of d ∈ D and A (i) ∈ A , where D is the 360-learning datasets, A (i) a learning algorithm configuration from the 3 millions evaluated configurations, we retrieve the set of all best predictive results R(E i , d, A (i) ) for each evaluation metric E i (e.g., accuracy, F1-score, recall and precision). 2. For each d ∈ D we designate the learner algorithm configuration A (i) as Class 1 (top performer algorithm configuration for the dataset) if its best predictive results for the dataset are greater than or equal to the highest performance achieved by all other algorithm configurations. Otherwise we label the A (i) for the dataset as Class 0 (low performer algorithm configuration). 3. For each combination of d ∈ D and A (i) ∈ A we generate a joint set F = {F d ∪ F A (i) } , where: -F d : the dataset's meta-features generated in the learning step.
-F A (i) : a discrete feature describing the learning configuration A (i) .
4. The joined meta-features vectors F are used to fit the RF meta-model for the top performing algorithms configurations, using the meta-features variables as predictors and the learner's labels as targets of the metamodel. For our Meta-Model, we have been mainly interested for optimizing the prediction recall of Class 1 (the classifier has the potential to be among the best performing classifiers). Therefore, we had to consider different levels of the decision tree model hyperparameters configuration where the configuration: The main functionality of the meta-model can be formally defined as follows: given a set of learning algorithms space where i is the i-th hyperparameters configuration of A, a dataset D divided into disjoint training D train , and validation D validation sets, and an evaluation measure E, the goal is to identify the ML algorithm (s) A (i) * , where A (i) * ∈ A and A (i) * is a tuned version of A (i) that minimizes or maximizes the E on D. Figure 6 presents Random Forest and KNN meta-models performances on suggesting the best predictive pipeline configuration. The KNN-based meta-model clearly performs better than the random forest classifier-based meta-learner according to the accuracy metric.

The evaluation of robustness
In this evaluation, we investigate the performance that can be achieved by using the proposed recommending module on various manufacturing related problems. We evaluate its ability to predict the ML algorithms with related hyperparameters configurations that shall provide the best result of the analysis. We benchmark on a highly varied selection of 30 datasets covering binary and multi-class classification problems from different industry 4.0 levels with a sample size from 1000 to 100000 instance (this is a common sample size for general real-world datasets). These 30 fresh datasets were not previously exploited by any learning method during the offline phase in our framework. Table 6 shows characteristics of a sample of the evaluation datasets in terms of number of classes, instances and covered tasks.
The proposed system uses the meta-model to predict all the pipelines in the meta-knowledge base with respect to the analyzed dataset and then returns its top-ranked pipelines according to the provided performance criteria. The top-ranked pipelines are then fitted on the datasets that was split into train and test sets using a 70% / 30% ratio. The results of this evaluation were used to compare the performances of AMLBID to those of the TPOT and Autosklearn state-of-the-art frameworks. It is important to highlight, that while the majority of state-of-the-art frameworks evaluate a set of pipelines by running them on the given dataset before the recommendation step, which demands considerable computational budget that is not always available. They also take a huge amount of time in the majority of cases, which makes them impractical solutions in real-world problems as in the industrial contexts. AML-BID immediately produces a ranked list of potential top pipelines configurations using its meta-model and metaknowledge base at an imperceptible computational cost (in term of time and computational resources). The evaluation results are presented in Table 7.
As shown in Table 7 and summarized in Table 8, it is clear that the performances of AMLBID are comparable and even better than those of the baselines even though it does not run any pipeline on the dataset prior to the recommendation.
As state of the art systems support only the predictive accuracy as the performance measure of the recommended configuration, further robustness comparison on different performance measures such Recall, F1 score and Precision could not be done. Whereas in some concrete cases, Recall or Precision may be more important and informative than the predictive accuracy. Therefore, the proposed system is the first AutoML system to support different predictive performance measures (Precision, Recall, Accuracy and F1score).
Beyond their black box nature, among the major shortcomings of AutoML solutions is their computational complexity, which necessitates a huge budget of time and resources. On the contrary, AMLBID has the advantage of the O(1) computational complexity, consequently it generates the recommendation in negligible amount of time. Table 9 presents the running time of AMLBID, TPOT and Autosklearn on the benchmarked datasets.
Noting that the rather "long" time taken by some massive datasets for AMLBID relates to the calculations made for extracting the dataset's characteristics (Meta-features).

The explainer module
AMLExplainer is implemented as a client-server tool integrated with the recommender module. The server coordinates as an AutoML support system. As the client, the visual interface provides graphical interaction with AutoML results and maps the summary data for visualization through a set of different visual summary levels of the recommended models. We structure the process of explainability in the proposed XAI framework during the following three phases: Understanding phase This phase is the entry point of the proposed explainability workflow. For a ML expert, this step offers necessary information to fit and deploy the recommended model(s). For a model user (supposedly, a non-ML expert), it explains the recommended models and their functionalities by providing visual representations, descriptions and external information about the recommended model(s). In the current version of the prototype system, we implement this phase as the integration of information cards of the properties, summary and classification statistics of the recommended model(s) along with a report regarding the input data.

Diagnosis phase
In the proposed framework, we define the diagnosis phase as the most important part of the XAI workflow. It enables end users (experts or novices), to visually explore the insight of the recommended ML models and thus understand the reasoning behind the output of the model. It offers a decision support tool, that helps end users examine the attribution value for each input feature with respect to the model predictions through the what-if analysis and the decision path.
Refinement phase This phase shows the interaction recommendations. These are based on the recommended model architecture, findings from previous stages (meta-Knowledge base), and general heuristics (ANOVA [49]) to improve the predictive performance of the recommended model(s).
AMLExplainer users are allowed to explore the models provided by the AutoML process at four main levels of detail (i.e. AutoML Overview, Recommendation-level View, What-if analysis-level View, and Refinement-level View). Meanwhile, AMLExplainer provide end users with a guidance, when AutoML returns unsatisfying results, to improve the predictive performances. Thence increases the transparency, controllability, and the acceptance of AutoML.
The workflow of the proposed auto-explanatory AutoML system consists of two main components: -The AutoML component, which shows the high-level of the AutoML process from recommendations to refinements. -The recommended configuration component, that allows users to inspect the recommended model's inner working and decision's generation process (includes the Recommendation-level and What-if analysis-level views).
The AutoML overview The AutoML overview level (Fig. 7) summarizes high-level information of the AutoML process. Users will be able to compare and choose between the top K recommended configurations. They can focus their analysis on the top model configuration on the next level view, which highlights the corresponding algorithm in the detail views. The recommendation-level view The recommendationlevel view enables users to inspect recommendations with respect to performance distribution. As shown in (Fig. 8), a detailed explanation about the top performed recommendation is generated through multiple granularity levels, such as statistics about the configuration performances ( Fig. 8(A)), and a tree-based explanation of the conducted predictions ( Fig. 8(B)).
By providing intelligible explanations about the process and reasoning behind an individual prediction, as illustrated in Fig. 8, it is clear that the decision-maker whether a manufacturing engineer or a machine learning practitioner is much better positioned to make decisions since He / She usually have prior knowledge about the data and the application domain, which can use to trust in and accept or reject a prediction if the reasoning behind it is well explained. Figure 9 is designed to investigate the machine learning models. It enables understanding models by enabling end users to investigate attribution values for individual input features in relation to model predictions. Explaining the inner working of the model helps to gain an understanding of what the model does and does not do. This is important so that they can gain an intuition for when the model is likely missing information and may have to be overruled. Therefore explore scenarios, test, and evaluate / validate business assumptions, and gain intuition for modification. Figure 10 shows the correlation between performances and hyperparameters of a recommended algorithm. To accomplish that, we takes as input performances data gathered with different hyperparameter settings of the algorithm (from the recommender module's knowledge-base), fits a random forest to capture the relationship between hyperparameters and performances, and then we apply functional Analysis of variance (ANOVA) to assess how important each of the hyperparameters and each low-order interaction of hyperparameters is to performance. Guided by this in-depth analysis, end users have a guidance, when AutoML returns unsatisfying results, to improve the predictive performances.

The refinement-level view
The tool documentation and a detailed list of features with an illustrative example is available in the Github repository in form of a Python-package 5 .

Evaluation of user explanations
The following sections describe the evaluation methodology of the proposed auto-explainable AutoML tool. We globally draw the insights from the feedback that we have received from the various target users.

Demonstration test case: application to manufacturing quality prediction
The proposed system is designed for the machine learning-based predictive modeling problems. The major goal of the work has been to show the feasibility to achieve maximum possible performance for a specific predictive modeling problem, and automatically explain the results Among these records, 74.39% are diagnosed as compliant products.

User interview
Participants and apparatus To evaluate the proposed whitebox AutoML system as a decision support system, we conducted a semi-structured qualitative user study with two different groups of target users. These groups range from ML novices to experts (53% male and 47% female who are aged between 24 and 38 years with an average age of = 26.78 years). Among the ML-users, 48% were the participants with particular knowledge in the big industrial data analysis. While, among the ML-experts, 52% were the participants with experience in developing ML models for their domain problems. All of these participants have the experience in machine learning or data analysis, but none of them had prior experience with AutoML. The evaluation studies are conducted on a set of dedicated computers equipped with an Intel Core i5-2400 3,10GHz -8Go RAM DDR4.
Tasks and procedure The evaluation study began with a tutorial session, in which the tasks and the usage of the selfexplainable AutoML system were introduced to the participants. Participants were asked to complete the Post-Study System Usability Questionnaire (PSSUQ), third version [50], when performing the explanatory module on the three tasks understand, diagnose, and refine.
The PSSUQ-3 is a 16 item measure (as shown in the questionnaire sheet Fig. 12 of Appendix). The questionnaire consists of an overall satisfaction scale (the mean value of items 1 through 16) and three subscales. System usefulness subscale assesses the ease of learning and use of the system (i.e. the mean value of items 1 through 6). The information quality subscale evaluates the feedback provided by the system to the user (i.e. the mean value of items 7 through 12). Finally, the interface quality subscale quantifies the familiarity of the user with the system, as whether the system has met the expected functionality (i.e. the mean value of items 13 through 16). For all the scales, the rating range was between 1 and 7; the lower the score, the higher the satisfaction with the tool.
The results of the usability questionnaire are summarized in Table 10 and Fig. 11. The PSSUQ overall and subscale scores were extremely positive, with an overall total mean score of 1.53 (standard deviation 0.71) and a range from strongly agree to neutral (1 to 5). In this context, the mean system usefulness, information quality, and interface quality subscale scores were 1.74, 1.09, and 1.22 respectively.
As shown in Fig. 11, most of the participants agreed that the auto-explainable AutoML DSS is easy to learn and use. Among them, 80% of the participants strongly agreed that they are confident in their recommended model(s). We also conducted semi-structured post-study interviews to gather more detailed feedback from participants. The interviews reflect the difference between the initial expectation and the experience during the pair analytic regarding the workflow of the system.
We collected the participants feedback about the AutoML module as a black-box decision support system assisting the experts to choose and configure ML models for their problems initially and afterwards with the entire system (recommendation module and the explanation module). Based on their feedback, we summarize two main appreciations of the proposed tool: -AutoML can help stakeholders (neophyte as well as experts) to improve the applications of machine learning algorithms. AutoML enables quick experimentation with a large number of models and configurations, whose results could provide useful knowledge to ML researchers and domain practitioners. On the test set, the recom-   Fig. 11 Results of the usability test the importance of hyperparameters tuning for ML algorithms. The participants highlight the fact that being able to match prior knowledge about machine learning to the visualizations produced by AMLExplainer creates confidence in the underlying AutoML process and increases the likelihood of adopting AutoML. -The participants appreciated the human-machine interaction introduced in AMLExplainer. They observed such interaction could improve an AutoML process and enhance user experience and make such powerful black boxes trust worthy. One of the experts commented: "Users with more domain knowledge, such as myself, are usually critical of automated methods and like to be in control. I do not like getting a score back and hearing trust me".
Overall, the feedback on the system remains positive. In addition, the users provided several suggestions for complementary features. For the understanding and diagnosis tasks, in addition to the provided explanation levels, users wanted to gain insight into the underlying data. Such exploratory data analysis feature [51] is an integral part for any knowledge discovery process. For example, a data profiling level could review the dataset characteristics and quality and show it to the user. For results reporting, the feedback is mostly unified. All participants liked the code export function of the recommended ML pipeline and they are aspirant to use that for the communication of their results. Furthermore, in the perspectives of the code export, participants came up with several suggestions for enhancing this feature, such that possibility to store/export the overall explanation levels as PDF report file. For the refinement task, there are mixed feedback and expectations. Most of the participants are optimistic and they suggest additional ways to interactively refine the recommended model(s). Rather than providing static guidance content, some individual non-ML experts ask further guidance to select the appropriate refinements. Moreover, the ideas to enhance its functionality include the propositions of code fragments, providing building blocks, or even scaling it up to a click-to-refine functionality. In the future, the current system should be extended with the suggested refinement methods and additional guidance to select the appropriate refinements.
Our documentation of the real-world evaluation case illustrates how to overcome the transparency problem of AutoML systems as decision support systems, for instance, the absence of human interaction and analysis of the inner working and reasoning of such tools. This could extend the use-of and trust-in the intelligent AutoML systems to areas where they are so far neglected due to their insistence on comprehensible models. Separating the automatic selection and configuration of machine learning algorithms from model explanation is another benefit of expert and intelligent AutoML DSS.

Conclusion
There has been significant progress in democratizing the application of ML to non-experts of data analysis by providing them with "off the shelf" solutions. However, these powerful support systems fail to provide detailed instructions about the recommended configurations and the inner working of these models, thence making them less trustworthy highly performant black-boxes. In this work, we present a novel transparent and self-explained AutoML system along with an interactive visualization module that supports machine learning experts and neophytes in analyzing the automatic results of an AutoML DSS.
To our knowledge, the proposed system is the first application of the general explanation methods of AutoML systems as decision support systems. We explore several levels of explanations, ranged from individual decisions to the entire model's recommendations and predictions. The explanations of the prediction models and what-if analysis proved to be an effective support for manufacturing related problems. A set of evaluations demonstrate the utility and usability of AMLBID in a real-world manufacturing problem. We show how powerful black-box ML systems can be made transparent and help domain experts to iteratively evaluate and update their beliefs. Based on the promising findings presented in this paper, further validation of the proposed framework in other real-world applications with a larger and more diverse group of users shall improve the visualization and presentation of explanations. At present, we are planning to expand AMLBID to support algorithms of regression, deep learning and distributed ML libraries (e.g., Spark ML [52]) since we are dealing with Big Industrial Data. Degree [2,3] Degree for the 'poly' kernel.  [1,20] The minimal number of data points required in order to create a leaf. Min_samples_split [2,20] The minimal number of data points required to split an internal node. imputation mean, median, mode Strategy for imputing missing numeric variables. split criterion {entropy, gini} Function to determine the quality of a possible split. Learning rate shrinks the contribution of each classifier. Max_depth [1,11] The maximal depth of the decision trees. Number of features to consider when computing the best node split. min_samples_leaf [1,21] The minimum number of samples required to be at a leaf node. Min_samples_split [2,21] The minimum number of samples required to split an internal node. criterion {'entropy', 'gini' } Function used to measure the quality of a split.    [50,501] Number of decision trees in the ensemble. max depth [1,11] Maximum depth of the decision trees. Controls the complexity of the decision trees Min_samples_split [2,21] The minimum number of samples required to split an internal node. Min_samples_leaf [1,21] The minimum number of samples required to be at a leaf node.

Fig. 12
The Post-Study System Usability Questionnaire