SML-AutoML: A Smart Meta-Learning Automated Machine Learning Framework

doi:10.21203/rs.3.rs-2085778/v1

Machine Learning (ML) and Automated Machine Learning (Auto-ML) had attracted more attention and interest in industrial applications in recent years. Machine learning pipeline includes repetitive tasks such as data pre-processing, feature engineering, model selection and hyper-parameter optimization. Building a machine learning model requires extensive time for development, stress-testing, and multiple experiments. Besides, building a model with a small search space of pipeline steps with multiple algorithms and hyper-parameters takes hours. Hence, Auto-ML has been widely adapted to save time and efforts on such tasks. Nevertheless, there are many limitations in the existing Auto-ML frameworks. Most of the existing Auto-ML pipeline frameworks focus only on a part of the ML pipeline which does not always result in the optimum overall pipeline for the given dataset. Moreover, several Auto-ML pipeline frameworks ignore integrating meta-learning, thus they recommend a well-performing pipeline for a single task and not a global and generic optimal pipeline. Thus, for new tasks they must search for a new pipeline. Furthermore, although there are Auto-ML frameworks that consider the complete pipeline but imbalanced datasets did not receive much attention from the proposed Auto-ML frameworks. To tackle these challenges, we propose a new efficient Auto-ML framework that is equipped with a meta learning mechanism for automated algorithm selection which also handles imbalanced datasets efficiently. This paper also demonstrates how the proposed framework outperforms the-state-of-the-art frameworks.

Automated machine learning (Auto-ML)

hyper-parameter optimization (HPO)

Meta learning

supervised learning

Success of artificial intelligence and machine learning appeared in various sectors and domains, thus it has attracted a lot of attention from both the business and research communities. In general, the effectiveness of machine learning algorithms is highly dependent on the availability of large datasets. Thus, data is a vital and influential part for machine learning process. Recently and with the spread of internet, social media, different kinds of applications, devices, and data sources; the size of data is growing in an unprecedented way. Generally, the cleaner data we have, the higher the chance of getting better machine learning performance and robust results. However, machine learning has been broadly implemented in various fields such as image classification [1] [2] [3] text classification [4], speech recognition [5], predictive analytics [6] [7] [8] [9] [10] and recommendation systems [11] [12] [13] [14].There is a great success of machine and deep learning techniques in a wide range of fields specially in the current Big Data Era [15], [16]. With the presence of Big Data, it is necessary to have data science, as well as a large number of data scientists with good experience and strong knowledge so that they can handle that massive size of data which is produced daily. In particular, it is difficult to find appropriate qualified data scientists with satisfactory knowledge and experience and who can analyze manually the increasing growing size of data. Thus, we are witnessing a growing interest and focus in eliminating or reducing human interference in the ML loop, hence encouraging “human not in the loop” by finding a way for automating the process of building machine learning pipelines. In general, data scientists depend on trial and error experiments to build a high-quality machine learning model, consequently, building a ML model is a complex and time-consuming process that involves a set of steps as shown in Figure 1 [17].

Generally, there are many challenges facing data scientists. These challenges include a wide range of possible machine learning algorithms that the data scientist needs to select from such as SVM, K-nearest neighbor, decision tree, XGBoost, linear regression, neural network, random forest, and Bayesian classifier. In addition, the type of problem as supervised or unsupervised which determines the set of algorithms that can be used and fit the given problem. Also, data scientists need to tune a set of hyper-parameters for the selected algorithm. Furthermore, the model performance can be measured with many metrics such as accuracy, recall, precision, and/or F1 score, so selecting the most adequate metric is another challenge. Naturally, the choices made by the data scientist in each of these steps affects the final results’ performance and the quality of the implemented machine learning model [18] [19] [20]. In general, trying multiple machine learning classifiers on the same dataset, leads to different results and different performance values. Usually choosing among these classifiers requires human to be in the loop. Human-in-the-Loop machine learning is a practical guide for optimizing the process of machine learning manually. However, all these processes are costly, time-consuming and require manual processes. Thus, there has been an increase interest to automate the process of developing a complete machine learning pipeline.

Nowadays, automated machine learning has a widespread and has been targeted in automating machine learning process. Automated machine learning has become an important area of research, as machine learning had a rapid rise in popularity through many sectors and domains. Automated Machine Learning is a term that refers to techniques that aim at automating the process of implementing and developing the machine learning models [21]. Moreover, Auto-ML can help in providing a high-performance machine learning solution for a given problem with accepted time budget. For example, there are a set of cloud platforms that provide Auto-ML as service for both practitioners and researchers such as Amazon Model Tuning [22], Google Hyper-Tune [23], Microsoft Azure Auto-ML [24], and IBM Auto-AI [25].

Auto-ML brings about three advantages:

Saving the machine learning users from the time-consuming process of building the iterative machine learning experiments that depends mainly on trial-and-error.
Facilitating the development of machine learning for business and machine learning experts or for traditional users with less technical backgrounds, and
Making machine learning universally accessible to all scientists in different domains.

Initially, Auto-ML was proposed to automate a specific part of machine learning workflow namely “hyper-parameter search” [26]. Over time, Auto-ML has been developed to include other ML workflow parts such as feature preprocessing, feature engineering, model selection and model interpretability. Furthermore, Auto-ML can automate machine learning workflow for supervised learning and unsupervised learning problems.

In the case of supervised learning, Auto-ML can be used to reduce in the loop from all stages of building supervised learning systems that utilize models for regression, classification, or forecasting. Adapting Auto-ML in classification and regression problem is very important as data is increasing vastly in various domains. However, Auto-ML can be adapted in multiple domains for analyzing data and for handling classification and regression problems.

In this paper, we address the challenges and limitations in the existing Auto-ML frameworks and how to overcome such challenges. To overcome these challenges, we propose SML-AutoML, which is a new efficient meta learning-based Auto-ML framework that focuses on the complete machine learning pipeline instead of focusing only on automating the process of Combined Algorithm Selection and Hyper-parameter tuning (CASH). CASH part costs 20% of the time that the data scientists spend for building a machine learning pipeline for a certain problem [27]. Moreover, our proposed framework makes use of meta-learning to be able to adapt easily and quickly with minimum steps for new tasks. Without meta-learning, auto-ML frameworks can search a well-performing architecture for a single task. However, they have to search for a new architecture on a new task. Furthermore, our proposed framework can deal with imbalanced datasets and give better performance by including more important feature engineering and preprocessing steps such as “data sampling”, “collinearity”, “feature selection” and advanced transformation algorithms. In addition, our proposed framework takes less time to recommend the best pipeline for the given dataset.

The rest of this paper is organized as follows. In Section 2, we present a brief review of related work. In Section 3, Section 4, and Section 5, we propose our solution for recommending a complete machine learning pipeline for any given dataset. The experimental results are presented in Section 6. Finally, Section 7 concludes and proposes directions for possible future work.

Machine learning is a very important science that can be adapted in different domains and can be used in solving the complicated problems. In this section, we will consider existing Auto-ML systems and frameworks that automate machine learning process; and we will give a brief summary of each framework. Each framework performs one or more tasks in order to automate a part or the full machine learning process. However, in the recent years several automated machine learning frameworks have been developed using the centralized machine learning packages.

Auto-WEKA [28] is an Auto-ML framework that has been implemented on top of the popular data mining tool named WEKA [29]. Auto-Weka was the first Auto-ML framework that studied automating the machine learning process through automating the steps of selection of machine learning algorithm and tuning its hyper-parameters. These two steps were later known as Combined Algorithm Selection and Hyper-parameter optimization (CASH) problem. Auto-WEKA uses the SMAC [30] optimization algorithm and depends on some algorithms of feature selection that have been implemented in WEKA to solve the CASH problem. In Auto-WEKA 2.0 [31] –the new release of Auto-WEKA –updates have been developed to be integrated within the WEKA ecosystem rather than being a standalone piece of software. Besides, including regression problems instead of supporting only classification problems.

Auto-sklearn [32] is the most popular framework among the existing Auto-ML frameworks because it was developed on top of the most popular Python machine learning library Scikit-Learn [20]. In addition, as a result of the widespread of the python language among data science developers, the Scikit-Learn and Auto-sklearn have received great attention. Auto-sklearn also considered the CASH problem that has been addressed by the Auto-WEKA framework. However, it includes some improvements over Auto-WEKA such as considering meta-learning [33]. The Meta-learning approach depends on the recognition of the characteristics of the datasets. It recommends the appropriate machine learning pipeline that fits those characteristics. Auto-sklearn focuses on studying the characteristics of the datasets through computing a set of meta-features such as number of instances, number of features, number of classes, and data skewness [21]. To construct such base of meta features, the training stage of Auto-sklearn depends on evaluating 140 different datasets in the OpenML repository [34], where meta features are calculated and then Bayesian optimization [35] is applied to determine and store the best machine learning pipeline with the highest performance for each dataset. When given a new dataset, meta features are calculated and compared with the meta features of the datasets used in the training stage using L1 distance to rank these datasets and select the stored machine learning pipelines for the nearest 25 datasets. The second improvement that has been included by Auto-sklearn is constructing automated ensemble models. Auto-sklearn stores all the models that have been considered in the training stage instead of discarding them; and performs a post-processing method to construct an ensemble out of these models. This improvement helps in avoiding overfitting problem. According to the study that had been made by [36], Auto-sklearn outperformed Auto-WEKA system in almost 86% of the cases.

TPOT [37] was initially developed to deal with biomedical data science, but later it had been adapted to handle any machine learning problem. TPOT is an open source Auto-ML framework. It automates the machine learning process through handling feature preprocessing, model selection, and hyper-parameter optimization tasks. Similar to Auto-sklearn, TPOT gained its popularity for being built on top of Python popular machine learning library Scikit-Learn. TPOT uses genetic programming algorithm described in [38] to construct genetic programming trees to combine the algorithms used in machine learning pipelines. Moreover, TPOT utilizes different algorithms for each task, for example: for feature preprocessing it utilizes algorithms such as Principal Component Analysis (PCA) and Scalers, for feature selection it uses algorithms such as Recursive Feature Elimination (RFE) [39] and Variance Thresholds, and for classification problems it employs algorithms such as K-Nearest Neighbors (KNN), Decision Tree and Random Forest. In general, TPOT is one of the most popular Auto-ML frameworks today.

Although the three mentioned Auto-ML frameworks are the dominant open-source Auto-ML frameworks, there were many frameworks that have been proposed in the last few years.

H2o [40] is a distributed automated machine learning framework that automates the process of training large different machine learning models. H2o training step is executed on server and can be accessed by APIs of different programming languages such as R, Python, Java and Scala. H2o automates feature engineering, data preprocessing, model selection and hyper-parameter optimization. H2o adapts fast random search and stacked ensembles to optimize the recommended pipeline.

LightAutoML [41], is an open-source auto-ML framework that was proposed to serve the financial sector. It can automate machine learning pipeline steps such as feature engineering, data preprocessing, model selection and hyperparameter optimization.

AMLBID [42], is an open-source auto-ML framework. AMLBID is meta-learning-based approach that has been proposed to automate machine learning model built over the industrial data. ATM [43] is an example of a distributed and scalable Auto-ML framework that helps machine learning users to upload their datasets, choose among machine learning algorithms, and define a search space for hyper-parameters. Consequently, ATM recommends the optimal machine learning pipeline for those datasets by utilizing Bayesian optimization system and meta-learning techniques.

ML-Plan [44] is another Auto-ML framework that depends on hierarchical task networks (HTNs) [45]. ML-Plan automates the machine learning process through automating algorithm selection and algorithm configuration tasks.

AlphaD3M [46] is an Auto-ML framework that uses reinforcement learning technique to optimize the machine learning pipeline. In AlphaD3M, the process of model discovery or model recommendation is achieved by performing iterative experiments with different machine learning pipelines. These machine learning pipelines are generated through inserting, deleting, or replacing different pipeline parts. The optimal pipeline is eventually identified through these trained pipelines as all actions and decisions are included.

Additionally, as R-language is one of the most popular programming and statistical language in data science,

SmartML has been proposed as the first Auto-ML framework for automating classification problems depending on R package [47]. Automation process in SmartML is divided into two phases, in the first phase, a knowledge base of meta features and algorithms performance across different training datasets is constructed. While in the second phase, for a given dataset, the meta-features are extracted and compared with the meta-features that are stored in the framework's knowledge base; then the nearest neighbor technique is used to identify the similar datasets in the knowledge base. The retrieved datasets are used to identify the best performing algorithms on them, hence recommend the dominant algorithm. SmartML depends on SMAC Bayesian Optimization [16] for hyper-parameter tuning step.

Nevertheless, several cloud-based platforms have considered automating the machine learning process considering the high computational power that characterizes the cloud environments and hence can facilitate trying different experiments with different machine learning algorithms and with a wide range of hyper-parameters. For example, Google Auto-ML [23] is a cloud-based service provided by Google cloud to automate the machine learning process to help machine learning experts or traditional users with less technical backgrounds. Google Auto-ML provides a wide range of machine learning algorithms for different tasks such as traditional machine learning, natural Language processing (NLP), and computer vision. For traditional machine learning tasks, Auto-ML Tables service can be used for tabular structured data by automating machine learning pipeline tasks such as feature engineering, model selection, and hyper-parameter tuning. Moreover, for natural Language processing tasks, Auto-ML Natural Language and Auto-ML Translation services can be used to deal with text in tasks like text analysis, language detection, sentiment analysis, and text similarity. Furthermore, for computer vision tasks, Auto-ML vision, and video intelligence services can be used to extract insights from visual data such as object detection, image classification and object localization.

Moreover, Azure Auto-ML [24] is another cloud-based service provided by Microsoft to automate both classification and regression tasks. Azure Auto-ML depends on Bayesian optimization and collaborative filtering in searching for the optimal machine learning pipeline for a given dataset. Azure Auto-ML is based on the different learning algorithms search space of Scikit-Learn. Moreover, Amazon Sage Maker [22] is a cloud-based service provided by Amazon to automate the machine learning process. Among the different Auto-ML cloud platforms, Amazon Sage Maker had a wide spread in the last few years. Amazon Sage Maker provides machine learning users with a wide range of machine learning, and deep learning frameworks. Moreover, deploying machine learning models can be performed on auto-scaling clusters in multiple zones, hence guaranteeing, high availability, and high performance during online predictions. Furthermore, Amazon provides a large set of pre-trained models for different tasks such as recommendation systems, image classification, text analysis and voice recognition.

In addition, Auto-AI [25] is a cloud-based service provided by IBM to automate machine learning classification and regression tasks. Auto-AI automate the machine learning process through a set of steps. First, identify the best model (model selection) for the given dataset. Second, apply feature selection step to keep only the features that support the problem and eliminate the remaining features. Finally, examine a wide range of hyper-parameters. Auto-AI accordingly recommends the best performing machine learning pipeline based on metrics such as accuracy and precision.

Generally, automated machine learning has been broadly adapted in various domains such as healthcare [48] [49] [50] [51] [52], finance [53] [41] [54] transportation [55] [56]. and manufacturing [57] [42]. This wide spectrum of applications increases the need for more work and research to enhance the capabilities of existing Auto-ML approaches.

Table 1 Comparison between State-of-the-art Automated Machine Learning Frameworks

	Auto-sklearn	Auto-Weka	Smart-ML	TPOT	H2O	AMLBID	LightAutoML
Language	Python	Java	R	Python	Python-Also provides (Java/R) APIs	Python	Python
Training Framework	On top of Scikit-Learn	On top of Weka	On top of R	On top of Scikit-Learn	On top of Scikit-Learn	On top of Scikit-Learn	On top of Scikit-Learn
Supported data type	Tabular	Tabular	Tabular	Tabular	Tabular, text, Images	Tabular	Tabular
Supported operating systems	Linux	Windows, Linux, Mac	Windows, Linux, Mac	Windows, Linux, Mac	Windows, Linux, Mac	Windows, Linux, Mac	Windows, Linux, Mac
Use Meta-learning	No	No	Yes	Yes	No	Yes	No
Support Ensembling	Yes	Yes	Yes	No	Yes	Yes	Only linear models and GBMs
Feature preprocessing	Yes	Yes	Yes	No	Yes	No	Yes
Performs well with imbalanced dataset	No	No	No	No	No	No	No

Table 1 shows a feature comparison between the most recent and popular state-of-the-art frameworks. While all of these Auto-ML Frameworks provide partial or complete ML pipeline automation, each one works differently and targets different algorithms or dataset structures. Although considering meta-learning approach and Ensembling algorithms in some frameworks, some challenges still exist.

Challenge-1: These frameworks are limited to simple and balanced datasets.

Challenge-2: the high computational power that is needed to perform auto-ML experiments using these frameworks.

In the next section we will show our proposed Auto-ML framework which tackles such two challenges.

In this section, we introduce a novel automated machine learning framework (SML-AutoML) that aims to overcome the problems and limitations of the existing frameworks. SML-AutoML is an Auto-ML framework that is built on top of Python library Scikit-Learn. The proposed framework contains a repository of data collected from previous machine learning experiments, and uses this data to build models that show how the different machine learning algorithms will perform on different datasets. For a new dataset, SML-AutoML uses those models to guide experimentation and recommend the optimal machine learning pipeline. Each recommendation provides SML-AutoML framework with more information through a feedback loop that allows it to refine its models and improve their performance metrics (accuracy, recall, and precision).

Figure 2 shows an overview of SML-AutoML framework architecture. The framework is comprised of two parts. The first part is the controller that interacts with the training datasets. The controller collects information about the different datasets, this information helps understanding the datasets characteristics. The controller uses standard APIs such as OpenML-Python [58] to collect this information from the training datasets. The second part is ML pipeline manager. ML pipeline manager receives the characteristics of new datasets collected from the controller and stores it in its repository along with the characteristics of previous datasets. Data repository also contains the results of various machine learning experiments that had been executed on different training datasets with different characteristics. These experiments represent the machine learning pipeline performance on different datasets. The ML pipeline manager is also supported by background processes that continuously analyze new datasets and refine SML-AutoML internal machine learning models. These models allow the ML pipeline manager to identify the most similar datasets and consequently the relevant machine learning pipeline for the new datasets.

In general, SML-AutoML framework is built in 2 stages. The first stage is an offline training stage, where the framework is trained using different machine learning algorithms and training different datasets that have different characteristics. The second stage is online prediction stage, where a machine learning pipeline is recommended for a given new dataset using the models that have been built in the offline stage.

In the sections that follow, we describe how SML-AutoML collects datasets characteristics (meta features). Next, a set of machine learning experiments are presented with different pipelines to track each pipeline performance over each dataset. Moreover, we illustrate how to use these knowledge base to recommend the appropriate machine learning pipeline for a new given dataset.

3.1 SML-AutoML Offline Stage

In this step, learning step, SML-AutoML discovers the characteristics of the training datasets and performs experimental machine learning over those training datasets. Figure 3 illustrates the offline SML-AutoML stage. The offline SML-AutoML pipeline stage consists of a set of steps. First, training datasets from the OpenML repository [34] and Kaggle are input to the SML-AutoML framework. For each training dataset, the framework performs two tasks:

It identifies the dataset characteristics by extracting and calculating meta-features that describe the dataset and help in identifying its main features as illustrated in Table 2 while showing the extracted meta features and their description; and
Evaluates multiple machine learning pipelines on the training datasets.

These pipelines are conducted using the search space as illustrated in Table 3. The proposed framework considers all tasks of the complete machine learning pipeline (i.e. data pre-processing, feature engineering, model selection and hyper-parameter optimization). The data pre-processing task considers the most popular techniques for processing datasets. For example, it considers multiple imputation algorithms to fill missing values such as mean, median, mode, and K-nearest neighbors [59]. For encoding data and transforming categorical features to numerical features algorithms such as One Hot Encoder, Ordinal encoder, and Log Transformer are considered. Also, for scaling and normalizing the data scaler algorithms such as Min-Max scaler are considered [60]. In addition, for data balancing, data sampling techniques such as over sampling [61], under sampling [62] and combine sampling [63] are used. Moreover, for feature engineering algorithms such as PCA [64], Polynomial Features are employed. Finally, for feature selection RFE-Random Forest, Remove Collinearity, Variance Threshold techniques are examined. Currently, the proposed SML-AutoML framework has been trained using 14 different classifiers (as shown in Table 3) to complete the whole Auto-ML pipeline selection lifecycle.

To the best of our knowledge, another limitation of most of the existing Auto-ML frameworks is that they do not handle imbalanced datasets carefully even though imbalanced datasets are widely spread. In fact, imbalanced datasets exist in various use cases such as retail, churn prediction, fraud detection, and cancer prediction. To tackle this challenge, SML-AutoML added some tasks in the search space such as log transformers, data sampling and collinearity removal.

After identifying the characteristics of the training dataset and performing machine learning pipelines on these training datasets, the result of these two steps is stored in a data repository to be utilized for the online prediction and for recommending the appropriate machine learning pipeline for any new dataset not in the training set.

In addition, the data stored in the data repository acts as a knowledge base for analyzing the performance of each machine learning pipeline on the different datasets. This in turn helps in extracting the relationship between the machine learning pipeline performance and the features of the datasets so that future pipeline recommendations better match the dataset features.

Table 2 Sample meta features used in SML-AutoML and their description

Feature	Description
Number of instances (Cardinality)	Number of rows in the dataset
Number of features (Degree)	Number of columns in the dataset
Number of classes (Projectivity)	Number of distinct values in the target feature
Number of missing values (Null Count)	Total number of missing values across the dataset
Number of numeric features (Numerical Degree)	Number of numerical columns
Number of categorical features (Categorical Degree)	Number of categorical columns
Number of binary features (Binary Degree)	Number of binary (with two distinct values) columns
MinorityClassPercentage	Percent of the minority class from the whole population
ImbalancedFlag?	Is the dataset imbalanced (i.e. there is a noticeable difference between percent of each class label)?
MaxSkewnessNumericFeatures	Maximum skewness of numerical features
MaxKurtosisNumericFeatures	Maximum kurtosis of numerical features

Table 3 The list of algorithms used by SML-AutoML

Data preprocessing

Feature engineering

Feature selection

Classifiers

Ordinal Encoder

One Hot Encoder

Log Transformer

Imputer – Mean

Imputer – Median

Imputer – Simple (Mode)

Imputer – KNN

Sampling – Over Sampling

Sampling – Under Sampling

Sampling – combine sampling

Min Max - Scaler

PCA

Polynomial Features

RFE-RandomForest

Remove Collinearity

Variance Threshold

GaussianNB

SVM

KNeighbors

Decision Tree

RandomForest

ExtraTreesClassifier

GradientBoostingClassifier

Logistic Regression

Bagging

LDA

XGBoost

SGD

Ridge

LGBM

3.2 SML-AutoML Online Stage

After constructing the knowledge base in the training stage, the knowledge base is utilized in prediction and pipeline recommendation for new datasets. In general, SML-AutoML follows a set of steps to recommend a machine learning pipeline for any given dataset. These steps are shown in Figure 4.

The process of recommending a machine learning pipeline for a given dataset is executed through three steps:

Dataset characterization, where SML-AutoML framework identifies the characteristics of the given dataset by computing a set of meta features.
Similar datasets identification, where the computed meta features are compared with the meta features of all training datasets stored in the data repository and identify the three nearest datasets.
Automatic pipeline recommendation, where the extraction of the stored machine learning pipelines performance that has been stored in the data repository of those three nearest datasets happens, and the recommendation of the dominant machine learning pipelines occurs. If no similar datasets exist for the given dataset, the characteristics of the given dataset are stored in data repository to be trained using customized search space.

In this section we present an evaluation of SML-AutoML ability to automatically recommend the best machine learning pipeline for a given dataset. We implement all of our codes in Python version 3.9.

4.1 Training Data Collection

As discussed in Section 3.1, SML-AutoML requires a corpus of previous training sessions where the framework explores different training datasets that have different characteristics and is trained using different machine learning algorithms. We evaluate our framework on 300 classification datasets with different characteristics such as number of instances, number of features, type of features, number of classes and class imbalance as presented in Table 3. All datasets are obtained from various sources including OpenML repository [34], and Kaggle.

4.2 ML Pipeline Recommendation Evaluation

In this section, we demonstrate how learning from the training sessions improves SML-AutoML’s ability to find a good machine learning pipeline for a given dataset. To accomplish this, we compare SML-AutoML with the most popular Auto-ML pipeline generation platforms. We performed four different experiments where we tracked multiple evaluation measures such as accuracy, precision, recall and time. First, we compare SML-AutoML with two open source Auto-ML pipeline generation platforms: TPOT and H2O. Second, we compare SML-AutoML with one open source Auto-ML pipeline generation platform: Auto-sklearn. In the third experiment, we compare SML-AutoML with two cloud-based Auto-ML platforms: Microsoft Azure Auto-ML and IBM Auto-AI. And finally, we compare SML-AutoML with one open source Auto-ML pipeline generation platform: LightAutoML [41].

4.2.1. Datasets

We used 10 public datasets with different characteristics to evaluate our proposed framework to the existing frameworks. Table 4 presents details about the datasets used inside the evaluation.

Table 4 Characteristics of the 10 datasets used for the evaluation

Dataset	Cardinality	Degree	# Classes	% Minor Class	URL
Credit-Card-Fraud	284807	31	2	0.17 %	https://www.kaggle.com/mlg-ulb/creditcardfraud
Telco-Customer-Churn	7043	21	2	26 %	https://www.kaggle.com/blastchar/telco-customer-churn
Anomaly-Detection	134229	8	2	4.8 %	https://www.kaggle.com/jorekai/anomaly-detection-falling-people-events
Covid-19	5644	106	2	9.9 %	https://www.kaggle.com/einsteindata4u/covid19
Malware-Detection	5210	70	2	47.7 %	https://www.kaggle.com/saurabhshahane/classification-of-malwares
House-Sales-Prediction	21613	4	5	0.13 %	https://www.kaggle.com/dumburanjith/house-sales-prediction-and-classification
Oil-Spill	937	50	2	4.37 %	https://www.kaggle.com/paulh2718/oil-spill
Wine-Quality	1599	12	6	0.62%	https://www.kaggle.com/uciml/red-wine-quality-cortez-et-al-2009
Occupancy-Detection	8143	7	2	21.2 %	https://www.kaggle.com/robmarkcole/occupancy-detection-data-set-uci
Abalone	4168	9	21	0.14 %	https://archive.ics.uci.edu/ml/datasets/abalone

4.2.2 Solution Environment and Hardware

Our proposed framework was implemented, in Python 3.9 using a Windows 10 Core i7 CPU machine with 16GB RAM to compare it with TPOT, H2O, and LightAutoML. Table 5, Table 8 show the time in minutes it takes for TPOT, H2O, LightAutoML and SML-AutoML to recommend a machine learning pipeline for each of the evaluation datasets.

Moreover, the proposed framework was implemented, in Python 3.9 using a virtual machine with aspects (ubuntu-20.04, 12 GB RAM, and 4 processors) to compare it with Auto-sklearn. Table 6 shows the time in minutes it takes for Auto-sklearn and SML-AutoML to recommend a machine learning pipeline for each of the evaluation datasets.

Finally, we compared our framework with Azure Auto-ML and IBM Auto-AI. We used a compute cluster with aspects (4 vCPUs (cores), 14 GB RAM, 28 GB (storage)) to perform Auto-ML experiments for the 10 datasets on Microsoft Azure. On the other hand, we used an environment with aspects (8 vCPU and 32 GB RAM) to perform Auto-ML experiments for the 10 datasets on IBM Watson Machine Learning. Table 7 shows the time in minutes it takes for Azure Auto-ML, IBM Auto-AI and SML-AutoML to recommend a machine learning pipeline for each of the evaluation datasets.

Table 5 Performance Evaluation (time) SML-AutoML vs TPOT & H2O

	Datasets
	Credit-Card-Fraud	Telco-Customer-Churn	Anomaly-Detection	Covid-19	Malware-Detection	House-Sales-Prediction	Oil-Spill	Wine-Quality	Occupancy-Detection	Abalone
TPOT	700	29.1	295.2	35.7	31.9	155	5.12	20.7	12.4	104.2
H2O	1.8	1.4	1.7	1.5	1.72	1.33	1.06	1.63	1.7	1.73
SML-AutoML	0.67	0.62	4.3	1.1	0.48	1.1	0.16	0.15	0.5	1.83

Table 6 Performance Evaluation (time) SML-AutoML vs Auto-sklearn

	Datasets
	Credit-Card-Fraud	Telco-Customer-Churn	Anomaly-Detection	Covid-19	Malware-Detection	House-Sales-Prediction	Oil-Spill	Wine-Quality	Occupancy-Detection	Abalone
Auto-sklearn	60.1	59.9	60	59.8	60.1	59.9	59.8	60	59.9	59.8
SML-AutoML	0.66	0.55	4.1	1	0.45	1.1	0.15	0.15	0.38	1.8

Table 7 Performance Evaluation (time) SML-AutoML vs Azure Auto-ML & IBM Auto-AI

	Datasets
	Credit-Card-Fraud	Telco-Customer-Churn	Anomaly-Detection	Covid-19	Malware-Detection	House-Sales-Prediction	Oil-Spill	Wine-Quality	Occupancy-Detection	Abalone
Azure Auto-ML	84	52.6	112.3	50.8	123.4	58.6	63.7	62.6	59.46	120.8
IBM Auto-AI	127	5	5.3	6.1	7.2	9.4	6.3	11.7	6.6	8.9
SML-AutoML	0.67	0.62	4.3	1.1	0.48	1.1	0.16	0.15	0.5	1.83

Table 8 Performance Evaluation (time) SML-AutoML vs LightAutoML

	Datasets
	Credit-Card-Fraud	Telco-Customer-Churn	Anomaly-Detection	Covid-19	Malware-Detection	House-Sales-Prediction	Oil-Spill	Wine-Quality	Occupancy-Detection	Abalone
LightAutoML	13.7	2.25	19.9	10.5	7.7	11.3	0.97	8	3.5	13.4
SML-AutoML	0.67	0.62	4.3	1.1	0.48	1.1	0.16	0.15	0.5	1.83

In addition, we tracked important measures such as accuracy, precision and recall in the conducted experiments:

Figure 5 (a): shows the accuracy of the proposed framework against the existing frameworks.
Figure 5 (b): shows the precision of the proposed framework against the existing frameworks.
Figure 5 (c): shows the recall of the proposed framework against the existing frameworks. However, Figure 6 shows the average performance of the proposed framework against the existing frameworks on the evaluation datasets.

4.3 Evaluation Results and Analysis

In this section, we will analyze the experimental results that had been performed to compare our proposed framework SML-AutoML and the existing frameworks. In the results of the experiments, presented in figure 5, it is shown that the performance of the proposed framework SML-AutoML and TPOT, H2O and Auto-sklearn frameworks is almost similar in terms of accuracy, precision, and recall measures of binary class datasets, with a slight superiority of SML-AutoML in those measures. However, this superiority appears greatly in the same measures (accuracy, Precision and Recall) for multi-class datasets. Additionally, SML-AutoML dominates TPOT, H2O and Auto-sklearn frameworks in time as shown in Table 5 and Table 8. SML-AutoML can recommend the appropriate pipeline faster for all datasets. Moreover, TPOT framework dominates H2O framework in accuracy, precision, and recall for all datasets. On the other hand, H2O takes less time in all experiments than TPOT because of the dependence on a sample of the data instead of using the total population or the whole dataset.

Moreover, SML-AutoML and Azure Auto-ML outperform IBM Auto-AI in all measures of most of the datasets. It is also shown that the performance is almost similar between SML-AutoML framework and Azure Auto-ML with a slight superiority of SML-AutoML. However, this superiority appears greatly in the performance and time in the case of multi-class datasets as shown in Table 6.

Finally, Table 8 shows that SML-AutoML can recommend a machine learning pipeline for the given dataset faster than LightAutoML. Also, SML-AutoML outperforms LightAutoML in the remaining measures as shown in Figure 5 and Figure 6.

In general, there is a noticeable superiority of our proposed framework in terms of cost where it takes less time to recommend the appropriate pipeline for a given dataset, this superiority is due to its reliance on meta-learning approach. Furthermore, utilizing steps such as performing data sampling, using ensemble algorithms, and performing some additional preprocessing steps such as transformation, empower the proposed framework to dominate the existing frameworks in all measures with imbalanced datasets.

In this paper, we proposed a new Auto-ML framework named SML-AutoML for the automatic recommendation of ML pipelines. The significant advantages of our proposed framework are its ability to recommend the appropriate pipeline for a given dataset in a short time. Moreover, its ability to deal with imbalanced binary and multiclass datasets. Those advantages are due to integrating multiple components such as utilizing meta-learning, including data sampling techniques, using additional transformation techniques that fits the nature of imbalanced datasets, and making use of both ensemble and basic algorithms. Thus, adapting those components enables our framework to outperform the state-of-the-art results in terms of accuracy, precision, recall, and time. Experiments and analysis show that our proposed framework achieves on average of more than 5% outperformance than the existing auto-ML frameworks. For future work, we intent to extend our proposed framework to include unsupervised learning and time series forecasting problems.

Ethical Approval

Not Applicable.

Availability of supporting data

All authors are sure that all data and materials as well as software application or custom code support their published claims and comply with field standards. In addition, we can share with the journal a sample of the used data and output in concordance with disciplinary norms and expectations.

Competing interests

The authors have no competing interests to declare that are relevant to the content of this article.

Funding

The authors have no relevant financial or non-financial interests to disclose.

Authors' contributions

All authors contributed to the study conception, design, material preparation, data collection and analysis. The first draft of the manuscript was written by “Ibrahim Gomaa” and all authors commented on previous versions of the manuscript. All authors read and approved the final manuscript.

Acknowledgments

Not Applicable.

K. He, X. Zhang, S. Ren, J. Sun.: Deep Residual Learning for Image Recognition. In the Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 770–778 (2016)
M. Sheykhmousa, M. Mahdianpari, H. Ghanbari, F. Mohammadimanesh, P. Ghamisi, S. Homayouni.: Support Vector Machine Versus Random Forest for Remote Sensing Image Classification: A Meta-Analysis and Systematic Review. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens., vol. 13, pp. 6308–6325 (2020)
M. A. Chandra, S. S. Bedi.: Survey on SVM and their application in image classification. Int. J. Inf. Technol., vol. 13, no. 5, pp. 1–11 (2021)
MinaeeShervin, KalchbrennerNal, CambriaErik, NikzadNarjes, ChenaghluMeysam, GaoJianfeng.: Deep Learning–based Text Classification: A Comprehensive Review. ACM Comput. Surv. CSUR (2021)
A. Graves, A. Mohamed, G. Hinton.: Speech recognition with deep recurrent neural networks. In IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 6645–6649 (2013)
N. M. Abdulkareem, A. M. Abdulazeez, D. Q. Zeebaree, and D. A. Hasan.: COVID-19 World Vaccination Progress Using Machine Learning Classification Algorithms. Qubahan Acad. J., vol. 1, no. 2, Art. no. 2 (2021)
R. Sujitha, V. Seenivasagam.: Classification of lung cancer stages with machine learning over big data healthcare framework. J. Ambient Intell. Humaniz. Comput., vol. 12, no. 5, pp. 5639–5649 (2021)
H. Jain, G. Yadav, R. Manoov.: Churn Prediction and Retention in Banking, Telecom and IT Sectors Using Machine Learning Techniques. In Advances in Machine Learning and Computational Intelligence, Singapore, pp. 137–156 (2021)
O. Iatrellis, I. Κ. Savvas, P. Fitsilis, V. C. Gerogiannis.: A two-phase machine learning approach for predicting student outcomes. Educ. Inf. Technol., vol. 26, no. 1, pp. 69–88 (2021)
C. Li, P. Xu.: Application on traffic flow prediction of machine learning in intelligent transportation. Neural Comput. Appl., vol. 33, no. 2, pp. 613–624 (2021)
K. A. Fararni, F. Nafis, B. Aghoutane, A. Yahyaouy, J. Riffi,A. Sabri.: Hybrid recommender system for tourism based on big data and AI: A conceptual framework. Big Data Min. Anal., vol. 4, no. 1, pp. 47–55 (2021)
P. Covington, J. Adams, E. Sargin.: Deep Neural Networks for YouTube Recommendations. In Proceedings of the 10th ACM Conference on Recommender Systems, New York, NY, USA, pp. 191–198 (2016)
S. S. Choudhury, S. N. Mohanty, A. K. Jagadev.: Multimodal trust based recommender system with machine learning approaches for movie recommendation. Int. J. Inf. Technol., vol. 13, no. 2, pp. 475–482 (2021)
D. Cintia Ganesha Putri, J.-S. Leu, P. Seda.: Design of an Unsupervised Machine Learning-Based Movie Recommender System. Symmetry, vol. 12, no. 2, Art. no. 2 (2020)
A. Y. Zomaya, S. Sakr.: Handbook of Big Data Technologies. Springer International Publishing (2017)
S. Sakr, A. Y. Zomaya, Eds.: Encyclopedia of Big Data Technologies. Cham: Springer International Publishing (2019)
R. Elshawi, M. Maher, S. Sakr.: Automated Machine Learning: State-of-The-Art and Open Challenges. arXiv preprint arXiv:1906.02287 (2019)
T. Vafeiadis, K. I. Diamantaras, G. Sarigiannidis, K. Ch. Chatzisavvas,: A comparison of machine learning techniques for customer churn prediction. Simul. Model. Pract. Theory, vol. 55, pp. 1–9 (2015)
P. Probst, A.-L. Boulesteix.: To tune or not to tune the number of trees in random forest. Journal of Machine Learning Research, 18:181–1 (2017)
F. Pedregosa et al.: Scikit-learn: Machine Learning in Python. J. Mach. Learn. Res., vol. 12, no. 85, pp. 2825–2830 (2011)
F. Hutter, L. Kotthoff, J. Vanschoren, Eds.: Automated Machine Learning: Methods, Systems, Challenges. Springer Nature (2019)
Amazon.: Perform automatic model tuning. https://docs.aws.amazon.com/en_us/sagemaker/latest/dg/automatic-model-tuning.html
Google.: Overview of hyperparameter tuning. https://cloud.google.com/ai-platform/ training/docs/hyperparameter-tuning-overview
Microsoft.: What is automated machine learning? https://docs.microsoft.com/en-us/azure/ machine-learning/concept-automated-ml
IBM.: AutoAI with IBM Watson Studio. https://www.ibm.com/cloud/watson-studio/autoai (2019)
L. Yang, A. Shami.: On hyperparameter optimization of machine learning algorithms: Theory and practice. Neurocomputing, vol. 415, pp. 295–316 (2020)
M.-A. Zöller, M. F. Huber.: Benchmark and Survey of Automated Machine Learning Frameworks. J. Artif. Intell. Res., vol. 70, pp. 409–472 (2021)
Thornton C, et al.: Auto-WEKA: combined selection and hyperparameter optimization of classification algorithms. In proceedings of the 19th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (2013)
G. Holmes, A. Donkin, I. H. Witten.: WEKA: a machine learning workbench. In Proceedings of ANZIIS ’94 - Australian New Zealnd Intelligent Information Systems Conference, pp. 357–361 (1994)
F. Hutter, H. H. Hoos, K. Leyton-Brown.: Sequential Model-Based Optimization for General Algorithm Configuration. In Learning and Intelligent Optimization, Berlin, Heidelberg, pp. 507–523 (2011)
L. Kotthoff, C. Thornton, H. H. Hoos, F. Hutter, K. Leyton-Brown.: Auto-WEKA: Automatic Model Selection and Hyperparameter Optimization in WEKA. Cham: Springer International Publishing, pp. 81–95 (2019)
M. Feurer, A. Klein, K. Eggensperger, J. Springenberg, M. Blum, F. Hutter.: Efficient and Robust Automated Machine Learning. In Advances in Neural Information Processing Systems, vol. 28 (2015)
J. Vanschoren.: Meta-Learning. In Automated Machine Learning. Cham: Springer International Publishing, pp. 35–61 (2019)
J. Vanschoren, J. N. van Rijn, B. Bischl, L. Torgo.: OpenML: networked science in machine learning. ACM SIGKDD Explor. Newsl., vol. 15, no. 2, pp. 49–60 (2014)
M. Feurer, J. Springenberg, F. Hutter.: Initializing Bayesian Hyperparameter Optimization via Meta-Learning. Proc. AAAI Conf. Artif. Intell., vol. 29, no. 1, Art. no. 1 (2015)
Guyon I, et al.: Design of the 2015 ChaLearn AutoML challenge. International Joint Conference on Neural Networks (IJCNN) (2015)
R. S. Olson, R. J. Urbanowicz, P. C. Andrews, N. A. Lavender, L. C. Kidd, J. H. Moore.: Automating Biomedical Data Science Through Tree-Based Pipeline Optimization. In Applications of Evolutionary Computation, Cham, pp. 123–137 (2016)
Banzhaf, Wolfgang, et al.: Genetic programming: an introduction: on the automatic evolution of computer programs and its applications. Vol. 1. San Francisco: Morgan Kaufmann Publishers Inc (1998)
I. Guyon, J. Weston, S. Barnhill, V. Vapnik.: Gene Selection for Cancer Classification using Support Vector Machines. Mach. Learn., vol. 46, no. 1, pp. 389–422 (2002)
E. LeDell, S. Poirier.: H2O AutoML: Scalable Automatic Machine Learning. Proc AutoML Workshop ICML, vol. 2020, p. 16, (2020)
A. Vakhrushev, A. Ryzhkov, M. Savchenko, D. Simakov, R. Damdinov, A. Tuzhilin.: LightAutoML: AutoML Solution for a Large Financial Services Ecosystem. arXiv preprint arXiv:2109.01528 (2022)
M. Garouani, A. Ahmad, M. Bouneffa, M. Hamlich, G. Bourguin, A. Lewandowski.: Using meta-learning for automated algorithms selection and configuration: an experimental framework for industrial big data. J. Big Data, vol. 9, no. 1, p. 57 (2022)
T. Swearingen, W. Drevo, B. Cyphers, et al.: ATM: a distributed, collaborative, scalable system for automated machine learning. In Proc. IEEE Int. Conf. Big Data, pp. 151–162 (2017)
F. Mohr, M. Wever, E. Hüllermeier.: ML-Plan: Automated machine learning via hierarchical planning. Mach. Learn., vol. 107, no. 8, pp. 1495–1515 (2018)
M. Ghallab, D. Nau, P. Traverso.: Automated Planning: Theory and Practice. Elsevier (2004)
I. Drori et al.: AlphaD3M: Machine Learning Pipeline Synthesis. arXiv preprint arXiv:2111.02508 (2021)
M. M. M. Z. A. Maher, S. Sakr.: SmartML: A Meta Learning-Based Framework for Automated Selection and Hyperparameter Tuning for Machine Learning Algorithms. In EDBT: 22nd International Conference on Extending Database Technology (2019)
A. Mustafa, M. Rahimi Azghadi.: Automated Machine Learning for Healthcare and Clinical Notes Analysis. Computers, vol. 10, no. 2, Art. no. 2 (2021)
D. Lagerev, A. Korsakov, A. Zakharova.: Exploratory Analysis of Biomedical Data in Order to Construct Intelligent Analytical Models for Assessing the Risk of Cancer. p. 929 (2021)
L. O. Schwen, D. Schacherer, C. Geißler, A. Homeyer.: Evaluating generic AutoML tools for computational pathology. Inform. Med. Unlocked, vol. 29, p. 100853 (2022)
D. Singh, P. K. Pant, H. Pant, D. C. Dobhal.: Robust Automated Machine Learning (AutoML) System for Early Stage Hepatic Disease Detection. In Intelligent Data Communication Technologies and Internet of Things, Singapore, pp. 65–76 (2021)
A. Abbas et al.: Evaluating an automated machine learning model that predicts visual acuity outcomes in patients with neovascular age-related macular degeneration. Graefes Arch. Clin. Exp. Ophthalmol (2022)
V. Garg, S. Chaudhary, A. Mishra.: Analysing Auto ML Model for Credit Card Fraud Detection. International Journal of Innovative Research in Computer Science & Technology (2021)
A. Agrapetidou, P. Charonyktakis, P. Gogas, T. Papadimitriou, I. Tsamardinos.: An AutoML application to forecasting bank failures. Appl. Econ. Lett., vol. 28, no. 1, pp. 5–9 (2021)
J. S. Angarita-Zapata, G. Maestre-Gongora, J. F. Calderín.: A Case Study of AutoML for Supervised Crash Severity Prediction. pp. 187–194 (2021)
J. S. Angarita-Zapata, A. D. Masegosa, I. Triguero.: General-Purpose Automated Machine Learning for Transportation: A Case Study of Auto-sklearn for Traffic Forecasting. In Information Processing and Management of Uncertainty in Knowledge-Based Systems, Cham, pp. 728–744 (2020)
M. Garouani, A. Ahmad, M. Bouneffa, M. Hamlich.: AMLBID: An auto-explained Automated Machine Learning tool for Big Industrial Data. SoftwareX, vol. 17, p. 100919 (2022)
M. Feurer et al.: OpenML-Python: an extensible Python API for OpenML. arXiv preprint arXiv:1911.02490 (2019)
C. Zhang, X. Zhu, J. Zhang, Y. Qin, S. Zhang.: GBKII: An Imputation Method for Missing Values. In Advances in Knowledge Discovery and Data Mining, Berlin, Heidelberg, pp. 1080–1087 (2007)
A. Zheng, A. Casari.: Feature Engineering for Machine Learning: Principles and Techniques for Data Scientists. Newton, MA, USA: O’Reilly Media (2018)
N. V. Chawla, K. W. Bowyer, L. O. Hall, W. P. Kegelmeyer.: SMOTE: Synthetic Minority Over-sampling Technique. J. Artif. Intell. Res., vol. 16, pp. 321–357 (2002)
Zhang, J.P., Mani, I.: KNN Approach to Unbalanced Data Distributions: A Case Study Involving Information Extraction. Proceeding of International Conference on Machine Learning, Workshop on Learning from Imbalanced Data Sets, Washington DC (2003)
N. Junsomboon, T. Phienthrakul.: Combining Over-Sampling and Under-Sampling Techniques for Imbalance Dataset. In Proceedings of the 9th International Conference on Machine Learning and Computing, New York, NY, USA, pp. 243–247 (2017)
F. Song, Z. Guo, D. Mei.: Feature Selection Using Principal Component Analysis. In Engineering Design and Manufacturing Informatization 2010 International Conference on System Science, vol. 1, pp. 27–30 (2010)

No competing interests reported.

SMLAutoML.zip

SML-AutoML: A Smart Meta-Learning Automated Machine Learning Framework

Status:

Version 1

Abstract

Figures

1 Introduction

2 Related Work

3 Proposed Auto-ml Framework: Sml-automl

4 Experimental Evaluation

5 Conclusion And Future Work

Declarations

References

Additional Declarations

Supplementary Files

Status:

Version 1