Machine learning is a very important science that can be adapted in different domains and can be used in solving the complicated problems. In this section, we will consider existing Auto-ML systems and frameworks that automate machine learning process; and we will give a brief summary of each framework. Each framework performs one or more tasks in order to automate a part or the full machine learning process. However, in the recent years several automated machine learning frameworks have been developed using the centralized machine learning packages.
Auto-WEKA [28] is an Auto-ML framework that has been implemented on top of the popular data mining tool named WEKA [29]. Auto-Weka was the first Auto-ML framework that studied automating the machine learning process through automating the steps of selection of machine learning algorithm and tuning its hyper-parameters. These two steps were later known as Combined Algorithm Selection and Hyper-parameter optimization (CASH) problem. Auto-WEKA uses the SMAC [30] optimization algorithm and depends on some algorithms of feature selection that have been implemented in WEKA to solve the CASH problem. In Auto-WEKA 2.0 [31] –the new release of Auto-WEKA –updates have been developed to be integrated within the WEKA ecosystem rather than being a standalone piece of software. Besides, including regression problems instead of supporting only classification problems.
Auto-sklearn [32] is the most popular framework among the existing Auto-ML frameworks because it was developed on top of the most popular Python machine learning library Scikit-Learn [20]. In addition, as a result of the widespread of the python language among data science developers, the Scikit-Learn and Auto-sklearn have received great attention. Auto-sklearn also considered the CASH problem that has been addressed by the Auto-WEKA framework. However, it includes some improvements over Auto-WEKA such as considering meta-learning [33]. The Meta-learning approach depends on the recognition of the characteristics of the datasets. It recommends the appropriate machine learning pipeline that fits those characteristics. Auto-sklearn focuses on studying the characteristics of the datasets through computing a set of meta-features such as number of instances, number of features, number of classes, and data skewness [21]. To construct such base of meta features, the training stage of Auto-sklearn depends on evaluating 140 different datasets in the OpenML repository [34], where meta features are calculated and then Bayesian optimization [35] is applied to determine and store the best machine learning pipeline with the highest performance for each dataset. When given a new dataset, meta features are calculated and compared with the meta features of the datasets used in the training stage using L1 distance to rank these datasets and select the stored machine learning pipelines for the nearest 25 datasets. The second improvement that has been included by Auto-sklearn is constructing automated ensemble models. Auto-sklearn stores all the models that have been considered in the training stage instead of discarding them; and performs a post-processing method to construct an ensemble out of these models. This improvement helps in avoiding overfitting problem. According to the study that had been made by [36], Auto-sklearn outperformed Auto-WEKA system in almost 86% of the cases.
TPOT [37] was initially developed to deal with biomedical data science, but later it had been adapted to handle any machine learning problem. TPOT is an open source Auto-ML framework. It automates the machine learning process through handling feature preprocessing, model selection, and hyper-parameter optimization tasks. Similar to Auto-sklearn, TPOT gained its popularity for being built on top of Python popular machine learning library Scikit-Learn. TPOT uses genetic programming algorithm described in [38] to construct genetic programming trees to combine the algorithms used in machine learning pipelines. Moreover, TPOT utilizes different algorithms for each task, for example: for feature preprocessing it utilizes algorithms such as Principal Component Analysis (PCA) and Scalers, for feature selection it uses algorithms such as Recursive Feature Elimination (RFE) [39] and Variance Thresholds, and for classification problems it employs algorithms such as K-Nearest Neighbors (KNN), Decision Tree and Random Forest. In general, TPOT is one of the most popular Auto-ML frameworks today.
Although the three mentioned Auto-ML frameworks are the dominant open-source Auto-ML frameworks, there were many frameworks that have been proposed in the last few years.
H2o [40] is a distributed automated machine learning framework that automates the process of training large different machine learning models. H2o training step is executed on server and can be accessed by APIs of different programming languages such as R, Python, Java and Scala. H2o automates feature engineering, data preprocessing, model selection and hyper-parameter optimization. H2o adapts fast random search and stacked ensembles to optimize the recommended pipeline.
LightAutoML [41], is an open-source auto-ML framework that was proposed to serve the financial sector. It can automate machine learning pipeline steps such as feature engineering, data preprocessing, model selection and hyperparameter optimization.
AMLBID [42], is an open-source auto-ML framework. AMLBID is meta-learning-based approach that has been proposed to automate machine learning model built over the industrial data. ATM [43] is an example of a distributed and scalable Auto-ML framework that helps machine learning users to upload their datasets, choose among machine learning algorithms, and define a search space for hyper-parameters. Consequently, ATM recommends the optimal machine learning pipeline for those datasets by utilizing Bayesian optimization system and meta-learning techniques.
ML-Plan [44] is another Auto-ML framework that depends on hierarchical task networks (HTNs) [45]. ML-Plan automates the machine learning process through automating algorithm selection and algorithm configuration tasks.
AlphaD3M [46] is an Auto-ML framework that uses reinforcement learning technique to optimize the machine learning pipeline. In AlphaD3M, the process of model discovery or model recommendation is achieved by performing iterative experiments with different machine learning pipelines. These machine learning pipelines are generated through inserting, deleting, or replacing different pipeline parts. The optimal pipeline is eventually identified through these trained pipelines as all actions and decisions are included.
Additionally, as R-language is one of the most popular programming and statistical language in data science,
SmartML has been proposed as the first Auto-ML framework for automating classification problems depending on R package [47]. Automation process in SmartML is divided into two phases, in the first phase, a knowledge base of meta features and algorithms performance across different training datasets is constructed. While in the second phase, for a given dataset, the meta-features are extracted and compared with the meta-features that are stored in the framework's knowledge base; then the nearest neighbor technique is used to identify the similar datasets in the knowledge base. The retrieved datasets are used to identify the best performing algorithms on them, hence recommend the dominant algorithm. SmartML depends on SMAC Bayesian Optimization [16] for hyper-parameter tuning step.
Nevertheless, several cloud-based platforms have considered automating the machine learning process considering the high computational power that characterizes the cloud environments and hence can facilitate trying different experiments with different machine learning algorithms and with a wide range of hyper-parameters. For example, Google Auto-ML [23] is a cloud-based service provided by Google cloud to automate the machine learning process to help machine learning experts or traditional users with less technical backgrounds. Google Auto-ML provides a wide range of machine learning algorithms for different tasks such as traditional machine learning, natural Language processing (NLP), and computer vision. For traditional machine learning tasks, Auto-ML Tables service can be used for tabular structured data by automating machine learning pipeline tasks such as feature engineering, model selection, and hyper-parameter tuning. Moreover, for natural Language processing tasks, Auto-ML Natural Language and Auto-ML Translation services can be used to deal with text in tasks like text analysis, language detection, sentiment analysis, and text similarity. Furthermore, for computer vision tasks, Auto-ML vision, and video intelligence services can be used to extract insights from visual data such as object detection, image classification and object localization.
Moreover, Azure Auto-ML [24] is another cloud-based service provided by Microsoft to automate both classification and regression tasks. Azure Auto-ML depends on Bayesian optimization and collaborative filtering in searching for the optimal machine learning pipeline for a given dataset. Azure Auto-ML is based on the different learning algorithms search space of Scikit-Learn. Moreover, Amazon Sage Maker [22] is a cloud-based service provided by Amazon to automate the machine learning process. Among the different Auto-ML cloud platforms, Amazon Sage Maker had a wide spread in the last few years. Amazon Sage Maker provides machine learning users with a wide range of machine learning, and deep learning frameworks. Moreover, deploying machine learning models can be performed on auto-scaling clusters in multiple zones, hence guaranteeing, high availability, and high performance during online predictions. Furthermore, Amazon provides a large set of pre-trained models for different tasks such as recommendation systems, image classification, text analysis and voice recognition.
In addition, Auto-AI [25] is a cloud-based service provided by IBM to automate machine learning classification and regression tasks. Auto-AI automate the machine learning process through a set of steps. First, identify the best model (model selection) for the given dataset. Second, apply feature selection step to keep only the features that support the problem and eliminate the remaining features. Finally, examine a wide range of hyper-parameters. Auto-AI accordingly recommends the best performing machine learning pipeline based on metrics such as accuracy and precision.
Generally, automated machine learning has been broadly adapted in various domains such as healthcare [48] [49] [50] [51] [52], finance [53] [41] [54] transportation [55] [56]. and manufacturing [57] [42]. This wide spectrum of applications increases the need for more work and research to enhance the capabilities of existing Auto-ML approaches.
Table 1 Comparison between State-of-the-art Automated Machine Learning Frameworks
|
Auto-sklearn
|
Auto-Weka
|
Smart-ML
|
TPOT
|
H2O
|
AMLBID
|
LightAutoML
|
Language
|
Python
|
Java
|
R
|
Python
|
Python-Also provides (Java/R) APIs
|
Python
|
Python
|
Training Framework
|
On top of Scikit-Learn
|
On top of Weka
|
On top of R
|
On top of Scikit-Learn
|
On top of Scikit-Learn
|
On top of Scikit-Learn
|
On top of Scikit-Learn
|
Supported data type
|
Tabular
|
Tabular
|
Tabular
|
Tabular
|
Tabular, text, Images
|
Tabular
|
Tabular
|
Supported operating systems
|
Linux
|
Windows, Linux, Mac
|
Windows, Linux, Mac
|
Windows, Linux, Mac
|
Windows, Linux, Mac
|
Windows, Linux, Mac
|
Windows, Linux, Mac
|
Use Meta-learning
|
No
|
No
|
Yes
|
Yes
|
No
|
Yes
|
No
|
Support Ensembling
|
Yes
|
Yes
|
Yes
|
No
|
Yes
|
Yes
|
Only linear models and GBMs
|
Feature preprocessing
|
Yes
|
Yes
|
Yes
|
No
|
Yes
|
No
|
Yes
|
Performs well with imbalanced dataset
|
No
|
No
|
No
|
No
|
No
|
No
|
No
|
Table 1 shows a feature comparison between the most recent and popular state-of-the-art frameworks. While all of these Auto-ML Frameworks provide partial or complete ML pipeline automation, each one works differently and targets different algorithms or dataset structures. Although considering meta-learning approach and Ensembling algorithms in some frameworks, some challenges still exist.
Challenge-1: These frameworks are limited to simple and balanced datasets.
Challenge-2: the high computational power that is needed to perform auto-ML experiments using these frameworks.
In the next section we will show our proposed Auto-ML framework which tackles such two challenges.