Data
To compare the performance of a globally optimized model tree with that using general machine learning or a bilevel GA in a previous research, we evaluate the prediction accuracy using multiple benchmark data. Because the model tree constructed in this study aims to unravel a complicated data structure and extract a universal pattern for population, the data used for accuracy evaluation should also be applicable to complicated and noisy conditions.
1. UC Irvine Machine Learning Depository
From the UC Irvine Machine Learning Depository, we select relatively simple classification problems that have been used in many previous studies in addition to relatively complex regression problems meeting the following conditions.
-
Explanatory variable type: continuous variable
-
Data type: time series
-
Number of explanatory variables: more than 10
-
Number of samples: about 10,000 or more
-
Low sparseness
Table 1 shows a summary of the data used in this study.
Table 1 Summary of different works pertaining to face and speech fusion
2. Financial market time series data
In addition, time series data of financial markets are used as actual data of complex systems. Financial market data contain many one-off factors and noise and serve as representative data for which high prediction accuracy cannot be obtained even by machine learning. This study uses as an objective variable the intraday return for the TOPIX Futures nearby month from the opening price at 08:45 to the closing price at 15:15. There are many other stock indices which represent the global financial markets, such as S&P 500, Dow Jones Industrial Average, Euro Stoxx 50 and so on; however, most of them have been on a consistent upward trend since 2009. Although TOPIX Futures have a smaller trading volume than other indices, they have no long-term trends and are a better example of complex system data. As explanatory variables, we select indicators that represent the financial markets of the United States and Japan. Unlike the objective variable, the explanatory variables do not necessarily have to be the price of the product that can actually be traded, although they need to reflect the movement of the entire financial market from a different perspective. For this reason, the change in the closing value of the stock index, exchange rate, and interest rate shown in Table 2 are used.
The financial data used to predict the TOPIX Futures in next business day include 4,901 intraday returns between January 04, 2001, and December 30, 2020 as an objective variable.
Table 2. Financial data used to predict TOPIX Futures in next business day
Methods
The purpose of this study is to obtain high prediction accuracy by a globally optimized model tree. For this purpose, a model tree is constructed by splitting a sample space with certain splits of a tree, evaluating the versatility of the pattern recognition model at each final node of the tree, and searching the best splits so the average versatility of the pattern recognition models in all final nodes becomes highest. The splits are not recursively searched individually; instead, they are simultaneously searched from large-scale combination optimization.
1. Bilevel GA
When optimizing the overall structure of a decision tree collectively, the fitness value of the tree changes discontinuously when the features used for the splits of a certain node change; therefore, the search method based on the continuity of functions cannot be used. In this study, we first identify the features to be used for each split and their positions in a tree randomly, and we then optimize the threshold for each feature by inner level search. Finally, we search the best features and their positions in a tree by outer level search. This method, outlined in Fig. 1, is referred to as a bilevel GA.
In the inner level search, the features used for each split and their positions in a tree are given, which enables the threshold value of each feature to be searched by a method using continuous distribution. All of the explanatory variables in this study are continuous; therefore, the threshold value change continuously. Because the data sample under a certain split changes discontinuously when the threshold values change, the evaluation function of the tree becomes discontinuous. However, data samples under certain splits change gradually one by one as the threshold changes, which enables the use of an evolutionary computation method that generates individuals based on continuous distribution.
In previous bilevel GA research, the position of the feature was also changed at the inner level, and the search method that presupposes the continuity of the function could not be used. Conversely, the inner level search in this study can be performed efficiently using continuous distribution.
In the outer level search, multiple trees with different features and their positions are generated randomly, and the optimal tree is searched by repeating the selection, crossover, and mutation. When evaluating each individual tree at the outer level, that in which all thresholds are already optimized according to the inner level is used.
2. Inner level optimization
CMA-ES, an evolutionary computation that performs multipoint search based on normal distribution, is used in this study to search the threshold value at the inner level search. This method updates the covariance matrix based on the evolutionary path that accumulates the previous solutions and generates offspring in the direction of movement of the solutions. It is suitable for the inner level search because it can search non-continuous evaluation functions by assuming normal distribution. In addition, CMA-ES is more resistant to noise than other efficient search methods that can handle discontinuous evaluation functions.
However, for data with particularly high levels of complexity such as financial time series data, the evaluation function becomes steep and multimodal, and the search for a global optimal solution requires a large amount of calculation even when using CMA-ES.
In general CMA-ES applications, the degree of freedom is ; the time complexity is O(n2); and the spatial complexity is O(n3) for the number of the dimension O(n2) of the evaluation function. By limiting the variance–covariance matrix C(t+1) used for individual generation to diagonal components, the degree of freedom becomes n, and the amount of time and spatial complexity is reduced to O(n):
(1)
where ccov ∈[0,1] is the learning rate of diagonal element updates; ∈[0,1] is the weighting coefficient of the evolution path ; is the i-th most rated of the z(t+1); and is the i-th component of .
3. Outer level optimization
In the outer level search, the features used for each split and their positions are optimized. The parameters to be optimized are discrete values. In a decision tree, if the features used in a certain split are changed, the structure below will change significantly. Therefore, in the outer level search, the search method assuming continuous distribution cannot be used. For this reason, we use a GA that searches for individuals with high fitness values by repeating the selection, selection, crossover, and mutation because the shape of the evaluation function is not the issue.
As many trees in which thresholds of all splits are already optimized according to the inner level search as the population size are randomly generated as the initial population. A certain percentage among them (preservation rate) is left for the next generation in the order of the fitness value of the tree, and a pair is randomly created from the next certain percentage (crossover rate) group. Branches at random positions are swapped in a pair, and the rest of the population will not be passed on to the next generation and will instead be replaced by trees with new features and their positions. This process is regarded as one generation, and the generation change is repeated. In this study, we use 20 population sizes, a 20% preservation rate, a 20% crossover rate, and 200 generations, as shown by the outer level search in Fig. 1.
4. Model evaluation method
In the classification problem, we use the weighted accuracy rate of each final node as the fitness value of the individual (model tree) generated by the bilevel GA. In the regression problem, we use the weighted prediction accuracy by linear regression analysis at the final node. The prediction accuracy is the average R^2 obtained by the five-fold cross-validation method. The linear regression model uses lasso regression with a regularization parameter of 0.1.
Large-scale combination optimization is required to select the optimal model tree when using such an evaluation method. Therefore, the Oakbridge-CX supercomputer system at the Information Technology Center, University of Tokyo, iss used for the calculation.
5. Accuracy comparison of each method
To evaluate the prediction accuracy of each method, the results obtained from following approaches are compared: linear discriminant analysis; logistic regression analysis; support vector machine; neural network; classification and regression tree (CART); random forest; XGBoost; a decision tree constructed by bilevel GA proposed in [1], hereinafter referred to as bilevel GA by related work; and a decision tree constructed by bilevel GA proposed in this study, hereinafter referred to as bilevel GA by this study. For comparing the prediction accuracy used for comparison, the average classification accuracy rate of the results of five verifications is used according to the five-fold cross-validation method
For regression problems, the following methods are used: multiple regression analysis, lasso regression analysis, partial minimum error, neural network, XGBoost, bilevel GA by related work, and a model tree constructed by bilevel GA proposed in this study, hereinafter also referred to as bilevel GA by this study. For the prediction accuracy used for comparison, the average R2 according to the five-fold cross-validation method is used.
For problems using financial time series data, the following methods are used: multiple regression analysis, lasso regression analysis, partial minimum error, neural network, XGBoost, bilevel GA by related work, and bilevel GA by this study. For the prediction accuracy used for comparison, the average R2 according to the five-fold cross-validation method is used.