Breast cancer Wisconsin: This is a breast cancer dataset provided by the University of Wisconsin comprising 212 negative samples and 357 positive samples, for a total of 569. The total number of features is 32, including the radius, texture, and perimeter .
Forest cover type: This dataset categorizes seven forest cover types and consists of 581,012 samples. It has 54 features including elevation, quantitative, slope, and soil types .
Spambase: This is a short message service (SMS) dataset with a label indicating whether a message is spam or non-spam. It contains 4,601 samples, consisting of 1,813 Spam messages and 2,788 non-spam messages. There are 57 features in this dataset, and each feature consists of information about the number of specific characters appearing in a message .
The insurance company benchmark: This dataset is labeled according to whether a customer has purchased Caravan insurance. There are 9,822 samples, consisting of 586 subscriber data and 9,236 non-subscriber data. This dataset has 86 features, which are composed of product usage data and socio-demographic data derived from zip area codes .
Musk: This a dataset with a Musk or non-Musk label for each data. It has 6,598 samples that consist of 1,017 Musk data and 5,581 non-Musk data, and 168 features. The first two attributes of the Musk dataset were excluded from this experiment because they refer to the names of molecules and forms .
Colon cancer: This is a dataset provided by Princeton University. It has 62 samples of which 40 contain normal data and 22 contain colorectal cancer patient data. It has 2,000 features, and each feature represents gene information .
The composition of the variables in this experiment were as follows: The initial epsilon (ε) value was 0.5, and the EDR was 0.9995. The number of main and guide agents generated depends on the number of features in the dataset being tested. The learning rate was 0.01 and the maximum number of episodes was 10,000. For the initial main agent action, the Q-value corresponding to 0 (Deselect) was randomly initialized to a real value between 0 and 1, and the Q-value corresponding to 1 (Select) was randomly initialized to a real value between 0 and 0.05. This was intended to allow a small number of features to be initially selected, and gradually increase the number.
In this experiment, an artificial neural network was used as the classifier. It consists of an input layer, two hidden layers, and an output layer. The two hidden layers consist of five and two nodes, respectively. Furthermore, the rectified linear unit function was used as the activation function and the learning rate was set to 0.01. Epochs were assigned a maximum score of 10.
In addition, to compare and verify the performance of the proposed method, feature selection was performed using the mRMR, Relief, and genetic algorithm. They were implemented and directly tested. In the case of the genetic algorithm, only a basic concept was used without introducing a special method, and the fitness value of the genetic algorithm was defined as the classification accuracy. In this case, a classifier for obtaining classification accuracy had the same configuration as that of the aforementioned classifier.
Figure 4 shows the number of features selected for each episode in each dataset, and Fig. 5 shows the classification accuracy for each episode. For each experiment, 10,000 episodes were performed, and the values plotted on the graph are the average values of the number and accuracy of the features derived per 100 episodes.
It is evident that the experiments conducted with the guide agent increase the classification accuracy compared with the experiments that do not. In the experiments wherein the guide agent is applied, the classification accuracy increases as the episode progresses, but in the ones without the guide agent, there is either no significant increase compared to the initially derived accuracy or the accuracy is constant. It is not always maintained and appears to rise or fall continuously. In the experiments that did not apply the guide agent, all agents received the initial result values as a reward without any strategy. Therefore, it is evident that this is because the Q-value for one action in the beginning increases considerably and the opportunity to take another action decreases. In addition, the continuous change in accuracy can be attributed to a behavioral method of agents, called exploration. Exploration causes an agent to perform various actions, and it appears that the change in accuracy is significantly caused by such an exploration action. Therefore, the accuracy of the selected characteristics can be seen to vary greatly when there are significant changes in the types of selected characteristics because this greatly affects the change in the behavior of the entire agent.
Through various experiments, it was deduced that the method in which the guide agent is applied does not change the number of selected features and the accuracy constantly increases. By applying the guide agent, the main agent can reliably determine whether their actions are correct, and by learning only a small number of features through the proposed learning strategy, large fluctuations can be avoided. However, in the case of the forest cover type dataset, the selection number of features changes significantly compared to other datasets. This is when many of the features of the forest cover type dataset have meaningless values. These features are filtered out during the learning process, while other features are selected; hence, the amount of change is large compared to other datasets.
Table 1 lists the experimental results for each dataset, where each numerical value represents the classification accuracy. The numbers on the left in parentheses indicate the total number of features in the dataset, and the numbers on the right indicate the number of features finally selected by applying our algorithm. Evidently, compared with other experiments, our proposed method achieves a slightly better performance. In other methods, there were no data on the number of selected features; therefore, a comparison of the number of features could not be made. The results of six experiments show that our proposed feature selection method enables efficient feature selection for classification, regardless of the number of features in the dataset.
Table 2 lists the results of our proposed method, and the feature selection methods commonly used in other studies. The results obtained after applying each feature selection method are that of the features showing the greatest accuracy.
In the case of the Wisconsin breast cancer dataset, there was no significant difference in accuracy between the four feature selection methods and our proposed method; however, the mRMR method achieved the highest accuracy. A satisfactory result of 0.9404 was obtained even when the experiment was conducted without feature selection. All features of the Wisconsin breast cancer dataset have some significant characteristics. Moreover, from the results in Table 2, it can be observed that the linear data analysis method is the most efficient for this dataset.
For the forest cover type dataset, our algorithm achieved the best result, with an accuracy of 0.8802. In the case of this dataset, the accuracy improved by approximately 0.17 or more when the feature selection method was used. However, it is judged that the linear analysis method, that is, the selection method based on a combination of features, does not show very good performance because the improved classification accuracy is not very high.
For the Spambase dataset, our algorithm achieved the best results, with an accuracy of 0.963. Compared with other experimental results, the Genetic Algorithm and proposed method achieved high accuracy. From the results of these experiments, in the case of this dataset, the feature selection method based on a combination of features achieved better results than the linear analysis method.
In the case of the insurance company benchmark dataset, our algorithm achieved the second-highest result with an accuracy of 0.943. The feature selection method using Relief achieved the best accuracy. A similar method, mRMR, also achieved high accuracy. The genetic algorithm's accuracy was relatively low. Therefore, in the case of the insurance company benchmark dataset, there seems to be linear relationship between features and labels.
For the Musk dataset, the proposed method achieved the highest accuracy of 0.984. Compared with when no method was used, it achieved a significant accuracy improvement when feature selection was performed. It can be observed that the Musk dataset contains many meaningless features. Compared with the other algorithms, the genetic algorithm and our proposed method achieved higher classification accuracy. However, in the case of the experiment using mRMR, it can be assumed that there is linearity between each characteristic and the label of the Musk dataset, as it shows high accuracy.
The experiment with the colon cancer dataset showed the greatest difference compared to the other dataset experiments. Feature selection using the linear analysis method achieved a slightly higher accuracy than when no method was used. However, the feature selection methods based on combinations, such as the genetic algorithm and our algorithm, showed superior performance. In addition, the difference in the results obtained using our proposed feature selection method and the genetic algorithm was 0.0364; a difference of approximately 3.64%.