Over the last two decades, data has increased dramatically in many scopes such as in business, scientific, and social media. Therefore, the storage and transmission of data are growing at an enormous rate [15]. In other words, with the rapid development of the Internet and communication channels, the information explosion has become a serious problem [12]. A large amount of data and the dimensionality of them has become a severe task for machine learning and data-mining systems [2, 5, 20]. As known, the dimension of data has direct effects on the performance of these algorithms, especially clustering, classification, regression, and time-series prediction. Moreover, a huge amount of data increases the cost of computation and storage pressure. Besides, there is valuable hidden knowledge in the Dataset that plays a crucial role in analyzing tasks.
Totally, the original data contains redundant/irrelevant features [21]; In detail, the outside features provide misleading information, which leads to a fall in the learning accuracy. For example, in K-Nearest Neighbor, the irrelevant features increase the distances between samples from the same class, which makes it more challenging to truly classify data [4,13].
To prevent losing useful information and improve the performance of learning systems, efficient mechanisms should be considered as prepossessing to extract and maintain the information [3, 8, 12]. Hence, to cope with the mentioned problem, an effective knowledge compression algorithm for compressing and fetching useful data is proposed in this work. Generally, there are two types of compression, including lossy and lossless. In lossy mode, which is employed in multimedia data, the initial data are different compared to those recovered from the decompressing phase. On the other hand, the original data are the same as the retrieved in lossless compression. It is hugely employed in documents and execution files [8]. Also, compression can be classified into two categories in terms of dimensionality reduction. Firstly, the data can be compressed in row levels; In the following, the feature selection or extraction can be applied to compress the dataset [6]. During the last decades, the researchers only concentrate on compression at the features level.
In the proposed method, first, a subset of data is selected based on rows. Afterward, relevant features based on the target concept are selected to compress data more. The feature selection techniques can decrease the dimension of data and also speed up the analysis process. In other words, it is used to simplify the training phase and improve the quality of feature sets. Overall, it can significantly shorten the running time and improve the accuracy [3, 10, 20].
1.1 Literature review
As mentioned before, the compression techniques can be classified into two categories as the feature (column) or record (row) [9, 11, 19] reduction. Unfortunately, in the last decades, most research focused on feature selection for reducing the volume of data and improving the performance of learning and mining algorithms; Generally, feature selection contains two main goals, including the number of features and the classification accuracy [21]. On the other hand, record reduction plays a significant role in decreasing the space and time order of the mentioned algorithm. In this section, a brief review of several feature-based schemes proposed in recent years is discussed. Moreover, the advantages and weaknesses of techniques are presented.
In [8], a novel scheme was proposed for database compression based on the data mining technique. In this work, the redundant data which exist in the transaction database were eliminated. In this way, the redundant data were replaced with the mean of compression rules. Totally, it illustrated how association rules could be fetched using mining; Also, the compression rules can be employed to generate a compressed database. The experimental results proved the efficiency of the proposed method compared to the Apriori-based techniques in terms of both compression ratio and running time.
In [16], a threshold-based scheme proposed for feature selection in high-dimensional data. In detail, 11 feature selection techniques for ranking attributes were presented by considering the strength of the relationship between attributes and classes. For this aim, each attribute is normalized and paired individually with the class. The method was compared by Naive Bayes and Support Vector Machine algorithms to learn from the training datasets. The results demonstrated the superiority of the technique compared to traditional standard filter-based feature selection.
In [18], a new scheme based on K-Nearest-Neighbor(KNN) was proposed to decrease the time complexity of the feature selection phase. In detail, a novel strategy is presented to evaluate the quality of candidate features with the help of KNN. The primary purpose of this work is to accelerate wrapper-based feature subset selection. The results demonstrated that the scheme could significantly increase the evaluation process and decrease the running time cost without degrading the accuracy.
In [17], the authors proposed a feature selection method based on Ant Colony Optimization (ACO) and Genetic Algorithm(GA). The method includes two main models visibility density and pheromone density models. In this approach, each feature is modeled as a binary bit, and each bit has two orientations for selecting and deselecting. In the experimental result, the scheme was compared with other evolutionary algorithms. The results prove the performance of the optimization algorithm for solving the feature selection problems.
In another work, [5], a novel feature selection scheme was proposed based on genetic programming. In detail, a permutation strategy was employed to select features for high-dimensional
symbolic regression. The regression result proved the method's efficiency compared to other traditional schemes by considering truly relevant features.
In [21], a novel unsupervised feature selection method was proposed based on Particle Swarm Optimization(PSO). In this way, two filter-based strategies were presented to speed up the convergence of the algorithm. The first filter was based on average mutual information, and the second one is based on feature redundancy. The first is employed to remove irrelevant and weakly relevant features; The second is applied to improve the exploitation capability of the swarm. The experimental results illustrated the effectiveness of classification accuracy by reducing the number of features.
In [6], for the first time, the ensemble feature selection is modeled as a Multi-Criteria Decision-Making (MCDM) process. used the VIKOR method to rank the features based on the evaluation of several feature selection methods as different decision-making criteria. Their proposed method first obtains a decision matrix using the ranks of every feature according to various rankers. The VIKOR approach is then used to assign a score to each feature based on the decision matrix. Finally, a rank vector for the features generates an output in which the user can select a desired number of features.
1.2 Key contributions
As discussed, one of the main solutions to compress the dataset is record reduction. In this work, one (or more) window is considered as background; In this way, we face the activity of the user which appear in the different time window. In detail, a wight is chosen for each time window. the smaller weights are considered for distant time windows. On the contrary, near time windows, the bigger weights are selected for them. Afterward, the attention mechanism can manage the whole proposed by setting them in universal windows. The main idea of the attention mechanism is learning to set accurate weights to the set of features; These updates are done to prove the corresponded feature for the desired task has crucial information. Finally, the technique calculates the matter of each click for each user and yields the results with the system to present the priority and interest of users.
Generally, the significant properties of the proposed method are listed below:
- In contrary to clustering techniques that ignore the least repetitive, the whole record that carries the user's activity during the lifetime is maintained.
- To improve the performance of recommender systems, the dynamic values are considered for the user's behavior based on the happened time.
- The performance of the neural network is boosted by decreasing the number of windows.
- the compression operation is employed in both record and feature levels.
- amount of data compression in the row level depends on the selected time window length
Road map
The rest of this paper is organized as follows: The details of the proposed method are described in Section 3. The experimental results, analysis, and performance comparison are given in Section 4. Finally, the conclusions and future scope are shown in Section 5.