OCFSP: self-supervised one-class classification approach using feature-slide prediction subtask for feature data

One-class classification (OCC) is a machine learning problem where training data has only one class. Recently, self-supervised OCC algorithms have been increasing attention. These algorithms train the model for pretext tasks and use the model error for OCC. However, these tasks are specialized for images, and applying them to feature data is not practical or appropriate for such a purpose. The motivation of this study is to apply self-supervised OCC to feature data. For this purpose, this paper proposes an OCC approach using feature-slide prediction (FSP) subtask for feature data (OCFSP). The main originality is the FSP subtask, which is the first classification subtask for feature data. In particular, the proposed method creates a self-labeled dataset by generating additional feature vectors with the feature slide of original vectors and self-annotating these vectors as the number of the slides. Such a dataset is applied to train a multi-class classifier to predict the number of feature slides. Since this classification model learns data from only one class, the FSP accuracy for a seen class is higher relative to unseen classes. Accordingly, OCC could be made using the accuracy of FSP. The proposed methods are experimented with using the imbalanced-learn, covtype, and kddcup datasets. OCFSP shows fair accuracy where few training data is given. In addition, classification subtask for feature data shows a relatively fast testing speed, unlike image data. Therefore, the bottleneck of the self-supervised approach is considered the memory size, which is the main difference between image and feature data. Source code is uploaded at https://github.com/ToshiHayashi/OCFSP


Introduction
In recent years, Machine Learning (ML) has been introduced to various fields. Especially supervised learning is widely applied to the system, with annotation by experts (Litjens et al. 2017;Huang et al. 2020). However, such methods require a large volume of data. Moreover, model accuracy worsens due to the dataset problems, such as data imbalance (Sun et al. 2020) and outlier .
Besides, the supervised learning model cannot predict classes, not in training data. Especially, there is a case where only one class is collectible as training data. Since all class labels are the same in training data, the supervised model classifies all data as the same class. This problem is called one-class classification (OCC) (Gautam et al. 2019a), which is an important issue in ML and is related to anomaly detection (Gautam et al. 2019b), novelty detection (Sadooghi and Khadem 2018), intrusion detection (Mazini et al. 2019), and zero-shot learning (Socher et al. 2013). The objective of OCC is to classify input data into a seen class or the rest of the unseen classes. These classes are defined as included in training data or not, respectively.
For this purpose, various OCC algorithms are proposed. Early studies are shallow methods, such as OCSVM (Scholkopf et al. 2001), Local Outlier Factor (LOF) (Breunig et al. 2000), and Isolation Forest (IF) (Liu et al. 2008 These methods are effective for feature data. However, shallow methods have limitations for image data since there is no feature extraction process, such as the convolutional layer (Ruff et al. 2018).
Recently, Deep Learning (DL)-based OCC methods have been proposed for image datasets (Ruff et al. 2018). These methods are roughly classified into three groups, shallow methods with feature extraction (Ruff et al. 2018), fake-unseen-samples approach (Yang et al. 2019), and selfsupervised approach (Golan and El-Yaniv 2018; Hayashi et al. 2021). The last approach is the best in terms of accuracy (Golan and El-Yaniv 2018; Hayashi et al. 2021). These methods consider a pretext task for data. Then, the model for such a task is trained using training data. Since all training data is seen, model error for seen class is small relative to unseen classes. Therefore, OCC could be made using model error for the pretext task.
The motivation of this study is to apply the self-supervised approach to feature data. Such motivation has three backgrounds as follows: • OCC is an important issue in detecting the data for unseen classes. Moreover, classifying the feature data has significant meaning because the features are included in every data regardless of the data types. • The advance of DL improves the feature extraction process on image or time-series data. However, such an advance is not significant for the feature data because whose features are already extracted. Therefore, algorithm-level improvement is needed. • The self-supervised OCC algorithms show the best accuracy in the image datasets. However, these pretext tasks are specialized to images (Golan and El-Yaniv 2018; Hayashi et al. 2021) and applying these tasks to feature data is not practical or suitable for such a purpose. Accordingly, considering effective subtask for feature data is a significant challenge.
Existing pretext tasks can be roughly classified into two groups, generation (regression) (Hayashi et al. 2021) or classification (Golan and El-Yaniv 2018). Generation tasks are related to reconstruction or transformation, and these methods have an advantage in processing speed. In contrast, classification tasks aim to classify the self-labeled dataset created according to the defined classification subtasks, and these tasks have an advantage in accuracy. These aspects are trade-off practices (Hayashi et al. 2021).
This study prioritizes accuracy and selects classification tasks. Then, the main problem is what kind of classification should be applied. In image data, pretext tasks are such as rotation classification (Gidaris et al. 2018), classification of geometric transformation (Golan and El-Yaniv 2018), and image perturb classification (Gao et al. 2020). However, applying these tasks to feature data is not possible because rotation or geometric transformations are not applicable.
Accordingly, a One-class classification approach using feature slide prediction (OCFSP) is proposed (Hayashi and Fujita 2021a) with a novel subtask, namely featureslide prediction (FSP). This task uses a self-labeled dataset, including additional feature vectors created by sliding dimensions of original vectors. These vectors are annotated with the number of feature slides. Then, a multi-class classifier is trained to classify these feature slides. Since this classification model is built using data from only one class, the accuracy of FSP for seen data is high relative to unseen data. Therefore, using FSP accuracy could discriminate between seen and unseen classes.
This paper is an extension of the work presented in Hayashi and Fujita (2021a). In which additional classification algorithms are applied for FSP subtask. Moreover, OCFSP is compared with other OCC algorithms in multiple train test split. In this comparison, OCFSP has relatively high accuracy where training data is small.
The contributions of this study are listed as follows: • Novel One-class classification algorithm, namely OCFSP, is proposed. The main originality is the FSP subtask, which is the classification subtask for feature data. In particular, the self-labeled dataset is created by sliding the feature vector. Then, the classification model is trained to predict the number of feature slides. Such a model accuracy is used to classify between seen and unseen classes. • OCFSP is experimented with using the imbalancedlearn dataset and two real-world datasets, compared with other OCC algorithms. OCFSP shows a high AUC score where seen data is in the small distribution. • Time complexity is analyzed. OCFSP shows fast processing speed in the testing stage, unlike classification subtasks on image datasets. Therefore, the bottleneck of the self-supervised approach is considered the memory size, which is the main difference between image and feature data.
The organization of the paper is summarized as follows. Section 2 describes related work, such as OCC and selfsupervised OCC. Section 3 presents the proposed OCFSP framework. Section 4 and Sect. 5 provide experiment results and discussions. Finally, Sect. 6 gives the conclusion and future work.
2 Related work 2.1 One-class classification OCC is a promising research area because it can detect samples for unseen classes, which could improve supervised learning. In this problem, only one class is seen as training data, and other classes are unseen. The main challenge is how to detect unseen classes without training.
In addition, combining multiple one-class classifiers is a promising solution for binary or multi-class training data. Such a strategy trains OCC models class by class and classifies testing samples by all models (Zhou and Fujita 2017). Such a kind of strategy is called a one-class ensemble (Silva et al. 2017;Krawczyk et al. 2018;Hayashi and Fujita 2021b) and is applicable in many situations. In addition, this strategy is effective for data imbalance problems. Since each model trains from one class, data balance is not a problem (Hayashi and Fujita 2021b).
The objective in OCC is to obtain class label Y, defined in Eq. (1).
where S and U are seen and unseen class, respectively. Figure 1 shows the general OCC framework, which consists of two stages, training and testing. In the training stage, the OCC model is trained by data from one class. Then, such a model classifies the testing data into one seen or other unseen classes in the testing stage.
In which the OCC model is represented as shown in Eq. (2).
where X is input data, and the score is related to the seen class. In addition, k is a threshold value to discriminate between seen and unseen classes. In this framework, the main challenge is the algorithm part of training the OCC model. The main requirement is considering how to compute the score related to the seen class.
For this purpose, several OCC algorithms are proposed. One-class Support Vector Machine (OCSVM) applies mapping function for seen data into feature vector space. In such vector space, unseen samples are considered origin O. Then, the maximum margin hyperplane between mapped seen vectors and O is computed (Scholkopf et al. 2001). In contrast, the Local Outlier Factor (LOF) computes outlier scores of a sample using the local density of the original sample and the neighbor samples (Breunig et al. 2000). Such scoring becomes large where data is far from the neighbor samples (Breunig et al. 2000). Additionally, Isolation Forest (IF) is the technique to detect outliers using a tree structure with random splits. Such a tree isolates outliers with high probability since outliers are far from normal samples (Liu et al. 2008). Furthermore, recent studies extend these algorithms Silva et al. 2017;Karczmarek et al. 2020). In addition, the cluster-based method generates clusters from seen class. Then, data not assigned to clusters are considered unseen classes (Hayashi and Fujita 2021c).
Apart from these studies, DL-based OCC methods are developed for image data (Ruff et al. 2018). These methods are roughly classified into three groups, feature extraction ? shallow method (Ruff et al. 2018), fake-unseen approach (Yang et al. 2019), and self-supervised OCC (Golan and El-Yaniv 2018; Hayashi et al. 2021). The first method extracts the feature vector and applies the shallow method. In contrast, the second approach generates fake unseen samples and applies supervised classification (Yang et al. 2019). However, generating fake samples is not easy since there is no information for samples belonging to unseen classes. On the other hand, self-supervised OCC considers any subtask and trains a supervised learning model for such subtask. Such methods are the best in terms of accuracy (Golan and El-Yaniv 2018; Hayashi et al. 2021).

Self-supervised one-class classification
Recently, self-supervised OCC is a promising framework and increasing attention. Such a framework considers a pretext task and trains the ML model for such a subtask. Since training data has only seen class, model error for a seen class is small relative to an unseen class. Therefore, OCC could be made using model error of pretext task.
Several pretext tasks are proposed for self-supervised OCC (Golan and El-Yaniv 2018; Hayashi et al. 2021;Gao et al. 2020;Bergman and Hoshen 2019;Baldacci et al. 2016;Blázquez-García et al. 2021). These tasks are roughly classified into two groups, generation (regression) and classification. Generative tasks, such as reconstruction Fig. 1 One-class classification framework OCFSP: self-supervised one-class classification approach using feature-slide prediction subtask for… 10129 and transformation, have an advantage in processing speed (Hayashi et al. 2021). However, these methods have a limitation in terms of accuracy. In contrast, classification tasks aim to classify the selflabeled dataset. Such a dataset includes original data and additional data (created using original data). In which selflabels correspond to how data are created. These tasks have an advantage in terms of accuracy (Golan and El-Yaniv 2018). However, this approach takes time to create the selflabeled dataset. These aspects are trade-off practices.
Self-supervised OCC is mainly used for image data. In which various subtasks are proposed, such as classification of geometric transformation (Golan and El-Yaniv 2018), perturb classification (Gao et al. 2020), and image transformation to one image (Hayashi et al. 2021). However, these subtasks are specialized for image data. Apart from these methods, Bergman et al. propose the classification of the random affine transformations to extend the subtask to all data types (Bergman and Hoshen 2019).
Self-supervised techniques have been applied to timeseries data in the context of anomaly detection (Baldacci et al. 2016;Blázquez-García et al. 2021). Baldacci et al. use the gas consumption forecasting task for time-series anomaly detection (Baldacci et al. 2016). Blázquez-García et al. proposed a classification sub-task for water leak detection (Blázquez-García et al. 2021). They generated the additional signals by multiplying the original signals (Blázquez-García et al. 2021).
On the other hand, this paper proposes a classification subtask for feature data, namely FSP. In which the selflabeled dataset is created with sliding original feature vectors.

One-class classification for feature data
As for feature data, shallow methods, such as OCSVM, LOF, IF, and GMM, are still alternative solutions. DL methods are not the best solution for OCC in feature data. In particular, the advantage of DL models is the feature extraction process (Ruff et al. 2018;Cao et al. 2019). However, such an advantage is not related to feature data whose feature is already extracted. Moreover, DL needs much training data; the performance is degraded if the training data size is small.
Autoencoder is a one and only DL approach for OCC. Usage of AE in OCC could be roughly classified into two types, (1) Using an encoder output (Cao et al. 2019), and (2) Using the reconstruction error (Hawkins et al. 2002). These approaches are related to feature extraction and the self-supervised approach, respectively. As for feature data, only the second approach is applied because the feature is already extracted.
Besides, Lenz et al. propose Average Localised Proximity (ALP) as an improvement of LOF. Their proposal is based on the hypothesis that LOF has an extreme localization problem (Lenz et al. 2021). Their source codes are released in the fuzzy-rough-learn package (Lenz et al. 2020).
Several methods treat a part of seen data as the fake unseen class (Aguilar et al. 2021;Kang 2022). Aguilar et al. propose the PBC4occ (Aguilar et al. 2021). They train the decision trees by treating 10% of seen data as unseen. Since 90% of training data is labeled as seen, the seen data should be classified into seen class with high probability. The source code is released as Weka (Aguilar et al. 2021).
Similarly, Kang (2022) proposes an OCC algorithm for feature data, using clustering and binary classifications. He applies a clustering algorithm to annotate seen data and trains One-vs-All classifiers for each cluster. Since all training samples are self-labeled as ''one'' at only once, the models classify seen data into ''all'' in most cases. On the other hand, data belonging to unseen classes are assigned to ''one'' with a higher probability because these data are not related to training data. Therefore, binary classification results can discriminate between seen and unseen classes (Kang 2022).
3 One-class classification using feature-slide prediction subtask In this section, the novel one-class classification algorithm OCFSP is presented. The main originality is FSP, a new subtask for feature data. Figure 2 shows the framework, which consists of two stages, training, and testing. In which, only seen data is used as training data.
In the training stage, additional data are generated by sliding feature vectors. Then, the self-labeled dataset is created by gathering an original and these additional data. In which the self-label represents the number of feature slides. Then after, the classification model is trained using such a created dataset. Finally, the threshold value is computed based on the accuracy of the FSP model.
In the testing stage, additional data are generated in the same way as training. Then, the FSP model is applied, and the accuracy is computed. Data is treated as a seen class if accuracy is larger than a threshold value. Otherwise, data is concluded as an unseen class. The following paragraphs provide mathematical descriptions.
Data is defined as d-dimensional feature vector X, as shown in Eq. (3): where d is the number of dimensions. The objective in OCC is predicting class label Y, which is defined as the previous Eq. (1). For this purpose, this study proposes an FSP subtask, which considers additional data A generated by computing feature slide T from the original data X as Eqs. (4) and (5): In which z is number of applied slides where 0 \ z \ d because d-dimensional data has d -1 possible slides. Moreover, the original feature vector X is slid forward by the number of the slide. These original and additional data are annotated using self-label SL as in Eq. (6): In which original data is self-labeled as 0. In contrast, additional data are labeled as number of the slides. Accordingly, FSP model g is defined as following Eq. (7): In which g aims to predict SL, which is the number of feature slides.
Finally, score for the data related to seen class is computed using original data X and additional data A as in Eq. (8): where k is the value for each slide, the score is computed using the likelihood function related to correct prediction.
Likelihood provides a more detailed score value than general accuracy, with a small data size. Therefore, such scoring highlights the difference between seen and unseen classes.
The following sub-sections present the training stage and testing stage.
Besides, the self-labeled dataset Dself; is created by merging Dtr and Atr as in Eq. (11): Then, self-label SL is assigned to Dself, as shown in Eq. (6). Besides, feature-slide classifier g is trained using Dself and SL, as shown in Eq. (7). In such a process, existing classification algorithms are applied. Since training is done using only seen samples, this classifier has high accuracy for seen class relative to the unseen class. Therefore, the threshold value between seen and unseen could be computed using the FSP accuracy for seen class. Such computation is made in the heuristic optimal way.

Testing stage
In the testing stage, input is Xtest, and additional data Atest are generated from Xtest as shown in Eq. (5). These data are merged to the self-labeled testing set Dtest as shown in Eq. (12): Then, the score of the Xtest is computed as Eq. (13): Such an equation is based on the previous Eq. (8).
Finally, seen-unseen classification f is established using Eq. (14): where k is a threshold value, which is determined in a heuristic optimal way.

Experiment
The proposed method (OCFSP) has been validated using the dataset listed in Sect. 4.1. The measurement of evaluation is shown in Sect. 4.2. The experiment results are shown in Sect. 4.3.

The data
OCFSP is evaluated with an imbalanced learn dataset (Lemaitre et al. 2017), implemented as a Python package. Such dataset consists of 27 sub-datasets with class imbalance for binary classification (Lemaitre et al. 2017). Table 1 shows the information of datasets, such as dimension, number of each class, and Imbalance Ratio (IR). All datasets are normalized based on min-max. In the experiment, one class is treated as a seen class, and another class is concluded as an unseen class.
In which, training data and testing data are split by three ratios, such as 80%:20%, 60%:40%, and 10%:90%. Then, the training set is split into minority and Majority data, and the one-class classifiers are trained separately. Each split is applied five times, and these average scores are reported as the experiment result.
In addition, the Kddcup99 and covtype datasets are used as real-world datasets. These datasets are included in the scikit-learn library (Pedregosa et al. 2011). Table 2 shows the data balance of kddcup99. Such a dataset includes 23 classes. In addition, Table 3 provides data balance of covtype dataset, which have seven classes.
These two datasets are used for the comparison stage; one class is used as seen, and the other classes are regarded as unseen. In addition, the training testing split ratio is 80%:20%.

Measurement of the evaluation
Evaluation is done using the Area under the ROC Curve (AUC). This curve is a graph plotting the performance in all possible thresholds. In which the x-axis and y-axis are False Positive Rate (FPR) and True Positive Rate (TPR), respectively. These values are computed in Eqs. (15) and (16) and Table 4; positive and negative correspond to minority and Majority classes, respectively.

Experiment result
The experiment is done with the following three parts: (1) Considering the appropriate number of the slide, (2) Compare with other OCC algorithms, and (3) Compute processing time.
In this experiment, four OCC and supervised classification algorithms are applied using the scikit-learn package (Pedregosa et al. 2011). Table 5 provides applied packages for algorithms. Baseline methods are OCC algorithms. In addition, OCFSP uses supervised classification algorithms for the FSP model. AUC scores are computed using outputs of score functions, as shown in Table 5. These scorings are the likelihood related to seen class or correct slide prediction.
In addition, FSP is made by four classification algorithms such as Decision Tree (DT), Logistic Regression (LR), Gaussian Naïve Bayes (GNB), and Multi-layer Perceptron (MLP). These algorithms are selected from algorithms included in the scikit-learn library (Pedregosa et al. 2011). All algorithms are applied with default parameters. The processing time was considered as the criteria for model selection because model training takes time when the size of the self-labeled dataset increases.
The first experiment is done to consider the appropriate number of feature slides for OCFSP. Figure 3 shows the average AUC score for the imbalanced-learn dataset. These results correspond to seen classes, train test split, and classification algorithms for FSP. In addition, Z represents the number of slides applied to the original data. Therefore, (Z ? 1)-class classification is done as a pretext task.
In such a figure, AUC is small, where the z value is small because random prediction leads to high accuracy and FSP for unseen class becomes unfairly high accuracy. On the other hand, the AUC score decreases where the z value is large because the FSP becomes unpredictable for even seen class.
Besides, Table 6 shows the best AUC scores for each pair of seen class and classification algorithms. Overall, MLP is the best classifier where seen class is minority. In contrast, DT shows the best performance for majority seen class. In addition, a larger ratio of training data provides higher AUC score.
The following experiments use the number of feature slide z, as shown in Table 6. In some datasets, z is larger   than the dimension of the data. In such a case, the reported result is an AUC score where z = d -1. Besides, the performance of the minority classifier is better than the majority classifier. The reason is considered the seen data distribution. Perhaps, the minority class has a narrower distribution than the majority class. In which the distribution of the feature slide is narrow in the same way.
On the other hand, the majority class and such feature slide vectors should have wide distribution. In such a case, self-labeled data distribution should overlap and make FSP difficult. Such a model error cannot discriminate between seen and unseen classes.

Comparison with other OCC algorithms
The proposed OCFSP is compared with other OCC algorithms. Comparison is made with seven OCC algorithms as follows: • One-class support vector machine (OCSVM) (Scholkopf et al. 2001) uses mapping function from seen data into feature vector space. In such space, the  maximum margin hyper-plane is computed between mapped seen data and Origin O, which is regarded as unseen. • Local outlier factor (LOF) (Breunig et al. 2000) compute outlier scores for the sample. Such a score is calculated with densities of the original and neighbor samples. The outlier score is large where data is far from the neighbor samples. • Isolation forest (IF) (Liu et al. 2008) uses a tree structure with random splits. Such a tree is regarded to assign seen data to the same place and isolate unseen data. Therefore, the score could be computed from where data is assigned.
• Gaussian mixture model (GMM) (Mario et al. 2002) is an unsupervised clustering algorithm that could be applicable to discriminate between seen and unseen data. As a process, GMM is applied for training data, and clusters are generated. Then, score related to these clusters is computed for each data. Since clusters are created from a seen class, the score is related to seen class. Therefore, data with a small score is concluded as unseen classes (Hayashi and Fujita 2021c). • Autoencoder (AE) is a neural network combining two networks, the encoder and the decoder. The reconstruction error is used to discriminate between seen and unseen classes (Hawkins et al. 2002). This study implemented AE with three dense layers, and the dimension of compression is int(d/2). • Average localised proximity (ALP) is proposed by Lenz et al. as an improvement of LOF to reduce the localization (Lenz et al. 2021). Such a method is applied with the fuzzy-rough-learn library (Lenz et al. 2020). • Kang (2022) applies a clustering algorithm to annotate seen data and trains One-vs-All classifiers for each cluster. Since all training samples are self-labeled as ''one'' at only once, the models classify seen data into ''all'' in most cases. According to Kang (2022), k-means is used as a clustering method with k = 20.
In addition, LR, Random Forest (RF), and MLP are applied as classification algorithms.
OCSVM, LOF, IF, GMM, and Kang methods are applied with the scikit-learn package (Pedregosa et al. 2011). In which default parameters are used for the experiment.
Tables 7, 8 and 9 provide the comparison where seen data is minority class. In particular, Table 7 shows the result where the train test split is 8:2. OCFSP methods show the best AUC scores for seven datasets in total. Table 8 shows the comparison where the train test split is 6:4. In which OCFSP methods show the best scores for eight datasets in total. In addition, OCFSP-MLP is 0.1 point behind the average score, which is comparable with the baseline. In the same way, Table 9 provides the comparison where the train test split is 1:9. OCFSP outperforms baseline in nine datasets. In addition, OCFSP-MLP shows the best AUC score in the compared methods. Moreover, ALP and methods from Kang have the errors because the number of training samples is less than the requirement of their algorithms. Apart from these methods, baseline algorithms and OCFSP do not have such limitations.
Besides, Tables 10, 11 and 12 provide the experiment result where majority class is the seen class. In particular, Table 10 shows the comparison where the train test split is 8:2. In which OCFSP shows the best scores in three datasets. On the other hand, OCFSP underperforms baseline in the average score.
In the same way, Table 11 provides a comparison where the split is 6:4. OCFSP shows the best score in two datasets and underperforms baseline methods in average AUC score.
Finally, Table 12 provides the comparison. OCFSP shows the best AUC scores in five datasets. However, OCFSP underperforms other OCC methods in the average AUC score. This result is due to the difficulty in FSP for majority seen class (see Sect. 5.3).
OCFSP shows comparable performance where a seen class is a minority. However, the AUC score worsens when a seen class is the majority; AUC scores are less than 50 in several datasets because the FSP accuracy for the unseen minority class is larger than seen majority class. Such a problem is related to the data distribution of classes. The majority distribution is wider than the minority; learning the majority distribution could cover the minority distribution.
On the other hand, minority data exists in narrow distribution. In such cases, feature slides significantly differ from the original data. Therefore, the FSP task becomes easier.
Besides, Table 13 shows the experiment result for the Kddcup dataset. OCFSP shows the best AUC scores in seven seen classes. In addition, OCFSP-DT provides the third-best average AUC score.
In addition, Table 14 provides the experiment result for the covtype dataset. OCFSP underperforms other methods due to the large seen samples. OCFSP needs improvement in terms of AUC for large distribution.
OCFSP is one of the alternative solutions where small seen distribution is given. However, OCFSP is not suitable for learning wide distribution.
In the overall comparisons, ALP is the best approach for large distributions. ALP is the best if training data is more than a requirement. In contrast, the methods from Kang show not good results in terms of AUC. The original paper reports the results using only small (less than 3000 samples) and relatively balanced datasets (Kang 2022). However, such an algorithm is not good in more extensive and imbalanced datasets.

Discussion
This section provides discussion of OCFSP.

One-class ensemble classifier
In this section, OCFSP is applied as the part of one-class ensemble classifier. In which, OCC classifiers are trained class by class. Then, all classifiers are combined to create the final classifier. Imbalanced-learn dataset has two classes, minority, and majority. Therefore, minority classifier and majority classifier are trained separately. Then, an ensemble of both OCC classifiers is computed using Eqs. (17) and (18).
These two equations are based on previous paper (Hayashi and Fujita 2021b). Where k is the threshold value (Deciding such value is not necessary to compute AUC score). Tables 11, 12 and 13 compare the ensembled AUC score for OCC and OCFSP.
In particular, Table 15 provides experiment result where the split is 8:2. In which OCFSP shows the best results in two datasets.  Table 16 shows the result where the train test split is 6:4. OCFSP shows the best AUC scores in the four datasets. Table 17 provides the result where the training testing split is 1:9. OCFSP outperforms other methods in five datasets. Moreover, OCFSP-MLP is the best in the average AUC score.
The previous three tables show that the OCFSP ensemble is relatively effective where training data is small size.

Statistical test
This section provides a statistical test. Table 18 provides the average rank for the imbalanced-learn dataset. The scores marked by bold and underline show the best and the second-best results, respectively. OCFSP shows a relatively good rank where a seen class is the minority. On the other hand, OCFSP performance is degraded where a seen class is the majority.
Besides, the following tables report the p-value from Wilcoxon signed-rank test. The null hypothesis is that the ranks of other methods are better than OCFSP algorithms. This hypothesis is accepted (The other method is better) where p [ 0.95 and rejected (OCFSP is better) where p \ 0.05. Otherwise, the two methods are statistically equal performances. This study applies the test to the best (minor 1:9) and the worst (major 8:2) results for OCFSP. Table 19 provides p-values for the best result where a seen class is the minority, and 10% of data is used as training data.
OCFSP-MLP outperforms six comparative methods significantly and does not have significant difference between other two methods. Therefore, OCFSP shows toplevel performance where data size is small.  In summary, statistical test results show that OCFSP is related to training data size and seen distributions. In particular, OCFSP is an effective alternative solution for small training data. Table 21 shows the processing time for OCC and OCFSP. In which a webpage dataset is used for time computation. Such dataset has 300 dimensions and is split into 20,868 training data (589 minority and 20,279 majority samples) and 13,912 testing data. Comparison is made for both minority and majority class. In addition, the number of slide Z is 29 for OCFSP. The processing time is calculated with the following computer, CPU: Intel (R) Core (TM) i9-9900 K @ 3.60 GHz, RAM: 64 GB, and no GPU.

Time complexity analysis
In the OCFSP algorithm, GNB is the fastest FSP model. In contrast, MLP takes time for training.
Overall, OCFSP takes more time for the training stage because the FSP model is trained from additional data generated as the self-labeled dataset; training time increases where the number of slide Z is a large value. In contrast, OCFSP is fast in the testing stage. Such a result is different from the previous self-supervised OCC in the image dataset. In particular, the previous study reports that the classification of geometric transformation tasks takes time. On the other hand, OCFSP is applicable to real-time processing.  For this reason, the following hypothesis is considered; (1) Applying feature slide is faster than geometric transformation, (2) Memory size for feature data is smaller than image data. In other words, increasing memory size could speed up self-supervised OCC in the image dataset. In addition, OCFSP does not change testing time according to the training data size. Therefore, OCFSP is applicable as real-time processing for feature data.
On the other hand, ALP needs much testing time because it uses neighbor samples. Such an aspect is a tradeoff with high accuracy. Therefore, OCFSP could be considered as one of the alternative solutions. Figure 4 provides accuracy scores for FSP subtask, consisting of 10 sub-figures. Left sub-figures show the accuracy where seen data is the minority. On the other hand, the right sub-figures provide the result for the majority seen class. In addition, the bottom sub-figures compare the accuracy of all applied classification algorithms for seen class. These scores are computed using a self-labeled testing set of optical digits dataset.

Accuracy of feature slide prediction subtask
In which train test split is done as 8:2. FSP accuracy for a seen class is higher than for an unseen class, where a seen class is the minority. In contrast, FSP accuracy does not have a difference where a seen class is the majority. In addition, MLP shows the highest accuracy for seen class.

Limitation and future direction
The limitation of OCFSP is the AUC score to learn wide distribution. OCFSP shows a fair AUC score for minority data. However, OCFSP does not provide a high AUC score where a seen class is the majority. The majority distribution is considered broader and more diverse than the minority. Therefore, learning the majority distribution could cover the minority distribution. In addition, the wide distribution must provide overlap between feature slides and the originals. For example, Fig. 5 visualizes the twodimensional vector space with randomly generated data. The left graph shows the example where original data have small distribution. In which the original data and feature slides are clearly separated. On the other hand, the right graph provides the case where original data have wide distribution. In such a case, training the FSP model is not possible because overlap exists between self-labels. This problem should exist even if data have larger dimensions. These issues decrease the model error difference between seen and unseen classes and degrade the AUC score.
As a future direction, transforming the data distribution to a narrow/sparse distribution (Wu et al. 2021) is a significant challenge in addressing these issues. In addition, OCFSP is applicable for sparse data (Luo et al. 2021a), such as data in the recommendation system (Luo et al. 2021b).
Besides, considering other pretext tasks for feature data is a significant challenge; the main question is how to avoid overlap between self-labels. In addition, combining the feature slide and autoencoder is a fascinating idea; the encoder-slide-decoder structure could be an interesting study for the pretext tasks on image or time-series data.

Conclusion and future work
This paper proposes the OCFSP algorithm as a self-supervised OCC for feature data. The main originality is the FSP subtask, where the self-labeled dataset is created by sliding feature vectors and is trained by supervised classification algorithms. Since the training is computed using seen data, accuracy for a seen class is considered high relative to an unseen class. Accordingly, OCC is computable based on the accuracy of the FSP model.
The proposed OCFSP is experimented using the imbalanced-learn, covtype, and kddcup datasets. In addition, statistical analysis is done. OCFSP shows top-level AUC where training data have few samples and small distribution. Moreover, OCFSP shows consistent testing speed, which is applicable for real-time processing. As a weak point, OCFSP takes much time for training. In addition, FSP subtask has limitations in training from wide distribution.
Considering other pretext tasks are promising research areas to tackle such a weak point. In particular, the classification subtask could provide high-speed testing time for feature data in the same way as OCFSP. The main challenge is how to avoid the overlap between self-labels. In addition, feature slides could be combined with an autoencoder. b Fig. 4 Feature Slide Prediction accuracy for optical_digits dataset.
The blue line shows the accuracy for seen class, and the orange line shows for unseen class Fig. 5 Feature slides in two-dimensional vector space. Class overlap problem for the FSP subtask where a seen class has a wide distribution