Kernel-based data transformation model for nonlinear classification of symbolic data

Symbolic data are usually composed of some categorical variables used to represent discrete entities in many real-world applications. Mining of symbolic data is more difficult than numerical data due to the lack of inherent geometric properties of this type of data. In this paper, we use two kinds of kernel learning methods to create a kernel estimation model and a nonlinear classification algorithm for symbolic data. By using the kernel smoothing method, we construct a squared-error consistent probability estimator for symbolic data, followed by a new data transformation model proposed to embed symbolic data into Euclidean space. Based on the model, the inner product and distance measure between symbolic data objects are reformulated, allowing a new Support Vector Machine (SVM), called SVM-S, to be defined for nonlinear classification on symbolic data using the Mercer kernel learning method. The experiment results show that SVM can be much more effective for symbolic data classification based on our proposed model and measures.


Introduction
Symbolic data, alternatively known as categorical data or nominal data, are widely used in real-world applications, where the attributes are represented by symbols, which are qualitative category of things (Agresti 2008). Taking two attributes, named gender and height, respectively, for example, the former is usually represented by the category "male" or "female," while the latter can be with one of the categories from {"low," "medium," "high"}. Compared to numeric data, mining of symbolic data is a more challenging task due to the lack of inherent geometric characteristics (Buttrey 1998;Guha et al. 2000;Yan et al. 2018;Zhu and Xu 2018;Wang et al. 2018 data, such as Euclidean distance, inner product and mean, are not well-defined for symbolic data (Dos Santos and Zárate 2015).
As an important tool in data mining, data classification, which assigns unlabeled samples to known classes by using supervised learning method, has been a subject of wide interest in categorical data mining, especially in the fields of business, finance, social sciences and health sciences. A number of methods have been developed to classify symbolic data, including decision trees (DT), Naive Bayes (NB) (Seeger 2006) and distance-based methods such as the k-nearest neighbors (KNN) and the prototype-based classifiers (Han and Karypis 2000;Zhang et al. 2013). Since both DT and NB are typically based on the assumption that symbolic attributes are conditionally independent given the class attribute, they cannot identify the nonlinear correlation between attributes, which has been validated to be useful in high-quality classification (Wang et al. , 2020. With an elaborate distance measure, it is possible to apply the traditional distance-based classifiers to nonlinear categorical data classification; however, defining such a meaningful distance measure directly on symbolic data is currently a difficult problem due to the challenges discussed previously (Dos Santos and Zárate 2015;Boriah et al. 2008).
Recently, kernel learning has been popular in efficiently learning the nonlinear correlation between attributes and in nonlinear data classification (Hofmann et al. 2008;Vo and Sowmya 2010;Zhong et al. 2014). For example, the nonlinear Support Vector Machine (SVM) (Cortes and Vapnik 1995) makes use of Mercer kernel functions to embed raw objects into a reproducing kernel Hilbert space, such that the data can be classified in the new space with high quality. Such a method cannot thus be directly applied to nonlinear symbolic data classification, because, essentially, it is designed for numeric data, where the Mercer kernels and some key intermediate operations, such as inner product, are well-defined. Popular solution to this problem is to transform symbolic data into numeric data as a preprocessing, using a frequency estimation-based encoding model such as the wellknown One-Hot Encoding (Alaya et al. 2017). Note that such a data transformation model typically results in large estimation variance, as measured by the finite-sample mean squared error (Ouyang et al. 2006;Li and Racine 2007).
For the sake of utilizing the intrinsic nonlinear learning capabilities of kernel methods, in this paper, we propose a kernel learning model for symbolic data classification. By using the kernel smoothing method (Ghosh 2018), the probability density of each discrete symbol can be estimated, based on which we present a new data transformation model, namely the kernel-based self-representation model, to embed symbolic data objects into Euclidean space. Based on the model, we define the novel inner product and distance measures for symbolic data, and show that a kernel-based attribute-weighting scheme can be combined into the distance measure with the space transformation. Applying the proposed model and measures to SVM, we provide a new classifier for symbolic data, named SVM-S, for nonlinear classification on symbolic data.
The following sections of the paper are organized as follows. Section 2 introduces related work. Section 3 describes the kernel probability estimator for symbolic data. Section 4 presents our data transformation model and the nonlinear SVM classifier for symbolic data, SVM-S. Section 5 experimentally evaluates the proposed model and SVM-S. Section 6 gives our conclusion and discusses future directions.

A sampling of classification methods for symbolic data
Real-world application of data mining usually needs to deal with various types of data, such as image, text, audio, and video, et al,. A few methods have been suggested to classify symbolic data in the input space, including decision trees such as the C4.5 classifier (Quinlan 1995), Naive Bayes (NB) and distance-based methods such as the KNN algorithm. A decision tree is a flowchart-like tree structure, where each internal node denotes a test on an attribute, each branch represents an outcome of the test, and each leaf node holds a class label. Decision trees generally use entropy gain (e.g., C4.5) or the Gini index (e.g., CART) (Bremner and Taplin 2002) to choose split attribute, so it can be directly applied to symbolic data, but it would encounter difficulties when the data set contains a large number of classes or attributes. NB is a probability-based classification based on the Bayes theorem and the assumption that each attribute is conditionally independent given the class. Moreover, for categorical data, NB computes the posterior probability with frequency estimator. Note that such an estimator generally results in large estimation variance, especially when the number of samples is small. Distance-based classifiers arise our interest due to their inherent simplicity and flexibility. They classify samples by the dissimilarity or distance between them; therefore, the performance of a distance-based classifier largely depends on the effectiveness of the chosen distance measure. When applied to symbolic data, the distance measure must be specially designed for symbolic attributes: examples include the simple matching (SM) distance (Huang 1998), frequency difference (Boriah et al. 2008) and the information theory-based measures (He et al. 2008). Since such a measure is typically defined for the categories distributed on each symbolic attribute, the correlation between attributes is eliminated from the dissimilarity computation.

Data transformation methods
To enable those methods that are originally defined for numeric data, such as SVM, BP Neural Network (Jin et al. 2000), and restricted Boltzmann machine (Larochelle et al. 2012), to complex data machine learning, a natural solution is to convert the data into numerical vectors, that is, to embed them into Euclidean space. For example, the Word2Vec family of algorithms (Mikolov et al. 2013) maps each word into a numerical vector by using artificial neural networks, and the Locally Linear Embedding (LLE) method embeds the data from manifold space to a low-dimensional Euclidean space (Roweis and Saul 2000).
For symbolic data, due to the lack of spatial structure, it is impossible to directly use those measures that are typically defined for numeric data, such as the mean, variance, and inner product measures. If symbolic data can be mapped to Euclidean space, many essential issues, like distance measures, will be easily addressed, and therefore those algorithms originally designed for numerical data can be easily transposed to symbolic data mining. For instance, Label Encoding, one of the popular encoding techniques, assigns a unique integer for each symbol based on alphabetical order-ing; thus, the transformed data (i.e., a series of integer values) are ordinal. However, the symbolic data usually are not with natural ordering in practice. Another popular technique is One-Hot Encoding (Alaya et al. 2017), which converts each symbol into a set of binaries. It thus easily results in the dummy variable trap (also known as the multi-collinearity problem) (Cerda et al. 2018).
Recently, a few alternative encoding methods have been suggested. For example, NPOD (Cheng et al. 2019) (Neural Probabilistic Outlier Detection method for categorical data) embeds symbolic data into Euclidean space by using a logbilinear neural network, where the relationship between two symbolic attributes is analogous to that of words and their context in the article. However, symbolic data from many real-world applications often lack natural semantics. Moreover, with such a method, the symbols have to be encoded in advance to feed the artificial neural network, using the One-Hot encoding techniques. Qian et al. (2016) suggested an alternative transformation method using the data selfrepresentation trick, based on which a general framework for space-structure-based categorical data clustering (abbreviated SBC) was derived. In this method, symbolic data are embedded into Euclidean space with a set of N-dimensional vectors, where N is the size of the dataset. Since N would be large in practice, such a method generally results in a huge increase in the storage and computing costs.

Kernel learning on symbolic data
Due to intrinsic nonlinear learning capabilities, kernel learning methods have been widely used in machine learning in recent years. A successful example is the nonlinear SVM, which makes use of the kernel trick (Cortes and Vapnik 1995) to map the raw data to high-dimensional feature space with Mercer kernel functions. By the implicitly mapping, the samples in the input space that are difficultly separated by a linear hyperplane would become linearly separable in the high-dimensional feature space. Here, the kernel can be regarded as a similarity measure between samples, due to the equivalence between the inner product and the distance metric for two sample vectors. However, as discussed previously, the inner product operation defined in Euclidean space does not naturally exist for symbolic data.
Another type of kernel learning methods is the so-called kernel smoothing (Ghosh 2018), which refers to the smoothing bandwidth method used in nonparametric density estimation, nonparametric regression and trend estimation (Cortes and Vapnik 1995;Deng et al. 2018). The work of using the kernel smoothing method for symbolic data learning can be traced back to Aitchison andAitken in 1976 (Aitchison andAitken 1976), where a discrete kernel function was defined and used to estimate the probability distribution of symbolic data, called kernel density estimation (KDE) or simply kernel estimation. Then, Ouyang et al. (2006) presented a datadriven bandwidth estimation method, and Chen et al. ; Yan et al. (2018); Chen and Wang (2013); Chen and Guo (2015); Chen et al. (2014b proposed a series of KDE-based classification algorithms for symbolic data. For example, in the K 2 NN algorithm (Chen et al. 2014b), which is an extension to the conventional KNN classifier, a weighted SM distance measure was derived based on the KDE on symbolic data; in , three new linear classifiers were defined for symbolic data classification and, interestingly, it was demonstrated that the classes can be more separable by kernel learning of symbolic attributes.
In this paper, we propose a KDE-based data transformation model to embed symbolic data into Euclidean space, called kernel-based self-representation of symbolic data, followed by the newly defined inner product and distance measures for symbolic data. The results thus allow symbolic data to be nonlinearly classified using a Mercer kernel-based classifier; in particular, we shall show that the SVM can be much more effective for symbolic data classification based on our novel formulation to the inner product and distance measures.

Discrete kernel estimation
In what follows, the symbolic data set is denoted by is a data object featured by D symbolic attributes, and y i is the class label of x i . The set of categories of the dth attribute, where d = 1, 2, . . . , D, is denoted by O d with |O d | being the cardinality(i.e., the dth attribute takes |O d | discrete values). An arbitrary category where I (·) is the indicator function, i.e., I (true) = 1, I (false) = 0. Let X d be a random variable associated with the observations for the dth attribute, and denote its probability density by p(X d ). In order to estimate p(X d ), we define the discrete kernel function as follows: where λ d ∈ [0, 1] , called bandwidth, which is the smoothing parameter of the kernel function corresponding to the dth attribute. Note that Eq.
(2) can be rewritten in a much simpler form, given as It can be seen that the kernel function defined by Eq. (2) or (3) satisfies the basic property of a probability distribution, i.e., . Now, based on the kernel density estimation (KDE) method (Scott 1992;Li and Racine 2007), the kernel prob- It is worthy to remark that the kernel probability estimation of o dl , as shown in Eq. (4), depends on both the frequency estimator f (o dl ) and the bandwidth λ d , which, in fact, is related to the data distribution characteristics of the dth attribute. Moreover, we have the interesting property of the estimation, as shown in the following theorem.
is the big-Oh notation (the 'O' stands for 'order of') and N is the number of samples. Since 1 The full proof is given in Appendix 1.
In addition, from the proof of Theorem 1, we can find that the smaller λ d , the smaller deviation. It can also be seen that by minimizing the mean square error, the bias and variance of the kernel estimation can be balanced.

Bandwidth optimization
Bandwidth optimization, which determines the asymptotic characteristics of the kernel estimation (Yan et al. 2018), is a key issue in KDE methods. Because the optimal bandwidth is closely related to the data distribution, it is a reasonable choice to use a data-driven method (Deng et al. 2018;Stone 1984), that is, to learn the optimal bandwidth from the data themselves. Here, we aim to learn the optimal bandwidth by minimizing the MSE of the kernel estimation, as given in Eq. (5). Substituting forp(o dl ; λ d ) in Eq. (5) according to Eq. (4), the loss function to be optimized can be rewritten as We then have the following results.
Theorem 2 For the dth attribute, the optimal bandwidth obtained by minimizing the loss function L (λ d ) is with The proof is given in Appendix 1.
Note that the underlying probability distribution p(o dl ) is unknown, which means that σ 2 d cannot be directly estimated. A practical approach is to use the frequency distribution of the training samples, such that σ 2 d can be easily estimated by the standard deviation of the training samples. In this way, with σ 2 d in Eq. (7) replaced by S 2 Here are some comments on the optimal kernel bandwidth according to Eq. (8): (1) The larger the S 2 d , the larger the bandwidth. Note that, S 2 d is widely known as the Gini-Simpson Index (Casquilho 2020) and can be used to measure the data dispersion.
In particular, when the data of an attribute is uniformly distributed, its bandwidth would reach the maximum.
|O d | will be very close to the frequency estimation. This also indicates the following property on the asymptotic characteristics of the kernel estimation.

Kernel-based self-representation model
In this subsection, a new data transformation model is proposed to embed symbolic data onto Euclidean space, based on the kernel estimation method discussed in the previous section. We will begin by representing the category taken on each attribute, say, x id on the dth symbolic attribute, A d , of the sample x i , by a probability vector, as set out in Definition 1.

Definition 1 (Category self-representation) For the category x id taken on
denotes the conditional probability of X d taking each category o dl ∈ O d given the fact that x d = o dl .
To estimate the conditional probabilities in Definition 1, we use the kernel estimator as defined in Eq. (3), , the constraint v id 1 = 1 in Definition 1 is always satisfied. Based on this, our kernel-based symbolic data transformation model can be obtained, as follows: Definition 2 (Kernel-based data transformation model, KDTM) Each symbolic data object x i is embedded in the Euclidean space by transformed into a numeric vector X i , defined as   1, 2, . . . , D), are usually much smaller than N in practice; therefore, we have that D N . Compared with the representation model suggested in Qian et al. (2016), where D = N , the dimensionality of the resulting Euclidean space obtained by our KDTM would be much smaller, providing a better usability for large-scale data classification.

Inner product and distance measures of symbolic data
Based on our KDTM involving only numeric values, the inner product of two symbolic objects can be formulated, as shown in the following definition.

Definition 3 (Inner product of symbolic objects)
The inner product of two symbolic data objects x i and x j is defined as In a bit more detail, based on Definition 1, we have that It is easy to verify that the inner product defined in Definition 3 satisfies the common properties of symmetry, linearity and additivity. Furthermore, for the case where x jd = y jd , the value of Eq. (10) has one more term of than that for x jd = y jd , which is an obviously reasonable result. 3. Solve the optimization problem shown in Eq. (12) and obtain the SVM model based on the method presented in Cortes and Vapnik (1995); 4. Identify the class of the test samples using the SVM prediction method, with the inner product or distance between symbolic objects also computed using the new formulation in Eq. (10) or Eq. (11). end * The source codes of SVM-S are freely available at https://github.com/Yan-XuanHui/SVM-S.git On the other hand, the distance between two symbolic objects can also be calculated using the similar approach. First, based on our KDTM, the dissimilarity between x i and x j on the dth attribute can be easily measured by Then, substituting the conditional probability with Eq. (3), we compute the squared distance between x i and x j by adding up the dissimilarity for each attribute, as given in the following Definition 5.
Definition 4 (Distance measure of symbolic objects) The distance between symbolic objects x i and x j is defined as From Eq. (11), we can see that the distance measure is dependent on the bandwidth λ d . This means that our distance measure for symbolic data is defined based on the data distribution characteristics in a data set. In addition, Eq. (11) implies that each symbolic attribute is, in effect, assigned with an individual weight, which is 2(1 − λ d ) 2 . As the bandwidth is related to the data dispersion (see Theorem 2), it can be seen that the attribute weight is inversely proportional to the data dispersion. Note that such a weighting scheme is similar to that commonly used for numerical data Huang et al. (2005).

SVM-S: SVM for symbolic data
This subsection aims at deriving an SVM for nonlinear classification of symbolic data, named SVM-S, using our new data transformation model KDTM and the inner product or distance measure formulated in the previous subsections. The main goal of the SVM algorithm is to establish a maximum margin classification model in the feature space to maximize the distance between the hyperplane and the two classes of samples. Using a Mercer kernel function, κ(·, ·), SVM is able to map nonlinearly separable samples (in the input space) onto a high-dimensional feature space, so that they can be effectively classified in the new space. Generally, such a classification model is learned by solving the optimization problem defined by (see Cortes and Vapnik (1995) for more details of the formulation) 2, . . . , N . (12) There are several choices for the kernel function κ, including the commonly used polynomial kernels and Gaussian kernels, defined as κ p (x i , 2 ), respectively. Clearly, such kernels cannot be computed for symbolic objects, where both the inner product x i · x j and distance measure x i − x j 2 are not defined. However, based on our KDTM defined in Definition 2, the kernels can be adapted to symbolic data using the definitions in Definitions 3 and 4. Formally, for the symbolic objects x i and x j , we compute the kernels by based on our new inner product formulation in Eq. (10), and using the new distance measure presented in Eq. (11). In this way, the traditional SVM can be adapted for nonlinear symbolic data classification, as outlined in Table 2.
It is interesting to remark that, similar to the kernel trick (Cortes and Vapnik 1995), in our SVM-S, the inner product or distance between symbolic objects can be directly computed using Eq. (10) or (11) in the input space, without actually converting symbolic data to Euclidean space by KDTM.

Experimental analysis
In this section, we aim at verifying the rationality and effectiveness of the proposed KTDM and the performance of the classification algorithm SVM-S for symbolic data.

Data sets and experimental setup
Nine real-world symbolic datasets were used in the experiments, all of which were obtained from the UCI Machine Learning Repository (http://archive.ics.uci.edu/ml/index. php). Table 3 summarizes the details. We chose to compare the performance of our SVM-S with a few representative classifiers: KNN, K 2 NN algorithm (Chen et al. 2014b), Naive Bayes(NB), Random Forest (RF) (Breiman 2001), XGBoost (Chen and Guestrin 2016), and SVM (Cortes and Vapnik 1995). For KNN and SVM, the pairwise distance was computed using the Euclidean distance function based on the One-Hot Encoding method (Alaya et al. 2017). In the experiment, all algorithms were tested on nine data sets and their experimental results were compared in terms of the weighted F1-Measure, which is WF1= 1 N m k=1 (F k × n k ), where m is the number of classes, and F k is the F1-Measure of the kth class, n k is the number of samples in the kth class.
As it is currently a difficult problem to choose an appropriate kernel function for SVM (and our SVM-S), a trainingand-validation method was used to config the SVMs in the experiments. For each dataset, first, we divided the training set into two disjoint subsets to create a validation set and a new training subset. Next, two SVMs (one with a polynomial kernel and another with a Gaussian kernel) were trained on the training subset and their classification accuracies on the validation set were computed. Finally, the kernel corresponding to the highest accuracy was chosen for each dataset. The results showed that the SVM with a Gaussian kernel was preferred for Vote, GermanCredit and Tic-Tac-Toe sets, while the SVM with a polynomial kernel can be more accurate on the remaining six sets.

Classification performance
In all the experiments, each dataset was classified by each classifier 20 times using 10-fold cross-validation, and the average WF1-score was calculated. For fairness, the grid search method is utilized to find the optimal parameters of each algorithm on each data set. The test results of each algo- rithm on the nine data sets are summarized in Table 4. The highest WF1-score on each data set is highlighted in bold typeface.
From Table 4, we can see that our SVM-S achieves the highest classification score on seven data sets (BreastCancer, Promoters, Soybean, Dermatology, Vote, GermanCredit and Tic-Tac-Toe). On the Chess data set, SVM-S obtains comparable accuracy to XGBoost; in fact, the classification performance of Random Forest, XGBoost, SVM and SVM-S on this set are approximately equivalent, all reaching a high classification score of more than 99%. SVM-S is slightly worse than XGBoost on the Car set, due to the fact that the set is extremely imbalanced (the numbers of samples in the major class and the smallest class are 1210 and 65, respectively). Overall, our SVM-S significantly outperforms KNN, Random Forest, SVM and K 2 NN algorithms for symbolic data classification. Moreover, it can be more accurate than XGBoost, the state-of-the-art classification algorithm, as evi-denced by the average WF1-scores shown in the last line of Table 4, which are 0.973 and 0.942, respectively.
Our SVM-S can be viewed as an SVM variant that specially designed for symbolic data classification. Its excellent performance in symbolic data classification is mainly due to the use of our newly formulated inner product and distance measure for symbolic data. This shows that our proposed methods not only provide a new solution for applying the Mercer kernel learning method to symbolic data mining, but also obtain better performance than other commonly used algorithms.

Attribute-weight analysis
To gain insights into the good performance of our SVM-S, we now focus on experimentally analyzing the kernel-based data transformation model KDTM. As discussed in Section 4.2, converting symbolic data objects into numeric vectors using our KDTM is equivalent to weighting each symbolic attribute according to its data dispersion. To demonstrate the effectiveness of the kernel-based weighting scheme, in this set of experiments, the weights learned by KDTM for each attribute of the nine datasets are used for further analysis.
As shown in Eq. (11), the weight assigned to the dth symbolic attribute equals to 2(1 − λ d ) 2 , with the bandwidth λ d computed by Eq. (8). To provide context, we chose the entropy-based attribute weighting method (Zhou et al. 2016) for comparisons, which is defined by where entropy(A d ) denotes the entropy of the dth attribute in terms of the category distribution. For convenience, we use w kernel and w entropy to denote the two kinds of attribute-weights, respectively. The weights learned by different methods on the nine datasets are shown in Fig. 1.
To examine the relationship between the two sets of the weights, the Pearson correlation coefficient was used, com- Here, X , Y denote w kernel and w entropy , respectively. From the figure, an obvious positive correlation between w kernel and w entropy on the same dataset can be observed. The values of Pearson correlation coefficient between w kernel and w entropy are beyond 0.9 on the eight data sets except BreastCancer, on which the correlation coefficient is larger than 0.87. We also observe that the weights w kernel and w entropy are precisely equal on the Car data set; this is because the data for each attribute in this set is uniformly distribution (note that our KDTM is an unsupervised data transformation model). This means that the attribute weighting scheme implied in our KDTM behaves similarly to the entropy-based method; this, consequently, provides our model with more capacity to distinguish between symbolic attributes.

Concluding remarks
In this work, we first use a kernel smoothing method to construct the kernel probability estimation model for symbolic data, and proved its convergence and consistency. By doing so, we then propose a kernel-based data transformation model, called KDTM, to embed symbolic data into Euclidean space. We also define new measures for the inner product and kernel-based weighted distance computation for symbolic objects. Finally, we extend the traditional SVM to SVM-S (i.e., SVM for symbolic data) by using the newly defined measures for nonlinear classification on symbolic data. The performance of the proposed methods is evaluated on nine real-world symbolic data sets, and the experimental results show their outstanding classification accuracy outperforming popular methods.
The important enlightenment of SVM is that some kind of nonlinear transformation can be achieved by the inner product based on the kernel learning method, for example, kernel principal component analysis (KPCA) (Wang and Tanaka 2016). Therefore, our work in this paper (e.g., the new kernelbased inner product measure) can help to extend more related methods to nonlinearly mining on symbolic data. Another interesting extension would be to extend our kernel-based method to learning more complex data, such as mixed-type data and multi-variate time series.

Conflict of interest
The authors declare that they have no conflict of interest.

A Proof of Theorem 1
Since [I (·)] 2 = I (·) and o∈O d [ p(o)] = 1, the expectation ofp o dl λ d can be obtained from Eq. (4): So, the Bias p o dl λ d and theVar p o dl λ d can be computed as: By combining the above two equalities, the theorem is proved.

B Proof of Theorem 2
For each o dl in Eq. (6), we have that Base on the facts that E [ f (o dl )] = p(o dl ) and [I (·)] 2 = I (·), the above equality can be simplified as Therefore, L (λ d ) can be computed as Let ∂L(λ d ) ∂λ d = 0, we get the optimal estimate of λ d , and Eq. (7).