Tutorial On Support Vector Machines


 The aim of this tutorial is to help students grasp the theory and applicability of support vector machines (SVMs). The contribution is an intuitive style tutorial that helped students gain insights into SVM from a unique perspective. An internet search will reveal many videos and articles on SVM, but many of them give simplified explanations that leave gaps in the derivations that beginning students cannot fill. Most free tutorials lack guidance on practical applications and considerations. The software wrappers in popular programming libraries such as Python and R hide many of the operational complexities. Free software tools often use default parameters that ignore domain knowledge or leave knowledge gaps about the important effects of SVM hyperparameters, resulting in misuse and subpar outcomes. The author uses this tutorial as a course reference for students studying artificial intelligence and machine learning. The tutorial derives the classic SVM classifier from first principles and then derives the practical form that a computer uses to train a classification model. An intuitive explanation about confusion matrices, F1 score, and the AUC metric extend insights into the inherent tradeoff between sensitivity and specificity. A discussion about cross-validation provides a basic understanding of hyperparameter tuning to maximize generalization by balancing underfitting and overfitting. Even seasoned self-learners with advanced statistical backgrounds have gained insights from this tutorial style of intuitive explanations, with all related considerations for tuning and performance evaluations in one place.


Introduction
Data scientists can apply support vector machines (SVMs) to solve both regression and classification problems. This tutorial focuses on the SVM classifier, which is a non-probabilistic binary classifier. The non-probabilistic aspect is in contrast with probabilistic classifiers, such as the Naïve Bayes, that compute class membership likelihood based on the training examples (Aggarwal 2015). An SVM separates data across a decision boundary, which is a plane in multidimensional feature space. Only a small subset of the data touches or supports the decision boundary, which is why the inventors named them support vectors.
An SVM cannot classify data into more than two classes. Handling multiclass data with SVMs is still an active area of research. Nevertheless, there are some workarounds. Methods involve creating multiple SVMs that compare feature vectors among themselves by using various techniques such as one-versus-Rest (OVR) or one-versus-one (OVO) (Bhavsar and Ganatra 2012). For k classes, the OVR method trains k classifiers so that each class discriminates against the remaining k-1 classes. OVO creates one binary classification problem for all possible pairings of classes, so it requires k(k-1)/2 classifiers. After constructing the number of required binary classifiers for either the OVR or OVO methods, the algorithm classifies a new object according the majority vote among the set of classifiers.
The SVM is a supervised machine learning (ML) method where each data point is a set of features {x1 ... xn} and a class label yi. SVM treats each data object as a point in feature space that belongs to one of only two classes. SVM defines the class labels are either yi = 1 or yi = -1. Hence, the mathematical representation of the dataset is Data = {( , )| ∈ ℜ , ∈ (−1, +1)} =1 (1) where p is the dimension of the feature vector and n is the number of vectors.
During training, the SVM classifier finds a linear decision boundary in feature space that best separates the data objects into the two classes. The equivalent optimization problem finds two parallel hyperplanes that form the widest gap that is void of data objects. A hyperplane is a subspace with one dimension less than the ambient space. The perpendicular distance between the parallel hyperplanes is the margin. The hyperplane that equidistantly bifurcates the space between the parallel hyperplanes defines a multidimensional decision boundary that separates the data into two parts.

Methods
This first subsection introduces the notion of a hyperplane in mathematical terms to help setup the optimization problem that an SVM solves. The second subsection presents the practical implementation of an SVM by introducing Lagrangian multipliers, Kernels, and slack variables that make them suitable for real-world data. The third subsection discusses various scoring metrics that users must consider when evaluating the performance of the trained SVM classifier.

Hyperplanes
An SVM uses a linear hyperplane rather than a non-linear one because in practice the latter will tend to too tightly fit (overfit) a boundary that would perfectly separate the training data, but not necessarily new data. That is, overfitting the model on the training data can cause the classifier to generalize poorly by inaccurately predicting the class of new data (Aggarwal 2015). The general equation for a linear hyperplane is Rewritten in the standard form for a hyperplane gives The equivalent vectors are where w0 = -b, w1 = -a, and w2 = 1. So, taking the dot product of w and x and setting that equal to zero produces the standard form equation for a line. Specific values for the vector components of w defines a specific line (linear hyperplane) and any point defined by an (x, y), which is the feature vector, must be located somewhere on that line. Given that the vectors are column vectors, the dot product is equivalent to the matrix operation = 0.
The vector and matrix forms are easier to represent in computer memory for rapid computations. The dot product of two vectors x and y is which is a scalar value. Hence, for the dot product of w and x to be zero, the vectors must be perpendicular to each other because cos(90) = 0. That is, if x lies on a plane of the space, then the w vector that defines that plane must be perpendicular (normal) to both x and the plane. Hence, the unit vector where ‖ ‖ is the vector norm or length must also be perpendicular to the plane. For the 2D example, This unit vector of w is important for finding the distance of any point (feature) from the hyperplane by projecting that point to a vector that is normal to the hyperplane. For example, vector p in Figure 1 is the projection of point A onto the plane of the w vector which is normal to the hyperplane. Hence the distance from point A to the hyperplane is the same as the length of p, which is ||p||. In this example, the projection of vector a onto the plane of w is The dot product produces a scalar, which is the magnitude (length) of the vector such that • = ∑ (11) and the direction of the vector is u. Figure 1: Projection of a vector to compute the distance to a hyperplane.

Margin Optimization
The illustration of Figure 2 shows that the region bounded by the two hyperplanes H1 and H2 is void of data points. The "support vectors" are the data objects at the boundaries of the two hyperplanes. Figure 2 highlights the four support vectors (points) of this example with thick black borders. The decision boundary is a linear hyperplane H0 that is equidistant between the two hyperplanes H1 and H2. The hyperplane margin is the separation distance between the two hyperplanes H1 and H2. Equivalently, hyperplanes H1 and H2 are offset by some amount δ from the decision boundary hyperplane H0. Associating w with the hyperplane H0, the parallel hyperplane with offset +δ to one side is and an equal and opposite offset to the other side of the hyperplane is Given a feature vector xi, the assigned class must satisfy for a class label of yi = 1 and for a class label of yi = -1. Solving the optimization problem will find the optimum w that maximizes the offset or margin. For mathematical convenience, setting the offset to δ = 1 facilitates combination of the two classification constraints into a single constraint. That is, multiplying both sides of equation (14) by yi, and assigning the class label value of +1 to the right yields Doing the same for equation (15) with its yi label value at -1 yields Note that per rule for inequalities, multiplying by -1 flips the inequality sign. Consequently, the single constraint of the optimization problem becomes Given that the vector w is perpendicular (normal) to the hyperplane H0, its unit vector u = w/||w|| must be perpendicular to H0 with magnitude 1. The vector du is the perpendicular vector from hyperplane H0 to a parallel hyperplane H1 some distance d away. Let x0 be the base coordinate of the du vector on the hyperplane, and z0 be the tip coordinate that lies on the hyperplane H1. The distance vector is The fact that z0 is on H1 means that Multiplying both sides of Equation (19) by the dot product of w and substituting z0 yields Substituting u from Equation (8) yields Expanding Equation (22) yields Given that Equation (23) becomes Hence, The fact that x0 is on H0 means that Substituting Equation (27) into Equation (26) yields Solving for distance d yields Therefore, the margin, which is the distance between the hyperplanes H1 and H2 must be 2/||w||. Hence the margin is Note again that using the class labels of +1 and -1 as the decision boundaries for vectors above and below H1, respectively, allowed for a simple derivation of the margin in terms of only the length of the w vector. From Equation (30) the optimization problem of maximizing the margin becomes equivalent to minimizing the length of the w vector. Hence, the optimization problem is to minimize the norm of w subject to the constraint of Equation (18), which keeps the hyperplane void of data objects. That is, the optimization problem becomes Note that this notation for the constraint separates w0, the bias component, from the w vector, which means that the iteration indices must be between 1 and n rather than 0 and n. For intuition, one can mentally visualize the optimization in 2D space as finding the orientation and maximum size of a rectangle that is void of data points. Upon converging to the optimum solution, the trained SVM classifier becomes That is, the predicted class label for a new feature vector x is the sign of the dot product between the w vector and the new vector. The signum function produces -1 if its argument is negative (less than zero), 0 if its argument is equal to 0, and +1 if its argument is positive (greater than zero).

Practical Implementation
A recasting of the minimization optimization problem presented in Equation (31) results in the primal problem, which is a quadratic programming problem with constraints as Note that the squaring of the length of the w vector produces a parabola with a single minimum. The ½ factor is for the mathematical convenience of cancellation when taking the partial derivative with respect to w such as in a gradient descent search.

Lagrangian Dual Form
The primal quadratic form can become impractical to solve with large datasets because the optimization space considers the dot product of w and all the training. Fortunately, researchers discovered that a dual form based on Lagrangian multipliers provides a practical alternative (Aggarwal 2015). Some benefits of the dual form are a reduction of computational resources and direct support for using Kernel functions to find suitable hyperplanes that can separate nonlinearly separable data. The dual problem requires learning only the number of support vectors, which can be significantly fewer than the number of feature space dimensions. Constructing the dual form involves constructing the Lagrange (see the appendix), by combining both the objective function f(w) and the equality constraint g(w) such that the Lagrangian function L(w) = f(w) -λ g(w) with λ being the so-called Lagrangian multiplier. Hence, the objective function and the constraint combines to yield the Lagrangian formulation where the αi parameters are the Lagrange multipliers. Note that the Lagrange is a function of only the w and the αi parameters because the data objects xi and labels yi are known from the data. Expanding the summation yields The maximum (or minimum) solutions of the Lagrangian that satisfies the constraints are where the gradient is zero such that Finding the gradient requires taking the partial derivatives with respect to each variable, setting them equal to zero, and then solving the resulting simultaneous equations. So, differentiating with respect to w yields Rearranging terms yields a solution for w as Then differentiating with respect to the bias component w0 yields This solution forms a constraint Substituting the solutions for w from Equation (40) and the constraint from Equation (42) back into the Lagrangian Equation (37) yields A simplification of the last remaining term is The arbitrary assignment to vector V simplifies the remaining algebra. That is and Hence, squaring ‖ ‖ 2 removes the square-root. This gives a simplified notation for the Lagrangian of Equation (43) as Expanding the norm-squared vector yields So, the final Lagrangian expression becomes Finally, the optimization problem now becomes Substituting the solution for w into Equation (32) gives the classifier as The dual form transforms the optimization problem to one of learning the αi parameters, given the training data xi and their classifications yi. A key feature of Equation (40) is that αi = 0 for many of the vectors, except for the support vectors. The non-zero parameters are those of the support vectors.

Kernel Trick
Some data is not separable using linear hyperplanes. Fortunately, transforming non-linearly separable data to a higher dimension plane can allow the SVM to separate them with a linear hyperplane in the new space. The transformation function Ф(x) creates new vectors by mapping the input vectors to a higher dimensional feature space. The transform does so by operating on at least one or combinations of the lower or ambient dimension features. For example, the exponential transform can create infinite feature dimensions because the exponential is an infinite series that operates on a single dimension z such that Applying the transformation function changes the training problem to the optimization subject to: ≥ 0 and ∑ = 0 (53) and the classification problem to As with the non-transformed vectors, the dot product of the transformed vectors produces a scalar value called the SVM score, and the sign yields the class label. The core idea behind using a Kernel is that the problem of first transforming each vector to a higher dimension and then taking the dot product Φ( ) • Φ( ) becomes equivalent to applying a special Kernel function to the original vectors that achieves the same thing where The trick aspect is that the Kernel achieves the same result as the dot product of the transformed vectors without computing the transforms. This "trick" saves computing time and lowers memory requirements.

Slack Variables
Data may not be completely separable in some cases, particularly if noisy. That is, some points will lie within the margin or they may not be correctly classified. Therefore, practical implementations of SVMs allow for some slackness in the classification by introducing an adjustable cost penalty hyperparameter C. The design is such that if C = 0, the SVM allows all errors. If C → ∞ then the SVM does not tolerate the errors. Hence, SVM users must set the hyperparameter C to some finite value between 0 and infinity, with C > 0. The classifier with slackness is The cost function to optimize then becomes = ‖ ‖ 2 + ∑ (68) such that C determines the contributions of the slack variables relative to the influence of ||w||. The power k is an integer hyperparameter that SVM users normally set to k = 1. Inclusion of the slack variables requires forming the Lagrangian to include both the new objective, Equation (68), and new constraint functions, Equation (67). As done before, setting the partial derivatives of the modified Lagrangian to zero, solving for w and w0, and substituting them back into the Lagrangian yields, the general SVM as Note that the only change is that the constraint involving the Lagrange multipliers has an upper bound of C. The generalization is that setting C = ∞ results in the original solution. Setting C closer to zero results in a slack margin that trades off classifying more data by allowing for a higher rate of misclassification.

Classifier Evaluation
Classifiers often report a set of performance parameters such as accuracy, precision, and recall (Géron 2019). This section will define the various performance metrics by using the following abbreviations: • TP = True Positive • TN = True Negative • FP = False Positive (Type I error or false alarm) • FN = False Negative (Type II error or missed detection)

Confusion Matrix
A confusion matrix is a table that summarizes the performance of the classifier. Table 1 illustrates a confusion matrix with the relationships defined between each performance metric.
The error rate of misclassification rate is 1 -a. The precision p of classification for each class is a measure of its prediction accuracy for that class. Precision is the proportion of positive predictions that are correct where Precision, therefore, provides a relative comparison of performance for each class to gauge unbalanced performers-those that predict better for one class than another. Practitioners sometimes refer to precision as specificity. Intuitively, once can think of specificity as the ability of a classifier to identify a specific class without errors. Specificity is also 1 -FP, which is the true negative rate.
The recall r is the proportion of the positive classes that the classifier correctly predicts where Intuitively, one can think of recall as the proportion of the true positives that a classifier can recall from the sample by its predictions. Therefore, practitioners sometimes refer to recall as sensitivity.
The F1 score is the harmonic mean of precision and recall such that Hence, the harmonic mean is a weighted average of the precision and recall where 1 is best and 0 is worst. Note that F1 is 1 if p = r = 1 and zero if either p or r is zero. Rewriting the F1 score as makes it clear that the score is equivalent to either a precision or a recall based on the mean of the total errors, which is (FP + FN)/2. In practice, there is always a tradeoff between sensitivity and specificity. Hence, the F1 score reflects the amount of the tradeoff involved.

ROC and AUC
ROC is often confused with the term region of convergence. However, it stands for Receiver Operating Characteristic in the context of evaluating classifier performance. The ROC is a graphical plot of the TP rate versus the FP rate (Type I error), as a function of a classifier's decision threshold (Fawcett 2006). The y-axis and x-axis are ranges for the TP and FP rates, respectively. Figure 4 provides an illustration for deriving a ROC based on hypothetical probability distributions of two classes. In this scenario, a radio receiver predicts the transmission of a zero binary digit or a one binary digit based on a received feature. In this scenario, the feature is a voltage level derived from the antenna. The plot spans the feature value on the x-axis and the probability of reception for each class on the y-axis. The feature threshold T is set such that the classifier, which is a binary receiver in this scenario, predicts the transmission as a zero and a one digit for voltages received below and above T, respectively. The amount of overlap between the two distributions is proportional to the amount of noise in the transmission channel, which causes confusion about the binary digits received. That is, some of the voltages representing zeros that exceed the threshold will cause the classifier to predict them as ones instead, so those would be FP predictions. Similarly, some of the voltages representing ones that fall below the threshold would cause the classifier to predict them as zeros instead, so those would be FN predictions. A low noise channel would yield distributions for zeros and ones that do not overlap, in which case the classifier can find a clear threshold to separate the classes with FP or FN errors.
The amount of overlap in the distributions affect the shape of the ROC curve. No overlap moves the curve towards the coordinate of perfect classification (0, 1). Full overlap moves the ROC curve towards the diagonal, which represents the results for random guessing. When the ROC curve falls below the diagonal, the classifier performs worse than random guessing. Given some amount of overlap in the distribution of feature values for each class (Figure 4a), moving the threshold is equivalent to moving the point along the ROC curve (Figure 4b). The goal of classifier design is to maximize the area under the ROC curve (AUC), and to move the detection threshold towards the point of perfect classification.
The tradeoff between sensitivity and specificity becomes evident by imagining moving the threshold line T. Increasing the threshold level by moving it towards the right will decrease the FP rate because the sensitivity will be lowered, thus reducing the probability of predicting noise as a positive. However, the higher threshold (lower sensitivity) will also increase the number of missed detections by not detecting true positives that fall below the threshold, thus increasing the FN rate. In summary, decreasing the sensitivity of the receiver will decrease the FP rate but increase the FN rate. The reverse occurs when increasing the sensitivity by lowering the threshold. That is, the FP rate will increase (decreased specificity) because the receiver may detect noise peaks as positives whereas the FN rate will decrease (increased sensitivity) because the receiver will be less likely to miss a weak signal that is a true positive.

Cross Validation
The goal of validating a classifier is to determine how it performs on unseen data. This is a mechanism to help prevent setting hyperparameters that cause the model to overfit the training data, which can lead to poor generalization in performance on new data. The simplest method of classifier performance validation splits the dataset into a training and a validation subset. Typically, the classifier trains on 70% of the data and uses the remaining 30% to validate its performance using one or more of the performance measures such as F1 score and AUC. The literature also refers to the two-part data split as leave-p-out validation. That is, the approach leaves p observations in a validation dataset, and uses the rest for training. Some techniques use leave-one-out validation (p = 1) for relatively small datasets.
Another technique called cross-validation evaluates the performance of a classifier by segmenting the dataset into several near equal parts so that the evaluation cycle uses each part for validation at least once and the union of the remaining parts to train a new model. Thereafter, average score of each model becomes the reported performance of the classifier. K-fold crossvalidation is a generalization that partitions the data into one validation set, and k-1 training sets. The algorithm creates the partitions (subsets or folds) by randomly selecting data points for each subset. The user can also modify the randomization to yield stratification that balances the proportion of classes across each subset. The key advantage of k-fold cross validation is that it uses all the data for both training and validation to prevent bias towards any portion of the data. The disadvantage is increased processing time for larger datasets.

Discussion
This section expands on some of the intuition behind the Lagrangian formulation, the Kernel Trick, and tuning hyperparameters. A few remarks also highlight some of the main advantages and limitations of SVMs relative to other types of machine learning classifier models.

The Lagrangian
The Lagrangian formulation is a mathematical tool that packages an optimization problem with constraints so that the problem becomes more practical to solve using a computer. The formulation comes from a realization that at each solution point in feature space, the gradient vector of the optimization function f is a scaled version of the gradient vector of the combined constraint function g such that where λ is the scaling factor or the Lagrangian multiplier. Rewriting Equation (75) and assigning it to a function (the Lagrangian) gives This Lagrangian formulation includes an extra variable λ into the solution set. Hence, solving ( , ) = 0 finds all the solution points where the optimization and combined constraint function share the same tangent. At each solution point, the two gradient vectors are perpendicular to the tangent. After substituting the solutions into the original Lagrangian expression, a subsequent optimization problem finds the maximum or minimum in the solution set. Some of the advantages of the Lagrangian formulation are: 1. There exists a global minimum or maximum (none are local), which is unlike the situation for other optimization problems such as using gradient descent to train artificial neural networks. Therefore, the solution set for a given training dataset is global and not local. 2. There are relatively few support vectors that define the class decision boundary, which reduces the computational complexity over other methods that use all the training data. 3. During training, the dot product of the feature vectors can facilitate rapid computation by leveraging the efficient multiply-accumulate (MAC) operation of digital signal processors. 4. The dot product reduces complexity by facilitating the direct substitution of any Kernel function that can generalize the optimization problem of finding either linear or nonlinear hyperplanes in the ambient space.

The Kernel Trick
As demonstrated earlier, a Kernel achieves the same effect as computing a dot product of the vectors transformed to a higher dimensional feature space by performing a dot product, along with some other manipulations, only in the ambient space. The type of Kernel selected determines the nature of the higher dimension space and its effectiveness in separating the classes so that an optimal bisecting hyperplane becomes possible. Given that a dot product is a measure of similarity between feature vectors, a Kernel retains the relative relationship among the feature vectors in the higher dimension space. Hence, feature clusters in ambient space will similarly cluster in the higher dimension feature space. Data objects that are more alike tend to cluster together and dissimilar data objects tend to be further apart from each other. Figure 5 illustrates the dot product of two vectors x and y to highlight their relationship based on the mathematical definition. Vectors that point in the same direction are maximally similar because the angle between them is zero degrees and cos(0) = 1. Therefore, the measure of similarity for parallel vectors is just the product of their lengths. In contrast, perpendicular vectors will be dissimilar regardless of their lengths because the angle between them is 90 degrees and cos(90) = 0. In summary, although a Kernel function adds the extra dimensions of feature space to enable the SVM to separate the classes with a linear hyperplane, the Kernel preserves the relative within class and across class relationships in the ambient space.

Hyperparameter Tuning
As noted, the value of the slack hyperparameter C can range from zero to infinity and accommodates the tradeoff of classifying more data by allowing for a higher rate of misclassification. Other hyperparameters include the type of Kernel function and their parameters. Setting the best value for the hyperparameters requires some intuition about the nature of the data, some domain knowledge about the processes that generated the data, and some idea about the relationship among the vectors in feature space. Unsupervised machine learning methods such as principle component analysis (PCA) (Jolliffe and Cadima 2016), kmeans clustering, and non-linear dimensionality reduction (Becht, et al. 2019) can help to visualize relationships in a lower dimension space such as 2D or 3D. Such visualizations can reveal how sparse, clustered, noisy, or separable the data might be.
A blind approach to finding the best hyperparameters for a classifier is to conduct several cross-validation cycles and plot the performance trends with different hyperparameter settings (Bridgelall and Tolliver 2021). Of course, doing so also requires selecting the best crossvalidation type, the cross-validation parameters, the performance scores to monitor, and the number cycles to average those scores.

Limitations of SVM
Some advantages of SVM classifiers are: