Journal of Statistical Software April 2006,Volume 15,Issue 9. http://www.jstatsoft.org/ Support Vector Machines in R Alexandros Karatzoglou David Meyer Technische Universitat Wien Wirtschaftsuniversitat Wien Kurt Hornik Wirtschaftsuniversitat Wien Abstract Being among the most popular and efficient classification and regression methods currently available,implementations of support vector machines exist in almost every popular programming language.Currently four R packages contain SVM related software. The purpose of this paper is to present and compare these implementations. Keywords:support vector machines,R. 1.Introduction Support Vector learning is based on simple ideas which originated in statistical learning theory (Vapnik 1998).The simplicity comes from the fact that Support Vector Machines (SVMs) apply a simple linear method to the data but in a high-dimensional feature space non-linearly related to the input space.Moreover,even though we can think of SVMs as a linear algorithm in a high-dimensional space,in practice,it does not involve any computations in that high- dimensional space.This simplicity combined with state of the art performance on many learning problems(classification,regression,and novelty detection)has contributed to the popularity of the SVM.The reminder of the paper is structured as follows.First,we provide a short introduction into Support Vector Machines,followed by an overview of the SVM- related software available in R and other programming languages.Next follows a section on the data sets we will be using.Then,we describe the four available SVM implementations in R.Finally,we present the results of a timing benchmark. 2.Support vector machines SVMs use an implicit mapping of the input data into a high-dimensional feature space
JSS Journal of Statistical Software April 2006, Volume 15, Issue 9. http://www.jstatsoft.org/ Support Vector Machines in R Alexandros Karatzoglou Technische Universit¨at Wien David Meyer Wirtschaftsuniversit¨at Wien Kurt Hornik Wirtschaftsuniversit¨at Wien Abstract Being among the most popular and efficient classification and regression methods currently available, implementations of support vector machines exist in almost every popular programming language. Currently four R packages contain SVM related software. The purpose of this paper is to present and compare these implementations. Keywords: support vector machines, R. 1. Introduction Support Vector learning is based on simple ideas which originated in statistical learning theory (Vapnik 1998). The simplicity comes from the fact that Support Vector Machines (SVMs) apply a simple linear method to the data but in a high-dimensional feature space non-linearly related to the input space. Moreover, even though we can think of SVMs as a linear algorithm in a high-dimensional space, in practice, it does not involve any computations in that highdimensional space. This simplicity combined with state of the art performance on many learning problems (classification, regression, and novelty detection) has contributed to the popularity of the SVM. The reminder of the paper is structured as follows. First, we provide a short introduction into Support Vector Machines, followed by an overview of the SVMrelated software available in R and other programming languages. Next follows a section on the data sets we will be using. Then, we describe the four available SVM implementations in R. Finally, we present the results of a timing benchmark. 2. Support vector machines SVMs use an implicit mapping Φ of the input data into a high-dimensional feature space
2 Support Vector Machines in R defined by a kernel function,i.e.,a function returning the inner product ((x),(x))between the images of two data points z,x'in the feature space.The learning then takes place in the feature space,and the data points only appear inside dot products with other points.This is often referred to as the "kernel trick"(Scholkopf and Smola 2002).More precisely,if a projection重:X→H is used,the dot product(Φ(x),Φ(x)》can be represented by a kernel function k k(c,x)=(④(x),④(x)》, (1) which is computationally simpler than explicitly projecting r and r'into the feature space H. One interesting property of support vector machines and other kernel-based systems is that, once a valid kernel function has been selected,one can practically work in spaces of any dimension without any significant additional computational cost,since feature mapping is never effectively performed.In fact,one does not even need to know which features are being used. Another advantage of SVMs and kernel methods is that one can design and use a kernel for a particular problem that could be applied directly to the data without the need for a feature extraction process.This is particularly important in problems where a lot of structure of the data is lost by the feature extraction process (e.g.,text processing). Training a SVM for classification,regression or novelty detection involves solving a quadratic optimization problem.Using a standard quadratic problem solver for training an SVM would involve solving a big QP problem even for a moderate sized data set,including the computation of an m x m matrix in memory (m number of training points).This would seriously limit the size of problems an SVM could be applied to.To handle this issue,methods like SMO(Platt 1998),chunking (Osuna,Freund,and Girosi 1997)and simple SVM(Vishwanathan,Smola, and Murty 2003)exist that iteratively compute the solution of the SVM and scale O(Nk) where k is between 1 and 2.5 and have a linear space complexity. 2.1.Classification In classification,support vector machines separate the different classes of data by a hyper- plane (w,Φ(x)》+b=0 (2) corresponding to the decision function f(x)=sign(w,Φ(x)》+b) (3) It can be shown that the optimal,in terms of classification performance,hyper-plane (Vapnik 1998)is the one with the maximal margin of separation between the two classes.It can be constructed by solving a constrained quadratic optimization problem whose solution w has an expansion w=iai(ri)in terms of a subset of training patterns that lie on the margin.These training patterns,called support vectors,carry all relevant information about the classification problem.Omitting the details of the calculation,there is just one crucial property of the algorithm that we need to emphasize:both the quadratic programming problem and the final decision function depend only on dot products between patterns.This allows the use of the "kernel trick"and the generalization of this linear algorithm to the nonlinear case
2 Support Vector Machines in R defined by a kernel function, i.e., a function returning the inner product hΦ(x), Φ(x 0 )i between the images of two data points x, x0 in the feature space. The learning then takes place in the feature space, and the data points only appear inside dot products with other points. This is often referred to as the “kernel trick” (Sch¨olkopf and Smola 2002). More precisely, if a projection Φ : X → H is used, the dot product hΦ(x), Φ(x 0 )i can be represented by a kernel function k k(x, x0 ) = hΦ(x), Φ(x 0 )i, (1) which is computationally simpler than explicitly projecting x and x 0 into the feature space H. One interesting property of support vector machines and other kernel-based systems is that, once a valid kernel function has been selected, one can practically work in spaces of any dimension without any significant additional computational cost, since feature mapping is never effectively performed. In fact, one does not even need to know which features are being used. Another advantage of SVMs and kernel methods is that one can design and use a kernel for a particular problem that could be applied directly to the data without the need for a feature extraction process. This is particularly important in problems where a lot of structure of the data is lost by the feature extraction process (e.g., text processing). Training a SVM for classification, regression or novelty detection involves solving a quadratic optimization problem. Using a standard quadratic problem solver for training an SVM would involve solving a big QP problem even for a moderate sized data set, including the computation of an m × m matrix in memory (m number of training points). This would seriously limit the size of problems an SVM could be applied to. To handle this issue, methods like SMO (Platt 1998), chunking (Osuna, Freund, and Girosi 1997) and simple SVM (Vishwanathan, Smola, and Murty 2003) exist that iteratively compute the solution of the SVM and scale O(Nk ) where k is between 1 and 2.5 and have a linear space complexity. 2.1. Classification In classification, support vector machines separate the different classes of data by a hyperplane hw, Φ(x)i + b = 0 (2) corresponding to the decision function f(x) = sign(hw, Φ(x)i + b) (3) It can be shown that the optimal, in terms of classification performance, hyper-plane (Vapnik 1998) is the one with the maximal margin of separation between the two classes. It can be constructed by solving a constrained quadratic optimization problem whose solution w has an expansion w = P i αiΦ(xi) in terms of a subset of training patterns that lie on the margin. These training patterns, called support vectors, carry all relevant information about the classification problem. Omitting the details of the calculation, there is just one crucial property of the algorithm that we need to emphasize: both the quadratic programming problem and the final decision function depend only on dot products between patterns. This allows the use of the “kernel trick” and the generalization of this linear algorithm to the nonlinear case
JJournal of Statistical Software 3 In the case of the L2-norm soft margin classification the primal optimization problem takes the form: minimize t(w,)= m i=1 subject to (④(x),w〉+b)≥1- (i=1,..,m) (4) 5≥0 (i=1,.,m) where m is the number of training patterns,and yi=1.As in most kernel methods,the SVM solution w can be shown to have an expansion w= 〉aΦ(x) (5) i=1 where non-zero coefficients(support vectors)occur when a point (i,yi)meets the constraint. The coefficients a;are found by solving the following(dual)quadratic programming problem: m 1 m maximize W(a)=>ai- 2 Qiajyiyik(Ti,xj) i=1 ij=1 subject to 0≤ai≤ (i=1,..,m) (6) m m ,ay=0. i=1 This is a typical quadratic problem of the form: minimize cx+xHx subject to b≤Ax≤b+r (7) l≤x≤u where H∈Rmxm with entries Hij=yk(x,x)c=(1,,1)∈Rm,u=(C,,C)∈Rm, 1 =(0,...,0)E Rm,A =(y1,...,ym)ERm,b=0,r=0.The problem can easily be solved in a standard QP solver such as quadprog()in package quadprog(Weingessel 2004)or ipop() in package kernlab (Karatzoglou,Smola,Hornik,and Zeileis 2005),both available in R(R Development Core Team 2005).Techniques taking advantage of the special structure of the SVM QP problem like SMO and chunking (Osuna et al.1997)though offer much better performance in terms of speed,scalability and memory usage. The cost parameter C of the SVM formulation in Equation 7 controls the penalty paid by the SVM for missclassifying a training point and thus the complexity of the prediction function. A high cost value C will force the SVM to create a complex enough prediction function to missclassify as few training points as possible,while a lower cost parameter will lead to a simpler prediction function.Therefore,this type of SVM is usually called C-SVM. Another formulation of the classification with a more intuitive hyperparameter than C is the v-SVM (Scholkopf,Smola,Williamson,and Bartlett 2000).The v parameter has the interesting property of being an upper bound on the training error and a lower bound on
Journal of Statistical Software 3 In the case of the L2-norm soft margin classification the primal optimization problem takes the form: minimize t(w, ξ) = 1 2 kwk 2 + C m Xm i=1 ξi subject to yi(hΦ(xi), wi + b) ≥ 1 − ξi (i = 1, . . . , m) (4) ξi ≥ 0 (i = 1, . . . , m) where m is the number of training patterns, and yi = ±1. As in most kernel methods, the SVM solution w can be shown to have an expansion w = Xm i=1 αiyiΦ(xi) (5) where non-zero coefficients (support vectors) occur when a point (xi , yi) meets the constraint. The coefficients αi are found by solving the following (dual) quadratic programming problem: maximize W(α) = Xm i=1 αi − 1 2 Xm i,j=1 αiαjyiyjk(xi , xj ) subject to 0 ≤ αi ≤ C m (i = 1, . . . , m) (6) Xm i=1 αiyi = 0. This is a typical quadratic problem of the form: minimize c >x + 1 2 x >Hx subject to b ≤ Ax ≤ b + r l ≤ x ≤ u (7) where H ∈ R m×m with entries Hij = yiyjk(xi , xj ), c = (1, . . . , 1) ∈ R m, u = (C, . . . , C) ∈ R m, l = (0, . . . , 0) ∈ R m, A = (y1, . . . , ym) ∈ R m, b = 0, r = 0. The problem can easily be solved in a standard QP solver such as quadprog() in package quadprog (Weingessel 2004) or ipop() in package kernlab (Karatzoglou, Smola, Hornik, and Zeileis 2005), both available in R (R Development Core Team 2005). Techniques taking advantage of the special structure of the SVM QP problem like SMO and chunking (Osuna et al. 1997) though offer much better performance in terms of speed, scalability and memory usage. The cost parameter C of the SVM formulation in Equation 7 controls the penalty paid by the SVM for missclassifying a training point and thus the complexity of the prediction function. A high cost value C will force the SVM to create a complex enough prediction function to missclassify as few training points as possible, while a lower cost parameter will lead to a simpler prediction function. Therefore, this type of SVM is usually called C-SVM. Another formulation of the classification with a more intuitive hyperparameter than C is the ν-SVM (Sch¨olkopf, Smola, Williamson, and Bartlett 2000). The ν parameter has the interesting property of being an upper bound on the training error and a lower bound on
Support Vector Machines in R the fraction of support vectors found in the data set,thus controlling the complexity of the classification function build by the SVM(see Appendix for details). For multi-class classification,mostly voting schemes such as one-against-one and one-against- all are used.In the one-against-all method k binary SVM classifiers are trained,where k is the number of classes,each trained to separate one class from the rest.The classifiers are then combined by comparing their decision values on a test data instance and labeling it according to the classifier with the highest decision value. In the one-against-one classification method (also called pairwise classification;see Knerr, Personnaz,and Dreyfus 1990;Krefel 1999),()classifiers are constructed where each one is trained on data from two classes.Prediction is done by voting where each classifier gives a prediction and the class which is most frequently predicted wins ("Max Wins").This method has been shown to produce robust results when used with SVMs(Hsu and Lin 2002a). Although this suggests a higher number of support vector machines to train the overall CPU time used is less compared to the one-against-all method since the problems are smaller and the SVM optimization problem scales super-linearly. Furthermore,SVMs can also produce class probabilities as output instead of class labels.This is can done by an improved implementation (Lin,Lin,and Weng 2001)of Platt's a posteriori probabilities (Platt 2000)where a sigmoid function 1 P(y=1|f)=1+eA+B (8) is fitted to the decision values f of the binary SVM classifiers,A and B being estimated by minimizing the negative log-likelihood function.This is equivalent to fitting a logistic regression model to the estimated decision values.To extend the class probabilities to the multi-class case,all binary classifiers class probability output can be combined as proposed in Wu,Lin,and Weng (2003). In addition to these heuristics for extending a binary SVM to the multi-class problem,there have been reformulations of the support vector quadratic problem that deal with more than two classes.One of the many approaches for native support vector multi-class classification is the one proposed in Crammer and Singer(2000),which we will refer to as 'spoc-svc'.This algorithm works by solving a single optimization problem including the data from all classes. The primal formulation is: minimize a0=∑wP+2s n=1 m subject to (Φ(x),w)-(④(x),wn〉≥b-a (i=1....,m) (9) where b}=1-δ,n (10) where the decision function is argmaxn=1,…k(Φ(x),wn) (11) Details on performance and benchmarks on various approaches for multi-class classification can be found in Hsu and Lin (2002b)
4 Support Vector Machines in R the fraction of support vectors found in the data set, thus controlling the complexity of the classification function build by the SVM (see Appendix for details). For multi-class classification, mostly voting schemes such as one-against-one and one-againstall are used. In the one-against-all method k binary SVM classifiers are trained, where k is the number of classes, each trained to separate one class from the rest. The classifiers are then combined by comparing their decision values on a test data instance and labeling it according to the classifier with the highest decision value. In the one-against-one classification method (also called pairwise classification; see Knerr, Personnaz, and Dreyfus 1990; Kreßel 1999), k 2 classifiers are constructed where each one is trained on data from two classes. Prediction is done by voting where each classifier gives a prediction and the class which is most frequently predicted wins (“Max Wins”). This method has been shown to produce robust results when used with SVMs (Hsu and Lin 2002a). Although this suggests a higher number of support vector machines to train the overall CPU time used is less compared to the one-against-all method since the problems are smaller and the SVM optimization problem scales super-linearly. Furthermore, SVMs can also produce class probabilities as output instead of class labels. This is can done by an improved implementation (Lin, Lin, and Weng 2001) of Platt’s a posteriori probabilities (Platt 2000) where a sigmoid function P(y = 1 | f) = 1 1 + eAf+B (8) is fitted to the decision values f of the binary SVM classifiers, A and B being estimated by minimizing the negative log-likelihood function. This is equivalent to fitting a logistic regression model to the estimated decision values. To extend the class probabilities to the multi-class case, all binary classifiers class probability output can be combined as proposed in Wu, Lin, and Weng (2003). In addition to these heuristics for extending a binary SVM to the multi-class problem, there have been reformulations of the support vector quadratic problem that deal with more than two classes. One of the many approaches for native support vector multi-class classification is the one proposed in Crammer and Singer (2000), which we will refer to as ‘spoc-svc’. This algorithm works by solving a single optimization problem including the data from all classes. The primal formulation is: minimize t({wn}, ξ) = 1 2 X k n=1 kwnk 2 + C m Xm i=1 ξi subject to hΦ(xi), wyi i − hΦ(xi), wni ≥ b n i − ξi (i = 1, . . . , m) (9) where b n i = 1 − δyi,n (10) where the decision function is argmaxn=1,...,khΦ(xi), wni (11) Details on performance and benchmarks on various approaches for multi-class classification can be found in Hsu and Lin (2002b)
Journal of Statistical Software 5 2.2.Novelty detection SVMs have also been extended to deal with the problem of novelty detection (or one-class classification;see Scholkopf,Platt,Shawe-Taylor,Smola,and Williamson 1999;Tax and Duin 1999),where essentially an SVM detects outliers in a data set.SVM novelty detection works by creating a spherical decision boundary around a set of data points by a set of support vectors describing the sphere's boundary.The primal optimization problem for support vector novelty detection is the following: minimize tw,5,)=w2-p+s mv i=1 subject to (Φ(x),w〉+b≥p-E(i=1,.,m) (12) 5≥0(i=1,.,m): The v parameter is used to control the volume of the sphere and consequently the number of outliers found.The value of v sets an upper bound on the fraction of outliers found in the data. 2.3.Regression By using a different loss function called the c-insensitive loss function lly-f()le=max{0,lly- f(x)-e},SVMs can also perform regression.This loss function ignores errors that are smaller than a certain threshold e>0 thus creating a tube around the true output.The primal becomes: minimize 9P+2G+ =1 subject to (Φ(x),w)+b)-≤e- (13) -(④(x),w〉+b)≤e- (14) 接≥0(i=1,.,m) We can estimate the accuracy of SVM regression by computing the scale parameter of a Laplacian distribution on the residuals s=y-f(c),where f(x)is the estimated decision function (Lin and Weng 2004). The dual problems of the various classification,regression and novelty detection SVM formu- lations can be found in the Appendix. 2.4.Kernel functions As seen before,the kernel functions return the inner product between two points in a suitable feature space,thus defining a notion of similarity,with little computational cost even in very high-dimensional spaces.Kernels commonly used with kernel methods and SVMs in particular include the following:
Journal of Statistical Software 5 2.2. Novelty detection SVMs have also been extended to deal with the problem of novelty detection (or one-class classification; see Sch¨olkopf, Platt, Shawe-Taylor, Smola, and Williamson 1999; Tax and Duin 1999), where essentially an SVM detects outliers in a data set. SVM novelty detection works by creating a spherical decision boundary around a set of data points by a set of support vectors describing the sphere’s boundary. The primal optimization problem for support vector novelty detection is the following: minimize t(w, ξ, ρ) = 1 2 kwk 2 − ρ + 1 mν Xm i=1 ξi subject to hΦ(xi), wi + b ≥ ρ − ξi (i = 1, . . . , m) (12) ξi ≥ 0 (i = 1, . . . , m). The ν parameter is used to control the volume of the sphere and consequently the number of outliers found. The value of ν sets an upper bound on the fraction of outliers found in the data. 2.3. Regression By using a different loss function called the -insensitive loss function ky−f(x)k = max{0, ky− f(x)k − }, SVMs can also perform regression. This loss function ignores errors that are smaller than a certain threshold > 0 thus creating a tube around the true output. The primal becomes: minimize t(w, ξ) = 1 2 kwk 2 + C m Xm i=1 (ξi + ξ ∗ i ) subject to (hΦ(xi), wi + b) − yi ≤ − ξi (13) yi − (hΦ(xi), wi + b) ≤ − ξ ∗ i (14) ξ ∗ i ≥ 0 (i = 1, . . . , m) We can estimate the accuracy of SVM regression by computing the scale parameter of a Laplacian distribution on the residuals ζ = y − f(x), where f(x) is the estimated decision function (Lin and Weng 2004). The dual problems of the various classification, regression and novelty detection SVM formulations can be found in the Appendix. 2.4. Kernel functions As seen before, the kernel functions return the inner product between two points in a suitable feature space, thus defining a notion of similarity, with little computational cost even in very high-dimensional spaces. Kernels commonly used with kernel methods and SVMs in particular include the following: