Least Squares Linear Discriminant Analysis Jieping Ye JIEPING.YEQASU.EDU Department of Computer Science and Engineering,Arizona State University,Tempe,AZ 85287 USA Abstract is obtained by minimizing the within-class distance Linear Discriminant Analysis (LDA)is a and maximizing the between-class distance simultane- well-known method for dimensionality reduc- ously,thus achieving maximum class discrimination. tion and classification.LDA in the binary- It has been applied successfully in many applications class case has been shown to be equiva- (Belhumeour et al.,1997;Swets Weng,1996;Du- lent to linear regression with the class label doit et al.,2002)including face recognition and mi- as the output.This implies that LDA for croarray gene expression data analysis.The optimal binary-class classifications can be formulated transformation is readily computed by solving a gener- as a least squares problem.Previous stud- alized eigenvalue problem.The original LDA formula- ies have shown certain relationship between tion,known as the Fisher Linear Discriminant Analy- multivariate linear regression and LDA for sis(FLDA)(Fisher.1936)deals with binary-class clas- the multi-class case.Many of these studies sifications.The key idea in FLDA is to look for a show that multivariate linear regression with direction that separates the class means well (when a specific class indicator matrix as the out- projected onto that direction)while achieving a small put can be applied as a preprocessing step variance around these means. for LDA.However,directly casting LDA as FLDA bears strong connections to linear regression a least squares problem is challenging for the with the class label as the output.It has been shown multi-class case.In this paper,a novel for- (Duda et al.,2000;Mika,2002)that FLDA is equiva- mulation for multivariate linear regression is lent to a least squares problem.Many real-world appli- proposed.The equivalence relationship be- cations deal with multi-class classifications,and LDA tween the proposed least squares formulation is generally used to find a subspace with k-1 dimen- and LDA for multi-class classifications is rig- sions for multi-class problems,where k is the num- orously established under a mild condition. ber of classes in the training dataset(Fukunaga,1990; which is shown empirically to hold in many Hastie et al.,2001).However,directly casting LDA as applications involving high-dimensional data. a least squares problem is challenging for multi-class Several LDA extensions based on the equiv- problems(Duda et al.,2000;Hastie et al.,2001;Zhang alence relationship are discussed. Riedel.2005). Multivariate linear regression with a specific class in- 1.Introduction dicator matrix has been considered in (Hastie et al.. 2001)for multi-class classifications.It follows the gen- Linear Discriminant Analysis (LDA)is a well-known eral framework of linear regression with multiple out- method for dimensionality reduction and classifica- puts.As pointed out in (Pages 83-84,Hastie et al. tion that projects high-dimensional data onto a low- (2001)),this approach has a serious problem when the dimensional space where the data achieves maximum number of classes k>3,especially prevalent when k class separability (Duda et al.,2000;Fukunaga,1990; is large.More specifically,classes can be masked by Hastie et al.,2001).The derived features in LDA are others due to the rigid nature of the regression model, linear combinations of the original features,where the which is not the case for LDA (Hastie et al.,2001). coefficients are from the transformation matrix.The This multivariate linear regression model is in general optimal projection or transformation in classical LDA different from LDA.However.there is a close connec- tion between multivariate linear regression and LDA. Appearing in Proceedings of the 24th International Confer- ence on Machine Learning,Corvallis,OR,2007.Copyright More specifically,it can be shown(Hastie et al.,1994; 2007 by the author(s)/owner(s). Hastie et al.,2001)that LDA applied to the trans-
Least Squares Linear Discriminant Analysis Jieping Ye jieping.ye@asu.edu Department of Computer Science and Engineering, Arizona State University, Tempe, AZ 85287 USA Abstract Linear Discriminant Analysis (LDA) is a well-known method for dimensionality reduction and classification. LDA in the binaryclass case has been shown to be equivalent to linear regression with the class label as the output. This implies that LDA for binary-class classifications can be formulated as a least squares problem. Previous studies have shown certain relationship between multivariate linear regression and LDA for the multi-class case. Many of these studies show that multivariate linear regression with a specific class indicator matrix as the output can be applied as a preprocessing step for LDA. However, directly casting LDA as a least squares problem is challenging for the multi-class case. In this paper, a novel formulation for multivariate linear regression is proposed. The equivalence relationship between the proposed least squares formulation and LDA for multi-class classifications is rigorously established under a mild condition, which is shown empirically to hold in many applications involving high-dimensional data. Several LDA extensions based on the equivalence relationship are discussed. 1. Introduction Linear Discriminant Analysis (LDA) is a well-known method for dimensionality reduction and classification that projects high-dimensional data onto a lowdimensional space where the data achieves maximum class separability (Duda et al., 2000; Fukunaga, 1990; Hastie et al., 2001). The derived features in LDA are linear combinations of the original features, where the coefficients are from the transformation matrix. The optimal projection or transformation in classical LDA Appearing in Proceedings of the 24 th International Conference on Machine Learning, Corvallis, OR, 2007. Copyright 2007 by the author(s)/owner(s). is obtained by minimizing the within-class distance and maximizing the between-class distance simultaneously, thus achieving maximum class discrimination. It has been applied successfully in many applications (Belhumeour et al., 1997; Swets & Weng, 1996; Dudoit et al., 2002) including face recognition and microarray gene expression data analysis. The optimal transformation is readily computed by solving a generalized eigenvalue problem. The original LDA formulation, known as the Fisher Linear Discriminant Analysis (FLDA) (Fisher, 1936) deals with binary-class classifications. The key idea in FLDA is to look for a direction that separates the class means well (when projected onto that direction) while achieving a small variance around these means. FLDA bears strong connections to linear regression with the class label as the output. It has been shown (Duda et al., 2000; Mika, 2002) that FLDA is equivalent to a least squares problem. Many real-world applications deal with multi-class classifications, and LDA is generally used to find a subspace with k − 1 dimensions for multi-class problems, where k is the number of classes in the training dataset (Fukunaga, 1990; Hastie et al., 2001). However, directly casting LDA as a least squares problem is challenging for multi-class problems (Duda et al., 2000; Hastie et al., 2001; Zhang & Riedel, 2005). Multivariate linear regression with a specific class indicator matrix has been considered in (Hastie et al., 2001) for multi-class classifications. It follows the general framework of linear regression with multiple outputs. As pointed out in (Pages 83–84, Hastie et al. (2001)), this approach has a serious problem when the number of classes k ≥ 3, especially prevalent when k is large. More specifically, classes can be masked by others due to the rigid nature of the regression model, which is not the case for LDA (Hastie et al., 2001). This multivariate linear regression model is in general different from LDA. However, there is a close connection between multivariate linear regression and LDA. More specifically, it can be shown (Hastie et al., 1994; Hastie et al., 2001) that LDA applied to the trans-
Least Squares Linear Discriminant Analysis formed space by multivariate linear regression with a between-class separation.In the lower-dimensional specific class indicator matrix as the output is identi- space resulting from the linear transformation G,the cal to LDA applied to the original space.In this case, scatter matrices become multivariate linear regression is applied as a prepro- cessing step for LDA.One natural question is whether SL=GTSG,S=GTSLG,SL=GTS:G.(4) LDA in the multi-class case can be directly formulated An optimal transformation G would maximize as a least squares problem. trace(S)and minimize trace(S.)simultaneously, In this paper,we propose a novel formulation for multi- which is equivalent to maximizing trace(S)and mini- variate linear regression based on a new class indicator mizing trace(S!)simultaneously,since S=+S. matrix.We establish the equivalence relationship be- The optimal transformation,GLDA,of LDA is com- tween the proposed least squares formulation and LDA puted by solving the following optimization problem under a mild condition(see Section 5),which holds in (Duda et al..2000;Fukunaga,1990): many applications involving high-dimensional and un- dersampled data.We call the proposed LDA formula- GLDA arg max {trace(())). (5) tion Least Squares Linear Discriminant Analysis (or LS-LDA in short).We have conducted experiments The optimal A consists of the top eigenvectors using a collection of high-dimensional datasets from of S.S corresponding to the nonzero eigenvalues various data sources,including text documents,face (Fukunaga,1990),provided that the total scatter ma- images,and microarray gene expression data.Experi- trix S:is nonsingular.In the following discussion,we mental results are consistent with the presented theo- consider the more general case when S:may be singu- retical analysis. lar.GLDA consists of the eigenvectors of SS corre- sponding to the nonzero eigenvalues.Here S denotes the pseudo-inverse of St (Golub Van Loan,1996). 2.Overview of Linear Discriminant Note that when St is nonsingular,S equals S. Analysis The above LDA formulation is an extension of the Given a dataset that consists of n samples original Fisher Linear Discriminant Analysis(FLDA) {(z,)}21,where z∈Ra,andh∈{1,2,·,de- (Fisher,1936),which deals with binary-class problems, notes the class label of the i-th sample,n is the sample i.e.,k=2.The optimal transformation,GF,of FLDA size,d is the data dimensionality,and k is the number is of rank one and is given by (Duda et al.,2000) of classes.Let the data matrix X=[1,x2,..,In]be partitioned into k classes as X =[X1,...,Xk],where GF=St(c四-c②) (6) X EIRdxn,nj is the size of the j-th class Xj,and n.Classical LDA computes a linear trans- Note that G is invariant of scaling.That is,aGF, for any 0 is also a solution to FLDA. formation Gxthat maps;in the d-dimensional space to a vector rf in the e-dimensional space as fol- lows:x:∈Rd→x=Grr:∈R(<d).n 3.Linear Regression versus Fisher LDA discriminant analysis (Fukunaga,1990),three scatter Given a dataset of two classes,.{(z,i)}是1,xi∈Rd matrices,called within-class,between-class and total and yi E{-1,1},the linear regression model with scatter matrices are defined as follows: the class label as the output has the following form: f(x)=rTw+b,where w E IRa is the weight vector, 。=1∑∑--c6, (1) and b is the bias of the linear model.A popular ap- proach for estimating w and b is the least squares,in j= which the following objective function is minimized: 1∑neo0-ceo-ar, (2) L(w,6)= llXTw+be-yll2= lf(x)-2, St= ∑a-o-d, (3) (7) n where X [1,z2,...,In]is the data matrix,e is the where c()is the centroid of the j-th class,and c is vectors of all ones,and y is the vector of class labels. the global centroid.It follows from the definition that Assume that both fri}and {yi}have been centered, S:=S+Sw.Furthermore,trace(S)measures the ie,∑1x=0and∑-1h=0.It follows that within-class cohesion,while trace(S)measures the yi∈{-2m2/n,2m1/m}
Least Squares Linear Discriminant Analysis formed space by multivariate linear regression with a specific class indicator matrix as the output is identical to LDA applied to the original space. In this case, multivariate linear regression is applied as a preprocessing step for LDA. One natural question is whether LDA in the multi-class case can be directly formulated as a least squares problem. In this paper, we propose a novel formulation for multivariate linear regression based on a new class indicator matrix. We establish the equivalence relationship between the proposed least squares formulation and LDA under a mild condition (see Section 5), which holds in many applications involving high-dimensional and undersampled data. We call the proposed LDA formulation Least Squares Linear Discriminant Analysis (or LS-LDA in short). We have conducted experiments using a collection of high-dimensional datasets from various data sources, including text documents, face images, and microarray gene expression data. Experimental results are consistent with the presented theoretical analysis. 2. Overview of Linear Discriminant Analysis Given a dataset that consists of n samples {(xi , yi)} n i=1, where xi ∈ IRd , and yi ∈ {1, 2, · · · , k} denotes the class label of the i-th sample, n is the sample size, d is the data dimensionality, and k is the number of classes. Let the data matrix X = [x1, x2, · · · , xn] be partitioned into k classes as X = [X1, · · · , Xk], where Xj ∈ IRd×nj , nj is the size of the j-th class Xj , and Pk j=1 nj = n. Classical LDA computes a linear transformation G ∈ IRd×ℓ that maps xi in the d-dimensional space to a vector x L i in the ℓ-dimensional space as follows: xi ∈ IRd → x L i = GT xi ∈ IRℓ (ℓ < d). In discriminant analysis (Fukunaga, 1990), three scatter matrices, called within-class, between-class and total scatter matrices are defined as follows: Sw = 1 n X k j=1 X x∈Xj (x − c (j) )(x − c (j) ) T , (1) Sb = 1 n X k j=1 nj (c (j) − c)(c (j) − c) T , (2) St = 1 n Xn i=1 (xi − c)(xi − c) T , (3) where c (j) is the centroid of the j-th class, and c is the global centroid. It follows from the definition that St = Sb + Sw. Furthermore, trace(Sw) measures the within-class cohesion, while trace(Sb) measures the between-class separation. In the lower-dimensional space resulting from the linear transformation G, the scatter matrices become S L w = G T SwG, SL b = G T SbG, SL t = G T StG. (4) An optimal transformation G would maximize trace(S L b ) and minimize trace(S L w) simultaneously, which is equivalent to maximizing trace(S L b ) and minimizing trace(S L t ) simultaneously, since S L t = S L w +S L b . The optimal transformation, GLDA, of LDA is computed by solving the following optimization problem (Duda et al., 2000; Fukunaga, 1990): G LDA = arg max G trace S L b (S L t ) −1 . (5) The optimal GLDA consists of the top eigenvectors of S −1 t Sb corresponding to the nonzero eigenvalues (Fukunaga, 1990), provided that the total scatter matrix St is nonsingular. In the following discussion, we consider the more general case when St may be singular. GLDA consists of the eigenvectors of S + t Sb corresponding to the nonzero eigenvalues. Here S + t denotes the pseudo-inverse of St (Golub & Van Loan, 1996). Note that when St is nonsingular, S + t equals S −1 t . The above LDA formulation is an extension of the original Fisher Linear Discriminant Analysis (FLDA) (Fisher, 1936), which deals with binary-class problems, i.e., k = 2. The optimal transformation, GF , of FLDA is of rank one and is given by (Duda et al., 2000) G F = S + t (c (1) − c (2)). (6) Note that GF is invariant of scaling. That is, αGF , for any α 6= 0 is also a solution to FLDA. 3. Linear Regression versus Fisher LDA Given a dataset of two classes, {(xi , yi)} n i=1, xi ∈ IRd , and yi ∈ {−1, 1}, the linear regression model with the class label as the output has the following form: f(x) = x T w + b, where w ∈ IRd is the weight vector, and b is the bias of the linear model. A popular approach for estimating w and b is the least squares, in which the following objective function is minimized: L(w, b) = 1 2 ||XT w + be − y||2 = 1 2 Xn i=1 ||f(xi) − yi ||2 , (7) where X = [x1, x2, · · · , xn] is the data matrix, e is the vectors of all ones, and y is the vector of class labels. Assume that both {xi} and {yi} have been centered, i.e., Pn i=1 xi = 0 and Pn i=1 yi = 0. It follows that yi ∈ {−2n2/n, 2n1/n}
Least Squares Linear Discriminant Analysis where n and n2 denote the number of samples from w1IRd,of the k linear models,f()=Twj, the negative and positive classes,respectively.The for j=1,...,k,via the minimization of the following bias term b becomes zero and we look for a linear model objective function: f(x)=xTw by minimizing L(w)-llXTw-yll2 (8) (9) The optimal w is given by (Duda et al.,2000;Golub where W [w1,w2,...,w]is the weight matrix,and Van Loan,1996;Hastie et al.,2001)w=(XXT)Xy. .F denotes the Frobenius norm of a matrix (Golub Note that XXT =nSt (data matrix X has been cen- Van Loan,1996).The optimal W is given by (Hastie tered)and Xy=2nin(c(1)-c(2)).It follows that etal.,2001) ”2Sc四-c2=12G2 (10) n2 W=(父)r, where GF is the optimal solution to FLDA in Eq.(6). which is determined by X and y Hence linear regression with the class label as the out- Both Yi and Y2 defined above,as well as the one in put is equivalent to Fisher LDA,as the projection in (ParkPark.2005)could be used to define the cen- FLDA is invariant of scaling.More details on this tered indicator matrix Y.An interesting connection equivalence relationship can be found at (Duda et al., between the linear regression model using Yi and LDA 2000:Mika,2002). can be found in (Page 112,Hastie et al.(2001)).It can be shown that if XL=WIX is the transformed 4.Multivariate Linear Regression with data by Wi,where Wi=(is the least a Class Indicator Matrix squares solution in Eq.(10)using the centered indica- In the multiclass case,we are given a dataset that tor matrix Yi,then LDA applied to X is identical to consists of n samples{(xi,k)}是l,where zi∈R4 LDA applied to the original space.In this case,lin- andh∈{l,2,·,k}denotes the class label of the ear regression is applied as a preprocessing step before i-th sample,and k 2.It is common to apply lin- the classification,and is in general not equivalent to ear regression of a class membership indicator ma- LDA.The second indicator matrix Y2 has been used trix YIRxk,which applies a vector valued class in SVM,and the resulting model using Y2 is also not code for each of the samples (Hastie et al.,2001). equivalent to LDA in general.This is also the case for the indicator matrix in (ParkPark.2005).A nat- There are several well-known indicator matrices in the literature..Denote Yi=(y(ij)》,∈IRaxk and ural question is whether there exists a class indicator y=(Y(ij》i∈Ras the class indicator matri- matrix Y ERR"k,with which the multivariate linear regression is equivalent to LDA.If this is the case,then ces as follows:Yi(ij)=1,if yi j,and Yi(ij)=0, LDA can be formulated as a least squares problem in otherwise;and Y2(ij)=1,if yi=j,and Y2(ij)= the multi-class case. -1/(k-1),otherwise.The first indicator matrix Yi is commonly used in connecting multi-class classifica- Note that in multivariate linear regression,each ii is tion with linear regression (Hastie et al.,2001),while transformed to(fh(i:),·,fik(a)=WT元,and the the second indicator matrix has recently been used in centered data matrix交∈Rxn is transformed to extending Support Vector Machines (SVM)to multi- WTX E IR*x",thus achieving dimensionality reduc- class classification (Lee et al.,2004)and in generalizing tionf<d.Note that W=(父文T)r交.Anat- the kernel target alignment measure(Guermeur et al., 2004),originally proposed in(Cristianini et al.,2001). ral measure for evaluating Y is the class discrimination used in LDA.We thus look for Y which solves the fol- In multivariate linear regression(MLR),a k-tuple of lowing optimization problem:(The pseudo-inverse is separating functions f()=(f1(x),f2(x),...,f(x)), applied to deal with the singular scatter matrix.) for any r IRd is considered.Denote= [,…,n∈Rdx,and立=(位)∈Rxk as maxy trace((WTSW)(WTS:W)+) the centered data matrix X and the centered indi- subject to W=(仪T)7 (11) cator matrix Y,respectively.That is,i=i- and=y-乃,where五=h∑-ar;and Y= In the following,we construct a specific class indica- Y Then MLR determines the weight vectors, tor matrix Y3 and show that it solves the optimiza-
Least Squares Linear Discriminant Analysis where n1 and n2 denote the number of samples from the negative and positive classes, respectively. The bias term b becomes zero and we look for a linear model f(x) = x T w by minimizing L(w) = 1 2 ||XT w − y||2 . (8) The optimal w is given by (Duda et al., 2000; Golub & Van Loan, 1996; Hastie et al., 2001) w = XXT + Xy. Note that XXT = nSt (data matrix X has been centered) and Xy = 2n1n2 n (c (1) − c (2)). It follows that w = 2n1n2 n2 S + t (c (1) − c (2)) = 2n1n2 n2 G F , where GF is the optimal solution to FLDA in Eq. (6). Hence linear regression with the class label as the output is equivalent to Fisher LDA, as the projection in FLDA is invariant of scaling. More details on this equivalence relationship can be found at (Duda et al., 2000; Mika, 2002). 4. Multivariate Linear Regression with a Class Indicator Matrix In the multiclass case, we are given a dataset that consists of n samples {(xi , yi)} n i=1, where xi ∈ IRd , and yi ∈ {1, 2, · · · , k} denotes the class label of the i-th sample, and k > 2. It is common to apply linear regression of a class membership indicator matrix Y ∈ IRn×k , which applies a vector valued class code for each of the samples (Hastie et al., 2001). There are several well-known indicator matrices in the literature. Denote Y1 = (Y1(ij))ij ∈ IRn×k and Y2 = (Y2(ij))ij ∈ IRn×k as the class indicator matrices as follows: Y1(ij) = 1, if yi = j, and Y1(ij) = 0, otherwise; and Y2(ij) = 1, if yi = j, and Y2(ij) = −1/(k − 1), otherwise. The first indicator matrix Y1 is commonly used in connecting multi-class classification with linear regression (Hastie et al., 2001), while the second indicator matrix has recently been used in extending Support Vector Machines (SVM) to multiclass classification (Lee et al., 2004) and in generalizing the kernel target alignment measure (Guermeur et al., 2004), originally proposed in (Cristianini et al., 2001). In multivariate linear regression (MLR), a k-tuple of separating functions f(x) = (f1(x), f2(x), · · · , fk(x)), for any x ∈ IRd is considered. Denote X˜ = [˜x1, · · · , x˜n] ∈ IRd×n , and Y˜ = Y˜ ij ∈ IRn×k as the centered data matrix X and the centered indicator matrix Y , respectively. That is, ˜xi = xi − x¯ and Y˜ ij = Yij − Y¯ j , where ¯x = 1 n Pn i=1 xi and Y¯ j = 1 n Pn i=1 Yij . Then MLR determines the weight vectors, {wj} k j=1 ∈ IRd , of the k linear models, fj (x) = x T wj , for j = 1, · · · , k, via the minimization of the following objective function: L(W) = 1 2 ||X˜ T W − Y˜ ||2 F = 1 2 X k j=1 Xn i=1 ||fj (˜xi) − Y˜ ij ||2 , (9) where W = [w1, w2, · · · , wk] is the weight matrix, and || · ||F denotes the Frobenius norm of a matrix (Golub & Van Loan, 1996). The optimal W is given by (Hastie et al., 2001) W = X˜X˜ T + X˜Y , ˜ (10) which is determined by X˜ and Y˜ . Both Y1 and Y2 defined above, as well as the one in (Park & Park, 2005) could be used to define the centered indicator matrix Y˜ . An interesting connection between the linear regression model using Y1 and LDA can be found in (Page 112, Hastie et al. (2001)). It can be shown that if XL = WT 1 X˜ is the transformed data by W1, where W1 = X˜X˜ T + X˜Y˜ 1 is the least squares solution in Eq. (10) using the centered indicator matrix Y˜ 1, then LDA applied to XL is identical to LDA applied to the original space. In this case, linear regression is applied as a preprocessing step before the classification, and is in general not equivalent to LDA. The second indicator matrix Y2 has been used in SVM, and the resulting model using Y2 is also not equivalent to LDA in general. This is also the case for the indicator matrix in (Park & Park, 2005). A natural question is whether there exists a class indicator matrix Y˜ ∈ IRn×k , with which the multivariate linear regression is equivalent to LDA. If this is the case, then LDA can be formulated as a least squares problem in the multi-class case. Note that in multivariate linear regression, each ˜xi is transformed to (f1(˜xi), · · · , fk(˜xi))T = WT x˜i , and the centered data matrix X˜ ∈ IRd×n is transformed to WT X˜ ∈ IRk×n , thus achieving dimensionality reduction if k < d. Note that W = X˜X˜ T + X˜Y˜ . A natural measure for evaluating Y˜ is the class discrimination used in LDA. We thus look for Y˜ which solves the following optimization problem: (The pseudo-inverse is applied to deal with the singular scatter matrix.) maxY˜ trace((WT SbW)(WT StW) +) subject to W = X˜X˜ T + X˜Y˜ (11) In the following, we construct a specific class indicator matrix Y3 and show that it solves the optimiza-
Least Squares Linear Discriminant Analysis tion problem in Eq.(11).More importantly,we show be the SVD of B,where P and Q are orthogonal and in Section 5 that multivariate linear regression us- EIRtxk is diagonal.Since S=HH,we have ing indicator matrix Y3 is equivalent to LDA under a mild condition,which has been shown empirically St'UT SLUI=BBT PSST PT PELPT, to hold for most high-dimensional and undersampled (20) data.The indicator matrix Y3=(Ya(ij)is where constructed as follows: ∑b= 2T=diag(a,…,a), (21) if班=j, ( (12) ≥…≥a号>0=听+1=…=,(22) otherwise, and q=rank(S). where nj is the sample size of the j-th class,and n is It follows from Eqs.(16)and(17)that the total sample size.It can be shown that Ya defined S:SoS:U1>2UTSUE2UT above has been centered (in terms of rows),and thus Y3=Y3 =U11(P6pT)1UT,(23) Define matrices Hw,Ho,and Ht as follows: where the last equality follows from Eq.(20).We have the following result: =后k-eY,…X-ce图 Lemma4.l.LetP,U,t,文,and Y be defined as i=点m0-e…v顶w-9 above.Then WTSoW =FTSoF,and WTSW (14) TF,where W is defined in Eq.(10),is defined in Eq.(21),and F=PTE:UT(XY). ,=5(X-ce (15) Vn We are now ready to present the main result of this section,that is Y=Y3,solves the optimization prob- where Xi is the data matrix of the j-th class,X is lem in Eq.(11),where Y3 is defined in Eq.(12),as the data matrix,c)is the centroid of the j-th class, summarized in the following theorem: c is the global centroid,e)is the vector of all ones of length nj and e is the vector of all ones of length n. Theorem 4.1.Let So;St,Eo,W,and Y be defined as above.Then for any Y,the following inequality holds: Then S,So,and S can be expressed as follows: tace(WTSW)(WTS.W)+)≤trace(Eb).Further- Sw =HuHi:So =HoH:St=HtH more,the equality holds when Y =Y3,where Y3 is defined in Eq.(12). Let Ht=UEVT be the Singular Value Decomposition (SVD)(Golub Van Loan,1996)of Ht,where Ht is With Y=Y3 as the class indicator matrix,the optimal defined in Eq.(15),U and V are orthogonal, weight matrix WMLR for multivariate linear regression diag(∑t,0),∑t∈Rtxt is diagonal,andt=rank(S). (MLR)in Eq.(10)becomes Then S:=HH=UXxTUT Udiag(2,0)UT.(16) wR=(x))X立=(nS,)+n,=S时H. (24) Let U=(U1,U2)be a partition of U,such that U1 E Rdxt and U2ERx(d-).That is,U2 lies in the null 5.Relationship between Multivariate space of St;i.e.,UTSU2=0.Since St=S+Sw, Linear Regression and LDA we have 0=UT S:U2 U3 SU2+U SuU2.Thus USU2 =0,since S is positive semi-definite.It Recall that in LDA,the optimal transformation ma- follows that trix,GLDA,consists of the top eigenvectors of SS corresponding to the nonzero eigenvalues.In this sec- 00 (17) tion,we study the relationship between WMLR SH in Eq.(24)and the eigenvectors of S,Sp.From Denote Eqs.(16)and (17),we can decompose matrix S,S as follows: B=UH6∈Rxk, (18) where H is defined in Eq.(14)and let B=PSQT (19 =u(8)(08)(0)
Least Squares Linear Discriminant Analysis tion problem in Eq. (11). More importantly, we show in Section 5 that multivariate linear regression using indicator matrix Y3 is equivalent to LDA under a mild condition, which has been shown empirically to hold for most high-dimensional and undersampled data. The indicator matrix Y3 = (Y3(ij))ij ∈ IRn×k is constructed as follows: Y3(ij) = q n nj − qnj n if yi = j, − qnj n otherwise, (12) where nj is the sample size of the j-th class, and n is the total sample size. It can be shown that Y3 defined above has been centered (in terms of rows), and thus Y˜ 3 = Y3. Define matrices Hw, Hb, and Ht as follows: Hw = 1 √ n [X1 − c (1)(e (1)) T , · · · , Xk − c (k) (e (k) ) T ],(13) Hb = 1 √ n [ √ n1(c (1) − c), · · · , √ nk(c (k) − c)], (14) Ht = 1 √ n (X − ceT ), (15) where Xj is the data matrix of the j-th class, X is the data matrix, c (j) is the centroid of the j-th class, c is the global centroid, e (j) is the vector of all ones of length nj and e is the vector of all ones of length n. Then Sw, Sb, and St can be expressed as follows: Sw = HwHT w , Sb = HbHT b , St = HtHT t . Let Ht = UΣV T be the Singular Value Decomposition (SVD) (Golub & Van Loan, 1996) of Ht, where Ht is defined in Eq. (15), U and V are orthogonal, Σ = diag(Σt, 0), Σt ∈ IRt×t is diagonal, and t = rank(St). Then St = HtHT t = UΣΣTU T = Udiag(Σ2 t , 0)U T . (16) Let U = (U1, U2) be a partition of U, such that U1 ∈ IRd×t and U2 ∈ IRd×(d−t) . That is, U2 lies in the null space of St, i.e., U T 2 StU2 = 0. Since St = Sb + Sw, we have 0 = U T 2 StU2 = U T 2 SbU2 + U T 2 SwU2. Thus U T 2 SbU2 = 0, since Sw is positive semi-definite. It follows that U T SbU = U T 1 SbU1 0 0 0 . (17) Denote B = Σ−1 t U T 1 Hb ∈ IRt×k , (18) where Hb is defined in Eq. (14) and let B = PΣˆQ T (19) be the SVD of B, where P and Q are orthogonal and Σˆ ∈ IRt×k is diagonal. Since Sb = HbHT b , we have Σ −1 t U T 1 SbU1Σ −1 t = BBT = PΣˆΣˆT P T = PΣbP T , (20) where Σb = ΣˆΣˆT = diag(α 2 1 , · · · , α2 t ), (21) α 2 1 ≥ · · · ≥ α 2 q > 0 = α 2 q+1 = · · · = α 2 t , (22) and q = rank(Sb). It follows from Eqs. (16) and (17) that S + t SbS + t = U1Σ −2 t U T 1 SbU1Σ −2 t U T 1 = U1Σ −1 t PΣbP T Σ −1 t U T 1 , (23) where the last equality follows from Eq. (20). We have the following result: Lemma 4.1. Let P, U1, Σt, X˜, and Y˜ be defined as above. Then WT SbW = 1 n2 F T ΣbF, and WT StW = 1 n2 F T F, where W is defined in Eq. (10), Σb is defined in Eq. (21), and F = P T Σ −1 t U T 1 (X˜Y˜ ). We are now ready to present the main result of this section, that is Y˜ = Y3, solves the optimization problem in Eq. (11), where Y3 is defined in Eq. (12), as summarized in the following theorem: Theorem 4.1. Let Sb, St, Σb, W, and Y˜ be defined as above. Then for any Y˜ , the following inequality holds: trace (WT SbW)(WT StW) + ≤ trace(Σb). Furthermore, the equality holds when Y˜ = Y3, where Y3 is defined in Eq. (12). With Y˜ = Y3 as the class indicator matrix, the optimal weight matrix W MLR for multivariate linear regression (MLR) in Eq. (10) becomes W MLR = X˜X˜ T + X˜Y˜ = (nSt) +nHb = S + t Hb. (24) 5. Relationship between Multivariate Linear Regression and LDA Recall that in LDA, the optimal transformation matrix, GLDA, consists of the top eigenvectors of S + t Sb corresponding to the nonzero eigenvalues. In this section, we study the relationship between W MLR = S + t Hb in Eq. (24) and the eigenvectors of S + t Sb. From Eqs. (16) and (17), we can decompose matrix S + t Sb as follows: S + t Sb =U Σ −1 t BBT Σt 0 0 0 U T =U Σ −1 t P 0 0 I Σb 0 0 0 P T Σt 0 0 I U T
Least Squares Linear Discriminant Analysis where the equalities follow since 6.Experiments B=UTH PSQT We performed our experimental studies using nine is the SVD of B as in Eq.(19)and=T.Thus, high-dimensional datasets,including text documents, the transformation matrix in LDA is given by GLDA face images,and gene expression data.DOC1,DOC2, UP,where P consists of the first g columns of P, and DOC3 are three text document datasets;ORL, since only the first g diagonal entries of X is nonzero. AR,and PIX are three face image datasets;and On the other hand. GCM,ALL,and ALLAML are three gene expression datasets.The statistics of the datasets are summa- S时H6= rized in Table 1 (the first column). 0 To compare LS-LDA and LDA,we use the K-NN algo- U (:UH) rithm with K=1 as the classifier.For all datasets,we 11B performed our study by repeated random splittings of UE PSOT the whole dataset into training and test sets as in(Du- doit et al.,2002).The data was partitioned randomly UP0QT into a training set,where each class consists of two- [GLDASBS,0]QT, thirds of the whole class and a test set with each class (25 consisting of one-third of the whole class.The splitting where,∑bg∈R9 x consists of the first q rows was repeated 10 times and the resulting accuracies of and the first g columns of o,respectively,the fifth different algorithms for the ten splittings are summa- equality follows since only the first g rows and the first rized in Table 1.The rank difference of three scatter g columns of are nonzero and the last equality fol- matrices,i.e.,rank(S)+rank(S)-rank(S:),as well lows since =EET.It follows that as the ratio of the largest to the smallest diagonal en- tries of for each of the splitting is also reported. WMLR=[GLDASB,0]QT, Recall from Theorem 5.1 that when the rank differ- where O is orthogonal. ence of the scatter matrices is zero,matrix Eog equals to the identity matrix and the ratio of the largest to The K-Nearest-Neighbor (K-NN)algorithm (Duda the smallest diagonal entries of is 1. et al.,2000)based on the Euclidean distance is com- monly applied as the classifier in the dimensionality re- We can observe from Table 1 that the rank difference, duced(transformed)space of LDA.If we apply WMLR rank(S)+rank(S)-rank(St),equals zero in all cases for dimensionality reduction before K-NN.the matrix except the DOC2 dataset.For most datasets,the n WMLR is invariant of an orthogonal transformation, data points are linearly independent,i.e.,rank(St)= since any orthogonal transformation preserves all pair- n-1.In this case,the k centroids are also linearly wise distance.Thus WMLR is essentially equivalent independent,i.e.,rank(S)=k-1,while in S, toGor GLDA,as the removal of zero each data point is subtracted by its class centroid and rank(S)=n-k.Hence,the rank difference is zero. columns does not change the pairwise distance either. Furthermore.LS-LDA and LDA achieve the same clas- The essential difference between WMLR and GLDA is sification performance for all cases when the rank dif- thus the diagonal matrix ference is zero.The empirical result confirms the the- Next,we show that matrix o is an identity matrix oretical analysis in Section 5.For DOC2,LS-LDA and of size q,that is,WMLR and GLDA are essentially LDA still achieve the same classification performance, equivalent,under a mild condition that the rank dif- although the rank difference is not zero. ference of the three scatter matrices is zero,that is. Recall that the value of ratio denotes the ratio of the rank(S)+rank(S)-rank(St)=0,which holds in largest to the smallest diagonal entries of the matrix many applications involving high-dimensional and un- From Table 1,the value of ratio equals 1 for all dersampled data (Ye Xiong,2006).The main result cases when the rank difference is zero.This is consis- is summarized in the following theorem: tent with the theoretical result in Theorem 5.1.For Theorem5.l.Let Ebg∈Rx consist o时the firstq DOC2,where the rank difference is not zero for sev- rows and the first q columns of o as defined above, eral cases,the value of ratio is close to 1 for all cases. where o is defined in Eg.(21).Assume that the fol- That is,matrix is close to the identity matrix. lowing equality holds:rank(So)+rank(S)-rank(S:)= This explains why LS-LDA and LDA achieve the same 0.Then Eog =Ig:where Ig is the identity matrix of classification performance for DOC2,even though the size g and q=rank(S6). rank difference is not zero
Least Squares Linear Discriminant Analysis where the equalities follow since B = Σ−1 t U T 1 Hb = PΣˆQ T is the SVD of B as in Eq. (19) and Σb = ΣˆΣˆT . Thus, the transformation matrix in LDA is given by GLDA = U1Σ −1 t Pq, where Pq consists of the first q columns of P, since only the first q diagonal entries of Σb is nonzero. On the other hand, S + t Hb = U Σ 2 t −1 0 0 0 U T Hb = U1Σ −1 t Σ −1 t U T 1 Hb = U1Σ −1 t B = U1Σ −1 t PΣˆQ T = U1Σ −1 t Pq h Σˆ q, 0 i Q T = G LDAΣ 0.5 bq , 0 Q T , (25) where Σˆ q, Σbq ∈ IRq×q consists of the first q rows and the first q columns of Σˆ, Σb, respectively, the fifth equality follows since only the first q rows and the first q columns of Σ are nonzero and the last equality fol- ˆ lows since Σb = ΣˆΣˆT . It follows that W MLR = G LDAΣ 0.5 bq , 0 Q T , where Q is orthogonal. The K-Nearest-Neighbor (K-NN) algorithm (Duda et al., 2000) based on the Euclidean distance is commonly applied as the classifier in the dimensionality reduced (transformed) space of LDA. If we apply W MLR for dimensionality reduction before K-NN, the matrix W MLR is invariant of an orthogonal transformation, since any orthogonal transformation preserves all pairwise distance. Thus W MLR is essentially equivalent to h GLDAΣ 0.5 bq , 0 i or GLDAΣ 0.5 bq , as the removal of zero columns does not change the pairwise distance either. The essential difference between W MLR and GLDA is thus the diagonal matrix Σ0.5 bq . Next, we show that matrix Σbq is an identity matrix of size q, that is, W MLR and GLDA are essentially equivalent, under a mild condition that the rank difference of the three scatter matrices is zero, that is, rank(Sb) + rank(Sw) − rank(St) = 0, which holds in many applications involving high-dimensional and undersampled data (Ye & Xiong, 2006). The main result is summarized in the following theorem: Theorem 5.1. Let Σbq ∈ IRq×q consist of the first q rows and the first q columns of Σb as defined above, where Σb is defined in Eq. (21). Assume that the following equality holds: rank(Sb)+rank(Sw)−rank(St) = 0. Then Σbq = Iq, where Iq is the identity matrix of size q and q = rank(Sb). 6. Experiments We performed our experimental studies using nine high-dimensional datasets, including text documents, face images, and gene expression data. DOC1, DOC2, and DOC3 are three text document datasets; ORL, AR, and PIX are three face image datasets; and GCM, ALL, and ALLAML are three gene expression datasets. The statistics of the datasets are summarized in Table 1 (the first column). To compare LS-LDA and LDA, we use the K-NN algorithm with K = 1 as the classifier. For all datasets, we performed our study by repeated random splittings of the whole dataset into training and test sets as in (Dudoit et al., 2002). The data was partitioned randomly into a training set, where each class consists of twothirds of the whole class and a test set with each class consisting of one-third of the whole class. The splitting was repeated 10 times and the resulting accuracies of different algorithms for the ten splittings are summarized in Table 1. The rank difference of three scatter matrices, i.e., rank(Sb) + rank(Sw) − rank(St), as well as the ratio of the largest to the smallest diagonal entries of Σ0.5 bq , for each of the splitting is also reported. Recall from Theorem 5.1 that when the rank difference of the scatter matrices is zero, matrix Σbq equals to the identity matrix and the ratio of the largest to the smallest diagonal entries of Σ0.5 bq is 1. We can observe from Table 1 that the rank difference, rank(Sb)+rank(Sw)−rank(St), equals zero in all cases except the DOC2 dataset. For most datasets, the n data points are linearly independent, i.e., rank(St) = n − 1. In this case, the k centroids are also linearly independent, i.e., rank(Sb) = k − 1, while in Sw, each data point is subtracted by its class centroid and rank(Sw) = n − k. Hence, the rank difference is zero. Furthermore, LS-LDA and LDA achieve the same classification performance for all cases when the rank difference is zero. The empirical result confirms the theoretical analysis in Section 5. For DOC2, LS-LDA and LDA still achieve the same classification performance, although the rank difference is not zero. Recall that the value of ratio denotes the ratio of the largest to the smallest diagonal entries of the matrix Σ 0.5 bq . From Table 1, the value of ratio equals 1 for all cases when the rank difference is zero. This is consistent with the theoretical result in Theorem 5.1. For DOC2, where the rank difference is not zero for several cases, the value of ratio is close to 1 for all cases. That is, matrix Σ0.5 bq is close to the identity matrix. This explains why LS-LDA and LDA achieve the same classification performance for DOC2, even though the rank difference is not zero.