6 CHAPTER 1.INTRODUCTION that takes a new data point and produces a prediction of its population The three examples above are just a small sample of the uses of classification.Given these preliminaries,this paper will examine various methods of classification,starting with linear discrim- inant analysis and Fisher's discriminant analysis,to Support Vector Machines(SVMs).Our work will conclude with an analysis of the linear classifiers and SVMs applied to an actual data set
6 CHAPTER 1. INTRODUCTION that takes a new data point and produces a prediction of its population The three examples above are just a small sample of the uses of classification. Given these preliminaries, this paper will examine various methods of classification, starting with linear discriminant analysis and Fisher’s discriminant analysis, to Support Vector Machines (SVMs). Our work will conclude with an analysis of the linear classifiers and SVMs applied to an actual data set
Chapter 2 Basic Discriminants The first classification methods we examine are linear discriminant;particularly,Linear Discriminant Analysis and Fisher's Discriminant Analysis.They are similar in that they both produce linear decision functions that in fact are nearly identical,but the two methods have different assumptions and different approaches.In this chapter the two methods are compared and contrasted. 2.1 Linear Discriminant Analysis for Two Populations Given a pair of known populations,m1 and m2,assume that mI has a probability density function (pdf)fi(x),and similarly,T2 has a pdf of f2(x),where fi(x)f2(x).Then intuitively,a decision function for the two populations would arise from looking at the probability ratio:D(x) f2(x) A new observation x is classified as m if D(x)>1 and T2 if D(x)<1.(For cases where D(x)=1, the vector x is unclassifiable.)Let be the space of all possible observations,and denote the set of whereasR,and similarly the set of whereas Ra.(Denote the set x1 as being Ra.) Such a decision function is simple but effective:by determining from which population x is more likely to have come,one can make quick predictions about its origin.However note that the probability ratio decision function is surely not fool proof:when the probabilities f(x)and f2(x) are close together(or not),there is always a chance that x could be from T2 when fi(x)>f2(x). (Or visa versa.)The conditional probability,P(21),of classifying an observation x as T2 when in fact x∈π1is P(21)=P(x∈R2|T)=五(x)dx. (2.1)
Chapter 2 Basic Discriminants The first classification methods we examine are linear discriminant; particularly, Linear Discriminant Analysis and Fisher’s Discriminant Analysis. They are similar in that they both produce linear decision functions that in fact are nearly identical, but the two methods have different assumptions and different approaches. In this chapter the two methods are compared and contrasted. 2.1 Linear Discriminant Analysis for Two Populations Given a pair of known populations, π1 and π2, assume that π1 has a probability density function (pdf) f1(x), and similarly, π2 has a pdf of f2(x), where f1(x) 6= f2(x). Then intuitively, a decision function for the two populations would arise from looking at the probability ratio: D(x) = f1(x) f2(x) . A new observation x is classified as π1 if D(x) > 1 and π2 if D(x) < 1. (For cases where D(x) = 1, the vector x is unclassifiable.) Let Ω be the space of all possible observations, and denote the set of x ∈ Ω where f1(x) f2(x) > 1 as R1, and similarly the set of x ∈ Ω where f1(x) f2(x) < 1 as R2. (Denote the set n x | f1(x) f2(x) = 1o as being R3.) Such a decision function is simple but effective: by determining from which population x is more likely to have come, one can make quick predictions about its origin. However note that the probability ratio decision function is surely not fool proof: when the probabilities f1(x) and f2(x) are close together (or not), there is always a chance that x could be from π2 when f1(x) > f2(x). (Or visa versa.) The conditional probability, P(2|1), of classifying an observation x as π2 when in fact x ∈ π1 is P(2|1) = P(x ∈ R2 | π1) = Z R2 f1(x)dx. (2.1) 7
8 CHAPTER 2.BASIC DISCRIMINANTS Similarly,the conditional probability of classifying an observation x as m when in fact x Em2 is P(12)=P(xER I T2)=/f2(x)dx. (2.2) Often times,the costs of misclassification for the two populations are different;take for example testing for a fatal but easily curable disease.The cost of misclassifying a patient as healthy when he or she is sick in this case would be higher than misclassifying a healthy patient as sick.To account for such situations,we would like a classification method that would classify patients as healthy only when there is sufficient evidence to overcome a certain cost threshold.We define c(12)as the cost of misclassifying a data point in m when it is actually a member of 72,and similarly,c(21)as the cost of misclassification into m2 when x E m1. Another factor that can affect accurate classification is the prior probability of belonging to one or the other of the populations.Again referring to the above example,if said disease has a low prevalence,even a test with a high sensitivity will classify many patients as sick when they are not when administered to enough people,making it difficult to determine whether a positive test actually indicates a sick patient.Some sort of weight ought to be given to the prior probability that a random observation x is from m or 72,and we denote such probabilities as pi and p2 respectively. Note that with our prior probabilities,we can find the overall probabilities of misclassification by substituting our priors into the earlier conditional probabilities: P(observation comes from m and is misclassified as T2) (2.3) =P(x∈R2|T1)P(1) (2.4) =P(21)×p1, (2.5) and P(observation comes from m2 and is misclassified as m1) (2.6) =P(x∈R1|π2)P(2) (2.7) =P(12)×p2: (2.8) Then,the expected cost of misclassification (ECM)is given by multiplying the cost of misclassification by the overall probability of misclassification for each population: ECM=c(21)P(2|1)P1+c(12)P(12)p. (2.9) One criterion for a classification method is to minimize the ECM,which leads us to the following result: Theorem 2.1.1.(Johnson and Wichern,(4)The regions R1 and R2 that minimize the ECM are defined by the values of x for which the following inequalities hold: R1: (回>12× (五>c(21)p1 2 (2.10) R2: fi() c《12× 田<2× (2.11) Proof.Let us substitute the integral expressions for P(21)and P(12)given by (2.1)and(2.2)into the ECM: ECM c(21)p1 fi(x)dx+c(12)p2 f2(x)dx. (2.12) 2
8 CHAPTER 2. BASIC DISCRIMINANTS Similarly, the conditional probability of classifying an observation x as π1 when in fact x ∈ π2 is P(1|2) = P(x ∈ R1 | π2) = Z R1 f2(x)dx. (2.2) Often times, the costs of misclassification for the two populations are different; take for example testing for a fatal but easily curable disease. The cost of misclassifying a patient as healthy when he or she is sick in this case would be higher than misclassifying a healthy patient as sick. To account for such situations, we would like a classification method that would classify patients as healthy only when there is sufficient evidence to overcome a certain cost threshold. We define c(1|2) as the cost of misclassifying a data point in π1 when it is actually a member of π2, and similarly, c(2|1) as the cost of misclassification into π2 when x ∈ π1. Another factor that can affect accurate classification is the prior probability of belonging to one or the other of the populations. Again referring to the above example, if said disease has a low prevalence, even a test with a high sensitivity will classify many patients as sick when they are not when administered to enough people, making it difficult to determine whether a positive test actually indicates a sick patient. Some sort of weight ought to be given to the prior probability that a random observation x is from π1 or π2, and we denote such probabilities as p1 and p2 respectively. Note that with our prior probabilities, we can find the overall probabilities of misclassification by substituting our priors into the earlier conditional probabilities: P(observation comes from π1 and is misclassified as π2) (2.3) = P(x ∈ R2 | π1)P(π1) (2.4) = P(2|1) × p1, (2.5) and P(observation comes from π2 and is misclassified as π1) (2.6) = P(x ∈ R1 | π2)P(π2) (2.7) = P(1|2) × p2. (2.8) Then, the expected cost of misclassification (ECM) is given by multiplying the cost of misclassification by the overall probability of misclassification for each population: ECM = c(2|1)P(2|1)p1 + c(1|2)P(1|2)p1. (2.9) One criterion for a classification method is to minimize the ECM, which leads us to the following result: Theorem 2.1.1. (Johnson and Wichern, [4]) The regions R1 and R2 that minimize the ECM are defined by the values of x for which the following inequalities hold: R1 : f1(x) f2(x) > c(1|2) c(2|1) × p2 p1 , (2.10) R2 : f1(x) f2(x) < c(1|2) c(2|1) × p2 p1 . (2.11) Proof. Let us substitute the integral expressions for P(2|1) and P(1|2) given by (2.1) and (2.2) into the ECM: ECM = c(2|1)p1 Z R2 f1(x)dx + c(1|2)p2 Z R1 f2(x)dx. (2.12)
2.1.LINEAR DISCRIMINANT ANALYSIS FOR TWO POPULATIONS 9 Note that R1+R2+R3,so 1=人i=人i+i+国a (2.13) Since we know that fi(x)f2(x),R3 must be a union of distinct points,meaning fr fi(x)dx=0, leaving us with fi()ds+f(x)dx. (2.14) Plugging (2.14)into (2.12),we see f2(x)dx (2.15) →ECM= c(12)p2f2(x)-c(21)pf1(x)dx+c(21)p1. (2.16) Recall that p,p2,c(12),c(21)are all positive since they are all probabilities and fi(x)and f2(x) are both positive functions.Then by inspection,we can see that the ECM is minimized when R is defined by those x such that [c(12)p2f2(x)-c(2 1)pifi()]<0.However,if in (2.13)we had decided to use f2(x)instead of fi(x),(2.16)would have been ECM= c(21)p1f1(x)-c(12)p2f2(x)dx+c(1|2)p2, R2 leading to the definition of R2 as the set {x [c(21)pifi(x)-c(12)p2f2(x)]<0}.As those x that satisfy c(12)p2f2(x)=c(21)pifi(x)can therefore be defined as both Ri and R2,we require that the inequalities in Theorem(2.1.1)defining the classification regions be strict and denote the case of equality as unclassifiable.(Note that unclassifiable points happen with probability zero.) The decision function given in Theorem(2.1.1)compares the probability ratio to the cost ratio and prior probability.The use of ratios is important as often times,it is much easier estimate the cost ratio than each cost explicitly.For example,if we are considering the costs of an state university of educating an eventual dropout versus the costs of not educating a eventual graduate, the former can be estimated with the school taxes and tuition,but the latter is more difficult to gauge.However,it could still be accurate to predict that such a cost ratio might be 5:1 or so. Let us examine the case where fi(x)are multivariate normal densities with known mean vectors ui and covariance matrices Ei. 2.1.1 Classification of Normal Populations when 1=X2=> Suppose that the density functions for fi(x)and f2(x)for population T1 and T2 are given by 1 for i=1,2, (2.17)
2.1. LINEAR DISCRIMINANT ANALYSIS FOR TWO POPULATIONS 9 Note that Ω = R1 + R2 + R3, so 1 = Z Ω f1(x)dx = Z R1 f1(x)dx + Z R2 f1(x)dx + Z R3 f1(x)dx. (2.13) Since we know that f1(x) 6= f2(x), R3 must be a union of distinct points, meaning R R3 f1(x)dx = 0, leaving us with 1 = Z R1 f1(x)dx + Z R2 f1(x)dx. (2.14) Plugging (2.14) into (2.12), we see ECM = c(2|1)p1 " 1 − Z R1 f1(x)dx # + c(1|2)p2 Z R1 f2(x)dx (2.15) ⇒ ECM = Z R1 " c(1|2)p2f2(x) − c(2|1)p1f1(x) # dx + c(2|1)p1. (2.16) Recall that p1, p2, c(1|2), c(2|1) are all positive since they are all probabilities and f1(x) and f2(x) are both positive functions. Then by inspection, we can see that the ECM is minimized when R1 is defined by those x such that [c(1|2)p2f2(x) − c(2|1)p1f1(x)] ≤ 0. However, if in (2.13) we had decided to use f2(x) instead of f1(x), (2.16) would have been ECM = Z R2 " c(2|1)p1f1(x) − c(1|2)p2f2(x) # dx + c(1|2)p2, leading to the definition of R2 as the set {x | [c(2|1)p1f1(x) − c(1|2)p2f2(x)] ≤ 0}. As those x that satisfy c(1|2)p2f2(x) = c(2|1)p1f1(x) can therefore be defined as both R1 and R2, we require that the inequalities in Theorem (2.1.1) defining the classification regions be strict and denote the case of equality as unclassifiable. (Note that unclassifiable points happen with probability zero.) The decision function given in Theorem (2.1.1) compares the probability ratio to the cost ratio and prior probability. The use of ratios is important as often times, it is much easier estimate the cost ratio than each cost explicitly. For example, if we are considering the costs of an state university of educating an eventual dropout versus the costs of not educating a eventual graduate, the former can be estimated with the school taxes and tuition, but the latter is more difficult to gauge. However, it could still be accurate to predict that such a cost ratio might be 5:1 or so. Let us examine the case where fi(x) are multivariate normal densities with known mean vectors µi and covariance matrices Σi . 2.1.1 Classification of Normal Populations when Σ1 = Σ2 = Σ Suppose that the density functions for f1(x) and f2(x) for population π1 and π2 are given by fi(x) = 1 (2π) m 2 |Σ| 1 2 exp" − 1 2 (x − µi) T Σ −1 (x − µi) # for i = 1, 2, (2.17)
10 CHAPTER 2.BASIC DISCRIMINANTS for x E Rm.Then by substituting(2.17)into Theorem (2.1.1),we get the following minimum ECM regions: R1:exp -5x-P2-(x-)+x-)P2-(x-) c(12×2 c(21)p1 (2.18) 尾:e-x-)ry-1x-)+x-ry-x-) 1 < 12×2 c(21)p1 (2.19) Given the regions Ri and R2,we can construct the following classification rule: Theorem 2.1.2.4]Let the populations n1 and n2 be described by multivariate normal densities with known parameters u1andμ2and∑=∑2=∑.Then the allocation rule that minimizes the ECM is as follows: Allocate x toπ1f (-2Σ-1x--2P14+2)>n c(12)×2 c(21)p1 Else,allocate x to n2.And as before,equality implies that the data point is unclassifiable. Proof.Note that -x-u)P2-1(x-4)+5x-)T-1(x-2), 1 (2.20) =2xy-1-喝8-1x-)-x-1-z1x-) (2.21) =x1x-x-e-x-x+g-g -xΣ-lx+xΣ-1h1+Σ-x-TΣ-1山1 (2.22) Here,we can cancel out the xT-1x term and moreover,since(xT-1u2)T=u-lx is a constant, -xT∑-12-μ∑-1x=-2μ5∑-1x (2.23) xT∑-1w1+μu∑-1x=2μT∑-1x (2.24) (Recall that E is a symmetric matrix.)Substituting equation(2.23)and(2.24)into equation(2.22)
10 CHAPTER 2. BASIC DISCRIMINANTS for x ∈ R m. Then by substituting (2.17) into Theorem (2.1.1), we get the following minimum ECM regions: R1 : exp" − 1 2 (x − µ1) T Σ −1 (x − µ1) + 1 2 (x − µ2) T Σ −1 (x − µ2) # > c(1|2) c(2|1) × p2 p1 , (2.18) R2 : exp" − 1 2 (x − µ1) T Σ −1 (x − µ1) + 1 2 (x − µ2) T Σ −1 (x − µ2) # < c(1|2) c(2|1) × p2 p1 . (2.19) Given the regions R1 and R2, we can construct the following classification rule: Theorem 2.1.2. [4] Let the populations π1 and π2 be described by multivariate normal densities with known parameters µ1 and µ2 and Σ1 = Σ2 = Σ. Then the allocation rule that minimizes the ECM is as follows: Allocate x to π1 if (µ1 − µ2) T Σ −1x − 1 2 (µ1 − µ2) T Σ −1 (µ1 + µ2) > ln" c(1|2) c(2|1) × p2 p1 # . Else, allocate x to π2. And as before, equality implies that the data point is unclassifiable. Proof. Note that − 1 2 (x − µ1) T Σ −1 (x − µ1) + 1 2 (x − µ2) T Σ −1 (x − µ2), (2.20) = 1 2 " (x T Σ −1 − µ T 2 Σ −1 )(x − µ2) − (x T Σ −1 − µ T 1 Σ −1 )(x − µ1) # , (2.21) = 1 2 " x T Σ −1x − x T Σ −1µ2 − µ T 2 Σ −1x + µ T 2 Σ −1µ2 − x T Σ −1x + x T Σ −1µ1 + µ T 1 Σ −1x − µ T 1 Σ −1µ1 # . (2.22) Here, we can cancel out the x T Σ −1x term and moreover, since (x T Σ −1µ2) T = µ T 2 Σ −1x is a constant, −x T Σ −1µ2 − µ T 2 Σ −1x = −2µ T 2 Σ −1x, (2.23) x T Σ −1µ1 + µ T 1 Σ −1x = 2µ T 1 Σ −1x. (2.24) (Recall that Σ is a symmetric matrix.) Substituting equation (2.23) and (2.24) into equation (2.22)