International Statistical Review (1990).58,2,pp.153-171.Printed in Great Britain Interational Statistical Institute Maximum Likelihood:An Introduction L.Le Cam Department of Statistics,University of California,Berkeley,California 94720,USA Summary Maximum likelihood estimates are reported to be best under all circumstances.Yet there are numerous simple examples where they plainly misbehave.One gives some examples for problems that had not been invented for the purpose of annoying maximum likelihood fans.Another example,imitated from Bahadur,has been specially created with just such a purpose in mind.Next, we present a list of principles leading to the construction of good estimates.The main principle says that one should not believe in principles but study each problem for its own sake. Key words:Estimation;Maximum likelihood;One-step approximations. 1 Introduction One of the most widely used methods of statistical estimation is that of maximum likelihood.Opinions on who was the first to propose the method differ.However Fisher is usually credited with the invention of the name 'maximum likelihood',with a major effort intended to spread its use and with the derivation of the optimality properties of the resulting estimates. Qualms about the general validity of the optimality properties have been expressed occasionally.However as late as 1970 L.J.Savage could imply in his 'Fisher lecture'that the difficulties arising in some examples would have rightly been considered 'mathemati- cal caviling'by R.A.Fisher. Of course nobody has been able to prove that maximum likelihood estimates are 'best' under all circumstances.The lack of any such proof is not sufficient by itself to invalidate Fisher's claims.It might simply mean that we have not yet translated into mathematics the basic principles which underlied Fisher's intuition. The present author has,unwittingly,contributed to the confusion by writing two papers which have been interpreted by some as attempts to substantiate Fisher's claims. To clarify the situation we present a few known facts which should be kept in mind as one proceeds along through the various proofs of consistency,asymptotic normality or asymptotic optimality of maximum likelihood estimates. The examples given here deal mostly with the case of independent identically distributed observations.They are intended to show that maximum likelihood does possess disquieting features which rule out the possibility of existence of undiscovered underlying principles which could be used to justify it.One of the very gross forms of misbehavior can be stated as follows. Maximum likelihood estimates computed with all the information available may turn out to be inconsistent.Throwing away a substantial part of the information may render them consistent. The examples show that,in spite of all its presumed virtues,the maximum likelihood procedure cannot be universally recommended.This does not mean that we advocate
International Statistical Review (1990), 58, 2, pp. 153-171. Printed in Great Britain ? International Statistical Institute Maximum Likelihood: An Introduction L. Le Cam Department of Statistics, University of California, Berkeley, California 94720, USA Summary Maximnm likelihood estimates are reported to be best under all circumstances. Yet there are numerous simple examples where they plainly misbehave. One gives some eranmples for problems that had not been invented for the purpose of annoying ms,aximunm likelihood fans. Another example, imitated from B'hadu'r, has been specially created with just such a purpose in mind. Next, we present a list of principles leading to the construction of good estimates. The main principle says that one should not believe in principles but study each problem for its own sake. Key words: Estimation; Maximum likelihood; One-step approximations. 1 Introduction One of the most widely used methods of statistical estimation is that of maximum likelihood. Opinions on who was the first to propose the method differ. However Fisher is usually credited with the invention of the name 'maximum likelihood', with a major effort intended to spread its use and with the derivation of the optimality properties of the resulting estimates. Qualms about the general validity of the optimality properties have been expressed occasionally. However as late as 1970 L.J. Savage could imply in his 'Fisher lecture' that the difficulties arising in some examples would have rightly been considered 'mathematical caviling' by R.A. Fisher. Of course nobody has been able to prove that maximum likelihood estimates are 'best' under all circumstances. The lack of any such proof is not sufficient by itself to invalidate Fisher's claims. It might simply mean that we have not yet translated into mathematics the basic principles which underlied Fisher's intuition. The present author has, unwittingly, contributed to the confusion by writing two papers which have been interpreted by some as attempts to substantiate Fisher's claims. To clarify the situation we present a few known facts which should be kept in mind as one proceeds along through the various proofs of consistency, asymptotic normality or asymptotic optimality of maximum likelihood estimates. The examples given here deal mostly with the case of independent identically distributed observations. They are intended to show that maximum likelihood does possess disquieting features which rule out the possibility of existence of undiscovered underlying principles which could be used to justify it. One of the very gross forms of misbehavior can be stated as follows. Maximum likelihood estimates computed with all the information available may turn out to be inconsistent. Throwing away a substantial part of the information may render them consistent. The examples show that, in spite of all its presumed virtues, the maximum likelihood procedure cannot be universally recommended. This does not mean that we advocate
154 L.LE CAM some other principle instead,although we give a few guidelines in 6.For other views see the discussion of the paper by Berkson (1980). This paper is adapted from lectures given at the University of Maryland,College Park, in the Fall of 1975.We are greatly indebted to Professor Grace L.Yang for the invitation to give the lectures and for the permission to reproduce them. 2 A Few Old Examples Let X,X2,...,X be independent identically distributed observations with values in some space X,A).Suppose that there is a o-finite measure A on A and that the distribution Pe of X;has a density f(x,0)with respect to u.The parameter 0 takes its values in some set e. For n observations x1,x2,...,xn the maximum likelihood estimate is any value 6 such that IIf(0)=sup IIf(0). j=1 ej=1 Note that such a 6 need not exist,and that,when it does,it usually depends on what version of the densities f(x,6)was selected.A function (x1,...,x)(x1,...,x) selecting a value 6 for each n-tuple (x1,...,x)may or may not be measurable. However all of this is not too depressing.Let us consider some examples. Example 1.(This may be due to Kiefer and Wolfowitz or to whoever first looked at mixtures of Normal distributions.)Let a be the number=10-10.Let =(u,o), u(-0,+)o>0.Let fi(x,0)be the density defined with respect to Lebesgue measureλon the line by a,o-a高e卿{-+vp{, Then,for (x1,...,x)one can take u=x1 and note that supΠfk;h,o)=o. 0=1 If o=0 was allowed one could claim that=(x,0)is maximum likelihood. Example 2.The above Example 1 is obviously contaminated and not fit to drink.Now a variable X is called log normal if there are numbers (a,b,c)such that X=c+ear+b with a Y which is N(0,1).Let 0=(a,b,c)in R3.The density of X can be taken zero for xsc and for x>c,and is equal to ,o)-v2np{-aloge-a)-br() 1 A sample (x1,...,x)from this density will almost surely have no ties and a unique minimum z minxi. The only values to consider are those for which c<z.Fix a value of b,say b =0.Take a
L. LE CAM some other principle instead, although we give a few guidelines in ? 6. For other views see the discussion of the paper by Berkson (1980). This paper is adapted from lectures given at the University of Maryland, College Park, in the Fall of 1975. We are greatly indebted to Professor Grace L. Yang for the invitation to give the lectures and for the permission to reproduce them. 2 A Few Old Examples Let X1, X2, ... , X, be independent identically distributed observations with values in some space {X,A}. Suppose that there is a a-finite measure A on A and that the distribution P0 of Xj has a density f(x, 0) with respect to M. The parameter 0 takes its values in some set 0. For n observations x,l, x,.. ., xn the maximum likelihood estimate is any value 0 such that n n f (x0) sup f(x,e 0). j=1 0eO j= Note that such a 0 need not exist, and that, when it does, it usually depends on what version of the densities f(x, 0) was selected. A function (xl,..., x,n) 0((x,.. ., x,) selecting a value 0 for each n-tuple (xl,..., x,) may or may not be measurable. However all of this is not too depressing. Let us consider some examples. Example 1. (This may be due to Kiefer and Wolfowitz or to whoever first looked at mixtures of Normal distributions.) Let ca be the number c = 10-1017. Let 0= (,u, a), M e (-00, +oo), a>0. Let fl(x, 0) be the density defined with respect to Lebesgue measure A on the line by - 2p{( -^ 1 (X{-7)2} fi(x, 0) = (2r) exp -2 (x - P)2 + a(2r) exp {- (a2 Then, for (xl, ..., xn) one can take p = xl and note that n sup fi(x,;p, o)= o. a j=l If a = 0 was allowed one could claim that 0 = (xl, 0) is maximum likelihood. Example 2. The above Example 1 is obviously contaminated and not fit to drink. Now a variable X is called log normal if there are numbers (a, b, c) such that X = c + eaY+b with a Y which is N(0, 1). Let 0 = (a, b, c) in R3. The density of X can be taken zero for x < c and for x > c, and is equal to 2(X, ) = (2) exp 2 [log (x - c) - b]2} - (-x ). A sample (x1, .. ., Xn) from this density will almost surely have no ties and a unique minimum z = min xj. The only values to consider are those for which c < z. Fix a value of b, say b = 0. Take a 154
Maximum Likelihood 155 ce(z-2,z)so close to z that llog (z-c)=max llog (x-c) Then the sum of squares in the exponent of the joint density does not exceed azn llog (z-c)2 One can make sure that this does not get too large by taking a=nllog(z-c).The extra factor in the density has then a term of the type [og(z-c训1 -c' which can still be made as large as you please. If you do not believe my algebra,look at the paper by Hill(1963). Example 3.The preceding example shows that the log normal distribution misbehaves. Everybody knows that taking logarithms is unfair.The following shows that three dimensional parameters are often unfair as well.(The example can be refined to apply to 0∈R2) Let X=R3=e.Let xll be the usual Euclidean length of x.Take a density i6,8)=ce-o2 llx-e118' with Be(0,1)fixed,say B=.Here again 5,6) =1 will have a supremum equal to +This time it is even attained by taking 6=x1,or x2. One can make the situation a bit worse selecting a dense countable subset {ak}, k=1,2,...,in R3 and taking f6,)-∑ck)exp{-r-6-4的 x-8-akl吃 with suitable coefficients C(k)which decrease rapidly to zero. Now take again a=10-101and take 1-a f5x,)= 2me-r+a6k,0). If we do take into account the contamination afa(,e)the supremum is infinite and attained at each x.If we ignore it everything seems fine,but then the maximum likelihood estimate is the mean n i=1 which,says C.Stein,is not admissible. Example 4.The following example shows that,as in Examples 2 and 3,one should not shift things.Take independent identically distributed observations X1,...,X from the
Maximum Likelihood c e (z - 1, z) so close to z that Ilog (z - c)l = max Ilog (xj - c)l. i Then the sum of squares in the exponent of the joint density does not exceed 1n 1log (z - c)12. One can make sure that this does not get too large by taking a = n |log (z - c)l. The extra factor in the density has then a term of the type [n? log (z - c)l]-n - c which can still be made as large as you please. If you do not believe my algebra, look at the paper by Hill (1963). Example 3. The preceding example shows that the log normal distribution misbehaves. Everybody knows that taking logarithms is unfair. The following shows that three dimensional parameters are often unfair as well. (The example can be refined to apply to 0eR2.) Let X = R3 = O. Let lIxll be the usual Euclidean length of x. Take a density -fllx-0112 IIx - 011' with 3 e (0, 1) fixed, say / = -.Here again n If3(xj, 0) j=1 will have a supremum equal to +o0. This time it is even attained by taking 0 = xi, or x2. One can make the situation a bit worse selecting a dense countable subset {ak}, k = 1, 2,..., in R3 and taking f4(x, 0)= C(k) exp (-llx - 0- ak112} k IIx - 0 - ak II' with suitable coefficients C(k) which decrease rapidly to zero. Now take again a = 101-137 and take f5(x, 0) = (2r)32 e + of3(x, e). If we do take into account the contamination af3(x, 0) the supremum is infinite and attained at each xi. If we ignore it everything seems fine, but then the maximum likelihood estimate is the mean n nj=1 which, says C. Stein, is not admissible. Example 4. The following example shows that, as in Examples 2 and 3, one should not shift things. Take independent identically distributed observations X, .. ., Xn from the 155
156 L.LE CAM gamma density shifted to start atg so that it is fx,)=BT-(a)e-x-)-1 forx>and zero otherwise.Let B and a take positive values and let g be arbitrary real. Here,for arbitrary 0<a<1,and arbitrary B>0,one will have P0.9)=m One can achieve by taking=min Xi,ae(0,1)and B arbitrary.The shape of your observed histogram may be trying to tell you that it comes from an a>10,but that must be ignored. Example 5.The previous examples have infinite contaminated inadmissible difficulties. Let us be more practical.Suppose that Xi,X2,...,X are independent uniformly distributed on [0,e],>0 Let Z=max Xi.Then =Z is the m.l.e.It is obviously pretty good.For instance E(6n-62=82, 2 (n+1)(n+2) Except for mathematical caviling,as L.S.Savage says,it is also obviously best for all purposes.So,let us not cavil,but try 时=n+22 n+1 Then E(8:-0P=6,1 (n+1)21 The ratio of the two is Ea(0n-0)2 E0(8*-B)2=2n+>、 This must be less than unity.Therefore one must have 2(n+1)sn+2 or equivalently n≤0. It is hard to design experiments where the number of observations is strictly negative. Thus our best bet is to design them with n=0 and uphold the faith. 3 A More Disturbing Example This one is due to Neyman and Scott.Suppose that (Xi,Y),j=1,2,...,are all independent random variables with X,and Y both Normal N(,o).We wish to estimate o2.A natural way to proceed would be to eliminate the nuisances and use the differences Z=X-Y;which are now N(0,202).One could then estimate o2 by That looks possible,but we may have forgotten about some of the information which is contained in the pairs (X,Y)but not in their differences Z.Certainly a direct application of maximum likelihood principles would be better and much less likely to lose information.So we compute o2 by taking suprema over all 5 and over o
L. LE CAM gamma density shifted to start at ~ so that it is f(x, o) = fiT-l(c)e-(x-)(x - 5) -1 for x ? $ and zero otherwise. Let fS and ar take positive values and let t be arbitrary real. Here, for arbitrary 0 < a < 1, and arbitrary f > 0, one will have n sup f(xi, 0)=. / j=1 One can achieve +0o by taking J = min Xi, c E (0, 1) and A arbitrary. The shape of your observed histogram may be trying to tell you that it comes from an a ? 10, but that must be ignored. Example 5. The previous examples have infinite contaminated inadmissible difficulties. Let us be more practical. Suppose that X1, X2, . .., Xn are independent uniformly distributed on [0, 0], 0 >0 Let Z = maxXj. Then n = Z is the m.l.e. It is obviously pretty good. For instance Eo(O - 0)2= 02 2 (n + 1)(n + 2)' Except for mathematical caviling, as L.S. Savage says, it is also obviously best for all purposes. So, let us not cavil, but try n+2 n ~- Z. n* + n+l 1 Then E,(08* - 0)2 = 02 (n + 1)2' The ratio of the two is Ee(O- 0)2 n_+1 =2 E(On*- 0)2 n+2 This must be less than unity. Therefore one must have 2(n + 1) S n + 2 or equivalently 0. It is hard to design experiments where the number of observations is strictly negative. Thus our best bet is to design them with n = 0 and uphold the faith. 3 A More Disturbing Example This one is due to Neyman and Scott. Suppose that (Xj, Yj), j = 1, 2,..., are all independent random variables with Xj and Yj both Normal N(ij, a2). We wish to estimate a2. A natural way to proceed would be to eliminate the nuisances , and use the differences Zi = Xj - Yj which are now N(0, 22). One could then estimate a2 by 1 n s2=_ zj2 2n i- =l That looks possible, but we may have forgotten about some of the information which is contained in the pairs (Xj, Yj) but not in their differences Zj. Certainly a direct application of maximum likelihood principles would be better and much less likely to lose information. So we compute e2 by taking suprema over all 5j and over a. 156
Maximum Likelihood 157 This gives 0=却品 Now,we did not take logarithms,nothing was contaminated,there was no infinity involved.In fact nothing seems amiss. So the best estimate must be not the intuitive s2 but2=2. The usual explanation for this discrepancy is that Neyman and Scott had too many parameters.This may be,but how many is too many?When there are too many should one correct the m.l.e.by a factor of two or (n+2)/(n+1)as in Example 5,or by taking a square root as in the m.l.e.for a star-like distribution?For this latter case,see Barlow etal.(1972). The number of parameters,by itself,does not seem to be that relevant.Take,for instance,i.i.d.observations XIX2,...,X on the line with a totally unknown distribu- tion function F.The m.l.e.of F is the empirical cumulative F.It is not that bad.Yet,a crude evaluation shows that F depends on very many parameters indeed,perhaps even more than Barlow et al.had for their star-like distributions. Note that in the above examples we did not let n tend to infinity.It would not have helped,but now let us consider some examples where the misbehavior will be described asn→e. 4 An Example of Bahadur The following is a slight modification of an example given by Bahadur in 1958.The modification does not have the purity of the original but it is more transparent and the purity can be recovered. Take a function,say h,defined on (0,1].Assume that h is decreasing,that h(x)>1 for allx∈(0,1]and that h(x)dx=oo. 0 Select a number c,ce(0,1)and proceed as follows.One probability measure,say po,on [0,1]is the Lebesgue measure A itself.Define a number a by the property [h(x)-c]dx=1-c. Take for pi the measure whose density with respect to A is c for 0sxsa and h(x)for a1<x≤1. If a1,a2,...,ak-1 have been determined define ak by the relation [h(x)-c]dx=1-c and take for p&the measure whose density with respect to A on [0,1]is c for x (ak,ak-] and h(x)for xE(ak,ag-1]. Since h(x)dx=oo Jo the process can be continued indefinitely,giving a countable family of measures p&, k=1,2,....Note that any two of them,say p;and p&with j<k,are mutually absolutely continuous
Maximum Likelihood This gives 2n = 2S Now, we did not take logarithms, nothing was contaminated, there was no infinity involved. In fact nothing seems amiss. So the best estimate must be not the intuitive s2 but a2 = 1S2 The usual explanation for this discrepancy is that Neyman and Scott had too many parameters. This may be, but how many is too many? When there are too many should one correct the m.l.e. by a factor of two or (n + 2)/(n + 1) as in Example 5, or by taking a square root as in the m.l.e. for a star-like distribution? For this latter case, see Barlow et al. (1972). The number of parameters, by itself, does not seem to be that relevant. Take, for instance, i.i.d. observations X1X2, Xn on the line with a totally unknown distribution function F. The m.l.e. of F is the empirical cumulative Fn. It is not that bad. Yet, a crude evaluation shows that F depends on very many parameters indeed, perhaps even more than Barlow et al. had for their star-like distributions. Note that in the above examples we did not let n tend to infinity. It would not have helped, but now let us consider some examples where the misbehavior will be described as n -> o. 4 An Example of Bahadulr The following is a slight modification of an example given by Bahadur in 1958. The modification does not have the purity of the original but it is more transparent and the purity can be recovered. Take a function, say h, defined on (0, 1]. Assume that h is decreasing, that h(x) > 1 for all x E (0, 1] and that f h(x)dx= . Select a number c, c E (0, 1) and proceed as follows. One probability measure, say po, on [0, 1] is the Lebesgue measure A itself. Define a number al by the property [h(x) - c]dx = 1 - c. ial Take for pi the measure whose density with respect to A is c for 0 x - a1 and h(x) for al <x<l. If al, a2, .. , ak- have been determined define ak by the relation [h(x) - c]dx = 1 - c ak and take for Pk the measure whose density with respect to A on [0, 1] is c for x ? (ak, ak-1] and h(x) for x e (ak, ak-1]. Since f h(x)dx = oo the process can be continued indefinitely, giving a countable family of measures Pk, k = 1, 2, .... Note that any two of them, say pj and Pk with j < k, are mutually absolutely continuous. 157