DEPARTMENT OF ECONOMICS UNIVERSITY OF CYPRUS THE MM,ME,ML,EL,EF AND GMM APPROACHES TO ESTIMATION:A SYNTHESIS Anil K.Bera and Yannis Bilias Discussion Paper 2001-09 P.O.Box20537,1678 Nicosia,CYPRUS Tel.:++357-2-892430,Fax:++357-2-892432 Web site:http://www.econ.ucy.ac.cy
DEPARTMENT OF ECONOMICS UNIVERSITY OF CYPRUS THE MM, ME, ML, EL, EF AND GMM APPROACHES TO ESTIMATION: A SYNTHESIS Anil K. Bera and Yannis Bilias Discussion Paper 2001-09 P.O. Box 20537, 1678 Nicosia, CYPRUS Tel.: ++357-2-892430, Fax: ++357-2-892432 Web site: http://www.econ.ucy.ac.cy
Abstract The 20th century began on an auspicious statistical note with the publication of Karl Pearson's(1900)goodness-of-fit test,which is regarded as one of the most important scien- tific breakthroughs.The basic motivation behind this test was to see whether an assumed probability model adequately described the data at hand.Pearson(1894)also introduced a formal approach to statistical estimation through his method of moments(MM)estima- tion.Ronald A.Fisher,while he was a third year undergraduate at the Gonville and Caius College,Cambridge,suggested the maximum likelihood estimation (MLE)procedure as an alternative to Pearson's MM approach.In 1922 Fisher published a monumental paper that introduced such basic concepts as consistency,efficiency,sufficiency-and even the term "parameter"with its present meaning.Fisher (1922)provided the analytical foundation of MLE and studied its efficiency relative to the MM estimator.Fisher (1924a)established the asymptotic equivalence of minimum x2 and ML estimators and wrote in favor of using minimum x2 method rather than Pearson's MM approach.Recently,econometricians have found working under assumed likelihood functions restrictive,and have suggested using a generalized version of Pearson's MM approach,commonly known as the GMM estimation procedure as advocated in Hansen (1982).Earlier,Godambe (1960)and Durbin (1960) developed the estimating function(EF)approach to estimation that has been proven very useful for many statistical models.A fundamental result is that score is the optimum EF. Ferguson(1958)considered an approach very similar to GMM and showed that estimation based on the Pearson chi-squared statistic is equivalent to efficient GMM.Golan,Judge and Miller (1996)developed entropy-based formulation that allowed them to solve a wide range of estimation and inference problems in econometrics.More recently,Imbens,Spady and Johnson (1998),Kitamura and Stutzer (1997)and Mittelhammer,Judge and Miller (2000)put GMM within the framework of empirical likelihood (EL)and maximum en- tropy (ME)estimation.It can be shown that many of these estimation techniques can be obtained as special cases of minimizing Cressie and Read (1984)power divergence crite- rion that comes directly from the Pearson(1900)chi-squared statistic.In this way we are able to assimilate a number of seemingly unrelated estimation techniques into a unified framework
Abstract The 20th century began on an auspicious statistical note with the publication of Karl Pearson’s (1900) goodness-of-fit test, which is regarded as one of the most important scientific breakthroughs. The basic motivation behind this test was to see whether an assumed probability model adequately described the data at hand. Pearson (1894) also introduced a formal approach to statistical estimation through his method of moments (MM) estimation. Ronald A. Fisher, while he was a third year undergraduate at the Gonville and Caius College, Cambridge, suggested the maximum likelihood estimation (MLE) procedure as an alternative to Pearson’s MM approach. In 1922 Fisher published a monumental paper that introduced such basic concepts as consistency, efficiency, sufficiency–and even the term “parameter” with its present meaning. Fisher (1922) provided the analytical foundation of MLE and studied its efficiency relative to the MM estimator. Fisher (1924a) established the asymptotic equivalence of minimum χ 2 and ML estimators and wrote in favor of using minimum χ 2 method rather than Pearson’s MM approach. Recently, econometricians have found working under assumed likelihood functions restrictive, and have suggested using a generalized version of Pearson’s MM approach, commonly known as the GMM estimation procedure as advocated in Hansen (1982). Earlier, Godambe (1960) and Durbin (1960) developed the estimating function (EF) approach to estimation that has been proven very useful for many statistical models. A fundamental result is that score is the optimum EF. Ferguson (1958) considered an approach very similar to GMM and showed that estimation based on the Pearson chi-squared statistic is equivalent to efficient GMM. Golan, Judge and Miller (1996) developed entropy-based formulation that allowed them to solve a wide range of estimation and inference problems in econometrics. More recently, Imbens, Spady and Johnson (1998), Kitamura and Stutzer (1997) and Mittelhammer, Judge and Miller (2000) put GMM within the framework of empirical likelihood (EL) and maximum entropy (ME) estimation. It can be shown that many of these estimation techniques can be obtained as special cases of minimizing Cressie and Read (1984) power divergence criterion that comes directly from the Pearson (1900) chi-squared statistic. In this way we are able to assimilate a number of seemingly unrelated estimation techniques into a unified framework
1 Prologue:Karl Pearson's method of moment estima- tion and chi-squared test,and entropy In this paper we are going to discuss various methods of estimation,especially those developed in the twentieth century,beginning with a review of some developments in statistics at the close of the nineteenth century.In 1892 W.F.Raphael Weldon,a zoologist turned statistician, requested Karl Pearson (1857-1936)to analyze a set of data on crabs.After some investigation Pearson realized that he could not fit the usual normal distribution to this data.By the early 1890's Pearson had developed a class of distributions that later came to be known as the Pearson system of curves,which is much broader than the normal distribution.However,for the crab data Pearson's own system of curves was not good enough.He dissected this "abnormal frequency curve"into two normal curves as follows: f(y)=af(y)+(1-a)f2(y) (1) where 1 f(y)= 1 o,exp-2) √2r01 ,j=1,2 This model has five parameters!(a,u,of,u2,02).Previously,there had been no method avail- able to estimate such a model.Pearson quite unceremoniously suggested a method that simply equated the first five population moments to the respective sample counterparts.It was not easy to solve five highly nonlinear equations.Therefore,Pearson took an analytical approach of eliminating one parameter in each step.After considerable algebra he found a ninth-degree polynomial equation in one unknown.Then,after solving this equation and by repeated back- substitutions,he found solutions to the five parameters in terms of the first five sample moments. It was around the autumn of 1893 he completed this work and it appeared in 1894.And this was the beginning of the method of moment (MM)estimation.There is no general theory in Pearson(1894).The paper is basically a worked-out "example"(though a very difficult one as the first illustration of MM estimation)of a new estimation method.? IThe term "parameter"was introduced by Fisher(1922,p.311)[also see footnote 16].Karl Pearson described the“parameters”as“constants”of the“crve.”Fisher(1912)also used“frequency curve.”However,in Fisher (1922)he used the term "distribution"throughout."Probability density function"came much later,in Wilks (1943,p.8)[see,David(1995)l 2Shortly after Karl Pearson's death,his son Egon Pearson provided an account of life and work of the elder Pearson [see Pearson (1936)].He summarized (pp.219-220)the contribution of Pearson (1894)stating,"The paper is particularly noteworthy for its introduction of the method of moments as a means of fitting a theoretical curve to observed data.This method is not claimed to be the best but is advocated from the utilitarian standpoint 1
1 Prologue: Karl Pearson’s method of moment estimation and chi-squared test, and entropy In this paper we are going to discuss various methods of estimation, especially those developed in the twentieth century, beginning with a review of some developments in statistics at the close of the nineteenth century. In 1892 W.F. Raphael Weldon, a zoologist turned statistician, requested Karl Pearson (1857-1936) to analyze a set of data on crabs. After some investigation Pearson realized that he could not fit the usual normal distribution to this data. By the early 1890’s Pearson had developed a class of distributions that later came to be known as the Pearson system of curves, which is much broader than the normal distribution. However, for the crab data Pearson’s own system of curves was not good enough. He dissected this “abnormal frequency curve” into two normal curves as follows: f(y) = αf1(y) + (1 − α)f2(y), (1) where fj (y) = 1 √ 2πσj exp[− 1 2σ 2 j (y − µj ) 2 ], j = 1, 2. This model has five parameters1 (α, µ1, σ2 1 , µ2, σ2 2 ). Previously, there had been no method available to estimate such a model. Pearson quite unceremoniously suggested a method that simply equated the first five population moments to the respective sample counterparts. It was not easy to solve five highly nonlinear equations. Therefore, Pearson took an analytical approach of eliminating one parameter in each step. After considerable algebra he found a ninth-degree polynomial equation in one unknown. Then, after solving this equation and by repeated backsubstitutions, he found solutions to the five parameters in terms of the first five sample moments. It was around the autumn of 1893 he completed this work and it appeared in 1894. And this was the beginning of the method of moment (MM) estimation. There is no general theory in Pearson (1894). The paper is basically a worked-out “example” (though a very difficult one as the first illustration of MM estimation) of a new estimation method.2 1The term “parameter” was introduced by Fisher (1922, p.311) [also see footnote 16]. Karl Pearson described the “parameters” as “constants” of the “curve.” Fisher (1912) also used “frequency curve.” However, in Fisher (1922) he used the term “distribution” throughout. “Probability density function” came much later, in Wilks (1943, p.8) [see, David (1995)] 2Shortly after Karl Pearson’s death, his son Egon Pearson provided an account of life and work of the elder Pearson [see Pearson (1936)]. He summarized (pp.219-220) the contribution of Pearson (1894) stating, “The paper is particularly noteworthy for its introduction of the method of moments as a means of fitting a theoretical curve to observed data. This method is not claimed to be the best but is advocated from the utilitarian standpoint 1
After an experience of "some eight years"in applying the MM to a vast range of physical and social data,Pearson (1902)provided some "theoretical"justification of his methodology. Suppose we want to estimate the parameter vector 6=(01,02,...,0p)'of the probability density function f(y;0).By a Taylor series expansion of f(y)=f(y;0)around y=0,we can write )=0+1划++ ”31++Φ知阶+2 (2) where 2,...,depends on 01,02,...,p and R is the remainder term.Let f(y)be the ordinate corresponding to y given by observations.Therefore,the problem is to fit a smooth curve f(y;0)to p histogram ordinates given by f(y).Then f(y)-f(y)denotes the distance between the theoretical and observed curve at the point corresponding to y,and our objective would be to make this distance as small as possible by a proper choice of ..[see Pearson (1902,p.268)].3 Although Pearson discussed the fit of f(y)to p histogram ordinates f(y),he proceeded to find a"theoretical"version of f(y)that minimizes [see Mensch(1980)] Lf(w)-F(u)Pdv. (3) Since f(.)is the variable,the resulting equation is /[f()-f)16fdy=0, (4) where,from(2),the differential of can be written as yi Bp 6f=∑60,7+00 (5) j=0 Therefore,we can write equation (4)as /ro-o空号+4海-上/o-号+,-0 (6) 1=0 Since the quantities 0,01,02,...,p are at our choice,for (6)to hold,each component should be independently zero,i.e.,we should have r)om ∂R dw=0, j=0,1,2,,p () on the grounds that it appears to give excellent fits and provides algebraic solutions for calculating the constants of the curve which are analytically possible." 3It is hard to trace the first use of smooth non-parametric density estimation in the statistics literature. Koenker(2000,p.349)mentioned Galton's(1885)illustration of "regression to the mean"where Galton averaged the counts from the four adjacent squares to achieve smoothness.Karl Pearson's minimization of the distance between f(y)and f(y)looks remarkably modern in terms of ideas and could be viewed as a modern-equivalent of smooth non-parametric density estimation [see also Mensch(1980)]. 2
After an experience of “some eight years” in applying the MM to a vast range of physical and social data, Pearson (1902) provided some “theoretical” justification of his methodology. Suppose we want to estimate the parameter vector θ = (θ1, θ2, . . . , θp) 0 of the probability density function f(y; θ). By a Taylor series expansion of f(y) ≡ f(y; θ) around y = 0, we can write f(y) = φ0 + φ1y + φ2 y 2 2! + φ3 y 3 3! + . . . + φp y p p! + R, (2) where φ0, φ1, φ2, . . . , φp depends on θ1, θ2, . . . , θp and R is the remainder term. Let ¯f(y) be the ordinate corresponding to y given by observations. Therefore, the problem is to fit a smooth curve f(y; θ) to p histogram ordinates given by ¯f(y). Then f(y) − ¯f(y) denotes the distance between the theoretical and observed curve at the point corresponding to y, and our objective would be to make this distance as small as possible by a proper choice of φ0, φ1, φ2, . . . , φp [see Pearson (1902, p.268)].3 Although Pearson discussed the fit of f(y) to p histogram ordinates ¯f(y), he proceeded to find a “theoretical” version of f(y) that minimizes [see Mensch (1980)] Z [f(y) − ¯f(y)]2 dy. (3) Since f(.) is the variable, the resulting equation is Z [f(y) − ¯f(y)]δf dy = 0, (4) where, from (2), the differential δf can be written as δf = X p j=0 (δφj y j j! + ∂R ∂φj δφj ). (5) Therefore, we can write equation (4) as Z [f(y) − ¯f(y)]X p j=0 (δφj y j j! + ∂R ∂φj δφj )dy = X p j=0 Z [f(y) − ¯f(y)](y j j! + ∂R ∂φj )dyδφj = 0. (6) Since the quantities φ0, φ1, φ2, . . . , φp are at our choice, for (6) to hold, each component should be independently zero, i.e., we should have Z [f(y) − ¯f(y)](y j j! + ∂R ∂φj )dy = 0, j = 0, 1, 2, . . . , p, (7) on the grounds that it appears to give excellent fits and provides algebraic solutions for calculating the constants of the curve which are analytically possible.” 3 It is hard to trace the first use of smooth non-parametric density estimation in the statistics literature. Koenker (2000, p.349) mentioned Galton’s (1885) illustration of “regression to the mean” where Galton averaged the counts from the four adjacent squares to achieve smoothness. Karl Pearson’s minimization of the distance between f(y) and ¯f(y) looks remarkably modern in terms of ideas and could be viewed as a modern-equivalent of smooth non-parametric density estimation [see also Mensch (1980)]. 2
which is same as ∂Bdy, =m!f(u)( j=0,1,2,,p. (8) Hereand mj are,respectively,the j-th moment corresponding to the theoretical curve f(y) and the observed curve f(y).4 Pearson(1902)then ignored the integral terms arguing that they involve the small factor f(y)-f(y),and the remainder term R,which by "hypothesis"is small for large enough sample size.After neglecting the integral terms in (8),Pearson obtained the equations 4=m,j=0,1,.,p. (9) Then,he stated the principle of the MM as [see Pearson(1902,p.270)]:"To fit a good theoretical curve f(y;01,02,...,p)to an observed curve,express the area and moments of the curve for the given range of observations in terms of 01,02,...,0p,and equate these to the like quantities for the observations."Arguing that,if the first p moments of two curves are identical,the higher moments of the curves becomes "ipso facto more and more nearly identical"for larger sample size,he concluded that the "equality of moments gives a good method of fitting curves to observations"[Pearson(1902,p.271)].We should add that much of his theoretical argument is not very rigorous,but the 1902 paper did provide a reasonable theoretical basis for the MM and illustrated its usefulness.5 For detailed discussion on the properties of the MM estimator see Shenton(1950,1958,1959). After developing his system of curves [Pearson (1895)],Pearson and his associates were fitting this system to a large number of data sets.Therefore,there was a need to formulate a test to check whether an assumed probability model adequately explained the data at hand. He succeeded in doing that and the result was Pearson's celebrated (1900)x2 goodness-of- fit test.To describe the Pearson test let us consider a distribution with k classes with the 4 t should be stressed that mj=∫yf(y)dy=∑yπwithπi denoting the area of the bin of the ith observation;this is not necessarily equal to the sample moment nthat is used in today's MM.Rather, Pearson's formulation of empirical moments uses the efficient weighting mi under a multinomial probability framework,an idea which is used in the literature of empirical likelihood and maximum entropy and to be described later in this paper. 5One of the first and possibly most important applications of MM idea is the derivation of t-distribution in Student(1908)which was major breakthrough in introducing the concept of finite sample(exact)distribution in statistics.Student(1908)obtained the first four moments of the sample variance 2,matched them with those of the Pearson type III distribution,and concluded(p.4)"a curve of Professor Pearson's type III may be expected to fit the distribution of S2."Student,however,was very cautious and quickly added(p.5),"it is probable that the curve found represents the theoretical distribution of S2 so that although we have no actual proof we shall assume it to do so in what follows."And this was the basis of his derivation of the t-distribution.The name t-distribution was given by Fisher(1924b). 2
which is same as µj = mj − j! Z [f(y) − ¯f(y)]( ∂R ∂φj )dy, j = 0, 1, 2, . . . , p. (8) Here µj and mj are, respectively, the j-th moment corresponding to the theoretical curve f(y) and the observed curve ¯f(y).4 Pearson (1902) then ignored the integral terms arguing that they involve the small factor f(y) − ¯f(y), and the remainder term R, which by “hypothesis” is small for large enough sample size. After neglecting the integral terms in (8), Pearson obtained the equations µj = mj , j = 0, 1, . . . , p. (9) Then, he stated the principle of the MM as [see Pearson (1902, p.270)]: “To fit a good theoretical curve f(y; θ1, θ2, . . . , θp) to an observed curve, express the area and moments of the curve for the given range of observations in terms of θ1, θ2, . . . , θp, and equate these to the like quantities for the observations.” Arguing that, if the first p moments of two curves are identical, the higher moments of the curves becomes “ipso facto more and more nearly identical” for larger sample size, he concluded that the “equality of moments gives a good method of fitting curves to observations” [Pearson (1902, p.271)]. We should add that much of his theoretical argument is not very rigorous, but the 1902 paper did provide a reasonable theoretical basis for the MM and illustrated its usefulness.5 For detailed discussion on the properties of the MM estimator see Shenton (1950, 1958, 1959). After developing his system of curves [Pearson (1895)], Pearson and his associates were fitting this system to a large number of data sets. Therefore, there was a need to formulate a test to check whether an assumed probability model adequately explained the data at hand. He succeeded in doing that and the result was Pearson’s celebrated (1900) χ 2 goodness-of- fit test. To describe the Pearson test let us consider a distribution with k classes with the 4 It should be stressed that mj = R y j ¯f(y)dy = Pn i y j i πi with πi denoting the area of the bin of the ith observation; this is not necessarily equal to the sample moment n −1 P i y j i that is used in today’s MM. Rather, Pearson’s formulation of empirical moments uses the efficient weighting πi under a multinomial probability framework, an idea which is used in the literature of empirical likelihood and maximum entropy and to be described later in this paper. 5One of the first and possibly most important applications of MM idea is the derivation of t-distribution in Student (1908) which was major breakthrough in introducing the concept of finite sample (exact) distribution in statistics. Student (1908) obtained the first four moments of the sample variance S 2 , matched them with those of the Pearson type III distribution, and concluded (p.4) “a curve of Professor Pearson’s type III may be expected to fit the distribution of S 2 .” Student, however, was very cautious and quickly added (p.5), “it is probable that the curve found represents the theoretical distribution of S 2 so that although we have no actual proof we shall assume it to do so in what follows.” And this was the basis of his derivation of the t-distribution. The name t-distribution was given by Fisher (1924b). 3