probability of j-th class being),j=1,2....and=1.Suppose that according to the assumed probability model,4=gio;therefore,one would be interested in testing the hypothesis,Ho:4=go,j=1,2,...,k.Let nj denote the observed frequency of the j-th class,with-N.Pearson (1900)suggested the goodness-of-fit statistic p=方-NgmP-方O-E2 (10) =1 j=1 E where O;and E;denote,respectively,the observed and expected frequencies of the j-th class This is the first constructive test in the statistics literature.Broadly speaking,P is essentially a distance measure between the observed and expected frequencies. It is quite natural to question the relevance of this test statistic in the context of estimation. Let us note that P could be used to measure the distance between any two sets of probabilities, say,(pi,qi),j=1,2,...,k by simply writing pi=nj/N and qi qio,i.e., P=x2色=2 (11) j=1 As we will see shortly a simple transformation of P could generate a broad class of distance measures.And later,in Section 5,we will demonstrate that many of the current estimation procedures in econometrics can be cast in terms of minimizing the distance between two sets of probabilities subject to certain constraints.In this way,we can tie and assimilate many estimation techniques together using Pearson's MM and x2-statistic as the unifying themes. We can write P as (12) =1 1 Therefore,the essential quantity in measuring the divergence between two probability distribu- tions is the ratio (p/g).Using Steven's (1975)idea on "visual perception"Cressie and Read 6This test is regarded as one of the 20 most important scientific breakthroughs of this century along with advances and discoveries like the theory of relativity,the IQ test,hybrid corn,antibiotics,television,the transistor and the computer [see Hacking(1984)].In his editorial in the inaugural issue of Sankhya,The Indian Journal of Statistics,Mahalanobis(1933)wrote,"...the history of modern statistics may be said to have begun from Karl Pearson's work on the distribution of x2 in 1900.The Chi-square test supplied for the first time a tool by which the significance of the agreement or discrepancy between theoretical expectations and actual observations could be judged with precision."Even Pearson's lifelong arch-rival Ronald A.Fisher (1922,p.314)conceded,"Nor is the introduction of the Pearsonian system of frequency curves the only contribution which their author has made to the solution of problems of specification:of even greater importance is the introduction of an objective criterion of goodness of fit."For more on this see Bera(2000)and Bera and Bilias(2001). 4
probability of j-th class being qj (≥ 0), j = 1, 2, . . . , k and Pk j=1 qj = 1. Suppose that according to the assumed probability model, qj = qj0; therefore, one would be interested in testing the hypothesis, H0 : qj = qj0, j = 1, 2, . . . , k. Let nj denote the observed frequency of the j-th class, with Pk j=1 nj = N. Pearson (1900) suggested the goodness-of-fit statistic6 P = X k j=1 (nj − Nqj0) 2 Nqj0 = X k j=1 (Oj − Ej ) 2 Ej , (10) where Oj and Ej denote, respectively, the observed and expected frequencies of the j-th class. This is the first constructive test in the statistics literature. Broadly speaking, P is essentially a distance measure between the observed and expected frequencies. It is quite natural to question the relevance of this test statistic in the context of estimation. Let us note that P could be used to measure the distance between any two sets of probabilities, say, (pj , qj ), j = 1, 2, . . . , k by simply writing pj = nj/N and qj = qj0, i.e., P = N X k j=1 (pj − qj ) 2 qj . (11) As we will see shortly a simple transformation of P could generate a broad class of distance measures. And later, in Section 5, we will demonstrate that many of the current estimation procedures in econometrics can be cast in terms of minimizing the distance between two sets of probabilities subject to certain constraints. In this way, we can tie and assimilate many estimation techniques together using Pearson’s MM and χ 2 -statistic as the unifying themes. We can write P as P = N X k j=1 pj (pj − qj ) qj = N X k j=1 pj pj qj − 1 . (12) Therefore, the essential quantity in measuring the divergence between two probability distributions is the ratio (pj/qj ). Using Steven’s (1975) idea on “visual perception” Cressie and Read 6This test is regarded as one of the 20 most important scientific breakthroughs of this century along with advances and discoveries like the theory of relativity, the IQ test, hybrid corn, antibiotics, television, the transistor and the computer [see Hacking (1984)]. In his editorial in the inaugural issue of Sankhy¯a, The Indian Journal of Statistics, Mahalanobis (1933) wrote, “. . . the history of modern statistics may be said to have begun from Karl Pearson’s work on the distribution of χ 2 in 1900. The Chi-square test supplied for the first time a tool by which the significance of the agreement or discrepancy between theoretical expectations and actual observations could be judged with precision.” Even Pearson’s lifelong arch-rival Ronald A. Fisher (1922, p.314) conceded, “Nor is the introduction of the Pearsonian system of frequency curves the only contribution which their author has made to the solution of problems of specification: of even greater importance is the introduction of an objective criterion of goodness of fit.” For more on this see Bera (2000) and Bera and Bilias (2001). 4
(1984)suggested using the relative difference between the perceived probabilities as(pi/qi)A-1 where A "typically lies in the range from 0.6 to 0.9"but could theoretically be any real num- ber [see also Read and Cressie (1988,p.17)].By weighing this quantity proportional to p;and summing over all the classes,leads to the following measure of divergence: (13) This is approximately proportional to the Cressie and Read(1984)power divergence family of statistics' Ip,9)= 2 (A+1) [- 言[+(g-}”- (14) where p=(p1,p2,...,Pn)'and q=(g1,92,...,qn)'.Lindsay (1994,p.1085)calls (pi/qi)-1 the"Pearson"residual since we can express the Pearson statistic in(11)asP-N From this,it is immediately seen that when A=1,I(p,q)reduces to P/N.In fact,a number of well-known test statistics can be obtained from Ia(p,q).When A-0,we have the likelihood (LR)test statistic,which,as an alternative to (10),can be written as -n()-空() (15) =1 Similarly,A=-1/2 gives the Freeman and Tukey (FT)(1950)statistic,or Hellinger distance, FT= 4∑v而-Vm2=4∑vO,-VER (16) All these test statistics are just different measures of distance between the observed and expected frequencies.Therefore,Ix(p,q)provides a very rich class of divergence measures Any probability distribution pi,i=1,2,...,n(say)of a random variable taking n values provides a measure of uncertainty regarding that random variable.In the information theory literature,this measure of uncertainty is called entropy.The origin of the term "entropy"goes 7In the entropy literature this is known as Renyi's(1961)a-class generalized measures of entropy see Maa- soumi(1993,p.144),Ullah(1996,p.142)and Mittelhammer,Judge and Miller (2000,p.328)].Golan,Judge and Miller (1996,p.36)referred to Schuitzenberger(1954)as well.This formulation has also been used extensively as a general class of decomposable income inequality measures,for example,see Cowell(1980)and Shorrocks (1980),and in time-series analysis to distinguish chaotic data from random data [Pompe(1994)]
(1984) suggested using the relative difference between the perceived probabilities as (pj/qj ) λ −1 where λ “typically lies in the range from 0.6 to 0.9” but could theoretically be any real number [see also Read and Cressie (1988, p.17)]. By weighing this quantity proportional to pj and summing over all the classes, leads to the following measure of divergence: X k j=1 pj " pj qj λ − 1 # . (13) This is approximately proportional to the Cressie and Read (1984) power divergence family of statistics7 Iλ(p, q) = 2 λ(λ + 1) X k j=1 pj " pj qj λ − 1 # = 2 λ(λ + 1) X k j=1 qj " 1 + pj qj − 1 λ+1 − 1 # , (14) where p = (p1, p2, . . . , pn) 0 and q = (q1, q2, . . . , qn) 0 . Lindsay (1994, p.1085) calls δj = (pj/qj ) − 1 the “Pearson” residual since we can express the Pearson statistic in (11) as P = N Pk j=1 qjδ 2 j . From this, it is immediately seen that when λ = 1, Iλ(p, q) reduces to P/N. In fact, a number of well-known test statistics can be obtained from Iλ(p, q). When λ → 0, we have the likelihood (LR) test statistic, which, as an alternative to (10), can be written as LR = 2X k j=1 nj ln nj Nqj0 = 2X k j=1 Oj ln Oj Ej . (15) Similarly, λ = −1/2 gives the Freeman and Tukey (FT) (1950) statistic, or Hellinger distance, F T = 4X k j=1 ( √ nj − √ nqj0) 2 = 4X k j=1 ( p Oj − p Ej ) 2 . (16) All these test statistics are just different measures of distance between the observed and expected frequencies. Therefore, Iλ(p, q) provides a very rich class of divergence measures. Any probability distribution pi , i = 1, 2, . . . , n (say) of a random variable taking n values provides a measure of uncertainty regarding that random variable. In the information theory literature, this measure of uncertainty is called entropy. The origin of the term “entropy” goes 7 In the entropy literature this is known as Renyi’s (1961) α-class generalized measures of entropy [see Maasoumi (1993, p.144), Ullah (1996, p.142) and Mittelhammer, Judge and Miller (2000, p.328)]. Golan, Judge and Miller (1996, p.36) referred to Sch¨utzenberger (1954) as well. This formulation has also been used extensively as a general class of decomposable income inequality measures, for example, see Cowell (1980) and Shorrocks (1980), and in time-series analysis to distinguish chaotic data from random data [Pompe (1994)]. 5
back to thermodynamics.The second law of thermodynamics states that there is an inherent tendency for disorder to increase.A probability distribution gives us a measure of disorder. Entropy is generally taken as a measure of expected information,that is,how much information do we have in the probability distribution pi,i=1,2,...,n.Intuitively,information should be a decreasing function of pi,i.e.,the more unlikely an event,the more interesting it is to know that it can happen see Shannon and Weaver (1949,p.105)and Sen (1975,pp.34-35)]. A simple choice for such a function is-In pi.Entropy H(p)is defined as a weighted sum of the information-In pi,i=1,2,...,n with respective probabilities as weights,namely, Hp)=-∑plnp, (17) If pi=0 for some i,then piln pi is taken to be zero.When pi=1/n for all i,H(p)=Inn and then we have the marimum value of the entropy and consequently the least information available from the probability distribution.The other extreme case occurs when pi=1 for one i,and =0 for the rest;then H(p)=0.If we do not weigh each-Inpi by pi and simply take the sum,another measure of entropy would be H'(p)=-∑lnp. (18) i=1 Following(17),the cross-entropy of one probability distribution p=(p1,p2,...,pn)'with respect to another distribution q=(g,q2,...,qn)'can be defined as C(p,q)=>p:ln(p:/q:)=E[np]-E[ndl, (19) which is yet another measure of distance between two distributions.It is easy to see the link between C(p,q)and the Cressie and Read (1984)power divergence family.If we choose q= (1/n,1/n,...,1/n)'=i/n where i is a n x 1 vector of ones,C(p,q)reduces to C(p,i/n)=>p:lnp:-Inn. (20) i=1 Therefore,entropy maximization is a special case of cross-entropy minimization with respect to the uniform distribution.For more on entropy,cross-entropy and their uses in econometrics see Maasoumi (1993),Ullah (1996),Golan,Judge and Miller (1996,1997 and 1998),Zellner and Highfield(1988),Zellner (1991)and other papers in Grandy and Schick(1991),Zellner (1997) and Mittelhammer,Judge and Miller(2000). If we try to find a probability distribution that maximizes the entropy H(p)in (17),the optimal solution is the uniform distribution,i.e.,p*=i/n.In the Bayesian literature,it is 6
back to thermodynamics. The second law of thermodynamics states that there is an inherent tendency for disorder to increase. A probability distribution gives us a measure of disorder. Entropy is generally taken as a measure of expected information, that is, how much information do we have in the probability distribution pi , i = 1, 2, . . . , n. Intuitively, information should be a decreasing function of pi , i.e., the more unlikely an event, the more interesting it is to know that it can happen [see Shannon and Weaver (1949, p.105) and Sen (1975, pp.34-35)]. A simple choice for such a function is − ln pi . Entropy H(p) is defined as a weighted sum of the information − ln pi , i = 1, 2, . . . , n with respective probabilities as weights, namely, H(p) = − Xpi ln pi . (17) If pi = 0 for some i, then pi ln pi is taken to be zero. When pi = 1/n for all i, H(p) = ln n and then we have the maximum value of the entropy and consequently the least information available from the probability distribution. The other extreme case occurs when pi = 1 for one i, and = 0 for the rest; then H(p) = 0. If we do not weigh each − ln pi by pi and simply take the sum, another measure of entropy would be H 0 (p) = − Xn i=1 ln pi . (18) Following (17), the cross-entropy of one probability distribution p = (p1, p2, . . . , pn) 0 with respect to another distribution q = (q1, q2, . . . , qn) 0 can be defined as C(p, q) = Xn i=1 pi ln(pi/qi) = E[ln p] − E[ln q], (19) which is yet another measure of distance between two distributions. It is easy to see the link between C(p, q) and the Cressie and Read (1984) power divergence family. If we choose q = (1/n, 1/n, . . . , 1/n) 0 = i/n where i is a n × 1 vector of ones, C(p, q) reduces to C(p, i/n) = Xn i=1 pi ln pi − ln n. (20) Therefore, entropy maximization is a special case of cross-entropy minimization with respect to the uniform distribution. For more on entropy, cross-entropy and their uses in econometrics see Maasoumi (1993), Ullah (1996), Golan, Judge and Miller (1996, 1997 and 1998), Zellner and Highfield (1988), Zellner (1991) and other papers in Grandy and Schick (1991), Zellner (1997) and Mittelhammer, Judge and Miller (2000). If we try to find a probability distribution that maximizes the entropy H(p) in (17), the optimal solution is the uniform distribution, i.e., p ∗ = i/n. In the Bayesian literature, it is 6
common to maximize an entropy measure to find non-informative priors.Jaynes (1957)was the first to consider the problem of finding a prior distribution that maximizes H(p)subject to certain side conditions,which could be given in the form of some moment restrictions.Jaynes' problem can be stated as follows.Suppose we want to find a least informative probability distribution pi=Pr(Y=yi),i=1,2,...,n of a random variable Y satisfying,say,m moment restrictions E[hj(Y)]=uj with known 's,j=1,2,...,m.Jaynes (1957,p.623)found an explicit solution to the problem of maximizing H(p)subject to the above moment conditions and=1 [for a treatment of this problem under very general conditions,see,Haberman (1984)].We can always find some (in fact,many)solutions just by satisfying the constraints; however,maximization of(17)makes the resulting probabilities pi(i=1,2,...,n)as smooth as possible.Jaynes (1957)formulation has been extensively used in the Bayesian literature to find priors that are as noninformative as possible given some prior partial information [see Berger(1985,pp.90-94)].In recent years econometricians have tried to estimate parameter(s)of interest say,6,utilizing only certain moment conditions satisfied by the underlying probability distribution,known as the generalized method of moments (GMM)estimation.The GMM procedure is an extension of Pearson's(1895,1902)MM when we have more moment restrictions than the dimension of the unknown parameter vector.The GMM estimation technique can also be cast into the information theoretic approach of maximization of entropy following the empirical likelihood (EL)method of Owen (1988,1990,1991)and Qin and Lawless (1994). Back and Brown (1993),Kitamura and Stutzer (1997)and Imbens,Spady and Johnson (1998) developed information theoretic approaches of entropy maximization estimation procedures that include GMM as a special case.Therefore,we observe how seemingly distinct ideas of Pearson's x2 test statistic and GMM estimation are tied to the common principle of measuring distance between two probability distributions through the entropy measure.The modest aim of this review paper is essentially this idea of assimilating distinct estimation methods.In the following two sections we discuss Fisher's (1912,1922)maximum likelihood estimation(MLE)approach and its relative efficiency to the MM estimation method.The MLE is the forerunner of the currently popular EL approach.We also discuss the minimum x2 method of estimation,which is based on the minimization of the Pearson x2 statistic.Section 4 proceeds with optimal estimation using an estimating function(EF)approach.In Section 5,we discuss the instrumental variable (IV)and GMM estimation procedure along with their recent variants.Both EF and GMM approaches were devised in order to handle problems of method of moments estimation where the number of moment restrictions is larger than the number of parameters.The last section provides some concluding remarks.While doing the survey,we also try to provide some personal perspectives on researchers who contributed to the amazing progress in statistical and 7
common to maximize an entropy measure to find non-informative priors. Jaynes (1957) was the first to consider the problem of finding a prior distribution that maximizes H(p) subject to certain side conditions, which could be given in the form of some moment restrictions. Jaynes’ problem can be stated as follows. Suppose we want to find a least informative probability distribution pi = Pr(Y = yi), i = 1, 2, . . . , n of a random variable Y satisfying, say, m moment restrictions E[hj (Y )] = µj with known µj ’s, j = 1, 2, . . . , m. Jaynes (1957, p.623) found an explicit solution to the problem of maximizing H(p) subject to the above moment conditions and Pn i=1 pi = 1 [for a treatment of this problem under very general conditions, see, Haberman (1984)]. We can always find some (in fact, many) solutions just by satisfying the constraints; however, maximization of (17) makes the resulting probabilities pi (i = 1, 2, . . . , n) as smooth as possible. Jaynes (1957) formulation has been extensively used in the Bayesian literature to find priors that are as noninformative as possible given some prior partial information [see Berger (1985, pp.90-94)]. In recent years econometricians have tried to estimate parameter(s) of interest say, θ, utilizing only certain moment conditions satisfied by the underlying probability distribution, known as the generalized method of moments (GMM) estimation. The GMM procedure is an extension of Pearson’s (1895, 1902) MM when we have more moment restrictions than the dimension of the unknown parameter vector. The GMM estimation technique can also be cast into the information theoretic approach of maximization of entropy following the empirical likelihood (EL) method of Owen (1988, 1990, 1991) and Qin and Lawless (1994). Back and Brown (1993), Kitamura and Stutzer (1997) and Imbens, Spady and Johnson (1998) developed information theoretic approaches of entropy maximization estimation procedures that include GMM as a special case. Therefore, we observe how seemingly distinct ideas of Pearson’s χ 2 test statistic and GMM estimation are tied to the common principle of measuring distance between two probability distributions through the entropy measure. The modest aim of this review paper is essentially this idea of assimilating distinct estimation methods. In the following two sections we discuss Fisher’s (1912, 1922) maximum likelihood estimation (MLE) approach and its relative efficiency to the MM estimation method. The MLE is the forerunner of the currently popular EL approach. We also discuss the minimum χ 2 method of estimation, which is based on the minimization of the Pearson χ 2 statistic. Section 4 proceeds with optimal estimation using an estimating function (EF) approach. In Section 5, we discuss the instrumental variable (IV) and GMM estimation procedure along with their recent variants. Both EF and GMM approaches were devised in order to handle problems of method of moments estimation where the number of moment restrictions is larger than the number of parameters. The last section provides some concluding remarks. While doing the survey, we also try to provide some personal perspectives on researchers who contributed to the amazing progress in statistical and 7
econometrics estimation techniques that we have witnessed in the last 100 years.We do this since in many instances the original motivation and philosophy of various statistical techniques have become clouded over time.And to the best of our knowledge these materials have not found a place in econometric textbooks. 2 Fisher's (1912)maximum likelihood,and the minimum chi-squared methods of estimation In 1912 when R.A.Fisher published his first mathematical paper,he was a third and final year undergraduate in mathematics and mathematical physics in Gonville and Caius College, Cambridge.It is now hard to envision exactly what prompted Fisher to write this paper. Possibly,his tutor the astronomer F.J.M.Stratton (1881-1960),who lectured on the theory of errors,was the instrumental factor.About Stratton's role,Edwards(1997a,p.36)wrote:"In the Easter Term 1911 he had lectured at the observatory on Calculation of Orbits from Observations, and during the next academic year on Combination of Observations in the Michaelmas Term (1911),the first term of Fisher's third and final undergraduate year.It is very likely that Fisher attended Stratton's lectures and subsequently discussed statistical questions with him during mathematics supervision in College,and he wrote the 1912 paper as a result."8 The paper started with a criticism of two known methods of curve fitting,least squares and Pearson's MM.In particular,regarding MM,Fisher (1912,p.156)stated "a choice has been made without theoretical justification in selecting r equations..."Fisher was referring to the equations in (9),though Pearson(1902)defended his choice on the ground that these lower-order moments have smallest relative variance [see Hald (1998,p.708)]. After disposing of these two methods,Fisher stated "we may solve the real problem directly" and set out to discuss his absolute criterion for fitting frequency curves.He took the probability density function (p.d.f)f(y;0)(using our notation)as an ordinate of the theoretical curve of unit area and,hence,interpreted f(y;0)y as the chance of an observation falling within the 8Fisher(1912)ends with"In conclusion I should like to acknowledge the great kindness of Mr.J.F.M.Stratton, to whose criticism and encouragement the present form of this note is due."It may not be out of place to add that in 1912 Stratton also prodded his young pupil to write directly to Student(William S.Gosset,1876-1937), and Fisher sent Gosset a rigorous proof of t-distribution.Gosset was sufficiently impressed to send the proof to Karl Pearson with a covering letter urging him to publish it in Biometrika as a note.Pearson,however,was not impressed and nothing more was heard of Fisher's proof [see Box(1978,pp.71-73)and Lehmann(1999, pp.419-420)].This correspondence between Fisher and Gosset was the beginning of a lifelong mutual respect and friendship until the death of Gosset. 8
econometrics estimation techniques that we have witnessed in the last 100 years. We do this since in many instances the original motivation and philosophy of various statistical techniques have become clouded over time. And to the best of our knowledge these materials have not found a place in econometric textbooks. 2 Fisher’s (1912) maximum likelihood, and the minimum chi-squared methods of estimation In 1912 when R. A. Fisher published his first mathematical paper, he was a third and final year undergraduate in mathematics and mathematical physics in Gonville and Caius College, Cambridge. It is now hard to envision exactly what prompted Fisher to write this paper. Possibly, his tutor the astronomer F. J. M. Stratton (1881-1960), who lectured on the theory of errors, was the instrumental factor. About Stratton’s role, Edwards (1997a, p.36) wrote: “In the Easter Term 1911 he had lectured at the observatory on Calculation of Orbits from Observations, and during the next academic year on Combination of Observations in the Michaelmas Term (1911), the first term of Fisher’s third and final undergraduate year. It is very likely that Fisher attended Stratton’s lectures and subsequently discussed statistical questions with him during mathematics supervision in College, and he wrote the 1912 paper as a result.”8 The paper started with a criticism of two known methods of curve fitting, least squares and Pearson’s MM. In particular, regarding MM, Fisher (1912, p.156) stated “a choice has been made without theoretical justification in selecting r equations . . . ” Fisher was referring to the equations in (9), though Pearson (1902) defended his choice on the ground that these lower-order moments have smallest relative variance [see Hald (1998, p.708)]. After disposing of these two methods, Fisher stated “we may solve the real problem directly” and set out to discuss his absolute criterion for fitting frequency curves. He took the probability density function (p.d.f) f(y; θ) (using our notation) as an ordinate of the theoretical curve of unit area and, hence, interpreted f(y; θ)δy as the chance of an observation falling within the 8Fisher (1912) ends with “In conclusion I should like to acknowledge the great kindness of Mr. J.F.M. Stratton, to whose criticism and encouragement the present form of this note is due.” It may not be out of place to add that in 1912 Stratton also prodded his young pupil to write directly to Student (William S. Gosset, 1876-1937), and Fisher sent Gosset a rigorous proof of t-distribution. Gosset was sufficiently impressed to send the proof to Karl Pearson with a covering letter urging him to publish it in Biometrika as a note. Pearson, however, was not impressed and nothing more was heard of Fisher’s proof [see Box (1978, pp.71-73) and Lehmann (1999, pp.419-420)]. This correspondence between Fisher and Gosset was the beginning of a lifelong mutual respect and friendship until the death of Gosset. 8