《实用非参数统计》课程教学资源（阅读材料）A Review on Empirical Likelihood Methods for Regression.pdf_P6-P10

requires the estimation of the variance (Bo),whereas the EL method does not re- quire any explicit variance estimation.This is because the studentization is carried out internally via the optimization procedure. In addition to the first order analogue between the parametric and the empirical likelihood,there is a second order analogue between them in the form of the Bartlett correction.Bartlett correction is an elegant second order property of the parametric likelihood ratios,which was conjectured and proposed in Bartlett (1937).It was for- mally established and studied in a series of papers including Lawley (1956),Hayakawa (1977),Barndorff-Nielsen and Cox(1984)and Barndorff-Nielsen and Hall (1988). Let wi=(B0)-1/2Zni =(w.P)T and for it1),I =1 define=E)for a k-th multivariate cross moments of wi.By as- suming the existence of higher order moments of Zni,it may be shown via developing Edgeworth expansions that the distribution of the empirical likelihood ratio admits the following expansion: P{rm(o）≤x2,1-a}=1-a-ax2,1-a9p(x2.1-a)n-1+0(n-3/2),(16) where gp is the density of the x distribution,and a=p1(3∑3m=1ajmm-专∑k,m=1 ajkmajkm (17) This means that for the parametric regression both parametric and empirical like- lihood ratio confidence regions 11-have coverage error of order n.Part of the coverage error is due to the fact that the mean of rn(Bo)does not agree with p,the mean of xp,that is Efrn(Bo)}p,but rather E{rn(o)}=p(1+an-1)+O(m-2), where a has been given above. The idea of the Bartlett correction is to adjust the EL ratio rn(Bo)tor(Bo)= rn(Bo)/(1+an-1)so that Efr(Bo)}=p+0(n-2).And amazingly this simple adjust- ment to the mean leads to improvement in (16)by one order of magnitude(DiCiccio, Hall and Romano,1991;Chen,1993 and Chen and Cui,2007)so that P{ri(3o)≤x2.1-a}=1-a+O(n-2). (18) 3 Nonparametric regression Consider in this section the nonparametric regression model Yi=m(Xi)+Ei, (19) where the regression function m(x)=E(YiXi=x)is nonparametric,and Xi is d-dimensional.We assume the regression can be heteroscedastic in that o-(r)= Var(YiXi =x),the conditional variance of Yi given Xi =z,may depend on x. The kernel smoothing method is a popular method for estimating m(r)nonpara- metrically.See Hardle (1990)and Fan and Gijbels (1996)for comprehensive overviews Other nonparametric methods for estimating m(z)include splines,orthogonal series

7 requires the estimation of the variance Σ(β0), whereas the EL method does not require any explicit variance estimation. This is because the studentization is carried out internally via the optimization procedure. In addition to the first order analogue between the parametric and the empirical likelihood, there is a second order analogue between them in the form of the Bartlett correction. Bartlett correction is an elegant second order property of the parametric likelihood ratios, which was conjectured and proposed in Bartlett (1937). It was formally established and studied in a series of papers including Lawley (1956), Hayakawa (1977), Barndorff-Nielsen and Cox (1984) and Barndorff-Nielsen and Hall (1988). Let wi = Σ(β0) −1/2Zni = (w 1 i , . . . , w p i ) T and for jl ∈ {1, · · · , p}, l = 1, · · · , k, define α j1···jk = E(w j1 i · · · w jk i ) for a k-th multivariate cross moments of wi . By assuming the existence of higher order moments of Zni, it may be shown via developing Edgeworth expansions that the distribution of the empirical likelihood ratio admits the following expansion: P{rn(β0) ≤ χ 2 p,1−α} = 1 − α − a χ 2 p,1−α gp(χ 2 p,1−α) n −1 + O(n −3/2 ), (16) where gp is the density of the χ 2 p distribution, and a = p −1 1 2 Pp j,m=1 α j j m m − 1 3 Pp j,k,m=1 α j k mα j k m . (17) This means that for the parametric regression both parametric and empirical likelihood ratio confidence regions I1−α have coverage error of order n −1 . Part of the coverage error is due to the fact that the mean of rn(β0) does not agree with p, the mean of χ 2 p, that is E{rn(β0)} 6= p, but rather E{rn(β0)} = p(1 + an −1 ) + O(n −2 ), where a has been given above. The idea of the Bartlett correction is to adjust the EL ratio rn(β0) to r ∗ n(β0) = rn(β0)/(1+an−1 ) so that E{r ∗ n(β0)} = p+O(n −2 ). And amazingly this simple adjustment to the mean leads to improvement in (16) by one order of magnitude (DiCiccio, Hall and Romano, 1991; Chen, 1993 and Chen and Cui, 2007) so that P{r ∗ n(β0) ≤ χ 2 p,1−α} = 1 − α + O(n −2 ). (18) 3 Nonparametric regression Consider in this section the nonparametric regression model Yi = m(Xi) + εi , (19) where the regression function m(x) = E(Yi |Xi = x) is nonparametric, and Xi is d-dimensional. We assume the regression can be heteroscedastic in that σ 2 (x) = Var(Yi |Xi = x), the conditional variance of Yi given Xi = x, may depend on x. The kernel smoothing method is a popular method for estimating m(x) nonparametrically. See H¨ardle (1990) and Fan and Gijbels (1996) for comprehensive overviews. Other nonparametric methods for estimating m(x) include splines, orthogonal series

and wavelets methods.The simplest kernel regression estimator for m(r)is the follow- ing Nadaraya-Watson estimator: m(x）= ∑Kae-X)y (20) ∑=1Kh(e-X) where Kh(t)=K(t/h)/hd,K is a d-dimensional kernel function and h is a band- width.The above kernel estimator can be obtained by minimizing the following locally weighted sum of least squares: ∑Ke-X)出-m(}2 i= with respect to m(r).It is effectively the solution of the following estimating equation: Kae-X化-me=0 (21) i=1 Under the nonparametric regression model,the unknown 'parameter'is the re- gression function m(r)itself.The empirical likelihood for m()at a fixed x can be formulated in a fashion similar to the parametric regression setting considered in the previous section.Alternatively,since the empirical likelihood is being applied to the weighted average()m(),it is also similar to the EL of a mean. Let pi,...,Pn be probability weights adding to one.The empirical likelihood eval- uated at 0(x),a candidate value of m(r),is Ln{(e}=maxⅡn (22) 三1 where the maximization is subject toPi1and n p,Kh(e-X){Y-9}=0. (23) =1 By comparing this formulation of the EL with that for the parametric regression,we see that the two formulations are largely similar except that (23)is used as the struc- tural constraint instead of(5).This comparison does highlight the role played by the structural constraint in the EL formulation.Indeed,different structural constraints give rise to EL for different 'parameters'(quantity of interest),just like different den- sities give rise to different parametric likelihoods.In gerenal,the empirical likelihood is formulated based on the parameters of interest via the structural constraints,and the parametric likelihood is fully based on a parametric model. The algorithm for solving the above optimization problem(22)-(23)is similar to the EL algorithm for the parametric regression given under (4)and (5),except that it may be viewed easier as the parameter'is one-dimensional if we ignore the issue of bandwidth selection for nonparametric regression.By introducing Lagrange multipliers like we did in (6)in the previous section,we have that the optimal EL weights for the above optimization problem at 0(r)are given by 1。 Pi= n1+(x)Kh(-Xi){Yi-0()}

8 and wavelets methods. The simplest kernel regression estimator for m(x) is the following Nadaraya-Watson estimator: mˆ (x) = Pn P i=1 Kh (x − Xi) Yi n i=1 Kh (x − Xi) , (20) where Kh(t) = K(t/h)/hd , K is a d-dimensional kernel function and h is a bandwidth. The above kernel estimator can be obtained by minimizing the following locally weighted sum of least squares: Xn i=1 Kh (x − Xi) {Yi − m(x)} 2 with respect to m(x). It is effectively the solution of the following estimating equation: Xn i=1 Kh (x − Xi) {Yi − m(x)} = 0. (21) Under the nonparametric regression model, the unknown ‘parameter’ is the regression function m(x) itself. The empirical likelihood for m(x) at a fixed x can be formulated in a fashion similar to the parametric regression setting considered in the previous section. Alternatively, since the empirical likelihood is being applied to the weighted average Pn i=1 Kh(x − Xi)m(x), it is also similar to the EL of a mean. Let p1, . . . , pn be probability weights adding to one. The empirical likelihood evaluated at θ(x), a candidate value of m(x), is Ln{θ(x)} = maxYn i=1 pi (22) where the maximization is subject to Pn i=1 pi = 1 and Xn i=1 piKh (x − Xi) {Yi − θ(x)} = 0. (23) By comparing this formulation of the EL with that for the parametric regression, we see that the two formulations are largely similar except that (23) is used as the structural constraint instead of (5). This comparison does highlight the role played by the structural constraint in the EL formulation. Indeed, different structural constraints give rise to EL for different ‘parameters’ (quantity of interest), just like different densities give rise to different parametric likelihoods. In gerenal, the empirical likelihood is formulated based on the parameters of interest via the structural constraints, and the parametric likelihood is fully based on a parametric model. The algorithm for solving the above optimization problem (22) – (23) is similar to the EL algorithm for the parametric regression given under (4) and (5), except that it may be viewed easier as the ‘parameter’ is one-dimensional if we ignore the issue of bandwidth selection for nonparametric regression. By introducing Lagrange multipliers like we did in (6) in the previous section, we have that the optimal EL weights for the above optimization problem at θ(x) are given by pi = 1 n 1 1 + λ(x)Kh (x − Xi) {Yi − θ(x)}

9 where λ(x) is a univariate Lagrange multiplier that satisfies Xn i=1 Kh (x − Xi) {Yi − θ(x)} 1 + λ(x)Kh (x − Xi) {Yi − θ(x)} = 0. (24) Substituting the optimal weights into the empirical likelihood in (22), the empirical likelihood evaluated at θ(x) is Ln{θ(x)} = Yn i=1 1 n 1 1 + λ(x)Kh (x − Xi) {Yi − θ(x)} and the log empirical likelihood is ℓn{θ(x)} =: log{Ln{θ(x)}} = − Xn i=1 log[1 + λ(x)Kh (x − Xi) {Yi − θ(x)}] − n log(n). (25) The overall EL is maximized at pi = n −1 , which corresponds to θ(x) being the Nadaraya-Watson estimator ˆm(x) in (20). Hence, we can define the log EL ratio at θ(x) as rn{θ(x)} = −2 log[Ln{θ(x)}/n−n ] = 2Xn i=1 log[1 + λ(x)Kh (x − Xi) {Yi − θ(x)}]. (26) The above EL is not actually for m(x), the true underlying function value at x, but rather for E{mˆ (x)}. This can be actually detected by the form of the structural constraint (23). It is well known in kernel estimation that ˆm(x) is not an unbiased estimator of m(x), as is the case for almost all nonparametric estimators. For the Nadaraya-Watson estimator, E{mˆ (x)} = m(x) + b(x) + o(h 2 ) where b(x) = 1 2 h 2 {m′′(x) + 2m′ (x)f ′ (x)/f(x)} is the leading bias of the kernel estimator, and f is the density of Xi . Then, the EL is actually evaluated at a θ(x), that is a candidate value of m(x) + b(x) instead of m(x). There are two strategies to reduce the effect of the bias (Hall, 1991). One is to undersmooth with a bandwidth h = o(n −1/(4+d) ), the optimal order of bandwidth that minimizes the mean squared error of estimation with a second order kernel (d is the dimension of X). Another is to explicitly estimate the bias and then to subtract it from the kernel estimate. We consider the first approach of undersmoothing here for reasons of simplicity. When undersmoothing so that n 2/(4+d)h 2 → 0, Wilks’ theorem is valid for the EL under the current nonparametric regression in that rn{m(x)} d→ χ 2 1 as n → ∞. This means that an empirical likelihood confidence interval with nominal coverage equal to 1 − α, denoted as I1−α,el, is given by I1−α,el = {θ(x) : rn{θ(x)} ≤ χ 2 1,1−α}. (27)

10 A special feature of the empirical likelihood confidence interval is that no explicit variance estimator is required in its construction as the studentization is carried out internally via the optimization procedure. Define ωi = Kh(x − Xi){Yi − m(x)} and, for positive integers j, ω¯j = n −1Xn i=1 ω j i , µj = E(¯ωj ) and Rj (K) = Z K j (u)du. We note here that the bias in the kernel smoothing makes µ1 = O(h 2 ) while in the parametric regression case µ1 = 0. It is shown in Chen and Qin (2003) that the coverage probability of I1−α,el admits the following Edgeworth expansion: P{m(x) ∈ I1−α,el} = 1 − α − {nhd µ 2 1µ −1 2 + ( 1 2 µ −2 2 µ4 − 1 3 µ −3 2 µ 2 3 )(nhd ) −1 }z1− α 2 φ(z1− α 2 ) +O{nhd+6 + h 4 + (nhd ) −1 h 2 + (nhd ) −2 }, (28) where φ and z1− α 2 are the density and the (1 − α 2 )-quantile of a standard normal random variable. The above expansion is non-standard in that the leading coverage error consists of two terms. The first term, nhdµ1µ −1 2 , of order nhd+4 is due to the bias in the kernel smoothing. The second term of order (nhd ) −1 is largely similar to the leading coverage error for parametric regression in (16). We note that in the second term, the effective sample size in the nonparametric estimation near x is nhd instead of n, the effective sample size in the parametric regression. The next question is if the Bartlett correction is still valid under the nonparametric regression. The answer is yes. It may be shown that E[rn{m(x)}] = 1 + (nhd ) −1 γ + o{nhd+4 + (nhd ) −1 }, where γ = µ −1 2 (nhd µ1) 2 + 1 2 µ −2 2 µ4 − 1 3 µ −3 2 µ 2 3 . (29) Note that γ appears in the leading coverage error term in (28). Based on (28) and choosing h = O(n − 1 d+2 ), we have, with cα = χ 2 1,1−α, P h rn{m(x)} ≤ cα{1 + γ(nhd ) −1 } i = P h χ 2 1 ≤ cα{1 + γ(nhd ) −1 } i −(nhd ) −1 γc 1/2 α {1 + γ(nhd ) −1 } 1/2 φ[c −1/2 α {1 + γ(nhd ) −1 } 1/2 }] + O{(nhd ) −2 } = P χ 2 1 ≤ cα + (nhd ) −1 γz1− α 2 φ(z1− α 2 ) − (nhd ) −1 γz1− α 2 φ(z1− α 2 ) + O{(nhd ) −2 } = 1 − α + O(n − 4 d+2 ). (30) Therefore, the empirical likelihood is Bartlett correctable in the current context of nonparametric regression. In practice, the Bartlett factor γ has to be estimated, say by a consistent ˆγ. Chen and Qin (2003) gave more details on practical implementation; see also Chen (1996) for an implementation in the case of density estimation