1 Chapter 7 Statistical Functionals and the Delta Method 1.Estimators as Functionals of Fn or Pn 2.Continuity of Functionals of F or P 3.Metrics for Distribution Functions F and Probability Distributions P 4.Differentiability of Functionals of F or P:Gateaux,Hadamard,and Frechet Derivatives 5.Higher Order Derivatives
1 Chapter 7 Statistical Functionals and the Delta Method 1. Estimators as Functionals of Fn or Pn 2. Continuity of Functionals of F or P 3. Metrics for Distribution Functions F and Probability Distributions P 4. Differentiability of Functionals of F or P: Gateaux, Hadamard, and Frechet Derivatives 5. Higher Order Derivatives
2
2
Chapter 7 Statistical Functionals and the Delta Method 1 Estimates as Functionals of Fn or Pn Often the quantity we want to estimate can be viewed as a functional T(F)or T(P)of the underlying distribution function F or P generating the data.Then a simple nonparametric estimator is simply T(Fn)or T(Pn)where Fn and Pn denote the empirical distribution function and empirical measure of the data. Notation.Suppose that X1,...,Xn are i.i.d.P on (,A).We let n ∑dx,三the empirical measure of the sample, Pn三 i=1 where the measure with mass one at x (so 6(A)=1A(x)for AA.When =R,especially when k =1,we will write En(x)= a=P(-小F=P-,斗 Here is a list of examples. Example 1.1 The mean T(F)=fxdF(x).T(Fn)=fxdFn(x) Example 1.2 The r-th moment:for r an integer,T(F)=fx"dF(x),and T(Fn)=fr'dFn(). Example 1.3 The variance: TD=VarW=/e-∫FGYF(=∫fe-F(z)F(. T)=vam.x=/e-R.(aP证)=∫∫e-P亚.aE Example 1.4 The median:T(F)=F-1(1/2).T(Fn)=F1(1/2). Example 1.5 The a-trimmed mean:T(F)=(1-2a)-1fF-1(u)du for 0<a <1/2. T(Fn)=(1-2a)-1faFn(u)du. 3
Chapter 7 Statistical Functionals and the Delta Method 1 Estimates as Functionals of Fn or Pn Often the quantity we want to estimate can be viewed as a functional T(F) or T(P) of the underlying distribution function F or P generating the data. Then a simple nonparametric estimator is simply T(Fn) or T(Pn) where Fn and Pn denote the empirical distribution function and empirical measure of the data. Notation. Suppose that X1, . . . , Xn are i.i.d. P on (X , A). We let Pn ≡ 1 n !n i=1 δXi ≡ the empirical measure of the sample, where δx ≡ the measure with mass one at x (so δx(A)=1A(x) for A ∈ A. When X = Rk, especially when k = 1, we will write Fn(x) = 1 n !n i=1 1(−∞,x](Xi) = Pn(−∞, x], F(x) = P(−∞, x]. Here is a list of examples. Example 1.1 The mean T(F) = " xdF(x). T(Fn) = " xdFn(x). Example 1.2 The r-th moment: for r an integer, T(F) = " xrdF(x), and T(Fn) = " xrdFn(x). Example 1.3 The variance: T(F) = V arF (X) = # (x − # xdF(x))2dF(x) = 1 2 # # (x − y) 2dF(x)dF(y), T(Fn) = V arFn(X) = # (x − # xdFn(x))2dFn(x) = 1 2 # # (x − y) 2dFn(x)dFn(y). Example 1.4 The median: T(F) = F −1(1/2). T(Fn) = F−1 n (1/2). Example 1.5 The α−trimmed mean: T(F) = (1 − 2α)−1 " 1−α α F −1(u)du for 0 < α < 1/2. T(Fn) = (1 − 2α)−1 " 1−α α F−1 n (u)du. 3
4 CHAPTER 7.STATISTICAL FUNCTIONALS AND THE DELTA METHOD Example 1.6 The Hodges-Lehmann functional:T(F)=(1/2){FF)1(1/2)where denotes convolution.Then T(Fn)=(1/2){Fn *Fn(1/2)=median{(Xi+Xi)/2}. Example 1.7 The Mann-Whitney functional.For X,Y independent with distribution functions F and G respectively,T(F,G)=fFdG PEG(X <Y).Then T(Fm,Gn)=fFmdGn (based on two independent samples X1,...,Xm i.i.d.F with empirical df Fm and Yi,...,Yn i.i.d.G with empirical df Gn. Example 1.8 Multivariate mean:for P on (R,B):T(P)=fxdP(x)(with values in R), T(Pn)=∫xdPn(o)=n-1∑=1X Example 1.9 Multivariate cross second moments:for P on (R,B): T(P)=zxTdP(x)=22dP(x); T) 「xn'l回)=2n)=n-∑X,Xg Note that T(P)and T(Pn)take values in Rxk. Example 1.10 Multivariate covariance matrix:for P on(R,B) =∫e-兆-P.a2.o)=m-空-不x- Example 1.11 k-means clustering functional:T(P)=(T1(P),...,T(P)where the Ti(P)'s min- imize -ti2dP(x) where Ci=fx E Rm ti minimizes I-t over {t1,...,t}} Then T(Pn)=(Ti(Pn),...,Tk(Pn))where the Ti(Pn)'s minimize -tl-tdn(). Example 1.12 The simplicial depth function:for P on R and z E R,set T(P)=T(P)(x)= Prp(x E S(X1,...,Xk+1))where X1,...,Xk+1 are i.i.d.P and S(1,...,k+1)is the simplex in R determined by x1,...,k+1;e.g.for k =2,the simplex determined by 1,x2,x3 is just a triangle.Then T(Pn)=Prp E S(X1,...,X+1)).Note that in this example T(P)is a function from Rk to R
4 CHAPTER 7. STATISTICAL FUNCTIONALS AND THE DELTA METHOD Example 1.6 The Hodges-Lehmann functional: T(F) = (1/2){F # F}−1(1/2) where # denotes convolution. Then T(Fn) = (1/2){Fn # Fn}−1(1/2) = median{(Xi + Xj )/2}. Example 1.7 The Mann-Whitney functional. For X, Y independent with distribution functions F and G respectively, T(F, G) = " F dG = PF,G(X ≤ Y ). Then T(Fm, Gn) = " FmdGn (based on two independent samples X1, . . . , Xm i.i.d. F with empirical df Fm and Y1, . . . , Yn i.i.d. G with empirical df Gn. Example 1.8 Multivariate mean: for P on (Rk,Bk): T(P) = " xdP(x) (with values in Rk), T(Pn) = " xdPn(x) = n−1 $n i=1 Xi. Example 1.9 Multivariate cross second moments: for P on (Rk,Bk): T(P) = # xxT dP(x) = # x⊗2dP(x); T(Pn) = # xxT dPn(x) = # x⊗2dPn(x) = n−1!n i=1 XiXT i . Note that T(P) and T(Pn) take values in Rk×k. Example 1.10 Multivariate covariance matrix: for P on (Rk,Bk): T(P) = # (x − # ydP(y))(x − # ydP(y))T dP(x) = 1 2 # # (x − y)(x − y) T dP(x)dP(y), T(Pn) = # (x − # ydPn(y))(x − # ydPn(y))T dPn(x) = 1 2 # # (x − y)(x − y) T dPn(x)dPn(y) = n−1!n i=1 (Xi − Xn)(Xi − Xn) T . Example 1.11 k−means clustering functional: T(P)=(T1(P), . . . , Tk(P) where the Ti(P)’s minimize # |x − t1| 2 ∧ · · · ∧ |x − tk| 2dP(x) = ! k i=1 # Ci |x − ti| 2dP(x) where Ci = {x ∈ Rm : ti minimizes |x − t| 2 over {t1, . . . , tk}}. Then T(Pn)=(T1(Pn), . . . , Tk(Pn)) where the Ti(Pn)’s minimize # |x − t1| 2 ∧ · · · ∧ |x − tk| 2dPn(x). Example 1.12 The simplicial depth function: for P on Rk and x ∈ Rk, set T(P) ≡ T(P)(x) = P rP (x ∈ S(X1, . . . , Xk+1)) where X1, . . . , Xk+1 are i.i.d. P and S(x1, . . . , xk+1) is the simplex in Rk determined by x1, . . . , xk+1; e.g. for k = 2, the simplex determined by x1, x2, x3 is just a triangle. Then T(Pn) = P rPn (x ∈ S(X1, . . . , Xk+1)). Note that in this example T(P) is a function from Rk to R
1. ESTIMATES AS FUNCTIONALS OF FN OR PN Example 1.13 (Z-functional derived from likelihood).A maximum likelihood estimator:for P on(,A),suppose that P={Po:0cR}is a regular parametric model with vector scores function le(;0).Then for general P,not necessarily in the model P,consider T defined by (1) io(z;T(P))dP(z)=0. Then /i(c:Te》证.=0 defines T(Pn).For estimation of location in one dimension with I(x;0)=(x-0)and =-f/f, these become v(-T(F))dF(x)=0 and (x-T(Fn))dFn(x)=0. We expect that often the value T(P)ee satisfying (1)also satisfies T(P)=argmineeeK(P,Po). Here is a heuristic argument showing why this should be true:Note that for many cases we have On=argmaxon-In(0)=argmaxoPn(log0) -p argmaxeP(log0)=argmaxe logpe(x)dP(x). Now P(log pe)=P(logp)+Plog P(logp)-Plog P(logp)-K(P;Pe). Thus argmaxe log pe(x)dP(x)=argminoK(P,Pe)≡f(P). If we can interchange differentiation and integration it follows that VoK(P,Po)=p(z)io(z:0)du(x)=io(z:0)dP(z), so the relation (1)is obtained by setting this gradient vector equal to 0. Example 1.14 A bootstrap functional:let T(F)be a functional with estimator T(Fn),and con- sider estimating the distribution function of vn(T(Fn)-T(F)), Hn(F;)=Pr(Vn(T(En)-T(F))<). A natural estimator is Hn(Fn,)
1. ESTIMATES AS FUNCTIONALS OF FN OR PN 5 Example 1.13 (Z-functional derived from likelihood). A maximum likelihood estimator: for P on (X , A), suppose that P = {Pθ : θ ∈ Θ ⊂ Rk} is a regular parametric model with vector scores function ˙ lθ(·; θ). Then for general P, not necessarily in the model P, consider T defined by # ˙ (1) lθ(x; T(P))dP(x) = 0. Then # ˙ lθ(x; T(Pn))dPn(x)=0 defines T(Pn). For estimation of location in one dimension with ˙ l(x; θ) = ψ(x − θ) and ψ ≡ −f% /f, these become # ψ(x − T(F))dF(x) = 0 and # ψ(x − T(Fn))dFn(x)=0. We expect that often the value T(P) ∈ Θ satisfying (1) also satisfies T(P) = argminθ∈ΘK(P, Pθ). Here is a heuristic argument showing why this should be true: Note that for many cases we have ˆθn = argmaxθn−1ln(θ) = argmaxθPn(log θ) →p argmaxθP(log θ) = argmaxθ # log pθ(x)dP(x). Now P(log pθ) = P(log p) + P log %pθ p & = P(log p) − P log % p pθ & = P(log p) − K(P, Pθ). Thus argmaxθ # log pθ(x)dP(x) = argminθK(P, Pθ) ≡ θ(P). If we can interchange differentiation and integration it follows that ∇θK(P, Pθ) = # p(x)˙ lθ(x; θ)dµ(x) = # ˙ lθ(x; θ)dP(x), so the relation (1) is obtained by setting this gradient vector equal to 0. Example 1.14 A bootstrap functional: let T(F) be a functional with estimator T(Fn), and consider estimating the distribution function of √n(T(Fn) − T(F)), Hn(F; ·) = PF ( √n(T(Fn) − T(F)) ≤ ·). A natural estimator is Hn(Fn, ·)