D0110.1214088342304000000297 Density Estimation Simon J.Sheather Abstract.This paper provides a practical description of density estimation based on kernel methods.An important aim is to encourage practicing statis- ticians to apply these methods to data.As such,reference is made to imple- mentations of these methods in R,S-PLUS and SAS. Key words and phrases:Kernel density estimation,bandwidth selection, local likelihood density estimates,data sharpening. 1.INTRODUCTION data from the U.S.PGA tour in Section 5.Finally,in Density estimation has experienced a wide ex- Section 6 we provide some overall recommendations. plosion of interest over the last 20 years.Silver- man's (1986)book on this topic has been cited over 2.THE BASICS OF KERNEL 2000 times.Recent texts on smoothing which in- DENSITY ESTIMATION clude detailed density estimation include Bowman and Let X1,X2,....Xn denote a sample of size n from a Azzalini(1997),Simonoff(1996)and Wand and Jones random variable with density f. (1995).Density estimation has been applied in many The kernel density estimate of f at the point x is fields,including archaeology (e.g.,Baxter,Beardah given by and Westwood,2000),banking(e.g.,Tortosa-Ausina, 2002),climatology (e.g.,Ferreyra et al.,2001),eco- (1) nomics(e.g.,DiNardo,Fortin and Lemieux,1996),ge- nh i=1 netics(e.g.,Segal and Wiemels,2002),hydrology (e.g., Kim and Heo,2002)and physiology(e.g.,Paulsen and where the kernel K satisfies fK(x)dx =1 and the Heggelund,1996). smoothing parameter h is known as the bandwidth.In This paper provides a practical description of density practice,the kernel K is generally chosen to be a uni- estimation based on kernel methods.An important aim modal probability density symmetric about zero.In this is to encourage practicing statisticians to apply these case,K satisfies the conditions methods to data.As such,reference is made to imple- mentations of these methods in R,S-PLUS and SAS K(y)dy=1, Section 2 provides a description of the basic proper- ties of kernel density estimators.It is well known that yK(y)dy=0, the performance of kernel density estimators depends crucially on the value of the smoothing parameter, y2K0)d=2(K)>0. commonly referred to as the bandwidth.We describe methods for selecting the value of the bandwidth in A popular choice for K is the Gaussian kernel,namely, Section 3.In Section 4,we describe two recent im- portant improvements to kernel methods,namely,local (2) likelihood density estimates and data sharpening.We K)=2cp( compare the performance of some of the methods that Throughout this section we consider a small gener- have been discussed using a new example involving ated data set to illustrate the ideas presented.The data consist of a random sample of size n=10 from a Simon J.Sheather is Professor of Statistics,Australian normal mixture distribution made up of observations Graduate School of Management,University of New from N(u=-1,o2 (1/3)2)and N(u=1,o2= South Wales and the University of Sydney,Sydney,NSW (1/3)2),each with probability 0.5.Figure 1 shows a 2052,Australia (e-mail:simonsh @agsm.edu.au). kernel estimate of the density for these data using the 588
Statistical Science 2004, Vol. 19, No. 4, 588–597 DOI 10.1214/088342304000000297 © Institute of Mathematical Statistics, 2004 Density Estimation Simon J. Sheather Abstract. This paper provides a practical description of density estimation based on kernel methods. An important aim is to encourage practicing statisticians to apply these methods to data. As such, reference is made to implementations of these methods in R, S-PLUS and SAS. Key words and phrases: Kernel density estimation, bandwidth selection, local likelihood density estimates, data sharpening. 1. INTRODUCTION Density estimation has experienced a wide explosion of interest over the last 20 years. Silverman’s (1986) book on this topic has been cited over 2000 times. Recent texts on smoothing which include detailed density estimation include Bowman and Azzalini (1997), Simonoff (1996) and Wand and Jones (1995). Density estimation has been applied in many fields, including archaeology (e.g., Baxter, Beardah and Westwood, 2000), banking (e.g., Tortosa-Ausina, 2002), climatology (e.g., Ferreyra et al., 2001), economics (e.g., DiNardo, Fortin and Lemieux, 1996), genetics (e.g., Segal and Wiemels, 2002), hydrology (e.g., Kim and Heo, 2002) and physiology (e.g., Paulsen and Heggelund, 1996). This paper provides a practical description of density estimation based on kernel methods. An important aim is to encourage practicing statisticians to apply these methods to data. As such, reference is made to implementations of these methods in R, S-PLUS and SAS. Section 2 provides a description of the basic properties of kernel density estimators. It is well known that the performance of kernel density estimators depends crucially on the value of the smoothing parameter, commonly referred to as the bandwidth. We describe methods for selecting the value of the bandwidth in Section 3. In Section 4, we describe two recent important improvements to kernel methods, namely, local likelihood density estimates and data sharpening. We compare the performance of some of the methods that have been discussed using a new example involving Simon J. Sheather is Professor of Statistics, Australian Graduate School of Management, University of New South Wales and the University of Sydney, Sydney, NSW 2052, Australia (e-mail: simonsh@agsm.edu.au). data from the U.S. PGA tour in Section 5. Finally, in Section 6 we provide some overall recommendations. 2. THE BASICS OF KERNEL DENSITY ESTIMATION Let X1, X2,...,Xn denote a sample of size n from a random variable with density f . The kernel density estimate of f at the point x is given by fˆ h(x) = 1 nh n i=1 K x − Xi h (1) , where the kernel K satisfies K(x) dx = 1 and the smoothing parameter h is known as the bandwidth. In practice, the kernel K is generally chosen to be a unimodal probability density symmetric about zero. In this case, K satisfies the conditions K(y) dy = 1, yK(y) dy = 0, y2K(y) dy = µ2(K) > 0. A popular choice for K is the Gaussian kernel, namely, K(y) = 1 √2π exp −y2 2 (2) . Throughout this section we consider a small generated data set to illustrate the ideas presented. The data consist of a random sample of size n = 10 from a normal mixture distribution made up of observations from N (µ = −1, σ2 = (1/3)2) and N (µ = 1, σ2 = (1/3)2), each with probability 0.5. Figure 1 shows a kernel estimate of the density for these data using the 588
DENSITY ESTIMATION 589 bin width h is of order h,whereas centering the kernel at each data point and using a symmetric kernel zeroes this term and as such produces a leading bias term for the kernel estimate of order h2. Adding the leading variance and squared bias terms produces the asymptotic mean squared error(AMSE) AMSE)(KF A widely used choice of an overall measure of the discrepancy between f and f is the mean integrated squared error(MISE),which is given by Data FIG.1.Kernel density estimate and contributions from each data MsEi)=E∫ao)-joPd point (dashed curve)along with the true underlying density (solid curve). =Bias(fy))2dy+Var(f(y))dy. Gaussian kernel with bandwidth h=0.3 (the dashed Under an integrability assumption on f,integrating the curve)along with the true underlying density(the solid expression for AMSE gives the expression for the as- curve).The 10 data points are marked by vertical lines ymptotic mean integrated squared error(AMISE),that on the horizontal axis.Centered at each data point is 1s, each point's contribution to the overall density esti- mate,namely,(1/(nh))K((x-Xi)/h)(i.e.,1/n times 同)AMEE=RuK+答FR. h4 a normal density with mean Xi and standard devia- tion h).The density estimate (the dashed curve)is the where sum of these scaled normal densities.Increasing the value of h widens each normal curve,smoothing out R(f")=[f")dy the two modes currently apparent in the estimate. The value of the bandwidth that minimizes the AMISE A Java applet that allows the user to watch the effects is given by of changing the bandwidth and the shape of the kernel 1/ function on the resulting density estimate can be found R(K) (4) hAMISE L2(K)2R(f」 n-l/5 at http://www-users.york.ac.uk/~jb35/mygr2.htm.It is well known that the value of the bandwidth is of critical Assuming that f is sufficiently smooth,we can use in- importance,while the shape of the kernel function has tegration by parts to show that little practical impact. Assuming that the underlying density is sufficiently R(f")=[f"()P dy=-f))f)dy. smooth and that the kernel has finite fourth moment,it can be shown using Taylor series that Thus,the functional R(f")is a measure of the under- h2 lying roughness or curvature.In particular,the larger Bias(K)") the value of R(f")is,the larger is the value of AMISE (i.e.,the more difficult it is to estimate f)and the Vartjin()=R(K))+o smaller is the value of hAMISE (i.e.,the smaller the nh nh bandwidth needed to capture the curvature in f). where 3.BANDWIDTH SELECTION FOR KERNEL R(K)=K2)dy DENSITY ESTIMATES (e.g.,Wand and Jones,1995,pages 20-21).In addition In this section,we briefly review methods for choos- to the visual advantage of being a smooth curve,the ing a global value of the bandwidth h.Where ap- kernel estimate has an advantage over the histogram in plicable,reference is made to implementations of these terms of bias.The bias of a histogram estimator with methods in R.S-PLUS and SAS
DENSITY ESTIMATION 589 FIG. 1. Kernel density estimate and contributions from each data point (dashed curve) along with the true underlying density (solid curve). Gaussian kernel with bandwidth h = 0.3 (the dashed curve) along with the true underlying density (the solid curve). The 10 data points are marked by vertical lines on the horizontal axis. Centered at each data point is each point’s contribution to the overall density estimate, namely, (1/(nh))K((x −Xi)/h) (i.e., 1/n times a normal density with mean Xi and standard deviation h). The density estimate (the dashed curve) is the sum of these scaled normal densities. Increasing the value of h widens each normal curve, smoothing out the two modes currently apparent in the estimate. A Java applet that allows the user to watch the effects of changing the bandwidth and the shape of the kernel function on the resulting density estimate can be found at http://www-users.york.ac.uk/∼jb35/mygr2.htm. It is well known that the value of the bandwidth is of critical importance, while the shape of the kernel function has little practical impact. Assuming that the underlying density is sufficiently smooth and that the kernel has finite fourth moment, it can be shown using Taylor series that Bias{fˆ h(x)} = h2 2 µ2(K)f (x) + o(h2), Var{fˆ h(x)} = 1 nhR(K)f (x) + o 1 nh , where R(K) = K2(y) dy (e.g., Wand and Jones, 1995, pages 20–21). In addition to the visual advantage of being a smooth curve, the kernel estimate has an advantage over the histogram in terms of bias. The bias of a histogram estimator with bin width h is of order h, whereas centering the kernel at each data point and using a symmetric kernel zeroes this term and as such produces a leading bias term for the kernel estimate of order h2. Adding the leading variance and squared bias terms produces the asymptotic mean squared error (AMSE) AMSE{fˆ h(x)} = 1 nhR(K)f (x)+ h4 4 µ2(K)2[f (x)] 2. A widely used choice of an overall measure of the discrepancy between fˆ and f is the mean integrated squared error (MISE), which is given by MISE(fˆ h) = E fˆ h(y) − f (y)2 dy = Bias(fˆ h(y))2 dy + Var(fˆ h(y)) dy. Under an integrability assumption on f , integrating the expression for AMSE gives the expression for the asymptotic mean integrated squared error (AMISE), that is, AMISE{fˆ h} = 1 nhR(K) + h4 4 µ2(K)2R(f (3) ), where R(f ) = [f (y)] 2 dy. The value of the bandwidth that minimizes the AMISE is given by hAMISE = R(K) µ2(K)2R(f ) 1/5 n−1/5 (4) . Assuming that f is sufficiently smooth, we can use integration by parts to show that R(f ) = [f (y)] 2 dy = − f (4) (y)f (y) dy. Thus, the functional R(f ) is a measure of the underlying roughness or curvature. In particular, the larger the value of R(f ) is, the larger is the value of AMISE (i.e., the more difficult it is to estimate f ) and the smaller is the value of hAMISE (i.e., the smaller the bandwidth needed to capture the curvature in f ). 3. BANDWIDTH SELECTION FOR KERNEL DENSITY ESTIMATES In this section, we briefly review methods for choosing a global value of the bandwidth h. Where applicable, reference is made to implementations of these methods in R, S-PLUS and SAS.
590 S.J.SHEATHER In SAS,PROC KDE produces kernel density esti- Silverman's reference bandwidth or Silverman's rule mates based on the usual Gaussian kernel (i.e.,the of thumb.It is given by Gaussian density with mean 0 and standard devia- tion 1),whereas S-PLUS has a function density which hSROT=0.9An-l/5」 produces kernel density estimates with a default kernel, where A minsample standard deviation,(sam- the Gaussian density with mean 0 and standard devia- ple interquartile range)/1.34).In SAS PROC KDE. tion 1/4.Thus,the bandwidths described in what fol- this method is called Silverman's rule of thumb lows must be multiplied by 4 when used in S-PLUS. (METHOD SROT).In R,Silverman's bandwidth is The program R also has a function density which pro- invoked by bw ="bw.nrd0".In S-PLUS,Silverman's duces kernel density estimates with a default kernel, bandwidth with constant 1.06 rather than 0.9 is invoked the Gaussian density with mean 0 and standard devia- by width=“nrd'. tion 1. Terrell and Scott (1985)and Terrell (1990)de- 3.1 Rules of Thumb veloped a bandwidth selection method based on the maximal smoothing principle so as to produce over- The computationally simplest method for choosing smoothed density estimates.The method is based on a global bandwidth h is based on replacing R(f"), choosing the "largest degree of smoothing compati- the unknown part of hAMISE,by its value for a para- ble with the estimated scale of the density"(Terrell, metric family expressed as a multiple of a scale pa- 1990,page 470).Looking back at (3),this amounts to rameter,which is then estimated from the data.The finding,for a given value of scale,the density f with method seems to date back to Deheuvels (1977)and the smallest value of R(f").Taking the variance o2 Scott (1979),who each proposed it for histograms. as the scale parameter,Terrell(1990,page 471)found However,the method was popularized for kernel den- that the beta(4,4)family of distributions with vari- sity estimates by Silverman(1986,Section 3.2),who ance o2 minimizes R(f").For the standard Gaussian used the normal distribution as the parametric family. kernel this leads to the oversmoothed bandwidth Let o and IQR denote the standard deviation and in- terquartile range of X,respectively.Take the kernel K hos=1.144Sn-1/5 to be the usual Gaussian kernel.Assuming that the un- derlying distribution is normal,Silverman(1986,pages In SAS PROC KDE,this method is called over- 45 and 47)showed that (3)reduces to smoothed(METHOD =OS). Comparing the oversmoothed bandwidth with the hAMISENORMAL =1.06gn-1/5 normal reference bandwidth hsNR,we see that the oversmoothed bandwidth is 1.08 times larger.Thus, and hAMISENORMAL =0.79IQRn-1/5. in practice there is often very little visual differ- ence between density estimates produced using either Jones,Marron and Sheather(1996)studied the Monte the oversmoothed bandwidth or the normal reference Carlo performance of the normal reference bandwidth bandwidth. based on the standard deviation,that is,they considered 3.2 Cross-Validation Methods hSNR 1.06Sn-1/5 A measure of the closeness of f and f for a given where S is the sample standard deviation.In SAS sample is the integrated squared error (ISE),which is PROC KDE,this method is called the simple nor- given by mal reference(METHOD SNR).Jones,Marron and Sheather(1996)found that hsNR had a mean that was ISE()=(f)-f()2dy usually unacceptably large and thus often produced oversmoothed density estimates. =ao2d-2∫a0d Furthermore,Silverman (1986,page 48)recom- mended reducing the factor 1.06 in the previous equa- +)dy. tion to 0.9 in an attempt not to miss bimodality and using the smaller of two scale estimates.This rule is Notice that the last term on the right-hand side of the commonly used in practice and it is often referred to as previous expression does not involve h
590 S. J. SHEATHER In SAS, PROC KDE produces kernel density estimates based on the usual Gaussian kernel (i.e., the Gaussian density with mean 0 and standard deviation 1), whereas S-PLUS has a function density which produces kernel density estimates with a default kernel, the Gaussian density with mean 0 and standard deviation 1/4. Thus, the bandwidths described in what follows must be multiplied by 4 when used in S-PLUS. The program R also has a function density which produces kernel density estimates with a default kernel, the Gaussian density with mean 0 and standard deviation 1. 3.1 Rules of Thumb The computationally simplest method for choosing a global bandwidth h is based on replacing R(f ), the unknown part of hAMISE, by its value for a parametric family expressed as a multiple of a scale parameter, which is then estimated from the data. The method seems to date back to Deheuvels (1977) and Scott (1979), who each proposed it for histograms. However, the method was popularized for kernel density estimates by Silverman (1986, Section 3.2), who used the normal distribution as the parametric family. Let σ and IQR denote the standard deviation and interquartile range of X, respectively. Take the kernel K to be the usual Gaussian kernel. Assuming that the underlying distribution is normal, Silverman (1986, pages 45 and 47) showed that (3) reduces to hAMISENORMAL = 1.06σ n−1/5 and hAMISENORMAL = 0.79 IQRn−1/5. Jones, Marron and Sheather (1996) studied the Monte Carlo performance of the normal reference bandwidth based on the standard deviation, that is, they considered hSNR = 1.06Sn−1/5, where S is the sample standard deviation. In SAS PROC KDE, this method is called the simple normal reference (METHOD = SNR). Jones, Marron and Sheather (1996) found that hSNR had a mean that was usually unacceptably large and thus often produced oversmoothed density estimates. Furthermore, Silverman (1986, page 48) recommended reducing the factor 1.06 in the previous equation to 0.9 in an attempt not to miss bimodality and using the smaller of two scale estimates. This rule is commonly used in practice and it is often referred to as Silverman’s reference bandwidth or Silverman’s rule of thumb. It is given by hSROT = 0.9An−1/5, where A = min{sample standard deviation, (sample interquartile range)/1.34}. In SAS PROC KDE, this method is called Silverman’s rule of thumb (METHOD = SROT). In R, Silverman’s bandwidth is invoked by bw = “bw.nrd0”. In S-PLUS, Silverman’s bandwidth with constant 1.06 rather than 0.9 is invoked by width = “nrd”. Terrell and Scott (1985) and Terrell (1990) developed a bandwidth selection method based on the maximal smoothing principle so as to produce oversmoothed density estimates. The method is based on choosing the “largest degree of smoothing compatible with the estimated scale of the density” (Terrell, 1990, page 470). Looking back at (3), this amounts to finding, for a given value of scale, the density f with the smallest value of R(f ). Taking the variance σ2 as the scale parameter, Terrell (1990, page 471) found that the beta(4, 4) family of distributions with variance σ2 minimizes R(f ). For the standard Gaussian kernel this leads to the oversmoothed bandwidth hOS = 1.144Sn−1/5. In SAS PROC KDE, this method is called oversmoothed (METHOD = OS). Comparing the oversmoothed bandwidth with the normal reference bandwidth hSNR, we see that the oversmoothed bandwidth is 1.08 times larger. Thus, in practice there is often very little visual difference between density estimates produced using either the oversmoothed bandwidth or the normal reference bandwidth. 3.2 Cross-Validation Methods A measure of the closeness of fˆ and f for a given sample is the integrated squared error (ISE), which is given by ISE(fˆ h) = fˆ h(y) − f (y)2 dy = (fˆ h(y))2 dy − 2 fˆ h(y)f (y) dy + f 2(y) dy. Notice that the last term on the right-hand side of the previous expression does not involve h.
DENSITY ESTIMATION 591 Bowman(1984)proposed choosing the bandwidth In S-PLUS,hLscv is invoked by width ="bandwidth. as the value of h that minimizes the estimate of the two ucv”,while in R it is invoked by bw=“bw.ucv”.The other terms in the last expression,namely least squares cross-validation function LSCV(h)can have more than one local minimum(Hall and Marron, 5 1991).Thus,in practice,it is prudent to plot LSCV(h) and not just rely on the result of a minimization rou- where fi(y)denotes the kernel estimator constructed tine.Jones,Marron and Sheather(1996)recommended from the data without the observation Xi.The method that the largest local minimizer of LSCV(h)be used is commonly referred to as least squares cross- as hLscv,since this value produces better empirical validation,since it is based on the so-called leave-one- performance than the global minimizer.The Bowman out density estimator f_i(y).Rudemo(1982)proposed and Azzalini (1997)library of S-PLUS functions con- the same technique from a slightly different viewpoint. tains the function cv(y,h)which produces values of Bowman and Azzalini(1997,page 37)provided an ex- LSCV(h)for the data set y over the vector of different plicit expression for(5)for the Gaussian kernel. bandwidth values in h. Stone(1984,pages 1285-1286)provided the follow- For the least squares cross-validation based criterion, ing straightforward demonstration that the second term by using the representation in(5)is an unbiased estimate of the second term in ISE. Observe (6)LSCV(h) E[ao)fo)dy】 nh =fK(h)rds fdy where y(c)=fK(w)K(w+c)dw-2K(c),Scott and Terrell(1987)showed that =[(',小 h4 EILSCV(-(K(R( This leads to the unbiased estimate offf(y)f(y)dy -R(f)+O0n- a2()-2.6 AMISE[fin}-R(f)+0(n-1). Hall(1983,page 1157)showed that Thus,least cross-validation essentially provides esti- mates of R(f"),the only unknown quantity in 2i-6rd=io2dw+o,(品) AMISE(fi). For a given set of data,denote the bandwidth that minimizes ISE(fh)by hIsE.A number of authors(e.g., and hence changed the least squares cross-validation Gu,1998)argued that the ideal bandwidth is the ran- (LSCV)based criterion from(5)to dom quantity hIsE,since it minimizes the ISE for the Lscv=∫a0》2d山-2手,x. given sample.However,hIsE is an inherently difficult quantity to estimate.In particular,Hall and Marron i=l (1987a)showed that the smallest possible relative er- since "it is slightly simpler to compute,without affect- ing the asymptotics."This version is the one used by ror for any data based bandwidth h is most authors.We denote the value of h that minimizes LSCV(h)by hLscv.Least squares cross-validation is 元-1=0,n-0 also referred to as unbiased cross-validation since Fu Hall and Marron(1987b)and Scott and Terrell(1987) E[LSCV(h)]= E[io-fo2a】 showed that the least squares cross-validation band- width hscv achieves this best possible convergence -f20)d rate.In particular,they showed that -MISE-f2o)dy hISE
DENSITY ESTIMATION 591 Bowman (1984) proposed choosing the bandwidth as the value of h that minimizes the estimate of the two other terms in the last expression, namely 1 n n i=1 (fˆ −i(y))2 dy − 2 n n i=1 fˆ (5) −i(Xi), where fˆ −i(y) denotes the kernel estimator constructed from the data without the observation Xi. The method is commonly referred to as least squares crossvalidation, since it is based on the so-called leave-oneout density estimator fˆ −i(y). Rudemo (1982) proposed the same technique from a slightly different viewpoint. Bowman and Azzalini (1997, page 37) provided an explicit expression for (5) for the Gaussian kernel. Stone (1984, pages 1285–1286) provided the following straightforward demonstration that the second term in (5) is an unbiased estimate of the second term in ISE. Observe E fˆ h(y)f (y) dy = K y − x h f (x) dx f (y) dy = E K Y − X h . This leads to the unbiased estimate of f (y) f (y) dy ˆ : 1 n(n − 1)h i=j K Xi − Xj h = 1 n n i=1 fˆ −i(Xi). Hall (1983, page 1157) showed that 1 n n i=1 (fˆ −i(y))2 dy = (fˆ h(y))2 dy + Op 1 n2h and hence changed the least squares cross-validation (LSCV) based criterion from (5) to LSCV(h) = (fˆ h(y))2 dy − 2 n n i=1 fˆ −i(Xi), since “it is slightly simpler to compute, without affecting the asymptotics.” This version is the one used by most authors. We denote the value of h that minimizes LSCV(h) by hLSCV. Least squares cross-validation is also referred to as unbiased cross-validation since E[LSCV(h)] = E fˆ h(y) − f (y)2 dy − f 2(y) dy = MISE − f 2(y) dy. In S-PLUS, hLSCV is invoked by width = “bandwidth. ucv”, while in R it is invoked by bw = “bw.ucv”. The least squares cross-validation function LSCV(h) can have more than one local minimum (Hall and Marron, 1991). Thus, in practice, it is prudent to plot LSCV(h) and not just rely on the result of a minimization routine. Jones, Marron and Sheather (1996) recommended that the largest local minimizer of LSCV(h) be used as hLSCV, since this value produces better empirical performance than the global minimizer. The Bowman and Azzalini (1997) library of S-PLUS functions contains the function cv(y, h) which produces values of LSCV(h) for the data set y over the vector of different bandwidth values in h. For the least squares cross-validation based criterion, by using the representation (6) LSCV(h) = 1 nhR(K) + 2 n2h i<j γ Xi − Xj h , where γ (c) = K(w)K(w+c) dw−2K(c), Scott and Terrell (1987) showed that E[LSCV(h)] = 1 nhR(K) + h4 4 µ2(K)2R(f ) − R(f ) + O(n−1) = AMISE{fˆ h} − R(f ) + O(n−1). Thus, least cross-validation essentially provides estimates of R(f ), the only unknown quantity in AMISE{fˆ h}. For a given set of data, denote the bandwidth that minimizes ISE(fˆ h) by hˆISE. A number of authors (e.g., Gu, 1998) argued that the ideal bandwidth is the random quantity hˆISE, since it minimizes the ISE for the given sample. However, hˆISE is an inherently difficult quantity to estimate. In particular, Hall and Marron (1987a) showed that the smallest possible relative error for any data based bandwidth hˆ is hˆ hˆISE − 1 = Op n−1/10 . Hall and Marron (1987b) and Scott and Terrell (1987) showed that the least squares cross-validation bandwidth hLSCV achieves this best possible convergence rate. In particular, they showed that n1/10hLSCV hˆISE − 1
592 S.J.SHEATHER has an asymptotic N(0,scv)distribution.The slow n/1rate of convergence means that hscv is highly variable in practice,a fact that has been demon- strated in many simulation studies (see Simonoff, 8520 1996,page 76 for references).In addition to high variability,least squares cross-validation "often un- dersmooths in practice,in that it leads to spurious bumpiness in the underlying density"(Simonoff,1996, page 76).On the other hand,a major advantage of least squares cross-validation over other methods is that it is widely applicable. Figure 2 shows Gaussian kernel density estimates 0.1 02 0.3 0.4 0.5 based on two different bandwidths for a sample of 500 data points from the standard normal distribu- tion.The 500 data points are marked as vertical bars FIG.3.Plot of the least squares cross-validation function above the horizontal axis in Figure 2.The dramati- LSCV(h)against h. cally undersmoothed density estimate(depicted by the dashed line in Figure 2)is based on the bandwidth ob- (see Wand and Jones,1995,pages 36-39 for further tained from least squares cross-validation,in this case material on measuring how difficult a density is to es- hLscy =0.059,while the density estimate depicted by timate). the solid curve in Figure 2 is based on the Sheather- Scott and Terrell(1987)proposed a method called Jones plug-in bandwidth,hs(which is discussed be- biased cross-validation (BCV),which is based on low).In this case,hsJ=0.277.Since both the data and choosing the bandwidth that minimizes an estimate of the kernel in this example are Gaussian,it is possible to AMISE rather than an estimate of ISE.The BCV ob- perform exact MISE calculations(see Wand and Jones, jective function is just the estimate of AMISE obtained 1995,pages 24-25 for details).The bandwidth which by replacing R(f")in (3)by minimizes MISE in this case is hMISE=0.315. Following the advice given above,Figure 3 contains (7) R()-I a plot of the least squares cross-validation function R(). LSCV(h)against h.It is clear from this figure that the where f"is the second derivative of the kernel density value hLscy =0.059 is the unambiguous minimizer estimate (1)and the subscript h denotes the fact that of LSCV(h).Thus,least squares cross-validation has the bandwidth used for this estimate is the same one performed poorly by depicting many modes in a situa- used to estimate the density f itself.The reason for tion in which the underlying density is easy to estimate subtracting the second term in(6)is that this term is the positive constant bias term that corresponds to the diagonal terms in R(). The BCV objective function is thus given by BCV=方R(K+42(K [R的-n亦RK] R+ nh 2n2h ∑(,) i<i LEIaILL 2 0 2 where (c)=fK"(w)K"(w+c)dw (Scott and Terrell, Data 1987).Notice the similarity of this last equation and the FIG.2.Kernel density estimates based on LSCV (dashed curve) version of least squares cross-validation given in(6). and the Sheather-Jones plug-in (solid curve)for 500 data points We denote the bandwidth that minimizes BCV(h) from a standard normal distribution. by hBcv.In S-PLUS,hBcv is invoked by width
592 S. J. SHEATHER has an asymptotic N (0, σ2 LSCV) distribution. The slow n−1/10 rate of convergence means that hLSCV is highly variable in practice, a fact that has been demonstrated in many simulation studies (see Simonoff, 1996, page 76 for references). In addition to high variability, least squares cross-validation “often undersmooths in practice, in that it leads to spurious bumpiness in the underlying density” (Simonoff, 1996, page 76). On the other hand, a major advantage of least squares cross-validation over other methods is that it is widely applicable. Figure 2 shows Gaussian kernel density estimates based on two different bandwidths for a sample of 500 data points from the standard normal distribution. The 500 data points are marked as vertical bars above the horizontal axis in Figure 2. The dramatically undersmoothed density estimate (depicted by the dashed line in Figure 2) is based on the bandwidth obtained from least squares cross-validation, in this case hLSCV = 0.059, while the density estimate depicted by the solid curve in Figure 2 is based on the Sheather– Jones plug-in bandwidth, hSJ (which is discussed below). In this case, hSJ = 0.277. Since both the data and the kernel in this example are Gaussian, it is possible to perform exact MISE calculations (see Wand and Jones, 1995, pages 24–25 for details). The bandwidth which minimizes MISE in this case is hMISE = 0.315. Following the advice given above, Figure 3 contains a plot of the least squares cross-validation function LSCV(h) against h. It is clear from this figure that the value hLSCV = 0.059 is the unambiguous minimizer of LSCV(h). Thus, least squares cross-validation has performed poorly by depicting many modes in a situation in which the underlying density is easy to estimate FIG. 2. Kernel density estimates based on LSCV (dashed curve) and the Sheather–Jones plug-in (solid curve) for 500 data points from a standard normal distribution. FIG. 3. Plot of the least squares cross-validation function LSCV(h) against h. (see Wand and Jones, 1995, pages 36–39 for further material on measuring how difficult a density is to estimate). Scott and Terrell (1987) proposed a method called biased cross-validation (BCV), which is based on choosing the bandwidth that minimizes an estimate of AMISE rather than an estimate of ISE. The BCV objective function is just the estimate of AMISE obtained by replacing R(f ) in (3) by R(fˆ h ) − 1 nh5 R(K (7) ), where fˆ h is the second derivative of the kernel density estimate (1) and the subscript h denotes the fact that the bandwidth used for this estimate is the same one used to estimate the density f itself. The reason for subtracting the second term in (6) is that this term is the positive constant bias term that corresponds to the diagonal terms in R(fˆ h ). The BCV objective function is thus given by BCV(h) = 1 nhR(K) + h4 4 µ2(K)2 · R(fˆ h ) − 1 nh5 R(K) = 1 nhR(K) + µ2(K)2 2n2h i<j φ Xi − Xj h , where φ(c)= K(w)K(w +c) dw (Scott and Terrell, 1987). Notice the similarity of this last equation and the version of least squares cross-validation given in (6). We denote the bandwidth that minimizes BCV(h) by hBCV. In S-PLUS, hBCV is invoked by width =