6 CHAPTER 1.NONPARAMETRIC REGRESSION 0.0 0.2 0.4 0.6 0.8 1.0 Figure 1.2:Cubic B-splines on [0,1]corresponding to knots at.3,.6 and.9. From this figure,it can be seen that the ith cubic B-spline is nonzero only on the interval [ti,tit4.In general,the ith p degree B-spline is nonzero only on the interval [ti,tp].This property ensures that the ith and i+j+1st B-splines are orthogonal, for j>p.B-splines whose supports overlap are linearly independent. 1.1.2 Least-Squares Splines Fitting a cubic spline to bivariate data can be done using least-squares.Using the trun- cated power basis,the model to be fit is of the form 斯=0+月x+…+月n+月+1(红-t岸+…+月+k(c-t)4+e,j=1,2,,n where s;satisfies the usual conditions.In vector-matrix form,we may write y=TB+E (1.5) where T is an nx(p+k+1)matrix whose first p+1 columns correspond to the model matrix for pth degree polynomial regression,and whose (j,p+1+i)element is (j) Applying least-squares to (1.5),we see that 3=(TT)-1Ty. Thus,all of the usual linear regression technology is at our disposal here,including stan- dard error estimates for coefficients and confidence and prediction intervals.Even regres- sion diagnostics are applicable in the usual manner
6 CHAPTER 1. NONPARAMETRIC REGRESSION 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 x cubic b−splines Figure 1.2: Cubic B-splines on [0, 1] corresponding to knots at .3, .6 and .9. From this figure, it can be seen that the ith cubic B-spline is nonzero only on the interval [ti , ti+4]. In general, the ith p degree B-spline is nonzero only on the interval [ti , ti+p+1]. This property ensures that the ith and i + j + 1st B-splines are orthogonal, for j ≥ p. B-splines whose supports overlap are linearly independent. 1.1.2 Least-Squares Splines Fitting a cubic spline to bivariate data can be done using least-squares. Using the truncated power basis, the model to be fit is of the form yj = β0 + β1xj + · · · + βpx p j + βp+1(xj − t1) p + + · · · + βp+k(xj − tk) p + + εj , j = 1, 2, . . . , n where εj satisfies the usual conditions. In vector-matrix form, we may write y = T β + ε (1.5) where T is an n × (p + k + 1) matrix whose first p + 1 columns correspond to the model matrix for pth degree polynomial regression, and whose (j, p+ 1 +i) element is (xj −ti) p +. Applying least-squares to (1.5), we see that βb = (T T T) −1T T y. Thus, all of the usual linear regression technology is at our disposal here, including standard error estimates for coefficients and confidence and prediction intervals. Even regression diagnostics are applicable in the usual manner
1.1.SPLINE REGRESSION 7 The only difficulty is the poor conditioning of the truncated power basis which will result in inaccuracies in the calculation of B.It is for this reason that the B-spline basis was introduced.Using this basis,we re-formulate the regression model as p+k BBi,p(x)+e (1.6) i=0 or in vector-matrix form y=BB+E where the (j,i)element of B is Bip(i).The least-squares estimate of B is then 3=(BTB)-BTy The orthogonality of the B-splines which are far enough apart results in a banded matrix BTB which has better conditioning properties than the matrix TTT.The bandedness property actually allows for the use of more efficient numerical techniques in computing B.Again,all of the usual regression techniques are available.The only drawback with this model is that the coefficients are uninterpretable,and the B-splines are a little less intuitive than the truncated power functions. We have been assuming that the knots are known.In general,they are unknown,and they must be chosen.Badly chosen knots can result in bad approximations.Because the spline regression problem can be formulated as an ordinary regression problem with a transformed predictor,it is possible to apply variable selection techniques such as back- ward selection to choose a set of knots.The usual approach is to start with a set of knots located at a subset of the order statistics of the predictor.Then backward selection is applied,using the truncated power basis form of the model.Each time a basis function is eliminated,the corresponding knot is eliminated.The method has drawbacks,notably the ill-conditioning of the basis as mentioned earlier. Figure 1.3 exhibits an example of a least-squares spline with automatically generated knots,applied to a data set consisting of titanium measurements.3 A version of backward selection was used to generated these knots;the stopping rule used was similar to the Akaike Information Criterion (AIC)discussed in Chapter 6.Although this least-squares spline fit to these data is better than what could be obtained using polynomial regression, it is unsatisfactory in many ways.The flat regions are not modelled smoothly enough, and the peak is cut off. 3To obtain Figure 1.3,type attach(titanium) y.1m <-1m(g bs(temperature,knots=c(755,835,905,975), Boundary.knots=c(550,1100))) plot(titanium) lines(temperature,predict(y.lm))
1.1. SPLINE REGRESSION 7 The only difficulty is the poor conditioning of the truncated power basis which will result in inaccuracies in the calculation of βb. It is for this reason that the B-spline basis was introduced. Using this basis, we re-formulate the regression model as yj = X p+k i=0 βiBi,p(xi) + εj (1.6) or in vector-matrix form y = Bβ + ε where the (j, i) element of B is Bi,p(xj ). The least-squares estimate of β is then βb = (B TB) −1B T y The orthogonality of the B-splines which are far enough apart results in a banded matrix BTB which has better conditioning properties than the matrix T T T. The bandedness property actually allows for the use of more efficient numerical techniques in computing βb. Again, all of the usual regression techniques are available. The only drawback with this model is that the coefficients are uninterpretable, and the B-splines are a little less intuitive than the truncated power functions. We have been assuming that the knots are known. In general, they are unknown, and they must be chosen. Badly chosen knots can result in bad approximations. Because the spline regression problem can be formulated as an ordinary regression problem with a transformed predictor, it is possible to apply variable selection techniques such as backward selection to choose a set of knots. The usual approach is to start with a set of knots located at a subset of the order statistics of the predictor. Then backward selection is applied, using the truncated power basis form of the model. Each time a basis function is eliminated, the corresponding knot is eliminated. The method has drawbacks, notably the ill-conditioning of the basis as mentioned earlier. Figure 1.3 exhibits an example of a least-squares spline with automatically generated knots, applied to a data set consisting of titanium measurements.3 A version of backward selection was used to generated these knots; the stopping rule used was similar to the Akaike Information Criterion (AIC) discussed in Chapter 6. Although this least-squares spline fit to these data is better than what could be obtained using polynomial regression, it is unsatisfactory in many ways. The flat regions are not modelled smoothly enough, and the peak is cut off. 3To obtain Figure 1.3, type attach(titanium) y.lm <- lm(g ~ bs(temperature, knots=c(755, 835, 905, 975), Boundary.knots=c(550, 1100))) plot(titanium) lines(temperature, predict(y.lm))
8 CHAPTER 1.NONPARAMETRIC REGRESSION 8 0 6 0 600700 800 900 1000 temperature Figure 1.3:A least-squares spline fit to the titanium heat data using automatically gen- erated knots.The knots used were 755,835,905,and 975. 0 600 700800900 1000 temperature Figure 1.4:A least-squares spline fit to the titanium heat data using manually-selected knots. A substantial improvement can be obtained by manually selecting additional knots, and removing some of the automatically generated knots.In particular,we can render the peak more effectively by adding an additional knot in its vicinity.Adjusting the knot
8 CHAPTER 1. NONPARAMETRIC REGRESSION 600 700 800 900 1000 1.0 1.5 2.0 temperature g Figure 1.3: A least-squares spline fit to the titanium heat data using automatically generated knots. The knots used were 755, 835, 905, and 975. 600 700 800 900 1000 1.0 1.5 2.0 temperature g Figure 1.4: A least-squares spline fit to the titanium heat data using manually-selected knots. A substantial improvement can be obtained by manually selecting additional knots, and removing some of the automatically generated knots. In particular, we can render the peak more effectively by adding an additional knot in its vicinity. Adjusting the knot
1.1.SPLINE REGRESSION 9 that was already there improves the fit as well.4 1.1.3 Smoothing Splines One way around the problem of choosing knots is to use lots of them.A result analogous to the Weierstrass approximation theorem says that any sufficiently smooth function can be approximated arbitrarily well by spline functions with enough knots. The use of large numbers of knots alone is not sufficient to avoid trouble,since we will over-fit the data if the number of knots k is taken so large that p++1>n.In that case, we would have no degrees of freedom left for estimating the residual variance.A standard way of coping with the former problem is to apply a penalty term to the least-squares problem.One requires that the resulting spline regression estimate has low curvature as measured by the square of the second derivative. More precisely,one may try to minimize (for a given constant A) over the set of all functions S(x)which are twice continuously differentiable.The solution to this minimization problem has been shown to be a cubic spline which is surprisingly easy to calculate.5 Thus,the problem of choosing a set of knots is replaced by selecting a value for the smoothing parameter A.Note that if A is small,the solution will be a cubic spline which almost interpolates the data;increasing values of A render increasingly smooth approximations The usual way of choosing A is by cross-validation.The ordinary cross-validation choice of入minimizes CV()=∑-S6(c) j=1 where (()is the smoothing spline obtained using parameter A,using all data but the jth observation.Note that the CV function is similar in spirit to the PRESS statistic,but 4The plot in Figure 1.4 can be generated using y.1m<-1m(g~bs(temperature,knots=c(755,835,885,895,915,975), Boundary.knots=c(550,1100))) plot(titanium) lines(spline(temperature,predict(y.lm))) 5The B-spline coefficients for this spline can be obtained from an expression of the form B=(BTB+λDTD)-1Bry where B is the matrix used for least-squares regression splines and D is a matrix that arises in the calculation involving the squared second derivatives of the spline.Details can be found in de Boor (1978).It is sufficient to note here that this approach has similarities with ridge regression,and that the estimated regression is a linear function of the responses
1.1. SPLINE REGRESSION 9 that was already there improves the fit as well.4 1.1.3 Smoothing Splines One way around the problem of choosing knots is to use lots of them. A result analogous to the Weierstrass approximation theorem says that any sufficiently smooth function can be approximated arbitrarily well by spline functions with enough knots. The use of large numbers of knots alone is not sufficient to avoid trouble, since we will over-fit the data if the number of knots k is taken so large that p+k+1 > n. In that case, we would have no degrees of freedom left for estimating the residual variance. A standard way of coping with the former problem is to apply a penalty term to the least-squares problem. One requires that the resulting spline regression estimate has low curvature as measured by the square of the second derivative. More precisely, one may try to minimize (for a given constant λ) Xn j=1 (yj − S(xj ))2 + λ Z b a (S ′′(x))2 dx over the set of all functions S(x) which are twice continuously differentiable. The solution to this minimization problem has been shown to be a cubic spline which is surprisingly easy to calculate.5 Thus, the problem of choosing a set of knots is replaced by selecting a value for the smoothing parameter λ. Note that if λ is small, the solution will be a cubic spline which almost interpolates the data; increasing values of λ render increasingly smooth approximations. The usual way of choosing λ is by cross-validation. The ordinary cross-validation choice of λ minimizes CV(λ) = Xn j=1 (yj − Sbλ,(j)(xj ))2 where Sbλ,(j)(x) is the smoothing spline obtained using parameter λ, using all data but the jth observation. Note that the CV function is similar in spirit to the PRESS statistic, but 4The plot in Figure 1.4 can be generated using y.lm <- lm(g ~ bs(temperature, knots=c(755, 835, 885, 895, 915, 975), Boundary.knots=c(550, 1100))) plot(titanium) lines(spline(temperature, predict(y.lm))) 5The B-spline coefficients for this spline can be obtained from an expression of the form βb = (B T B + λDT D) −1B T y where B is the matrix used for least-squares regression splines and D is a matrix that arises in the calculation involving the squared second derivatives of the spline. Details can be found in de Boor (1978). It is sufficient to note here that this approach has similarities with ridge regression, and that the estimated regression is a linear function of the responses