当前位置：和泉文库 > 统计 > 浏览文档

《实用非参数统计》课程教学资源（阅读材料）回归与方差分析 Practical Regression and Anova using R（共16章）Faraway-PRA

Chapter 1 Introduction Chapter 2 Estimation Chapter 3 Inference Chapter 4 Errors in Predictors Chapter 5 Generalized Least Squares Chapter 6 Testing for Lack of Fit Chapter 7 Diagnostics Chapter 8 Transformation Chapter 9 Scale Changes, Principal Components and Collinearity Chapter 10 Variable Selection Chapter 11 Statistical Strategy and Model Uncertainty Chapter 12 Chicago Insurance Redlining - a complete example Chapter 13 Robust and Resistant Regression Chapter 14 Missing Data Chapter 15 Analysis of Covariance Chapter 16 ANOVA

文件格式：PDF，文件大小：0.99MB，售价：39.42元

共212页，可试读40页，点击往前阅读 ↑↑

文档详细内容（约212页）

1.1.BEFORE YOU START 11 that can easily occur.A careless statistician might overlook these presumed missing values and complete an analysis assuming that these were real observed zeroes.If the error was later discovered,they might then blame the researchers for using 0 as a missing value code (not a good choice since it is a valid value for some of the variables)and not mentioning it in their data description.Unfortunately such oversights are not uncommon particularly with datasets of any size or complexity.The statistician bears some share of responsibility for spotting these mistakes. We set all zero values of the five variables to NA which is the missing value code used by R pimasdiastolic[pimasdiastolic =0]<-NA pimasglucose[pimasglucose =0]<-NA pimastriceps[pimastriceps ==0]<NA pimasinsulin[pimasinsulin =0]<-NA pimasbmi[pimasbmi ==0]<-NA The variable test is not quantitative but categorical.Such variables are also called factors.However, because of the numerical coding,this variable has been treated as if it were quantitative.It's best to designate such variables as factors so that they are treated appropriately.Sometimes people forget this and compute stupid statistics such as“average zip code” pimastest <-factor(pimastest) summary (pimastest) 01 500268 We now see that 500 cases were negative and 268 positive.Even better is to use descriptive labels: levels(pimastest)<-c("negative","positive") >summary(pima)） pregnant glucose diastolic triceps insulin Min.:0.00 Min.:44 Min.:24.0 Min. :7.0 Min.:14.0 1stQu.:1.00 1st Qu.:99 1stQu.:64.0 1stQu.:22.0 1stQu.:76.2 Median 3.00 Median :117 Median 72.0 Median 29.0 Median 125.0 Mean :3.85 Mean :122 Mean :72.4 Mean :29.2 Mean :155.5 3rdQu.:6.00 3rdQu.:141 3rdQu.:80.0 3rdQu.:36.0 3rdQu.:190.0 Max. :17.00 Max. :199 Max. :122.0 Max. :99.0 Max. :846.0 NA's :5 NA's :35.0 NA's :227.0 NA's :374.0 bmi diabetes age test Min. :18.2 Min. :0.078 Min. :21.0 negative:500 1stQu.:27.5 1stQu.:0.244 1stQu.:24.0 positive:268 Median :32.3 Median 0.372 Median 29.0 Mean :32.5 Mean :0.472 Mean :33.2 3rdQu.:36.6 3rdQu.:0.626 3rdQu.:41.0 Max. :67.1 Max. :2.420 Max. :81.0 NA's :11.0 Now that we've cleared up the missing values and coded the data appropriately we are ready to do some plots.Perhaps the most well-known univariate plot is the histogram: hist(pimasdiastolic)

1.1. BEFORE YOU START 11 that can easily occur. A careless statistician might overlook these presumed missing values and complete an analysis assuming that these were real observed zeroes. If the error was later discovered, they might then blame the researchers for using 0 as a missing value code (not a good choice since it is a valid value for some of the variables) and not mentioning it in their data description. Unfortunately such oversights are not uncommon particularly with datasets of any size or complexity. The statistician bears some share of responsibility for spotting these mistakes. We set all zero values of the five variables to NA which is the missing value code used by R . > pima$diastolic[pima$diastolic == 0] <- NA > pima$glucose[pima$glucose == 0] <- NA > pima$triceps[pima$triceps == 0] <- NA > pima$insulin[pima$insulin == 0] <- NA > pima$bmi[pima$bmi == 0] <- NA The variable test is not quantitative but categorical. Such variables are also called factors. However, because of the numerical coding, this variable has been treated as if it were quantitative. It’s best to designate such variables as factors so that they are treated appropriately. Sometimes people forget this and compute stupid statistics such as “average zip code”. > pima$test <- factor(pima$test) > summary(pima$test) 0 1 500 268 We now see that 500 cases were negative and 268 positive. Even better is to use descriptive labels: > levels(pima$test) <- c("negative","positive") > summary(pima) pregnant glucose diastolic triceps insulin Min. : 0.00 Min. : 44 Min. : 24.0 Min. : 7.0 Min. : 14.0 1st Qu.: 1.00 1st Qu.: 99 1st Qu.: 64.0 1st Qu.: 22.0 1st Qu.: 76.2 Median : 3.00 Median :117 Median : 72.0 Median : 29.0 Median :125.0 Mean : 3.85 Mean :122 Mean : 72.4 Mean : 29.2 Mean :155.5 3rd Qu.: 6.00 3rd Qu.:141 3rd Qu.: 80.0 3rd Qu.: 36.0 3rd Qu.:190.0 Max. :17.00 Max. :199 Max. :122.0 Max. : 99.0 Max. :846.0 NA’s : 5 NA’s : 35.0 NA’s :227.0 NA’s :374.0 bmi diabetes age test Min. :18.2 Min. :0.078 Min. :21.0 negative:500 1st Qu.:27.5 1st Qu.:0.244 1st Qu.:24.0 positive:268 Median :32.3 Median :0.372 Median :29.0 Mean :32.5 Mean :0.472 Mean :33.2 3rd Qu.:36.6 3rd Qu.:0.626 3rd Qu.:41.0 Max. :67.1 Max. :2.420 Max. :81.0 NA’s :11.0 Now that we’ve cleared up the missing values and coded the data appropriately we are ready to do some plots. Perhaps the most well-known univariate plot is the histogram: hist(pima$diastolic)

1.1. BEFORE YOU START 12 pima$diastolic Frequency 20 40 60 80 100 120 0 50 100 150 200 20 40 60 80 100 120 0.000 0.010 0.020 0.030 N = 733 Bandwidth = 2.872 Density 0 200 400 600 40 60 80 100 120 Index sort(pima$diastolic) Figure 1.1: First panel shows histogram of the diastolic blood pressures, the second shows a kernel density estimate of the same while the the third shows an index plot of the sorted values as shown in the first panel of Figure 1.1. We see a bell-shaped distribution for the diastolic blood pressures centered around 70. The construction of a histogram requires the specification of the number of bins and their spacing on the horizontal axis. Some choices can lead to histograms that obscure some features of the data. R attempts to specify the number and spacing of bins given the size and distribution of the data but this choice is not foolproof and misleading histograms are possible. For this reason, I prefer to use Kernel Density Estimates which are essentially a smoothed version of the histogram (see Simonoff (1996) for a discussion of the relative merits of histograms and kernel estimates). > plot(density(pima$diastolic,na.rm=TRUE)) The kernel estimate may be seen in the second panel of Figure 1.1. We see that it avoids the distracting blockiness of the histogram. An alternative is to simply plot the sorted data against its index: plot(sort(pima$diastolic),pch=".") The advantage of this is we can see all the data points themselves. We can see the distribution and possible outliers. We can also see the discreteness in the measurement of blood pressure - values are rounded to the nearest even number and hence we the “steps” in the plot. Now a couple of bivariate plots as seen in Figure 1.2: > plot(diabetes ˜ diastolic,pima) > plot(diabetes ˜ test,pima) hist(pima$diastolic) First, we see the standard scatterplot showing two quantitative variables. Second, we see a side-by-side boxplot suitable for showing a quantitative and a qualititative variable. Also useful is a scatterplot matrix, not shown here, produced by

L2 WHEN TO USE REGRESSION ANALYSIS 13 0 0 N & 0 0 8 0 p 0 oo88° 出目 OQO 00 888 ① 00 8 0 号号 40 60 80 100 120 negative positive diastolic test Figure 1.2:First panel shows scatterplot of the diastolic blood pressures against diabetes function and the second shows boxplots of diastolic blood pressure broken down by test result pairs (pima) We will be seeing more advanced plots later but the numerical and graphical summaries presented here are sufficient for a first look at the data. 1.2 When to use Regression Analysis Regression analysis is used for explaining or modeling the relationship between a single variable Y,called the response,output or dependent variable,and one or more predictor,input,independent or explanatory variables,X1,...,Xp.When p=1,it is called simple regression but when p>1 it is called multiple re- gression or sometimes multivariate regression.When there is more than one Y,then it is called multivariate multiple regression which we won't be covering here. The response must be a continuous variable but the explanatory variables can be continuous,discrete or categorical although we leave the handling of categorical explanatory variables to later in the course. Taking the example presented above,a regression of diastolic and bmi on diabetes would be a multiple regression involving only quantitative variables which we shall be tackling shortly.A regression of diastolic and bmi on test would involve one predictor which is quantitative which we will consider in later in the chapter on Analysis of Covariance.A regression of diastolic on just test would involve just qualitative predictors,a topic called Analysis of Variance or ANOVA although this would just be a simple two sample situation.A regression of test (the response)on diastolic and bmi(the predictors)would involve a qualitative response.A logistic regression could be used but this will not be covered in this book. Regression analyses have several possible objectives including 1.Prediction of future observations. 2.Assessment of the effect of,or relationship between,explanatory variables on the response. 3.A general description of data structure

1.2. WHEN TO USE REGRESSION ANALYSIS 13 40 60 80 100 120 0.0 0.5 1.0 1.5 2.0 2.5 diastolic diabetes negative positive 0.0 0.5 1.0 1.5 2.0 2.5 test diabetes Figure 1.2: First panel shows scatterplot of the diastolic blood pressures against diabetes function and the second shows boxplots of diastolic blood pressure broken down by test result > pairs(pima) We will be seeing more advanced plots later but the numerical and graphical summaries presented here are sufficient for a first look at the data. 1.2 When to use Regression Analysis Regression analysis is used for explaining or modeling the relationship between a single variable Y, called the response, output or dependent variable, and one or more predictor, input, independent or explanatory variables, X1 ✂✁✂✁✂✁✄ Xp. When p ☎ 1, it is called simple regression but when p ✆ 1 it is called multiple regression or sometimes multivariate regression. When there is more than one Y, then it is called multivariate multiple regression which we won’t be covering here. The response must be a continuous variable but the explanatory variables can be continuous, discrete or categorical although we leave the handling of categorical explanatory variables to later in the course. Taking the example presented above, a regression of diastolic and bmi on diabetes would be a multiple regression involving only quantitative variables which we shall be tackling shortly. A regression of diastolic and bmi on test would involve one predictor which is quantitative which we will consider in later in the chapter on Analysis of Covariance. A regression of diastolic on just test would involve just qualitative predictors, a topic called Analysis of Variance or ANOVA although this would just be a simple two sample situation. A regression of test (the response) on diastolic and bmi (the predictors) would involve a qualitative response. A logistic regression could be used but this will not be covered in this book. Regression analyses have several possible objectives including 1. Prediction of future observations. 2. Assessment of the effect of, or relationship between, explanatory variables on the response. 3. A general description of data structure

1.3.HISTORY 15 We have added the y=x(solid)line to the plot.Now a student scoring,say one standard deviation above average on the midterm might reasonably expect to do equally well on the final.We compute the least squares regression fit and plot the regression line (more on the details later).We also compute the correlations. g <-lm(final midterm,stat500) abline(gscoef,lty=5) cor(stat500) midterm final hw total midterm1.000000.5452280.2720580.84446 final 0.545231.0000000.0873380.77886 hw 0.272060.0873381.0000000.56443 total 0.844460.7788630.5644291.00000 We see that thethe student scoring 1 SD above average on the midterm is predicted to score somewhat less above average on the final (see the dotted regression line)-0.54523 SD's above average to be exact. Correspondingly,a student scoring below average on the midterm might expect to do relatively better in the final although still below average. If exams managed to measure the ability of students perfectly,then provided that ability remained un- changed from midterm to final,we would expect to see a perfect correlation.Of course,it's too much to expect such a perfect exam and some variation is inevitably present.Furthermore,individual effort is not constant.Getting a high score on the midterm can partly be attributed to skill but also a certain amount of luck.One cannot rely on this luck to be maintained in the final.Hence we see the"regression to mediocrity". Of course this applies to any (x,y)situation like this-an example is the so-called sophomore jinx in sports when a rookie star has a so-so second season after a great first year.Although in the father-son example,it does predict that successive descendants will come closer to the mean,it does not imply the same of the population in general since random fluctuations will maintain the variation.In many other applications of regression,the regression effect is not of interest so it is unfortunate that we are now stuck with this rather misleading name. Regression methodology developed rapidly with the advent of high-speed computing.Just fitting a regression model used to require extensive hand calculation.As computing hardware has improved,then the scope for analysis has widened

1.3. HISTORY 15 We have added the y ☎ x (solid) line to the plot. Now a student scoring, say one standard deviation above average on the midterm might reasonably expect to do equally well on the final. We compute the least squares regression fit and plot the regression line (more on the details later). We also compute the correlations. > g <- lm(final ˜ midterm,stat500) > abline(g$coef,lty=5) > cor(stat500) midterm final hw total midterm 1.00000 0.545228 0.272058 0.84446 final 0.54523 1.000000 0.087338 0.77886 hw 0.27206 0.087338 1.000000 0.56443 total 0.84446 0.778863 0.564429 1.00000 We see that the the student scoring 1 SD above average on the midterm is predicted to score somewhat less above average on the final (see the dotted regression line) - 0.54523 SD’s above average to be exact. Correspondingly, a student scoring below average on the midterm might expect to do relatively better in the final although still below average. If exams managed to measure the ability of students perfectly, then provided that ability remained unchanged from midterm to final, we would expect to see a perfect correlation. Of course, it’s too much to expect such a perfect exam and some variation is inevitably present. Furthermore, individual effort is not constant. Getting a high score on the midterm can partly be attributed to skill but also a certain amount of luck. One cannot rely on this luck to be maintained in the final. Hence we see the “regression to mediocrity”. Of course this applies to any ✁ x y ✂ situation like this — an example is the so-called sophomore jinx in sports when a rookie star has a so-so second season after a great first year. Although in the father-son example, it does predict that successive descendants will come closer to the mean, it does not imply the same of the population in general since random fluctuations will maintain the variation. In many other applications of regression, the regression effect is not of interest so it is unfortunate that we are now stuck with this rather misleading name. Regression methodology developed rapidly with the advent of high-speed computing. Just fitting a regression model used to require extensive hand calculation. As computing hardware has improved, then the scope for analysis has widened

点击进入文档下载页（PDF格式）

共212页，可试读40页，点击继续阅读 ↓↓

您可能感兴趣的文档

点击购买下载（PDF）

下载及服务说明

购买前请先查看本文档预览页，确认内容后再进行支付；
如遇文件无法下载、无法访问或其它任何问题，可发送电子邮件反馈，核实后将进行文件补发或退款等其它相关操作；
邮箱：

文档浏览记录