当前位置：和泉文库 > 统计 > 浏览文档

《实用非参数统计》课程教学资源（阅读材料）回归与方差分析 Practical Regression and Anova using R（共16章）Faraway-PRA

Chapter 1 Introduction Chapter 2 Estimation Chapter 3 Inference Chapter 4 Errors in Predictors Chapter 5 Generalized Least Squares Chapter 6 Testing for Lack of Fit Chapter 7 Diagnostics Chapter 8 Transformation Chapter 9 Scale Changes, Principal Components and Collinearity Chapter 10 Variable Selection Chapter 11 Statistical Strategy and Model Uncertainty Chapter 12 Chicago Insurance Redlining - a complete example Chapter 13 Robust and Resistant Regression Chapter 14 Missing Data Chapter 15 Analysis of Covariance Chapter 16 ANOVA

文件格式：PDF，文件大小：0.99MB，售价：39.42元

共212页，可试读40页，点击往前阅读 ↑↑

文档详细内容（约212页）

CONTENTS 6 10.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133 11 Statistical Strategy and Model Uncertainty 134 11.1 Strategy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134 11.2 Experiment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135 11.3 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 136 12 Chicago Insurance Redlining - a complete example 138 13 Robust and Resistant Regression 150 14 Missing Data 156 15 Analysis of Covariance 160 15.1 A two-level example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 161 15.2 Coding qualitative predictors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 164 15.3 A Three-level example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 165 16 ANOVA 168 16.1 One-Way Anova . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 168 16.1.1 The model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 168 16.1.2 Estimation and testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 168 16.1.3 An example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 169 16.1.4 Diagnostics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 171 16.1.5 Multiple Comparisons . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 172 16.1.6 Contrasts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 177 16.1.7 Scheffe’´ s theorem for multiple comparisons . . . . . . . . . . . . . . . . . . . . . . 177 16.1.8 Testing for homogeneity of variance . . . . . . . . . . . . . . . . . . . . . . . . . . 179 16.2 Two-Way Anova . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 179 16.2.1 One observation per cell . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 180 16.2.2 More than one observation per cell . . . . . . . . . . . . . . . . . . . . . . . . . . . 180 16.2.3 Interpreting the interaction effect . . . . . . . . . . . . . . . . . . . . . . . . . . . . 180 16.2.4 Replication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 184 16.3 Blocking designs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 185 16.3.1 Randomized Block design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 185 16.3.2 Relative advantage of RCBD over CRD . . . . . . . . . . . . . . . . . . . . . . . . 190 16.4 Latin Squares . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 191 16.5 Balanced Incomplete Block design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 195 16.6 Factorial experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 200 A Recommended Books 204 A.1 Books on R . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 204 A.2 Books on Regression and Anova . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 204 B R functions and data 205

1.1.BEFORE YOU START 1.1.2 Data Collection It's important to understand how the data was collected. Are the data observational or experimental?Are the data a sample of convenience or were they obtained via a designed sample survey.How the data were collected has a crucial impact on what conclusions can be made. Is there non-response?The data you don't see may be just as important as the data you do see. Are there missing values?This is a common problem that is troublesome and time consuming to deal with. How are the data coded?In particular,how are the qualitative variables represented. What are the units of measurement?Sometimes data is collected or represented with far more digits than are necessary.Consider rounding if this will help with the interpretation or storage costs. Beware of data entry errors.This problem is all too common-almost a certainty in any real dataset of at least moderate size.Perform some data sanity checks 1.1.3 Initial Data Analysis This is a critical step that should always be performed.It looks simple but it is vital. Numerical summaries-means,sds,five-number summaries,correlations. Graphical summaries -One variable-Boxplots,histograms etc Two variables-scatterplots. Many variables-interactive graphics. Look for outliers,data-entry errors and skewed or unusual distributions.Are the data distributed as you expect? Getting data into a form suitable for analysis by cleaning out mistakes and aberrations is often time consuming.It often takes more time than the data analysis itself.In this course,all the data will be ready to analyze but you should realize that in practice this is rarely the case. Let's look at an example.The National Institute of Diabetes and Digestive and Kidney Diseases conducted a study on 768 adult female Pima Indians living near Phoenix.The following variables were recorded:Number of times pregnant,Plasma glucose concentration a 2 hours in an oral glucose tolerance test,Diastolic blood pressure(mm Hg),Triceps skin fold thickness(mm),2-Hour serum insulin(mu U/ml), Body mass index(weight in kg/(height in m2)),Diabetes pedigree function,Age(years)and a test whether the patient shows signs of diabetes(coded 0 if negative,1 if positive).The data may be obtained from UCI Repository of machine learning databases at http://www.ics.uci.edu/"mlearn/MLRepository.html. Of course,before doing anything else,one should find out what the purpose of the study was and more about how the data was collected.But let's skip ahead to a look at the data:

1.1. BEFORE YOU START 9 1.1.2 Data Collection It’s important to understand how the data was collected. Are the data observational or experimental? Are the data a sample of convenience or were they obtained via a designed sample survey. How the data were collected has a crucial impact on what conclusions can be made. Is there non-response? The data you don’t see may be just as important as the data you do see. Are there missing values? This is a common problem that is troublesome and time consuming to deal with. How are the data coded? In particular, how are the qualitative variables represented. What are the units of measurement? Sometimes data is collected or represented with far more digits than are necessary. Consider rounding if this will help with the interpretation or storage costs. Beware of data entry errors. This problem is all too common — almost a certainty in any real dataset of at least moderate size. Perform some data sanity checks. 1.1.3 Initial Data Analysis This is a critical step that should always be performed. It looks simple but it is vital. Numerical summaries - means, sds, five-number summaries, correlations. Graphical summaries – One variable - Boxplots, histograms etc. – Two variables - scatterplots. – Many variables - interactive graphics. Look for outliers, data-entry errors and skewed or unusual distributions. Are the data distributed as you expect? Getting data into a form suitable for analysis by cleaning out mistakes and aberrations is often time consuming. It often takes more time than the data analysis itself. In this course, all the data will be ready to analyze but you should realize that in practice this is rarely the case. Let’s look at an example. The National Institute of Diabetes and Digestive and Kidney Diseases conducted a study on 768 adult female Pima Indians living near Phoenix. The following variables were recorded: Number of times pregnant, Plasma glucose concentration a 2 hours in an oral glucose tolerance test, Diastolic blood pressure (mm Hg), Triceps skin fold thickness (mm), 2-Hour serum insulin (mu U/ml), Body mass index (weight in kg/(height in m 2 )), Diabetes pedigree function, Age (years) and a test whether the patient shows signs of diabetes (coded 0 if negative, 1 if positive). The data may be obtained from UCI Repository of machine learning databases at http://www.ics.uci.edu/˜mlearn/MLRepository.html. Of course, before doing anything else, one should find out what the purpose of the study was and more about how the data was collected. But let’s skip ahead to a look at the data:

1.1.BEFORE YOU START 10 library(faraway) >data(pima)） pima pregnant glucose diastolic triceps insulin bmi diabetes age test 1 6 148 72 35 033.6 0.627 50 2 1 85 66 29 026.6 0.351 31 0 3 8 183 64 0 023.3 0.672 32 1 ..much deleted .. 768 1 93 70 31 030.4 0.315 23 0 The library (faraway)makes the data used in this book available while data(pima)calls up this particular dataset.Simply typing the name of the data frame,pima prints out the data.It's too long to show it all here.For a dataset of this size,one can just about visually skim over the data for anything out of place but it is certainly easier to use more direct methods We start with some numerical summaries: summary (pima) pregnant glucose diastolic triceps insulin Min.:0.00 Min. 0 Min. :0.0 Min. :0.0 Min. :0.0 1stQu.:1.00 1st Qu.:99 1stQu.:62.0 1stQu.:0.0 1stQu.:0.0 Median 3.00 Median 117 Median :72.0 Median 23.0 Median 30.5 Mean :3.85 Mean :121 Mean :69.1 Mean :20.5 Mean 79.8 3rdQu.:6.00 3rdQu.:140 3rdQu.:80.0 3rdQu.:32.0 3rdQu.:127.2 Max. :17.00 Max. :199 Max. :122.0 Max. :99.0 Max. :846.0 bmi diabetes age test Min. :0.0 Min. :0.078 Min. :21.0 Min. :0.000 1stQu.:27.3 1stQu.:0.244 1stQu.:24.0 1stQu.:0.000 Median 32.0 Median :0.372 Median 29.0 Median 0.000 Mean :32.0 Mean:0.472 Mean :33.2 Mean :0.349 3rdQu.:36.6 3rdQu.:0.626 3rdQu.:41.0 3rdQu.:1.000 Max. :67.1 Max. :2.420 Max. :81.0 Max. :1.000 The summary (command is a quick way to get the usual univariate summary information.At this stage, we are looking for anything unusual or unexpected perhaps indicating a data entry error.For this purpose,a close look at the minimum and maximum values of each variable is worthwhile.Starting with pregnant, we see a maximum value of 17.This is large but perhaps not impossible.However,we then see that the next 5 variables have minimum values of zero.No blood pressure is not good for the health-something must be wrong.Let's look at the sorted values: sort(pimasdiastolic) [1]0000 0 0 0 0 0 0 0 0 0 0 0 0 0 0 [19] 0000 00 00 0 0 0 0 0 0 0 0 0 24 [37]303038404444444446464848 48484850 50 50 ...etc... We see that the first 36 values are zero.The description that comes with the data says nothing about it but it seems likely that the zero has been used as a missing value code.For one reason or another,the researchers did not obtain the blood pressures of 36 patients.In a real investigation,one would likely be able to question the researchers about what really happened.Nevertheless,this does illustrate the kind of misunderstanding

1.1. BEFORE YOU START 10 > library(faraway) > data(pima) > pima pregnant glucose diastolic triceps insulin bmi diabetes age test 1 6 148 72 35 0 33.6 0.627 50 1 2 1 85 66 29 0 26.6 0.351 31 0 3 8 183 64 0 0 23.3 0.672 32 1 ... much deleted ... 768 1 93 70 31 0 30.4 0.315 23 0 The library(faraway) makes the data used in this book available while data(pima) calls up this particular dataset. Simply typing the name of the data frame, pima prints out the data. It’s too long to show it all here. For a dataset of this size, one can just about visually skim over the data for anything out of place but it is certainly easier to use more direct methods. We start with some numerical summaries: > summary(pima) pregnant glucose diastolic triceps insulin Min. : 0.00 Min. : 0 Min. : 0.0 Min. : 0.0 Min. : 0.0 1st Qu.: 1.00 1st Qu.: 99 1st Qu.: 62.0 1st Qu.: 0.0 1st Qu.: 0.0 Median : 3.00 Median :117 Median : 72.0 Median :23.0 Median : 30.5 Mean : 3.85 Mean :121 Mean : 69.1 Mean :20.5 Mean : 79.8 3rd Qu.: 6.00 3rd Qu.:140 3rd Qu.: 80.0 3rd Qu.:32.0 3rd Qu.:127.2 Max. :17.00 Max. :199 Max. :122.0 Max. :99.0 Max. :846.0 bmi diabetes age test Min. : 0.0 Min. :0.078 Min. :21.0 Min. :0.000 1st Qu.:27.3 1st Qu.:0.244 1st Qu.:24.0 1st Qu.:0.000 Median :32.0 Median :0.372 Median :29.0 Median :0.000 Mean :32.0 Mean :0.472 Mean :33.2 Mean :0.349 3rd Qu.:36.6 3rd Qu.:0.626 3rd Qu.:41.0 3rd Qu.:1.000 Max. :67.1 Max. :2.420 Max. :81.0 Max. :1.000 The summary() command is a quick way to get the usual univariate summary information. At this stage, we are looking for anything unusual or unexpected perhaps indicating a data entry error. For this purpose, a close look at the minimum and maximum values of each variable is worthwhile. Starting with pregnant, we see a maximum value of 17. This is large but perhaps not impossible. However, we then see that the next 5 variables have minimum values of zero. No blood pressure is not good for the health — something must be wrong. Let’s look at the sorted values: > sort(pima$diastolic) [1] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 [19] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 24 [37] 30 30 38 40 44 44 44 44 46 46 48 48 48 48 48 50 50 50 ...etc... We see that the first 36 values are zero. The description that comes with the data says nothing about it but it seems likely that the zero has been used as a missing value code. For one reason or another, the researchers did not obtain the blood pressures of 36 patients. In a real investigation, one would likely be able to question the researchers about what really happened. Nevertheless, this does illustrate the kind of misunderstanding

点击进入文档下载页（PDF格式）

共212页，可试读40页，点击继续阅读 ↓↓

您可能感兴趣的文档

点击购买下载（PDF）

下载及服务说明

购买前请先查看本文档预览页，确认内容后再进行支付；
如遇文件无法下载、无法访问或其它任何问题，可发送电子邮件反馈，核实后将进行文件补发或退款等其它相关操作；
邮箱：

文档浏览记录