CONTENTS 6 10.4 Summary 133 11 Statistical Strategy and Model Uncertainty 134 11.1 Strategy 134 11.2 Experiment 135 11.3 Discussion... 136 12 Chicago Insurance Redlining-a complete example 138 13 Robust and Resistant Regression 150 14 Missing Data 156 15 Analysis of Covariance 160 15.1 A two-level example 161 15.2 Coding qualitative predictors 164 15.3 A Three-level example 165 16 ANOVA 168 16.1 One-Way Anova..... 168 16.1.1 The model 168 16.1.2 Estimation and testing 168 16.1.3 An example 169 16.1.4 Diagnostics 171 16.1.5 Multiple Comparisons .172 16.1.6 Contrasts.. 177 l6.l.7 Scheffe's theorem for multiple comparisons..,···.···, 177 16.1.8 Testing for homogeneity of variance. 179 16.2 Two-Way Anova..... 179 16.2.1 One observation per cell... 180 16.2.2 More than one observation per cell................... 180 16.2.3 Interpreting the interaction effect.... 180 16.2.4 Replication,.,·,··,····· 184 16.3 Blocking designs··,··.····.··· 185 16.3.1 Randomized Block design 185 16.3.2 Relative advantage of RCBD over CRD 190 16.4 Latin Squares.. 191 16.5 Balanced Incomplete Block design 195 16.6 Factorial experiments....... 200 A Recommended Books 204 A.1 Books on R... 204 A.2 Books on Regression and Anova..... 204 B R functions and data 205
CONTENTS 6 10.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133 11 Statistical Strategy and Model Uncertainty 134 11.1 Strategy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134 11.2 Experiment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135 11.3 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 136 12 Chicago Insurance Redlining - a complete example 138 13 Robust and Resistant Regression 150 14 Missing Data 156 15 Analysis of Covariance 160 15.1 A two-level example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 161 15.2 Coding qualitative predictors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 164 15.3 A Three-level example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 165 16 ANOVA 168 16.1 One-Way Anova . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 168 16.1.1 The model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 168 16.1.2 Estimation and testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 168 16.1.3 An example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 169 16.1.4 Diagnostics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 171 16.1.5 Multiple Comparisons . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 172 16.1.6 Contrasts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 177 16.1.7 Scheffe’´ s theorem for multiple comparisons . . . . . . . . . . . . . . . . . . . . . . 177 16.1.8 Testing for homogeneity of variance . . . . . . . . . . . . . . . . . . . . . . . . . . 179 16.2 Two-Way Anova . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 179 16.2.1 One observation per cell . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 180 16.2.2 More than one observation per cell . . . . . . . . . . . . . . . . . . . . . . . . . . . 180 16.2.3 Interpreting the interaction effect . . . . . . . . . . . . . . . . . . . . . . . . . . . . 180 16.2.4 Replication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 184 16.3 Blocking designs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 185 16.3.1 Randomized Block design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 185 16.3.2 Relative advantage of RCBD over CRD . . . . . . . . . . . . . . . . . . . . . . . . 190 16.4 Latin Squares . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 191 16.5 Balanced Incomplete Block design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 195 16.6 Factorial experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 200 A Recommended Books 204 A.1 Books on R . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 204 A.2 Books on Regression and Anova . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 204 B R functions and data 205
CONTENTS 7 C Quick introduction to R 207 C.1 Reading the data in...··、·-.··,,····…····,,······· 207 C.2 Numerical Summaries 207 C.3 Graphical Summaries 209 C.4 Selecting subsets of the data·:......·,,.·.,.,,.···,,,·,···· 209 C.5 Learning more about R,·,····,·····,··,· 210
CONTENTS 7 C Quick introduction to R 207 C.1 Reading the data in . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 207 C.2 Numerical Summaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 207 C.3 Graphical Summaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 209 C.4 Selecting subsets of the data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 209 C.5 Learning more about R . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 210
Chapter 1 Introduction 1.1 Before you start Statistics starts with a problem,continues with the collection of data,proceeds with the data analysis and finishes with conclusions.It is a common mistake of inexperienced Statisticians to plunge into a complex analysis without paying attention to what the objectives are or even whether the data are appropriate for the proposed analysis.Look before you leap! 1.1.1 Formulation The formulation of a problem is often more essential than its solution which may be merely a matter of mathematical or experimental skill.Albert Einstein To formulate the problem correctly,you must 1.Understand the physical background.Statisticians often work in collaboration with others and need to understand something about the subject area.Regard this as an opportunity to learn something new rather than a chore. 2.Understand the objective.Again,often you will be working with a collaborator who may not be clear about what the objectives are.Beware of"fishing expeditions"-if you look hard enough,you'll almost always find something but that something may just be a coincidence 3.Make sure you know what the client wants.Sometimes Statisticians perform an analysis far more complicated than the client really needed.You may find that simple descriptive statistics are all that are needed. 4.Put the problem into statistical terms.This is a challenging step and where irreparable errors are sometimes made.Once the problem is translated into the language of Statistics,the solution is often routine.Difficulties with this step explain why Artificial Intelligence techniques have yet to make much impact in application to Statistics.Defining the problem is hard to program. That a statistical method can read in and process the data is not enough.The results may be totally meaningless. 8
Chapter 1 Introduction 1.1 Before you start Statistics starts with a problem, continues with the collection of data, proceeds with the data analysis and finishes with conclusions. It is a common mistake of inexperienced Statisticians to plunge into a complex analysis without paying attention to what the objectives are or even whether the data are appropriate for the proposed analysis. Look before you leap! 1.1.1 Formulation The formulation of a problem is often more essential than its solution which may be merely a matter of mathematical or experimental skill. Albert Einstein To formulate the problem correctly, you must 1. Understand the physical background. Statisticians often work in collaboration with others and need to understand something about the subject area. Regard this as an opportunity to learn something new rather than a chore. 2. Understand the objective. Again, often you will be working with a collaborator who may not be clear about what the objectives are. Beware of “fishing expeditions” - if you look hard enough, you’ll almost always find something but that something may just be a coincidence. 3. Make sure you know what the client wants. Sometimes Statisticians perform an analysis far more complicated than the client really needed. You may find that simple descriptive statistics are all that are needed. 4. Put the problem into statistical terms. This is a challenging step and where irreparable errors are sometimes made. Once the problem is translated into the language of Statistics, the solution is often routine. Difficulties with this step explain why Artificial Intelligence techniques have yet to make much impact in application to Statistics. Defining the problem is hard to program. That a statistical method can read in and process the data is not enough. The results may be totally meaningless. 8
1.1.BEFORE YOU START 1.1.2 Data Collection It's important to understand how the data was collected. Are the data observational or experimental?Are the data a sample of convenience or were they obtained via a designed sample survey.How the data were collected has a crucial impact on what conclusions can be made. Is there non-response?The data you don't see may be just as important as the data you do see. Are there missing values?This is a common problem that is troublesome and time consuming to deal with. How are the data coded?In particular,how are the qualitative variables represented. What are the units of measurement?Sometimes data is collected or represented with far more digits than are necessary.Consider rounding if this will help with the interpretation or storage costs. Beware of data entry errors.This problem is all too common-almost a certainty in any real dataset of at least moderate size.Perform some data sanity checks 1.1.3 Initial Data Analysis This is a critical step that should always be performed.It looks simple but it is vital. Numerical summaries-means,sds,five-number summaries,correlations. Graphical summaries -One variable-Boxplots,histograms etc Two variables-scatterplots. Many variables-interactive graphics. Look for outliers,data-entry errors and skewed or unusual distributions.Are the data distributed as you expect? Getting data into a form suitable for analysis by cleaning out mistakes and aberrations is often time consuming.It often takes more time than the data analysis itself.In this course,all the data will be ready to analyze but you should realize that in practice this is rarely the case. Let's look at an example.The National Institute of Diabetes and Digestive and Kidney Diseases conducted a study on 768 adult female Pima Indians living near Phoenix.The following variables were recorded:Number of times pregnant,Plasma glucose concentration a 2 hours in an oral glucose tolerance test,Diastolic blood pressure(mm Hg),Triceps skin fold thickness(mm),2-Hour serum insulin(mu U/ml), Body mass index(weight in kg/(height in m2)),Diabetes pedigree function,Age(years)and a test whether the patient shows signs of diabetes(coded 0 if negative,1 if positive).The data may be obtained from UCI Repository of machine learning databases at http://www.ics.uci.edu/"mlearn/MLRepository.html. Of course,before doing anything else,one should find out what the purpose of the study was and more about how the data was collected.But let's skip ahead to a look at the data:
1.1. BEFORE YOU START 9 1.1.2 Data Collection It’s important to understand how the data was collected. Are the data observational or experimental? Are the data a sample of convenience or were they obtained via a designed sample survey. How the data were collected has a crucial impact on what conclusions can be made. Is there non-response? The data you don’t see may be just as important as the data you do see. Are there missing values? This is a common problem that is troublesome and time consuming to deal with. How are the data coded? In particular, how are the qualitative variables represented. What are the units of measurement? Sometimes data is collected or represented with far more digits than are necessary. Consider rounding if this will help with the interpretation or storage costs. Beware of data entry errors. This problem is all too common — almost a certainty in any real dataset of at least moderate size. Perform some data sanity checks. 1.1.3 Initial Data Analysis This is a critical step that should always be performed. It looks simple but it is vital. Numerical summaries - means, sds, five-number summaries, correlations. Graphical summaries – One variable - Boxplots, histograms etc. – Two variables - scatterplots. – Many variables - interactive graphics. Look for outliers, data-entry errors and skewed or unusual distributions. Are the data distributed as you expect? Getting data into a form suitable for analysis by cleaning out mistakes and aberrations is often time consuming. It often takes more time than the data analysis itself. In this course, all the data will be ready to analyze but you should realize that in practice this is rarely the case. Let’s look at an example. The National Institute of Diabetes and Digestive and Kidney Diseases conducted a study on 768 adult female Pima Indians living near Phoenix. The following variables were recorded: Number of times pregnant, Plasma glucose concentration a 2 hours in an oral glucose tolerance test, Diastolic blood pressure (mm Hg), Triceps skin fold thickness (mm), 2-Hour serum insulin (mu U/ml), Body mass index (weight in kg/(height in m 2 )), Diabetes pedigree function, Age (years) and a test whether the patient shows signs of diabetes (coded 0 if negative, 1 if positive). The data may be obtained from UCI Repository of machine learning databases at http://www.ics.uci.edu/˜mlearn/MLRepository.html. Of course, before doing anything else, one should find out what the purpose of the study was and more about how the data was collected. But let’s skip ahead to a look at the data:
1.1.BEFORE YOU START 10 library(faraway) >data(pima)) pima pregnant glucose diastolic triceps insulin bmi diabetes age test 1 6 148 72 35 033.6 0.627 50 2 1 85 66 29 026.6 0.351 31 0 3 8 183 64 0 023.3 0.672 32 1 ..much deleted .. 768 1 93 70 31 030.4 0.315 23 0 The library (faraway)makes the data used in this book available while data(pima)calls up this particular dataset.Simply typing the name of the data frame,pima prints out the data.It's too long to show it all here.For a dataset of this size,one can just about visually skim over the data for anything out of place but it is certainly easier to use more direct methods We start with some numerical summaries: summary (pima) pregnant glucose diastolic triceps insulin Min.:0.00 Min. 0 Min. :0.0 Min. :0.0 Min. :0.0 1stQu.:1.00 1st Qu.:99 1stQu.:62.0 1stQu.:0.0 1stQu.:0.0 Median 3.00 Median 117 Median :72.0 Median 23.0 Median 30.5 Mean :3.85 Mean :121 Mean :69.1 Mean :20.5 Mean 79.8 3rdQu.:6.00 3rdQu.:140 3rdQu.:80.0 3rdQu.:32.0 3rdQu.:127.2 Max. :17.00 Max. :199 Max. :122.0 Max. :99.0 Max. :846.0 bmi diabetes age test Min. :0.0 Min. :0.078 Min. :21.0 Min. :0.000 1stQu.:27.3 1stQu.:0.244 1stQu.:24.0 1stQu.:0.000 Median 32.0 Median :0.372 Median 29.0 Median 0.000 Mean :32.0 Mean:0.472 Mean :33.2 Mean :0.349 3rdQu.:36.6 3rdQu.:0.626 3rdQu.:41.0 3rdQu.:1.000 Max. :67.1 Max. :2.420 Max. :81.0 Max. :1.000 The summary (command is a quick way to get the usual univariate summary information.At this stage, we are looking for anything unusual or unexpected perhaps indicating a data entry error.For this purpose,a close look at the minimum and maximum values of each variable is worthwhile.Starting with pregnant, we see a maximum value of 17.This is large but perhaps not impossible.However,we then see that the next 5 variables have minimum values of zero.No blood pressure is not good for the health-something must be wrong.Let's look at the sorted values: sort(pimasdiastolic) [1]0000 0 0 0 0 0 0 0 0 0 0 0 0 0 0 [19] 0000 00 00 0 0 0 0 0 0 0 0 0 24 [37]303038404444444446464848 48484850 50 50 ...etc... We see that the first 36 values are zero.The description that comes with the data says nothing about it but it seems likely that the zero has been used as a missing value code.For one reason or another,the researchers did not obtain the blood pressures of 36 patients.In a real investigation,one would likely be able to question the researchers about what really happened.Nevertheless,this does illustrate the kind of misunderstanding
1.1. BEFORE YOU START 10 > library(faraway) > data(pima) > pima pregnant glucose diastolic triceps insulin bmi diabetes age test 1 6 148 72 35 0 33.6 0.627 50 1 2 1 85 66 29 0 26.6 0.351 31 0 3 8 183 64 0 0 23.3 0.672 32 1 ... much deleted ... 768 1 93 70 31 0 30.4 0.315 23 0 The library(faraway) makes the data used in this book available while data(pima) calls up this particular dataset. Simply typing the name of the data frame, pima prints out the data. It’s too long to show it all here. For a dataset of this size, one can just about visually skim over the data for anything out of place but it is certainly easier to use more direct methods. We start with some numerical summaries: > summary(pima) pregnant glucose diastolic triceps insulin Min. : 0.00 Min. : 0 Min. : 0.0 Min. : 0.0 Min. : 0.0 1st Qu.: 1.00 1st Qu.: 99 1st Qu.: 62.0 1st Qu.: 0.0 1st Qu.: 0.0 Median : 3.00 Median :117 Median : 72.0 Median :23.0 Median : 30.5 Mean : 3.85 Mean :121 Mean : 69.1 Mean :20.5 Mean : 79.8 3rd Qu.: 6.00 3rd Qu.:140 3rd Qu.: 80.0 3rd Qu.:32.0 3rd Qu.:127.2 Max. :17.00 Max. :199 Max. :122.0 Max. :99.0 Max. :846.0 bmi diabetes age test Min. : 0.0 Min. :0.078 Min. :21.0 Min. :0.000 1st Qu.:27.3 1st Qu.:0.244 1st Qu.:24.0 1st Qu.:0.000 Median :32.0 Median :0.372 Median :29.0 Median :0.000 Mean :32.0 Mean :0.472 Mean :33.2 Mean :0.349 3rd Qu.:36.6 3rd Qu.:0.626 3rd Qu.:41.0 3rd Qu.:1.000 Max. :67.1 Max. :2.420 Max. :81.0 Max. :1.000 The summary() command is a quick way to get the usual univariate summary information. At this stage, we are looking for anything unusual or unexpected perhaps indicating a data entry error. For this purpose, a close look at the minimum and maximum values of each variable is worthwhile. Starting with pregnant, we see a maximum value of 17. This is large but perhaps not impossible. However, we then see that the next 5 variables have minimum values of zero. No blood pressure is not good for the health — something must be wrong. Let’s look at the sorted values: > sort(pima$diastolic) [1] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 [19] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 24 [37] 30 30 38 40 44 44 44 44 46 46 48 48 48 48 48 50 50 50 ...etc... We see that the first 36 values are zero. The description that comes with the data says nothing about it but it seems likely that the zero has been used as a missing value code. For one reason or another, the researchers did not obtain the blood pressures of 36 patients. In a real investigation, one would likely be able to question the researchers about what really happened. Nevertheless, this does illustrate the kind of misunderstanding