List of Figures 3.1.1 Strip charts of the precip,rivers,and discoveries data,,·,·,,·· 22 3.1.2 (Relative)frequency histograms of the precip data 23 3.1.3 More histograms of the precip data 24 3.1.4 Index plots of the LakeHuron data... 27 3.1.5 Bar graphs of the state.region data 29 3.1.6 Pareto chart of the state.division data,·····. 31 3.1.7 Dot chart of the state.region data ..... 32 3.6.1 Boxplots of weight by feed type in the chickwts data.. 50 3.6.2 Histograms of age by education level from the infert data. 50 3.6.3 An xyplot of Petal.Length versus Petal.Width by Species in the 51 3.6.4 A coplot of conc versus uptake by Type and Treatment in the Co2 data 52 4.5.1 The birthday problem....................... 89 5.3.1 Graph of the binom(size=3,prob=l/2)CDF·····...······ 115 5.3.2 The binom(size =3,prob =0.5)distribution from the distr package...116 5.5.1 The empirical CDF..,........................... 122 6.5.1 Chi square distribution for various degrees of freedom..·...···.. 152 6.5.2 Plot of the gamma(shape 13,rate 1)MGF 155 7.6.1 Graph of a bivariate normal PDF 173 7.9.1 Plot of a multinomial PMF 180 8.2.1 Student's t distribution for various degrees of freedom 185 8.5.1 Plot of simulated IQRs........................ 190 8.5.2 Plot of simulated MADs................ 190 9.1.1 Capture-.recapture experiment·,...··.····..···.····. 195 9.1.2 Assorted likelihood functions for fishing,part two... 196 9.1.3 Species maximum likelihood........... 198 9.2.1 Simulated confidence intervals......... 204 9.2.2 Confidence interval plot for the PlantGrowth data... 206 10.2.1 Hypothesis test plot based on normal.and.t.dist from the HH package.. 223 10.3.1 Hypothesis test plot based on normal.and.t.dist from the HH package.. 226 10.6.1 Between group versus within group variation.................231 10.6.2 Between group versus within group variation......... 232 10.6.3 Some F plots from the HH package 233 10.7.1 Plot of significance level and power................ 234 xiii
List of Figures 3.1.1 Strip charts of the precip, rivers, and discoveries data . . . . . . . . . 22 3.1.2 (Relative) frequency histograms of the precip data . . . . . . . . . . . . . 23 3.1.3 More histograms of the precip data . . . . . . . . . . . . . . . . . . . . . 24 3.1.4 Index plots of the LakeHuron data . . . . . . . . . . . . . . . . . . . . . . 27 3.1.5 Bar graphs of the state.region data . . . . . . . . . . . . . . . . . . . . 29 3.1.6 Pareto chart of the state.division data . . . . . . . . . . . . . . . . . . 31 3.1.7 Dot chart of the state.region data . . . . . . . . . . . . . . . . . . . . . 32 3.6.1 Boxplots of weight by feed type in the chickwts data . . . . . . . . . . . 50 3.6.2 Histograms of age by education level from the infert data . . . . . . . . 50 3.6.3 An xyplot of Petal.Length versus Petal.Width by Species in the iris data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 3.6.4 A coplot of conc versus uptake by Type and Treatment in the CO2 data 52 4.5.1 The birthday problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89 5.3.1 Graph of the binom(size = 3, prob = 1/2) CDF . . . . . . . . . . . . . . 115 5.3.2 The binom(size = 3, prob = 0.5) distribution from the distr package . . . 116 5.5.1 The empirical CDF . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122 6.5.1 Chi square distribution for various degrees of freedom . . . . . . . . . . . . 152 6.5.2 Plot of the gamma(shape = 13, rate = 1) MGF . . . . . . . . . . . . . . 155 7.6.1 Graph of a bivariate normal PDF . . . . . . . . . . . . . . . . . . . . . . . 173 7.9.1 Plot of a multinomial PMF . . . . . . . . . . . . . . . . . . . . . . . . . . 180 8.2.1 Student’s t distribution for various degrees of freedom . . . . . . . . . . . . 185 8.5.1 Plot of simulated IQRs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 190 8.5.2 Plot of simulated MADs . . . . . . . . . . . . . . . . . . . . . . . . . . . . 190 9.1.1 Capture-recapture experiment . . . . . . . . . . . . . . . . . . . . . . . . . 195 9.1.2 Assorted likelihood functions for fishing, part two . . . . . . . . . . . . . . 196 9.1.3 Species maximum likelihood . . . . . . . . . . . . . . . . . . . . . . . . . 198 9.2.1 Simulated confidence intervals . . . . . . . . . . . . . . . . . . . . . . . . 204 9.2.2 Confidence interval plot for the PlantGrowth data . . . . . . . . . . . . . . 206 10.2.1 Hypothesis test plot based on normal.and.t.dist from the HH package . . 223 10.3.1 Hypothesis test plot based on normal.and.t.dist from the HH package . . 226 10.6.1 Between group versus within group variation . . . . . . . . . . . . . . . . . 231 10.6.2 Between group versus within group variation . . . . . . . . . . . . . . . . . 232 10.6.3 Some F plots from the HH package . . . . . . . . . . . . . . . . . . . . . . 233 10.7.1 Plot of significance level and power . . . . . . . . . . . . . . . . . . . . . . 234 xiii
xiv LIST OF FIGURES 11.1.1 Philosophical foundations of SLR.......··. 237 11.1.2 Scatterplot of dist versus speed for the cars data 238 11.2.1 Scatterplot with added regression line for the cars data 241 11.2.2 Scatterplot with confidence/prediction bands for the cars data 248 ll.4.1 Normal q-q plot of the residuals for the cars data..····..···.·. 253 11.4.2 Plot of standardized residuals against the fitted values for the cars data... 255 11.4.3 Plot of the residuals versus the fitted values for the cars data........ 257 11.5.1 Cook's distances for the cars data.................... 263 11.5.2 Diagnostic plots for the cars data...... 265 12.1.1 Scatterplot matrix of trees data....... 269 12.1.2 3D scatterplot with regression plane for the trees data 270 12.4.1 Scatterplot of Volume versus Girth for the trees data 280 l2.4.2 A quadratic model for the trees data.....···..···..···.·· 282 l2.6.1 A dummy variable model for the trees data.·....··..·. 288 13.2.1 Bootstrapping the standard error of the mean,simulated data 300 13.2.2 Bootstrapping the standard error of the median for the rivers data.....302
xiv LIST OF FIGURES 11.1.1 Philosophical foundations of SLR . . . . . . . . . . . . . . . . . . . . . . . 237 11.1.2 Scatterplot of dist versus speed for the cars data . . . . . . . . . . . . . 238 11.2.1 Scatterplot with added regression line for the cars data . . . . . . . . . . . 241 11.2.2 Scatterplot with confidence/prediction bands for the cars data . . . . . . . 248 11.4.1 Normal q-q plot of the residuals for the cars data . . . . . . . . . . . . . . 253 11.4.2 Plot of standardized residuals against the fitted values for the cars data . . . 255 11.4.3 Plot of the residuals versus the fitted values for the cars data . . . . . . . . 257 11.5.1 Cook’s distances for the cars data . . . . . . . . . . . . . . . . . . . . . . 263 11.5.2 Diagnostic plots for the cars data . . . . . . . . . . . . . . . . . . . . . . . 265 12.1.1 Scatterplot matrix of trees data . . . . . . . . . . . . . . . . . . . . . . . 269 12.1.2 3D scatterplot with regression plane for the trees data . . . . . . . . . . . 270 12.4.1 Scatterplot of Volume versus Girth for the trees data . . . . . . . . . . . 280 12.4.2 A quadratic model for the trees data . . . . . . . . . . . . . . . . . . . . . 282 12.6.1 A dummy variable model for the trees data . . . . . . . . . . . . . . . . . 288 13.2.1 Bootstrapping the standard error of the mean, simulated data . . . . . . . . 300 13.2.2 Bootstrapping the standard error of the median for the rivers data . . . . . 302
List of Tables 4.1 Sampling k from n objects with urnsamples 86 4.2 Rolling two dice.,························ 90 5.1 Correspondence between stats and distr................ 116 7.1 Maximum U and sum V of a pair of dice rolls (X,Y)........ 160 7.2 Joint values of U=max(X,Y)andV=X+Y...·...····. 160 7.3 The joint PMF of(U,V.........··.··········· 160 E.1 Set operations················ 339 E.2 Differentiation rules.··············· 341 E.3 Some derivatives...············· 341 E.4 Some integrals(constants of integration omitted))····.···.. 342 XV
List of Tables 4.1 Sampling k from n objects with urnsamples . . . . . . . . . . . . . . . . . . 86 4.2 Rolling two dice . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90 5.1 Correspondence between stats and distr . . . . . . . . . . . . . . . . . . . 116 7.1 Maximum U and sum V of a pair of dice rolls (X, Y) . . . . . . . . . . . . . . . 160 7.2 Joint values of U = max(X, Y) and V = X + Y . . . . . . . . . . . . . . . . . . 160 7.3 The joint PMF of (U, V) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 160 E.1 Set operations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 339 E.2 Differentiation rules . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 341 E.3 Some derivatives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 341 E.4 Some integrals (constants of integration omitted) . . . . . . . . . . . . . . . . 342 xv
Chapter 1 An Introduction to probability and Statistics This chapter has proved to be the hardest to write,by far.The trouble is that there is so much to say-and so many people have already said it so much better than I could.When I get something I like I will release it here. In the meantime,there is a lot of information already available to a person with an Internet connection.I recommend to start at Wikipedia,which is not a flawless resource but it has the main ideas with links to reputable sources. In my lectures I usually tell stories about Fisher,Galton,Gauss,Laplace,Quetelet,and the Chevalier de Mere. 1.1 Probability The common folklore is that probability has been around for millennia but did not gain the attention of mathematicians until approximately 1654 when the Chevalier de Mere had a ques- tion regarding the fair division of a game's payoff to the two players,if the game had to end prematurely. 1.2 Statistics Statistics concerns data;their collection,analysis,and interpretation.In this book we distin- guish between two types of statistics:descriptive and inferential. Descriptive statistics concerns the summarization of data.We have a data set and we would like to describe the data set in multiple ways.Usually this entails calculating numbers from the data,called descriptive measures,such as percentages,sums,averages,and so forth. Inferential statistics does more.There is an inference associated with the data set,a conclu- sion drawn about the population from which the data originated. I would like to mention that there are two schools of thought of statistics:frequentist and bayesian.The difference between the schools is related to how the two groups interpret the underlying probability(see Section 4.3).The frequentist school gained a lot of ground among statisticians due in large part to the work of Fisher,Neyman,and Pearson in the early twentieth century.That dominance lasted until inexpensive computing power became widely available; nowadays the bayesian school is garnering more attention and at an increasing rate
Chapter 1 An Introduction to Probability and Statistics This chapter has proved to be the hardest to write, by far. The trouble is that there is so much to say – and so many people have already said it so much better than I could. When I get something I like I will release it here. In the meantime, there is a lot of information already available to a person with an Internet connection. I recommend to start at Wikipedia, which is not a flawless resource but it has the main ideas with links to reputable sources. In my lectures I usually tell stories about Fisher, Galton, Gauss, Laplace, Quetelet, and the Chevalier de Mere. 1.1 Probability The common folklore is that probability has been around for millennia but did not gain the attention of mathematicians until approximately 1654 when the Chevalier de Mere had a question regarding the fair division of a game’s payoff to the two players, if the game had to end prematurely. 1.2 Statistics Statistics concerns data; their collection, analysis, and interpretation. In this book we distinguish between two types of statistics: descriptive and inferential. Descriptive statistics concerns the summarization of data. We have a data set and we would like to describe the data set in multiple ways. Usually this entails calculating numbers from the data, called descriptive measures, such as percentages, sums, averages, and so forth. Inferential statistics does more. There is an inference associated with the data set, a conclusion drawn about the population from which the data originated. I would like to mention that there are two schools of thought of statistics: frequentist and bayesian. The difference between the schools is related to how the two groups interpret the underlying probability (see Section 4.3). The frequentist school gained a lot of ground among statisticians due in large part to the work of Fisher, Neyman, and Pearson in the early twentieth century. That dominance lasted until inexpensive computing power became widely available; nowadays the bayesian school is garnering more attention and at an increasing rate. 1
Chapter 2 An Introduction to R 2.1 Downloading and Installing R The instructions for obtaining R largely depend on the user's hardware and operating system. The R Project has written an R Installation and Administration manual with complete,precise instructions about what to do,together with all sorts of additional information.The following is just a primer to get a person started. 2.1.1 Installing R Visit one of the links below to download the latest version of R for your operating system: Microsoft Windows:http://cran.r-project.org/bin/windows/base/ MacOS:http://cran.r-project.org/bin/macosx/ Linux:http://cran.r-project.org/bin/linux/ On Microsoft Windows,click the R-x.y.z.exe installer to start installation.When it asks for "Customized startup options",specify Yes.In the next window,be sure to select the SDI(single document interface)option;this is useful later when we discuss three dimensional plots with the rgl package [1]. Installing R on a USB drive(Windows)With this option you can use R portably and without administrative privileges.There is an entry in the R for Windows FAQ about this.Here is the procedure I use: 1.Download the Windows installer above and start installation as usual.When it asks where to install,navigate to the top-level directory of the USB drive instead of the default C drive. 2.When it asks whether to modify the Windows registry,uncheck the box;we do NOT want to tamper with the registry. 3.After installation,change the name of the folder from R-x.y.z to just plain R.(Even quicker:do this in step 1.) 4.Download the following shortcut to the top-level directory of the USB drive,right beside the R folder.not inside the folder. 5
Chapter 2 An Introduction to R 2.1 Downloading and Installing R The instructions for obtaining R largely depend on the user’s hardware and operating system. The R Project has written an R Installation and Administration manual with complete, precise instructions about what to do, together with all sorts of additional information. The following is just a primer to get a person started. 2.1.1 Installing R Visit one of the links below to download the latest version of R for your operating system: Microsoft Windows: http://cran.r-project.org/bin/windows/base/ MacOS: http://cran.r-project.org/bin/macosx/ Linux: http://cran.r-project.org/bin/linux/ On Microsoft Windows, click the R-x.y.z.exe installer to start installation. When it asks for "Customized startup options", specify Yes. In the next window, be sure to select the SDI (single document interface) option; this is useful later when we discuss three dimensional plots with the rgl package [1]. Installing R on a USB drive (Windows) With this option you can use R portably and without administrative privileges. There is an entry in the R for Windows FAQ about this. Here is the procedure I use: 1. Download the Windows installer above and start installation as usual. When it asks where to install, navigate to the top-level directory of the USB drive instead of the default C drive. 2. When it asks whether to modify the Windows registry, uncheck the box; we do NOT want to tamper with the registry. 3. After installation, change the name of the folder from R-x.y.z to just plain R. (Even quicker: do this in step 1.) 4. Download the following shortcut to the top-level directory of the USB drive, right beside the R folder, not inside the folder. 5