Multivariate Linear Models in R An Appendix to An R Companion to Applied Regression,Second Edition John Fox Sanford Weisberg last revision:28 July 2011 Abstract The multivariate linear model is ()()(m)(n) where Y is a matrix of n observations on m response variables;X is a model matrix with columns for k+1 regressors,typically including an initial column of 1s for the regression constant;B is a matrix of regression coefficients,one column for each response variable;and E is a matrix of errors.This model can be fit with the lm function in R,where the left-hand side of the model comprises a matrix of response variables,and the right-hand side is specified exactly as for a univariate linear model (i.e.,with a single response variable).This appendix to Fox and Weisberg(2011)explains how to use the Anova and linearHypothesis functions in the car package to test hypotheses for parameters in multivariate linear models,including models for repeated-measures data. 1 Basic Ideas The multivariate linear model accommodates two or more response variables.The theory of mul- tivariate linear models is developed very briefly in this section.Much more extensive treatments may be found in the recommended reading for this appendix. The multivariate general linear model is Y,= XB+E (n×m)(n×k+1)(k+1×m)(m×m) where Y is a matrix of n observations on m response variables;X is a model matrix with columns for k+1 regressors,typically including an initial column of 1s for the regression constant;B is a matrix of regression coefficients,one column for each response variable;and E is a matrix of errors.1 The contents of the model matrix are exactly as in the univariate linear model (as described in Ch.4 of An R Companion to Applied Regression,Fox and Weisberg,2011-hereafter,the "R Companion"), and may contain,therefore,dummy regressors representing factors,polynomial or regression-spline terms,interaction regressors,and so on. The assumptions of the multivariate linear model concern the behavior of the errors:Let e represent the ith row of E.Then e~Nm(0,>)where is a nonsingular error-covariance matrix, constant across observations;e and e are independent for ii;and X is fixed or independent A typographical note:B and E are,respectively,the upper-case Greek letters Beta and Epsilon.Because these are indistinguishable from the corresponding Roman letters B and E,we will denote the estimated regression coefficients as B and the residuals as E. 1
Multivariate Linear Models in R An Appendix to An R Companion to Applied Regression, Second Edition John Fox & Sanford Weisberg last revision: 28 July 2011 Abstract The multivariate linear model is Y (n×m) = X (n×k+1) B (k+1×m) + E (n×m) where Y is a matrix of n observations on m response variables; X is a model matrix with columns for k + 1 regressors, typically including an initial column of 1s for the regression constant; B is a matrix of regression coefficients, one column for each response variable; and E is a matrix of errors. This model can be fit with the lm function in R, where the left-hand side of the model comprises a matrix of response variables, and the right-hand side is specified exactly as for a univariate linear model (i.e., with a single response variable). This appendix to Fox and Weisberg (2011) explains how to use the Anova and linearHypothesis functions in the car package to test hypotheses for parameters in multivariate linear models, including models for repeated-measures data. 1 Basic Ideas The multivariate linear model accommodates two or more response variables. The theory of multivariate linear models is developed very briefly in this section. Much more extensive treatments may be found in the recommended reading for this appendix. The multivariate general linear model is Y (n×m) = X (n×k+1) B (k+1×m) + E (n×m) where Y is a matrix of n observations on m response variables; X is a model matrix with columns for k+ 1 regressors, typically including an initial column of 1s for the regression constant; B is a matrix of regression coefficients, one column for each response variable; and E is a matrix of errors.1 The contents of the model matrix are exactly as in the univariate linear model (as described in Ch. 4 of An R Companion to Applied Regression, Fox and Weisberg, 2011—hereafter, the “R Companion”), and may contain, therefore, dummy regressors representing factors, polynomial or regression-spline terms, interaction regressors, and so on. The assumptions of the multivariate linear model concern the behavior of the errors: Let ε 0 i represent the ith row of E. Then ε 0 i ∼ Nm(0, Σ), where Σ is a nonsingular error-covariance matrix, constant across observations; ε 0 i and ε 0 i 0 are independent for i 6= i 0 ; and X is fixed or independent 1A typographical note: B and E are, respectively, the upper-case Greek letters Beta and Epsilon. Because these are indistinguishable from the corresponding Roman letters B and E, we will denote the estimated regression coefficients as Bb and the residuals as Eb. 1
of E.We can write more compactly that vec(E)~Nnm(0,In )Here,vec(E)ravels the error matrix row-wise into a vector,and is the Kronecker-product operator. The maximum-likelihood estimator of B in the multivariate linear model is equivalent to equation-by-equation least squares for the individual responses: B=(X'X)-X'Y Procedures for statistical inference in the multivariate linear model,however,take account of the fact that there are several,generally correlated,responses. Paralleling the decomposition of the total sum of squares into regression and residual sums of squares in the univariate linear model,there is in the multivariate linear model a decomposi- tion of the total sum-of-squares-and-cross-products (SSP)matrix into regression and residual SSP matrices.We have SSPT Y'Y-nyy (m×m) =它+(Y'空-n寸) SSPR+SSPReg where y is the (m x 1)vector of means for the response variables;Y=XB is the matrix of fitted values;and E=Y-Y is the matrix of residuals. Many hypothesis tests of interest can be formulated by taking differences in SSPReg(or,equiva- lently,SSPR)for nested models.Let SSPH represent the incremental SSP matrix for a hypothesis. Multivariate tests for the hypothesis are based on the m eigenvalues of SSPrSSP(the hy- pothesis SSP matrix"divided by"the residual SSP matrix),that is,the values of A for which det(SSPHSSPR-AI)=0 The several commonly employed multivariate test statistics are functions of these eigenvalues: llai-Bartlett Trace,TP=A m Hotelling-Lawley Trace,T=j j=1 (1) m 1 Wilks's Lambda,A= j=1 Roy's Maximum Root,Ai By convention,the eigenvalues of SSPHSSPR are arranged in descending order,and so is the largest eigenvalue.There are F approximations to the null distributions of these test statistics. For example,for Wilks's Lambda,let s represent the degrees of freedom for the term that we are testing (i.e.,the number of columns of the model matrix X pertaining to the term).Define r=n-k-1-m-8+1 2 (2) u=ms-2 4 Vm2s2-4 m2+s2-5 for m2+s2-5>0 t 0 otherwise 2
of E. We can write more compactly that vec(E) ∼ Nnm(0, In ⊗ Σ). Here, vec(E) ravels the error matrix row-wise into a vector, and ⊗ is the Kronecker-product operator. The maximum-likelihood estimator of B in the multivariate linear model is equivalent to equation-by-equation least squares for the individual responses: Bb = (X0X) −1X0Y Procedures for statistical inference in the multivariate linear model, however, take account of the fact that there are several, generally correlated, responses. Paralleling the decomposition of the total sum of squares into regression and residual sums of squares in the univariate linear model, there is in the multivariate linear model a decomposition of the total sum-of-squares-and-cross-products (SSP) matrix into regression and residual SSP matrices. We have SSPT (m×m) = Y0Y − ny y 0 = Eb0Eb + Yb 0Yb − ny y 0 = SSPR + SSPReg where y is the (m × 1) vector of means for the response variables; Yb = XBb is the matrix of fitted values; and Eb = Y − Yb is the matrix of residuals. Many hypothesis tests of interest can be formulated by taking differences in SSPReg (or, equivalently, SSPR) for nested models. Let SSPH represent the incremental SSP matrix for a hypothesis. Multivariate tests for the hypothesis are based on the m eigenvalues λj of SSPHSSP−1 R (the hypothesis SSP matrix “divided by” the residual SSP matrix), that is, the values of λ for which det(SSPHSSP−1 R − λIm) = 0 The several commonly employed multivariate test statistics are functions of these eigenvalues: Pillai-Bartlett Trace, TP B = Xm j=1 λj 1 − λj Hotelling-Lawley Trace, THL = Xm j=1 λj Wilks’s Lambda, Λ = Ym j=1 1 1 + λj Roy’s Maximum Root, λ1 (1) By convention, the eigenvalues of SSPHSSP−1 R are arranged in descending order, and so λ1 is the largest eigenvalue. There are F approximations to the null distributions of these test statistics. For example, for Wilks’s Lambda, let s represent the degrees of freedom for the term that we are testing (i.e., the number of columns of the model matrix X pertaining to the term). Define r = n − k − 1 − m − s + 1 2 (2) u = ms − 2 4 t = √ m2s 2 − 4 m2 + s 2 − 5 for m2 + s 2 − 5 > 0 0 otherwise 2
Rao (1973,p.556)shows that under the null hypothesis, 1-△1/trt-2u F0= 41/× (3) ms follows an approximate F-distribution with ms and rt-2u degrees of freedom,and that this result is exact if min(m,s)<2(a circumstance under which all four test statistics are equivalent). Even more generally,suppose that we want to test the linear hypothesis Ho:L B=C (4) (g×k+1)(k+1×m)(q×m) where L is a hypothesis matrix of full-row rank g<k+1,and the right-hand-side matrix C consists of constants(usually 0s).2 Then the SSP matrix for the hypothesis is SSP=(B-C)L(X')L](LB-c) and the various test statistics are based on the p=min(g,m)nonzero eigenvalues of SSPSSPR (and the formulas in Equations 1,2,and 3 are adjusted by substituting p for m) When a multivariate response arises because a variable is measured on different occasions, or under different circumstances (but for the same individuals),it is also of interest to formulate hypotheses concerning comparisons among the responses.This situation,called a repeated-measures design,can be handled by linearly transforming the responses using a suitable model matrix,for example extending the linear hypothesis in Equation 4 to Ho:L B P =C (5) (g×k+1)(k+1×m)(m×v)(9×u) Here,the response-transformation matrix P provides contrasts in the responses (see,e.g.,Hand and Taylor,1987,or O'Brien and Kaiser,1985).The SSP matrix for the hypothesis is SSPa=(P'BL-CL(X'x)L LBP-C) (gxq) and test statistics are based on the p=min(g,v) nonzero eigenvalues of SSPH(P'SSPRP)-1. 2 Fitting and Testing Multivariate Linear Models in R Multivariate linear models are fit in R with the lm function.The procedure is the essence of simplicity:The left-hand side of the model is a matrix of responses,with each column representing a response variable and each row an observation;the right-hand side of the model and all other arguments to lm are precisely the same as for a univariate linear model (as described in Chap.4 of the R Companion).Typically,the response matrix is composed from individual response variables via the cbind function. The anova function in the standard R distribution is capable of handling multivariate linear models(see Dalgaard,2007),but the Anova and linearHypothesis functions in the car package may also be employed,in a manner entirely analogous to that described in the R Companion 2Cf.,Sec.4.4.5 of the R Companion for linear hypotheses in univariate linear models. 3
Rao (1973, p. 556) shows that under the null hypothesis, F0 = 1 − Λ 1/t Λ1/t × rt − 2u ms (3) follows an approximate F-distribution with ms and rt−2u degrees of freedom, and that this result is exact if min(m, s) ≤ 2 (a circumstance under which all four test statistics are equivalent). Even more generally, suppose that we want to test the linear hypothesis H0: L (q×k+1) B (k+1×m) = C (q×m) (4) where L is a hypothesis matrix of full-row rank q ≤ k+1, and the right-hand-side matrix C consists of constants (usually 0s).2 Then the SSP matrix for the hypothesis is SSPH = Bb0L 0 − C0 hL(X 0X) −1L 0 i−1 LBb − C and the various test statistics are based on the p = min(q, m) nonzero eigenvalues of SSPHSSP−1 R (and the formulas in Equations 1, 2, and 3 are adjusted by substituting p for m). When a multivariate response arises because a variable is measured on different occasions, or under different circumstances (but for the same individuals), it is also of interest to formulate hypotheses concerning comparisons among the responses. This situation, called a repeated-measures design, can be handled by linearly transforming the responses using a suitable model matrix, for example extending the linear hypothesis in Equation 4 to H0: L (q×k+1) B (k+1×m) P (m×v) = C (q×v) (5) Here, the response-transformation matrix P provides contrasts in the responses (see, e.g., Hand and Taylor, 1987, or O’Brien and Kaiser, 1985). The SSP matrix for the hypothesis is SSPH (q×q) = P 0Bb0L 0 − C0 hL(X 0X) −1L 0 i−1 LBPb − C and test statistics are based on the p = min(q, v) nonzero eigenvalues of SSPH(P0SSPRP) −1 . 2 Fitting and Testing Multivariate Linear Models in R Multivariate linear models are fit in R with the lm function. The procedure is the essence of simplicity: The left-hand side of the model is a matrix of responses, with each column representing a response variable and each row an observation; the right-hand side of the model and all other arguments to lm are precisely the same as for a univariate linear model (as described in Chap. 4 of the R Companion). Typically, the response matrix is composed from individual response variables via the cbind function. The anova function in the standard R distribution is capable of handling multivariate linear models (see Dalgaard, 2007), but the Anova and linearHypothesis functions in the car package may also be employed, in a manner entirely analogous to that described in the R Companion 2Cf., Sec. 4.4.5 of the R Companion for linear hypotheses in univariate linear models. 3
Figure 1:Three species of irises in the Anderson/Fisher data set:setosa (left),versicolor (center), and virginica(right).Source:The photographs are respectively by Radomil Binek,Danielle Langlois,and Frank Mayfield,and are distributed under the Creative Commons Attribution-Share Alike 3.0 Unported license (first and second images)or 2.0 Creative Commons Attribution-Share Alike Generic license (third image);they were obtained from the Wikimedia Commons. (Sec.4.4)for univariate linear models.We briefly demonstrate the use of these functions in this section. To illustrate multivariate linear models,we will use data collected by Anderson (1935)on three species of irises in the Gaspe Peninsula of Quebec,Canada.The data are of historical interest in statistics,because they were employed by R.A.Fisher (1936)to introduce the method of discriminant analysis.The data frame iris is part of the standard R distribution: library(car) some(iris) Sepal.Length Sepal.Width Petal.Length Petal.Width Species 25 4.8 3.4 1.9 0.2 setosa 47 5.1 3.8 1.6 0.2 setosa 67 5.6 3.0 4.5 1.5 versicolor 73 6.3 2.5 4.9 1.5 versicolor 104 6.3 2.9 5.6 1.8 virginica 109 6.7 2.5 5.8 1.8 virginica 113 6.8 3.0 5.5 2.1 virginica 131 7.4 2.8 6.1 1.9 virginica 140 6.9 3.1 5.4 2.1 virginica 149 6.2 3.4 5.4 2.3 virginica The first four variables in the data set represent measurements (in cm)of parts of the flowers, while the final variable specifies the species of iris.(Sepals are the green leaves that comprise the calyx of the plant,which encloses the flower.)Photographs of examples of the three species of irises-setosa,versicolor,and virginica-appear in Figure 1.Figure 2 is a scatterplot matrix of the four measurements classified by species,showing within-species 50 and 95%concentration ellipses (see Sec.4.3.8 of the R Companion);Figure 3 shows boxplots for each of the responses by species: scatterplotMatrix(Sepal.Length Sepal.Width Petal.Length Petal.Width I Species, + data=iris,smooth=FALSE,reg.line=FALSE,ellipse=TRUE, by.groups=TRUE,diagonal="none") 4
Figure 1: Three species of irises in the Anderson/Fisher data set: setosa (left), versicolor (center), and virginica (right). Source: The photographs are respectively by Radomil Binek, Danielle Langlois, and Frank Mayfield, and are distributed under the Creative Commons Attribution-Share Alike 3.0 Unported license (first and second images) or 2.0 Creative Commons Attribution-Share Alike Generic license (third image); they were obtained from the Wikimedia Commons. (Sec. 4.4) for univariate linear models. We briefly demonstrate the use of these functions in this section. To illustrate multivariate linear models, we will use data collected by Anderson (1935) on three species of irises in the Gasp´e Peninsula of Quebec, Canada. The data are of historical interest in statistics, because they were employed by R. A. Fisher (1936) to introduce the method of discriminant analysis. The data frame iris is part of the standard R distribution: > library(car) > some(iris) Sepal.Length Sepal.Width Petal.Length Petal.Width Species 25 4.8 3.4 1.9 0.2 setosa 47 5.1 3.8 1.6 0.2 setosa 67 5.6 3.0 4.5 1.5 versicolor 73 6.3 2.5 4.9 1.5 versicolor 104 6.3 2.9 5.6 1.8 virginica 109 6.7 2.5 5.8 1.8 virginica 113 6.8 3.0 5.5 2.1 virginica 131 7.4 2.8 6.1 1.9 virginica 140 6.9 3.1 5.4 2.1 virginica 149 6.2 3.4 5.4 2.3 virginica The first four variables in the data set represent measurements (in cm) of parts of the flowers, while the final variable specifies the species of iris. (Sepals are the green leaves that comprise the calyx of the plant, which encloses the flower.) Photographs of examples of the three species of irises—setosa, versicolor, and virginica—appear in Figure 1. Figure 2 is a scatterplot matrix of the four measurements classified by species, showing within-species 50 and 95% concentration ellipses (see Sec. 4.3.8 of the R Companion); Figure 3 shows boxplots for each of the responses by species: > scatterplotMatrix(~ Sepal.Length + Sepal.Width + Petal.Length + + Petal.Width | Species, + data=iris, smooth=FALSE, reg.line=FALSE, ellipse=TRUE, + by.groups=TRUE, diagonal="none") 4
2.02.53.03.54.0 0.51.01.52.02.5 Sepal.Length + o setosa A versicolor +virginica c Sepal.Width Petal.Length 4 Petal.Width .0 4.5 5.5 6.5 7.5 1 2 3 5 6 7 Figure 2:Scatterplot matrix for the Anderson/Fisher iris data,showing within-species 50 and 95% concentration ellipses. 5
● setosa versicolor virginica Sepal.Length 2.0 2.5 3.0 3.5 4.0 ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ●● ● ● ● ●● ● ● ● ●● ● ● ● ● ● ● ●● ● ● ● ● ●● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ●● ● ● 0.5 1.0 1.5 2.0 2.5 4.5 5.5 6.5 7.5 ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ●● ● ● ● ● ● ● ● ●● ● ● 2.0 2.5 3.0 3.5 4.0 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● Sepal.Width ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ●●● ● ● ● ● ●●● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ●●● ●● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ●●● ●● ● ●●● ●●● ● ● ● ● ● ● ● ● ● ● ● ●● ●●●● ● ●●● ●● ● ● ●● ●●●● ● ● ● ●●● ● ● ● Petal.Length 1 2 3 4 5 6 7 ●● ● ● ● ●●● ●● ●●● ● ●● ● ● ● ● ●● ● ●● ● ●●●● ● ● ●● ●● ● ● 4.5 5.5 6.5 7.5 0.5 1.0 1.5 2.0 2.5 ●● ●●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ●●● ● ● ● ●●● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ●● ●● ● ● 1 2 3 4 5 6 7 ●●● ● ● ● ● ● ● ● ●● ●●● ● ● ● ● ● ● ● ●● ● ● ●●● ● ● ● ● ● ●●● ● ● Petal.Width Figure 2: Scatterplot matrix for the Anderson/Fisher iris data, showing within-species 50 and 95% concentration ellipses. 5