Overview www.wiley.com/wires/compstats TABLE 2 Eigenvalues and Percentage of Explained Inertia by Each Component N Cumulated Percent of Cumulated Component (eigenvalue) (eigenvalues) of Inertia (percentage) 1 392 392 83.29 83.29 2 52 444 11.71 100.00 cosine indicates the contribution of a component to what specific meaning of the word 'loadings'has been the squared distance of the observation to the origin. chosen when looking at the outputs of a program or It corresponds to the square of the cosine of the when reading papers on PCA.In general,however, angle from the right triangle made with the origin,the different meanings of loadings'lead to equivalent observation,and its projection on the component and interpretations of the components.This happens is computed as: because the different types of loadings differ mostly by their type of normalization.For example,the 屁 correlations of the variables with the components (11) are normalized such that the sum of the squared e correlations of a given variable is equal to one;by contrast,the elements of Q are normalized such that where dg is the squared distance of a given the sum of the squared elements of a given component observation to the origin.The squared distance,is is equal to one. computed(thanks to the Pythagorean theorem)as the Plotting the Correlations/Loadings of the sum of the squared values of all the factor scores of Variables with the Components this observation(cf.Eq.4).Components with a large The variables can be plotted as points in the value of cos contribute a relatively large portion to component space using their loadings as coordinates. the total distance and therefore these components are This representation differs from the plot of the important for that observation. observations:The observations are represented by The distance to the center of gravity is defined for their projections,but the variables are represented by supplementary observations and the squared cosine their correlations.Recall that the sum of the squared can be computed and is meaningful.Therefore,the loadings for a variable is equal to one.Remember, value of cos2 can help find the components that are also,that a circle is defined as the set of points important to interpret both active and supplementary with the property that the sum of their squared observations. coordinates is equal to a constant.As a consequence, when the data are perfectly represented by only two components,the sum of the squared loadings is equal Loading:Correlation of a Component and a to one,and therefore,in this case,the loadings will Variable be positioned on a circle which is called the circle of The correlation between a component and a variable correlations.When more than two components are estimates the information they share.In the PCA needed to represent the data perfectly,the variables framework,this correlation is called a loading.Note will be positioned inside the circle of correlations. that the sum of the squared coefficients of correlation The closer a variable is to the circle of correlations, between a variable and all the components is equal the better we can reconstruct this variable from the to 1.As a consequence,the squared loadings are easier first two components(and the more important it is to to interpret than the loadings (because the squared interpret these components);the closer to the center loadings give the proportion of the variance of the of the plot a variable is,the less important it is for the variables explained by the components).Table 3 gives first two components. the loadings as well as the squared loadings for the Figure 4 shows the plot of the loadings of the word length and definition example. variables on the components.Each variable is a point It is worth noting that the term 'loading'has whose coordinates are given by the loadings on the several interpretations.For example,as previously principal components. mentioned,the elements of matrix Q(cf.Eg.B.1) We can also use supplementary variables to are also called loadings.This polysemy is a potential enrich the interpretation.A supplementary variable source of confusion,and therefore it is worth checking should be measured for the same observations 438 2010 John Wiley Sons,Inc. Volume 2,July/August 2010
Overview www.wiley.com/wires/compstats TABLE 2 Eigenvalues and Percentage of Explained Inertia by Each Component λi Cumulated Percent of Cumulated Component (eigenvalue) (eigenvalues) of Inertia (percentage) 1 392 392 83.29 83.29 2 52 444 11.71 100.00 cosine indicates the contribution of a component to the squared distance of the observation to the origin. It corresponds to the square of the cosine of the angle from the right triangle made with the origin, the observation, and its projection on the component and is computed as: cos2 i,# = f 2 # i,# # f 2 i,# = f 2 i,# d2 i,g (11) where d2 i,g is the squared distance of a given observation to the origin. The squared distance, d2 i,g, is computed (thanks to the Pythagorean theorem) as the sum of the squared values of all the factor scores of this observation (cf. Eq. 4). Components with a large value of cos2 i,# contribute a relatively large portion to the total distance and therefore these components are important for that observation. The distance to the center of gravity is defined for supplementary observations and the squared cosine can be computed and is meaningful. Therefore, the value of cos2 can help find the components that are important to interpret both active and supplementary observations. Loading: Correlation of a Component and a Variable The correlation between a component and a variable estimates the information they share. In the PCA framework, this correlation is called a loading. Note that the sum of the squared coefficients of correlation between a variable and all the components is equal to 1. As a consequence, the squared loadings are easier to interpret than the loadings (because the squared loadings give the proportion of the variance of the variables explained by the components). Table 3 gives the loadings as well as the squared loadings for the word length and definition example. It is worth noting that the term ‘loading’ has several interpretations. For example, as previously mentioned, the elements of matrix Q (cf. Eq. B.1) are also called loadings. This polysemy is a potential source of confusion, and therefore it is worth checking what specific meaning of the word ‘loadings’ has been chosen when looking at the outputs of a program or when reading papers on PCA. In general, however, different meanings of ‘loadings’ lead to equivalent interpretations of the components. This happens because the different types of loadings differ mostly by their type of normalization. For example, the correlations of the variables with the components are normalized such that the sum of the squared correlations of a given variable is equal to one; by contrast, the elements of Q are normalized such that the sum of the squared elements of a given component is equal to one. Plotting the Correlations/Loadings of the Variables with the Components The variables can be plotted as points in the component space using their loadings as coordinates. This representation differs from the plot of the observations: The observations are represented by their projections, but the variables are represented by their correlations. Recall that the sum of the squared loadings for a variable is equal to one. Remember, also, that a circle is defined as the set of points with the property that the sum of their squared coordinates is equal to a constant. As a consequence, when the data are perfectly represented by only two components, the sum of the squared loadings is equal to one, and therefore, in this case, the loadings will be positioned on a circle which is called the circle of correlations. When more than two components are needed to represent the data perfectly, the variables will be positioned inside the circle of correlations. The closer a variable is to the circle of correlations, the better we can reconstruct this variable from the first two components (and the more important it is to interpret these components); the closer to the center of the plot a variable is, the less important it is for the first two components. Figure 4 shows the plot of the loadings of the variables on the components. Each variable is a point whose coordinates are given by the loadings on the principal components. We can also use supplementary variables to enrich the interpretation. A supplementary variable should be measured for the same observations 438 2010 John Wiley & Son s, In c. Volume 2, July/Augu st 2010
WIREs Computational Statistics Principal component analysis TABLE 3 Loadings (i.e.,Coefficients of Correlation between Variables and Components) and Squared Loadings Loadings Squared Loadings Q Component W W Y W 1 -0.9927 -0.98100.98550.9624 -0.5369 0.8437 2 0.1203 -0.1939 0.0145 0.0376 0.8437 0.5369 ∑ 1.00001.0000 The elements of matrix Q are also provided. TABLE 4 Supplementary Variables for TABLE 5 Loadings(i.e.,Coefficients of Correlation)and Squared the Example Length of Words and Number Loadings between Supplementary Variables and Components of lines Loadings Squared Loadings Frequency #Entries Component Frequency #Entries Frequency #Entries Bag 8 6 1 -0.3012 0.6999 0.0907 0.4899 Across 230 3 2 -0.7218 -0.4493 0.5210 0.2019 On 700 2 6117 6918 Insane 1 2 y 500 7 Monastery 1 These data are shown in Table 4.A table of loadings Relief 9 1 for the supplementary variables can be computed Slope 2 6 from the coefficients of correlation between these Scoundrel variables and the components (see Table 5).Note With 700 5 that,contrary to the active variables,the squared Neither 7 2 loadings of the supplementary variables do not add Pretentious 1 1 up to 1. Solid 4 5 This 500 0 For 900 2 STATISTICAL INFERENCE: Therefore 3 1 EVALUATING THE QUALITY Generality 1 1 OF THE MODEL Arise 10 4 Blot 2 Fixed Effect Model 4 The results of PCA so far correspond to a fixed Infectious 1 2 effect model (i.e.,the observations are considered Frequency'is expressed as number of occur- to be the population of interest,and conclusions rences per 100,000 words,Entries'is obtained by counting the number of entries are limited to these specific observations).In this for the word in the dictionary. context,PCA is descriptive and the amount of the variance of X explained by a component indicates its used for the analysis (for all of them or part importance. of them,because we only need to compute a For a fixed effect model,the quality of the PCA model using the first M components is obtained coefficient of correlation).After the analysis has been by first computing the estimated matrix,denoted performed,the coefficients of correlation (i.e.,the loadings)between the supplementary variables and the XIMI,which is matrix X reconstituted with the first components are computed.Then the supplementary M components.The formula for this estimation is variables are displayed in the circle of correlations obtained by combining Egs 1,5,and 6 in order to obtain using the loadings as coordinates. For example,we can add two supplementary variables to the word length and definition example. X FQT XQQT (12) Volume 2,July/August 2010 2010 John Wiley Sons,Inc 439
WIREs Computational Statistics Principal component analysis TABLE 3 Loadings (i.e., Coefficients of Correlation between Variables and Components) and Squared Loadings Loadings Squared Loadings Q Component Y W Y W Y W 1 −0.9927 −0.9810 0.9855 0.9624 −0.5369 0.8437 2 0.1203 −0.1939 0.0145 0.0376 0.8437 0.5369 & 1.0000 1.0000 The elements of matrix Q are also provided. TABLE 4 Supplementary Variables for the Example Length of Words and Number of lines Frequency # Entries Bag 8 6 Across 230 3 On 700 12 Insane 1 2 By 500 7 Monastery 1 1 Relief 9 1 Slope 2 6 Scoundrel 1 1 With 700 5 Neither 7 2 Pretentious 1 1 Solid 4 5 This 500 9 For 900 7 Therefore 3 1 Generality 1 1 Arise 10 4 Blot 1 4 Infectious 1 2 ‘Frequency’ is expressed as number of occurrences per 100,000 words, ‘# Entries’ is obtained by counting the number of entries for the word in the dictionary. used for the analysis (for all of them or part of them, because we only need to compute a coefficient of correlation). After the analysis has been performed, the coefficients of correlation (i.e., the loadings) between the supplementary variables and the components are computed. Then the supplementary variables are displayed in the circle of correlations using the loadings as coordinates. For example, we can add two supplementary variables to the word length and definition example. TABLE 5 Loadings (i.e., Coefficients of Correlation) and Squared Loadings between Supplementary Variables and Components Loadings Squared Loadings Component Frequency # Entries Frequency # Entries 1 −0.3012 0.6999 0.0907 0.4899 2 −0.7218 −0.4493 0.5210 0.2019 & .6117 .6918 These data are shown in Table 4. A table of loadings for the supplementary variables can be computed from the coefficients of correlation between these variables and the components (see Table 5). Note that, contrary to the active variables, the squared loadings of the supplementary variables do not add up to 1. STATISTICAL INFERENCE: EVALUATING THE QUALITY OF THE MODEL Fixed Effect Model The results of PCA so far correspond to a fixed effect model (i.e., the observations are considered to be the population of interest, and conclusions are limited to these specific observations). In this context, PCA is descriptive and the amount of the variance of X explained by a component indicates its importance. For a fixed effect model, the quality of the PCA model using the first M components is obtained by first computing the estimated matrix, denoted X+[M] , which is matrix X reconstituted with the first M components. The formula for this estimation is obtained by combining Eqs 1, 5, and 6 in order to obtain X = FQT = XQQT . (12) Volume 2, July/August 2010 2010 John Wiley & Son s, In c. 439
Overview www.wiley.com/wires/compstats (a) between X and XIMI.Several coefficients can be used PC2 for this task [see,e.g.,Refs 16-18].The squared coefficient of correlation is sometimes used,as well as the Ry coefficient.18.19 The most popular coefficient, Length (number of letters) however,is the residual sum of squares (RESS).It is Number of computed as: lines of the definition PC RESSM IX-XIMI2 traceETE =I- ∑ (15) =1 where ll ll is the norm of X(i.e.,the square root of the (b) PC2 sum of all the squared elements of X),and where the trace of a matrix is the sum of its diagonal elements. The smaller the value of RESS,the better the PCA model.For a fixed effect model,a larger M gives a Length better estimation of XIMI.For a fixed effect model, (number of letters) Number of lines of the the matrix X is always perfectly reconstituted with L definition components (recall that L is the rank of X). PC In addition,Eq.12 can be adapted to compute the estimation of the supplementary observations as Entries 单 =xpQIMIQIMIT (16) Frequency ● Random Effect Model In most applications,the set of observations represents FIGURE 4 Circle of correlations and plot of the loadings of (a)the a sample from a larger population.In this case,the variables with principal components 1 and 2,and (b)the variables and goal is to estimate the value of net observations from supplementary variables with principal components 1 and 2.Note that this population.This corresponds to a random effect the supplementary variables are not positioned on the unit circle. model.In order to estimate the generalization capacity of the PCA model,we cannot use standard parametric Then,the matrix XIMI is built back using Eq.12 procedures.Therefore,the performance of the PCA keeping only the first M components: model is evaluated using computer-based resampling techniques such as the bootstrap and cross-validation XIMI PIMIAIMIOIMIT techniques where the data are separated into a learning FIMIOIMIT and a testing set.A popular cross-validation technique =XQIMIOIMIT is the jackknife (aka 'leave one out'procedure).In the (13) jackknife,20-22 each observation is dropped from the set in turn and the remaining observations constitute where PlMI,AIMI,and QIMI represent,respectively the learning set.The learning set is then used to the matrices P,△,and Q with only their first M estimate (using Eq.16)the left-out observation which components.Note,incidentally,that Eq.7 can be constitutes the testing set.Using this procedure,each rewritten in the current context as: observation is estimated according to a random effect model.The predicted observations are then stored in X =XIMI+E=FIMIQIMIT +E (14) a matrix denoted X. The overall quality of the PCA random effect (where E is the error matrix,which is equal to model using M components is evaluated as the X-XIMI). similarity between X and XIMI.As with the fixed To evaluate the quality of the reconstitution of effect model,this can also be done with a squared X with M components,we evaluate the similarity coefficient of correlation or (better)with the Ry 440 2010 John Wiley Sons,Inc. Volume 2,July/August 2010
Overview www.wiley.com/wires/compstats PC2 Number of lines of the definition PC1 Length (number of letters) PC2 Number of lines of the definition PC1 Length (number of letters) # Entries Frequency (a) (b) FIGURE 4 | Circle of correlations and plot of the loadings of (a) the variables with principal components 1 and 2, and (b) the variables and supplementary variables with principal components 1 and 2. Note that the supplementary variables are not positioned on the unit circle. Then, the matrix X+[M] is built back using Eq. 12 keeping only the first M components: X+[M] = P[M] ![M] Q[M]T = F[M] Q[M]T = XQ[M] Q[M]T (13) where P[M] , ![M] , and Q[M] represent, respectively the matrices P, !, and Q with only their first M components. Note, incidentally, that Eq. 7 can be rewritten in the current context as: X = X+[M] + E = F[M] Q[M]T + E (14) (where E is the error matrix, which is equal to X − X+[M] ). To evaluate the quality of the reconstitution of X with M components, we evaluate the similarity between X and X+[M] . Several coefficients can be used for this task [see, e.g., Refs 16–18]. The squared coefficient of correlation is sometimes used, as well as the RV coefficient.18,19 The most popular coefficient, however, is the residual sum of squares (RESS). It is computed as: RESSM = &X − X+[M] &2 = trace , ETE - = I −# M #=1 λ# (15) where & & is the norm of X (i.e., the square root of the sum of all the squared elements of X), and where the trace of a matrix is the sum of its diagonal elements. The smaller the value of RESS, the better the PCA model. For a fixed effect model, a larger M gives a better estimation of X+[M] . For a fixed effect model, the matrix X is always perfectly reconstituted with L components (recall that L is the rank of X). In addition, Eq. 12 can be adapted to compute the estimation of the supplementary observations as +x[M] sup = xsupQ[M] Q[M]T. (16) Random Effect Model In most applications, the set of observations represents a sample from a larger population. In this case, the goal is to estimate the value of new observations from this population. This corresponds to a random effect model. In order to estimate the generalization capacity of the PCA model, we cannot use standard parametric procedures. Therefore, the performance of the PCA model is evaluated using computer-based resampling techniques such as the bootstrap and cross-validation techniques where the data are separated into a learning and a testing set. A popular cross-validation technique is the jackknife (aka ‘leave one out’ procedure). In the jackknife,20–22 each observation is dropped from the set in turn and the remaining observations constitute the learning set. The learning set is then used to estimate (using Eq. 16) the left-out observation which constitutes the testing set. Using this procedure, each observation is estimated according to a random effect model. The predicted observations are then stored in a matrix denoted X.. The overall quality of the PCA random effect model using M components is evaluated as the similarity between X and X.[M] . As with the fixed effect model, this can also be done with a squared coefficient of correlation or (better) with the RV 440 2010 John Wiley & Son s, In c. Volume 2, July/Augu st 2010
WIREs Computational Statistics Principal component analysis coefficient.Similar to RESS,one can use the predicted the optimal number of components to keep when the residual sum of squares (PRESS).It is computed as: goal is to generalize the conclusions of an analysis to new data PRESSM X -XIMI2 (17) A simple approach stops adding components when PRESS decreases.A more elaborated approach The smaller the PRESS the better the qutality of the [see e.g.,Refs 27-31]begins by computing,for each estimation for a random model. component e,a quantity denoted O2 is defined as: Contrary to what happens with the fixed effect model,the matrix X is not always perfectly reconsti- Q2=1- PRESSe tuted with all L components.This is particularly the case when the number of variables is larger than the RESS-1 (19) number of observations(a configuration known as the 'small N large P'problem in the literature). with PRESSe(RESSe)being the value of PRESS(RESS) for the e-th component (where RESSo is equal to the total inertia).Only the components with 2 How Many Components? greater or equal to an arbitrary critical value(usually Often,only the important information needs to be 1-0.952 =0.0975)are kept [an alternative set of extracted from a data matrix.In this case,the problem critical values sets the threshold to 0.05 when I 100 is to figure out how many components need to be and to 0 when I>100;see Ref 28]. considered.This problem is still open,but there Another approach-based on cross-validation- are some guidelines [see,e.g.,Refs 9,8,23].A first to decide upon the number of components to keep uses procedure is to plot the eigenvalues according to their the index We derived from Refs 32 and 33.In contrast size [the so called"scree,"see Refs 8,24 and Table 2] to 2,which depends on RESS and PRESS,the index and to see if there is a point in this graph(often called We,depends only upon PRESS.It is computed for the an 'elbow')such that the slope of the graph goes from e-th component as steep'to "flat"and to keep only the components which are before the elbow.This procedure,somewhat subjective,is called the scree or elbow test. W= PRESSL-1-PRESSe dfresidual, (20) Another standard tradition is to keep only PRESS dfe the components whose eigenvalue is larger than the average.Formally,this amount to keeping the e-th where PRESSo is the inertia of the data table,df,is the component if number of degrees of freedom for the e-th component equal to (18) df=1+J-2e, (21) (where L is the rank of X).For a correlation PCA, anddf is the residual number of degrees of this rule boils down to the standard advice to 'keep freedom which is equal to the total number of degrees only the eigenvalues larger than 1'[see,e.g.,Ref of freedom of the table [equal to /(I-1)]minus the 25].However,this procedure can lead to ignoring number of degrees of freedom used by the previous important information [see Ref 26 for an example of components.The value of dfresidual.is obtained as: this problem]. Random Model dfresidual.=(I-1)->(I+J-2k) As mentioned earlier,when using a random model, k=1 the quality of the prediction does not always increase with the number of components of the model.In fact, =J1-1)-(1+J--1). (22) when the number of variables exceeds the number of observations,quality typically increases and then Most of the time,O2 and We will agree on the number decreases.When the quality of the prediction decreases of components to keep,but We can give a more as the number of components increases this is an conservative estimate of the number of components indication that the model is overfitting the data (i.e., to keep than O2.When is smaller than I,the value the information in the learning set is not useful to fit of both O and WL is meaningless because they both the testing set).Therefore,it is important to determine involve a division by zero. Volume 2,July/August 2010 2010 John Wiley Sons,Inc. 441
WIREs Computational Statistics Principal component analysis coefficient. Similar to RESS, one can use the predicted residual sum of squares (PRESS). It is computed as: PRESSM = &X − X.[M] &2 (17) The smaller the PRESS the better the quality of the estimation for a random model. Contrary to what happens with the fixed effect model, the matrix X is not always perfectly reconstituted with all L components. This is particularly the case when the number of variables is larger than the number of observations (a configuration known as the ‘small N large P’ problem in the literature). How Many Components? Often, only the important information needs to be extracted from a data matrix. In this case, the problem is to figure out how many components need to be considered. This problem is still open, but there are some guidelines [see, e.g.,Refs 9,8,23]. A first procedure is to plot the eigenvalues according to their size [the so called ‘‘scree,’’ see Refs 8,24 and Table 2] and to see if there is a point in this graph (often called an ‘elbow’) such that the slope of the graph goes from ‘steep’ to ‘‘flat’’ and to keep only the components which are before the elbow. This procedure, somewhat subjective, is called the scree or elbow test. Another standard tradition is to keep only the components whose eigenvalue is larger than the average. Formally, this amount to keeping the #-th component if λ# > 1 L # L # λ# = 1 L I (18) (where L is the rank of X). For a correlation PCA, this rule boils down to the standard advice to ‘keep only the eigenvalues larger than 1’ [see, e.g., Ref 25]. However, this procedure can lead to ignoring important information [see Ref 26 for an example of this problem]. Random Model As mentioned earlier, when using a random model, the quality of the prediction does not always increase with the number of components of the model. In fact, when the number of variables exceeds the number of observations, quality typically increases and then decreases. When the quality of the prediction decreases as the number of components increases this is an indication that the model is overfitting the data (i.e., the information in the learning set is not useful to fit the testing set). Therefore, it is important to determine the optimal number of components to keep when the goal is to generalize the conclusions of an analysis to new data. A simple approach stops adding components when PRESS decreases. A more elaborated approach [see e.g., Refs 27–31] begins by computing, for each component #, a quantity denoted Q2 # is defined as: Q2 # = 1 − PRESS# RESS#−1 (19) with PRESS# (RESS#) being the value of PRESS (RESS) for the #-th component (where RESS0 is equal to the total inertia). Only the components with Q2 # greater or equal to an arbitrary critical value (usually 1 − 0.952 = 0.0975) are kept [an alternative set of critical values sets the threshold to 0.05 when I ≤ 100 and to 0 when I > 100; see Ref 28]. Another approach—based on cross-validation— to decide upon the number of components to keep uses the index W# derived from Refs 32 and 33. In contrast to Q2 # , which depends on RESS and PRESS, the index W#, depends only upon PRESS. It is computed for the #-th component as W# = PRESS#−1 − PRESS# PRESS# × dfresidual, # df# , (20) where PRESS0 is the inertia of the data table, df# is the number of degrees of freedom for the #-th component equal to df# = I + J − 2#, (21) and dfresidual, # is the residual number of degrees of freedom which is equal to the total number of degrees of freedom of the table [equal to J(I − 1)] minus the number of degrees of freedom used by the previous components. The value of dfresidual, # is obtained as: dfresidual, # = J(I − 1) −# # k=1 (I + J − 2k) = J(I − 1) − #(I + J − # − 1). (22) Most of the time, Q2 # and W# will agree on the number of components to keep, but W# can give a more conservative estimate of the number of components to keep than Q2 # . When J is smaller than I, the value of both Q2 L and WL is meaningless because they both involve a division by zero. Volume 2, July/August 2010 2010 John Wiley & Son s, In c. 441