COMPUTERS GEOSCIENCES ELSEVIER Computers Geosciences 31 (2005)579-587 www.elsevier.com/locate/cageo Multivariate outlier detection in exploration geochemistry* Peter Filzmoser*,Robert G.Garrett,Clemens Reimann "Institute of Statistics and Probability Theory.Vienna University of Technology.Wiedner Hauptstr.8-10.A-1040 Wien.Austria bGeological Survey of Canada,Natural Resources Canada,601 Booth Street,Ottawa,Ontario,Canada,KIA 0E8 Geological Survey of Norway.N-7491 Trondheim.Norway Received 16 November 2004:accepted 16 November 2004 Abstract A new method for multivariate outlier detection able to distinguish between extreme values of a normal distribution and values originating from a different distribution (outliers)is presented.To facilitate visualising multivariate outliers spatially on a map,the multivariate outlier plot,is introduced.In this plot different symbols refer to a distance measure from the centre of the distribution,taking into account the shape of the distribution,and different colours are used to signify the magnitude of the values for each variable.The method is illustrated using a real geochemical data set from far-northern Europe.It is demonstrated that important processes such as the input of metals from contamination sources and the contribution of sea-salts via marine aerosols to the soil can be identified and separated. C 2004 Elsevier Ltd.All rights reserved. Keywords:Multivariate outliers:Robust statistics:Exploration geochemistry:Background 1.Introduction tances.The definition of an outlier limit or threshold, dividing background data from outliers.has found The detection of data outliers and unusual data much attention in the geochemical literature and to date structures is one of the main tasks in the statistical no universally applicable method of identifying outliers analysis of geochemical data.Traditionally,despite the has been proposed (see discussion in Reimann et al., fact that geochemistry data sets are almost always 2005).In this context,background is defined by the multivariate,outliers are most frequently sought for properties,location and spread,of geochemical samples each single variable in a given data set(Reimann et al., that represent the natural variation of the material being 2005).The search for outliers is usually based on studied in a specific area that are uninfluenced by location and spread of the data.The higher (lower)the extraneous and exotic processes such as those related to analytical result of a sample,the greater is the distance rare rock types,mineral deposit forming processes,or of the observation from the central location of all anthropogenic contamination.In geochemistry,outliers observations;outliers thus,typically,have large dis- are generally observations resulting from a secondary process and not extreme values from the background *Code available from server at http://cran.r-project.org/ distribution.Samples where the analytical values are *Corresponding author.Tel.:+431 58801 10733; derived from a secondary process-be it mineralisation fax:+4315880110799. or contamination-do not need to be especially high E-mail addresses:p.filzmoser@tuwien.ac.at (P.Filzmoser). (or low)in relation to all values of a variable in a data garrett@gsc.NRCan.gc.ca(R.G.Garrett), set,and thus attempts to identify these samples with Clemens.Reimann@ngu.no(C.Reimann). classical univariate methods commonly fail.However, 0098-3004/S-see front matter 2004 Elsevier Ltd.All rights reserved. doi:10.1016j.cageo.2004.11.013
Computers & Geosciences 31 (2005) 579–587 Multivariate outlier detection in exploration geochemistry$ Peter Filzmosera,, Robert G. Garrettb , Clemens Reimannc a Institute of Statistics and Probability Theory, Vienna University of Technology, Wiedner Hauptstr. 8-10, A-1040 Wien, Austria b Geological Survey of Canada, Natural Resources Canada, 601 Booth Street, Ottawa, Ontario, Canada, K1A 0E8 c Geological Survey of Norway, N-7491 Trondheim, Norway Received 16 November 2004; accepted 16 November 2004 Abstract A new method for multivariate outlier detection able to distinguish between extreme values of a normal distribution and values originating from a different distribution (outliers) is presented. To facilitate visualising multivariate outliers spatially on a map, the multivariate outlier plot, is introduced. In this plot different symbols refer to a distance measure from the centre of the distribution, takinginto account the shape of the distribution, and different colours are used to signify the magnitude of the values for each variable. The method is illustrated using a real geochemical data set from far-northern Europe. It is demonstrated that important processes such as the input of metals from contamination sources and the contribution of sea-salts via marine aerosols to the soil can be identified and separated. r 2004 Elsevier Ltd. All rights reserved. Keywords: Multivariate outliers; Robust statistics; Exploration geochemistry; Background 1. Introduction The detection of data outliers and unusual data structures is one of the main tasks in the statistical analysis of geochemical data. Traditionally, despite the fact that geochemistry data sets are almost always multivariate, outliers are most frequently sought for each single variable in a given data set (Reimann et al., 2005). The search for outliers is usually based on location and spread of the data. The higher (lower) the analytical result of a sample, the greater is the distance of the observation from the central location of all observations; outliers thus, typically, have large distances. The definition of an outlier limit or threshold, dividingbackground data from outliers, has found much attention in the geochemical literature and to date no universally applicable method of identifyingoutliers has been proposed (see discussion in Reimann et al., 2005). In this context, background is defined by the properties, location and spread, of geochemical samples that represent the natural variation of the material being studied in a specific area that are uninfluenced by extraneous and exotic processes such as those related to rare rock types, mineral deposit formingprocesses, or anthropogenic contamination. In geochemistry, outliers are generally observations resulting from a secondary process and not extreme values from the background distribution. Samples where the analytical values are derived from a secondary process—be it mineralisation or contamination—do not need to be especially high (or low) in relation to all values of a variable in a data set, and thus attempts to identify these samples with classical univariate methods commonly fail. However, ARTICLE IN PRESS www.elsevier.com/locate/cageo 0098-3004/$ - see front matter r 2004 Elsevier Ltd. All rights reserved. doi:10.1016/j.cageo.2004.11.013 $Code available from server at http://cran.r-project.org/. Correspondingauthor. Tel.: +43 1 58801 10733; fax: +43 1 58801 10799. E-mail addresses: p.filzmoser@tuwien.ac.at (P. Filzmoser), garrett@gsc.NRCan.gc.ca (R.G. Garrett), Clemens.Reimann@ngu.no (C. Reimann)
580 P.Filzmoser et al.Computers Geosciences 31 (2005)579-587 this problem often may be overcome by utilising the measure which takes into account the covariance matrix multivariate nature of most geochemical data sets. is the Mahalanobis distance.For a p-dimensional In the multivariate case not only the distance of an multivariate sample x,...,the Mahalanobis distance observation from the centroid of the data but also the is defined as: shape of the data have to be considered.To illustrate this,two variables with normal distributions having a MD=((i-1)TC-l(1))2 for i=1,....n,(1) defined correlation (Fig.1)are simulated.The estimated where t is the estimated multivariate location and C the central location of each variable is indicated by dashed estimated covariance matrix.Usually,t is the multivariate lines (their intersection marks the multivariate centre or arithmetic mean.the centroid.and C is the sample centroid of the data). covariance matrix.For multivariate normally distributed In the absence of a prior threshold (Rose et al..1979) data the values MD are approximately chi-square a common practice of geochemists is to identify some distributed with p degrees of freedom ()By setting fraction,often 2%,of the data at the upper and lower the (squared)Mahalanobis distance equal to a certain extremes for further investigation.Today this is constant,ie.to a certain quantile of it is possible to achieved by direct estimation of the percentiles and define ellipsoids having the same Mahalanobis distance visual (EDA)inspection of the data.When computers from the centroid (e.g,Gnanadesikan,1977). were not widely available an approximation of the Fig.I illustrates this for the bivariate normally 97.5th percentile was obtained by estimating the mean distributed data.The ellipses correspond to the quantiles and standard deviation (SD)for each variate and 0.25.0.50,0.75 and 0.98 of Points lying on an ellipse computing the value of mean +2SD.The 2%limits thus have the same distance from the centroid.This are indicated by dotted lines on Fig.1.If candidates for distance measure takes the shape of the data cloud into outliers are defined to be observations falling in the account and has potential for more reliably identifying extreme 2%fractions of the univariate data for each extreme values. variable,the rectangle visualised with bold dots sepa- Multivariate outliers can now simply be defined as rates potential outliers from non-outliers.This proce- observations having a large (squared)Mahalanobis dure ignores the elliptical shape of the bivariate data and distance.As noted above for the univariate case.when therefore it is not effective. no prior threshold is available a certain proportion of the The shape and size of multivariate data are quantified data or quantile of the normal distribution is selected for by the covariance matrix.A well-known distance identifying extreme samples for further study.Similarly, in the multivariate case a quantile of the chi-squared 0 distribution (e.g.,the 98%quantile 2:9s)could be considered for this purpose.However,this approach has 90 several shortcomings that will be investigated in this 0 6◆·意归 $°。g paper.The Mahalanobis distances need to be estimated by a robust procedure in order to provide reliable measures for the recognition of outliers.In the geochem- ical context what is required is a reliable estimate of the 0 statistical properties of natural background.Using robust estimates that remove (trim)or downweight extreme 0 values in a population is an effective,if conservative, 8 8 solution.It is conservative to the extent that if there are in o:8 8808 fact no outliers the only consequence is that the true variability (variance-covariance)of the data will be 4 8 o underestimated.Furthermore,by selecting a fixed quan- tile for outlier identification there is no adjustment for ● different sample sizes.To address this situation an adaptive outlier identification method has been devel- -2 0 oped.Finally,the multivariate outlier plot is introduced as Fig.1.Simulated standard normally distributed data with a a helpful tool for the interpretation of multivariate data. predetermined correlation.Dashed lines mark locations (means)of variates,ellipses correspond to 0.25,0.50,0.75 and 0.98 quantiles of chi-squared distribution,and bold dotted lines to 2nd and 98th empirical percentiles for individual variables. 2.The robust distance(RD) Hence,inner rectangular(bold dotted lines)can be considered for univariate outlier recognition,outer ellipse for multivariate The Mahalanobis distance is very sensitive to the outlier identification. presence of outliers (Rousseeuw and Van Zomeren
this problem often may be overcome by utilisingthe multivariate nature of most geochemical data sets. In the multivariate case not only the distance of an observation from the centroid of the data but also the shape of the data have to be considered. To illustrate this, two variables with normal distributions havinga defined correlation (Fig. 1) are simulated. The estimated central location of each variable is indicated by dashed lines (their intersection marks the multivariate centre or centroid of the data). In the absence of a prior threshold (Rose et al., 1979) a common practice of geochemists is to identify some fraction, often 2%, of the data at the upper and lower extremes for further investigation. Today this is achieved by direct estimation of the percentiles and visual (EDA) inspection of the data. When computers were not widely available an approximation of the 97.5th percentile was obtained by estimatingthe mean and standard deviation (SD) for each variate and computingthe value of mean 72SD. The 2% limits are indicated by dotted lines on Fig. 1. If candidates for outliers are defined to be observations fallingin the extreme 2% fractions of the univariate data for each variable, the rectangle visualised with bold dots separates potential outliers from non-outliers. This procedure ignores the elliptical shape of the bivariate data and therefore it is not effective. The shape and size of multivariate data are quantified by the covariance matrix. A well-known distance measure which takes into account the covariance matrix is the Mahalanobis distance. For a p-dimensional multivariate sample x1; ... ; xn the Mahalanobis distance is defined as: MDi :¼ ððxi tÞ TC1 ðxi tÞÞ1=2 for i ¼ 1; ... ; n, (1) where t is the estimated multivariate location and C the estimated covariance matrix. Usually, t is the multivariate arithmetic mean, the centroid, and C is the sample covariance matrix. For multivariate normally distributed data the values MD2 i are approximately chi-square distributed with p degrees of freedom ðw2 pÞ: By setting the (squared) Mahalanobis distance equal to a certain constant, i.e. to a certain quantile of w2 p; it is possible to define ellipsoids havingthe same Mahalanobis distance from the centroid (e.g, Gnanadesikan, 1977). Fig. 1 illustrates this for the bivariate normally distributed data. The ellipses correspond to the quantiles 0.25, 0.50, 0.75 and 0.98 of w2 2: Points lyingon an ellipse thus have the same distance from the centroid. This distance measure takes the shape of the data cloud into account and has potential for more reliably identifying extreme values. Multivariate outliers can now simply be defined as observations havinga large (squared) Mahalanobis distance. As noted above for the univariate case, when no prior threshold is available a certain proportion of the data or quantile of the normal distribution is selected for identifyingextreme samples for further study. Similarly, in the multivariate case a quantile of the chi-squared distribution (e.g., the 98% quantile w2 p;0:98) could be considered for this purpose. However, this approach has several shortcomings that will be investigated in this paper. The Mahalanobis distances need to be estimated by a robust procedure in order to provide reliable measures for the recognition of outliers. In the geochemical context what is required is a reliable estimate of the statistical properties of natural background. Using robust estimates that remove (trim) or downweight extreme values in a population is an effective, if conservative, solution. It is conservative to the extent that if there are in fact no outliers the only consequence is that the true variability (variance–covariance) of the data will be underestimated. Furthermore, by selectinga fixed quantile for outlier identification there is no adjustment for different sample sizes. To address this situation an adaptive outlier identification method has been developed. Finally, the multivariate outlier plot is introduced as a helpful tool for the interpretation of multivariate data. 2. The robust distance (RD) The Mahalanobis distance is very sensitive to the presence of outliers (Rousseeuw and Van Zomeren, ARTICLE IN PRESS -3 -2 -1 0 1 -3 -2 -1 0 1 2 3 2 3 Fig. 1. Simulated standard normally distributed data with a predetermined correlation. Dashed lines mark locations (means) of variates, ellipses correspond to 0.25, 0.50, 0.75 and 0.98 quantiles of chi-squared distribution, and bold dotted lines to 2nd and 98th empirical percentiles for individual variables. Hence, inner rectangular (bold dotted lines) can be considered for univariate outlier recognition, outer ellipse for multivariate outlier identification. 580 P. Filzmoser et al. / Computers & Geosciences 31 (2005) 579–587
P.Filzmoser et al.Computers Geosciences 31 (2005)579-587 581 1990).Single extreme observations,or groups of Classical cor 0.66 Robust cor=0.18 observations,departing from the main data structure can have a severe influence on this distance measure. 000 This is somewhat obscure because the Mahalanobis distance should be able to detect outliers,but the same o60 0°。 outliers can heavily affect the Mahalanobis distance. 0 The reason is the sensitivity of arithmetic mean and 00 sample covariance matrix to outliers (Hampel et al., 4 1986).A solution to this problem is well-known in robust statistics:t and Cin Eq.(1)have to be estimated N in a robust manner.where the expression 'robust'means resistance against the influence of outlying observations. Many robust estimators for location and covariance have been introduced in the literature,for a review see Maronna and Yohai (1998).The minimum covariance determinant (MCD)estimator (Rousseeuw,1985)is probably most frequently used in practice,partly because it is a computationally fast algorithm (Rous- -3 -2 7 seeuw and Van Driessen,1999). log (Be)in Chorizon The MCD estimator is determined by that subset of Fig.2.Scatterplot of log (Be)and log (Sr).Covariance is observations of size h which minimises the determinant visualised by tolerance ellipses.Non-robust estimation (dotted of the sample covariance matrix,computed from only ellipse)leads to a Pearson correlation coefficient of 0.66,robust these h points.The location estimator is the average of procedure (solid ellipse)estimates a Pearson correlation of 0.18 these h points,whereas the scatter estimator is propor- for core population,i.e.weight of 1.identified by MCD tional to their covariance matrix.As a compromise procedure. between robustness and efficiency,a value of h0.75n (n is the sample size)will be employed in this study. through their influence on the classical non-robust The choice of h also determines the robustness of the computation.This influence is also reflected in the estimator.The breakdown value of the MCD estimator resulting correlation coefficients.Whereas the Pearson is approximately (n-h)/n,with h0.75n the break- correlation based on the classical estimates is 0.66.the down is approximately 25%.The breakdown value is robust correlation based on the MCD estimator is only the fraction of outliers that when exceeded will lead to 0.18.The next step would be an appropriate visualisa- completely biased estimates(Hampel et al.,1986). tion of the outliers in a map in order to support the Using robust estimators of location and scatter in the geochemical interpretation of the observations.This will formula for the Mahalanobis distance Eq.(1)leads to be demonstrated later for other examples.The high the so-called robust distances (RDs).Rousseeuw and correlation of Be and Sr in Fig.2 is due to a few samples Van Zomeren (1990)used these RDs for multivariate of soil developed on alkaline rocks that display outlier detection.If the squared RD for an observation unusually high concentrations of both these elements. is larger than,say,it can be declared a candidate The high non-robust correlation coefficient is thus an outlier. inappropriate estimate for the majority of the data as it This procedure is illustrated using real data from the is unduly influenced by true outliers (due to completely Kola project (Reimann et al.,1998).Fig.2 shows the different geology)】 plot of Be and Sr determined in C-horizon soils.Using the arithmetic mean and the sample covariance matrix in Eq.(1)it is possible to construct the ellipse correspond- ing to the squared Mahalanobis distance equal to 72.0.9s. 3.Multivariate outliers or extremes? This ellipse (often called a tolerance ellipse)is visualised as a dotted line in Fig.2.It identifies the extreme In the univariate case,Reimann et al.(2005)pointed members of the bivariate population and its shape out the difference between extremes of a distribution reflects the structure of the covariance matrix.By and true outliers.Outliers are thought to be observa- computing the RDs with the MCD estimator another tions coming from one or more different distributions, tolerance ellipse (solid line in Fig.2)can be constructed and extremes are values that are far away from the using the same quantile,29s.It is clearly apparent that centre but which belong to the same distribution.In an many more points in the upper right of Fig.2 are exploratory univariate data analysis it is convenient to identified as candidate outliers.These outliers cause the start with simply identifying all extreme observations as elongated orientation and shape of the dotted ellipse extreme.It is an important aim of data interpretation to
1990). Single extreme observations, or groups of observations, departingfrom the main data structure can have a severe influence on this distance measure. This is somewhat obscure because the Mahalanobis distance should be able to detect outliers, but the same outliers can heavily affect the Mahalanobis distance. The reason is the sensitivity of arithmetic mean and sample covariance matrix to outliers (Hampel et al., 1986). A solution to this problem is well-known in robust statistics: t and C in Eq. (1) have to be estimated in a robust manner, where the expression ‘robust’ means resistance against the influence of outlying observations. Many robust estimators for location and covariance have been introduced in the literature, for a review see Maronna and Yohai (1998). The minimum covariance determinant (MCD) estimator (Rousseeuw, 1985) is probably most frequently used in practice, partly because it is a computationally fast algorithm (Rousseeuw and Van Driessen, 1999). The MCD estimator is determined by that subset of observations of size h which minimises the determinant of the sample covariance matrix, computed from only these h points. The location estimator is the average of these h points, whereas the scatter estimator is proportional to their covariance matrix. As a compromise between robustness and efficiency, a value of h 0:75n (n is the sample size) will be employed in this study. The choice of h also determines the robustness of the estimator. The breakdown value of the MCD estimator is approximately ðn hÞ=n; with h 0:75n the breakdown is approximately 25%. The breakdown value is the fraction of outliers that when exceeded will lead to completely biased estimates (Hampel et al., 1986). Usingrobust estimators of location and scatter in the formula for the Mahalanobis distance Eq. (1) leads to the so-called robust distances (RDs). Rousseeuw and Van Zomeren (1990) used these RDs for multivariate outlier detection. If the squared RD for an observation is larger than, say, w2 2;0:98; it can be declared a candidate outlier. This procedure is illustrated usingreal data from the Kola project (Reimann et al., 1998). Fig. 2 shows the plot of Be and Sr determined in C-horizon soils. Using the arithmetic mean and the sample covariance matrix in Eq. (1) it is possible to construct the ellipse correspondingto the squared Mahalanobis distance equal to w2 2;0:98: This ellipse (often called a tolerance ellipse) is visualised as a dotted line in Fig. 2. It identifies the extreme members of the bivariate population and its shape reflects the structure of the covariance matrix. By computingthe RDs with the MCD estimator another tolerance ellipse (solid line in Fig. 2) can be constructed usingthe same quantile, w2 2;0:98: It is clearly apparent that many more points in the upper right of Fig. 2 are identified as candidate outliers. These outliers cause the elongated orientation and shape of the dotted ellipse through their influence on the classical non-robust computation. This influence is also reflected in the resultingcorrelation coefficients. Whereas the Pearson correlation based on the classical estimates is 0.66, the robust correlation based on the MCD estimator is only 0.18. The next step would be an appropriate visualisation of the outliers in a map in order to support the geochemical interpretation of the observations. This will be demonstrated later for other examples. The high correlation of Be and Sr in Fig. 2 is due to a few samples of soil developed on alkaline rocks that display unusually high concentrations of both these elements. The high non-robust correlation coefficient is thus an inappropriate estimate for the majority of the data as it is unduly influenced by true outliers (due to completely different geology). 3. Multivariate outliers or extremes? In the univariate case, Reimann et al. (2005) pointed out the difference between extremes of a distribution and true outliers. Outliers are thought to be observations comingfrom one or more different distributions, and extremes are values that are far away from the centre but which belongto the same distribution. In an exploratory univariate data analysis it is convenient to start with simply identifyingall extreme observations as extreme. It is an important aim of data interpretation to ARTICLE IN PRESS -3 -2 -1 0 2 0 2 log (Be) in Chorizon log (Sr) in Chorizon Classical cor = 0.66 Robust cor = 0.18 4 6 1 Fig. 2. Scatterplot of loge(Be) and loge(Sr). Covariance is visualised by tolerance ellipses. Non-robust estimation (dotted ellipse) leads to a Pearson correlation coefficient of 0.66, robust procedure (solid ellipse) estimates a Pearson correlation of 0.18 for core population, i.e. weight of 1, identified by MCD procedure. P. Filzmoser et al. / Computers & Geosciences 31 (2005) 579–587 581
582 P.Filzmoser et al.Computers Geosciences 31 (2005)579-587 identify the different geochemical processes that influence detect outliers.The tails will be defined by =for the data.Only in doing so can the true outliers be a certain small a(e.g.,=0.02),and identified and differentiated from extreme members of the one or more background populations in the data.This pn(=sup(G()-Gn()计 (2) u≥6 distinction should also be made in the multivariate case. In the previous section the assumption of multivariate is considered,where +indicates the positive differences. normality was implicitly used because this led to chi- In this way,p()measures the departure of the square distributed Mahalanobis distances.Also for the empirical from the theoretical distribution only in the RD this assumption was used,at least for the majority tails,defined by the value of p()can be considered as of data (depending on the choice of h for the MCD a measure of outliers in the sample.Gervini (2003)used estimator).Defining outliers by using a fixed threshold this idea as a reweighting step for the robust estimation value (e.g.,:9s)is rather subjective because of multivariate location and scatter.In this way,the efficiency (in terms of statistical precision)of the estimator could be improved considerably. (1)If the data should indeed come from a single P()will not be directly used as a measure of outliers. multivariate normal distribution.the threshold As mentioned in the previous section,the threshold would be infinity because there are no observations should be infinity in case of multivariate normally from a different distribution (only extremes); distributed background data.This means,that if the (2)There is no reason why this fixed threshold should be data are coming from a multivariate normal distribu- appropriate for every data set;and tion,no observation should be declared as an outlier. (3)The threshold has to be adjusted to the sample size Instead.observations with a large RD should be seen as (see Reimann et al.,2005;and simulations below). extremes of the distribution.Therefore a critical value Pr is introduced,which helps to distinguish between A better procedure than using a fixed threshold is to outliers and extremes.The measure of outliers in the adjust the threshold to the data set at hand.Garrett sample is then defined as (1989)used the chi-square plot for this purpose,by plotting the squared Mahalanobis distances(which have 0 ifPm()≤Peit(⑥,n,p, 以a()= Pn(5)if pn(5)>Peru(6.n.p). (3) to be computed on the basis of robust estimations of location and scatter)against the quantiles of the most extreme points are deleted until the remaining points The threshold value is then determined as c()= follow a straight line.The deleted points are the G(1-xn(). identified outliers,the multivariate threshold corre- The critical value porir for distinguishing between sponds to the distance of the closest outlier,the farthest outliers and extremes can be derived by simulation.For background individual,or some intermediate distance. different sample sizes n and different dimensions(num- Alternately,the cube root of the squared Mahalanobis bers of variables)p data from a multivariate normal distances may be plotted against normal quantiles (e.g., distribution are simulated.Then Eq.(2)is applied for Chork.1990).This procedure (Garrett.1989)is not computing the value p()for a fixed value (in the automatic,it needs user interaction and experience on simulations =72:0.9s is used).The procedure is repeated the part of the analyst.Moreover,especially for large 1000 times for every considered value of n and p. data sets,it can be time consuming,and also to some To directly compute the limiting distribution of the extent it is subjective.In the next section a procedure statistic defined by Eq.(2)would be a more elegant way that does not require analyst intervention,is reprodu- for determining the critical value.However,even for cible and therefore objective,and takes the above points, related simpler problems Csorgo and Revesz (1981, (1)-(3),into consideration is introduced. Chapter 5)note that this is analytically extremely difficult and they recommend simulation. The resulting values give an indication of the differences between the theoretical and the empirical 4.Adaptive outlier detection distributions,G(u)-G(u),if the data are sampled from multivariate normal distributions.To be on the safe side. The chi-square plot is useful for visualising the the 95%percentile of the 1000 simulated values can be deviation of the data distribution from multivariate used for every n and p,and these percentiles are shown normality in the tails.This principle is used in the for p=2,4,6,8,10 by different symbols in Fig.3.By following.Let G(u)denote the empirical distribution transforming the x-axis by the inverse of n it can be function of the squared robust distances RD,and let seen that-at least for larger sample size-the points lie G(u)be the distribution function of For multivariate on a line(see Fig.3).The lines in Fig.3 are estimated by normally distributed samples,Gn converges to G. least trimmed sum of squares (LTS)regression (Rous- Therefore the tails of G and G can be compared to seeuw.1984).Using LTS regression the less precise
identify the different geochemical processes that influence the data. Only in doingso can the true outliers be identified and differentiated from extreme members of the one or more background populations in the data. This distinction should also be made in the multivariate case. In the previous section the assumption of multivariate normality was implicitly used because this led to chisquare distributed Mahalanobis distances. Also for the RD this assumption was used, at least for the majority of data (dependingon the choice of h for the MCD estimator). Definingoutliers by usinga fixed threshold value (e.g., w2 p;0:98) is rather subjective because (1) If the data should indeed come from a single multivariate normal distribution, the threshold would be infinity because there are no observations from a different distribution (only extremes); (2) There is no reason why this fixed threshold should be appropriate for every data set; and (3) The threshold has to be adjusted to the sample size (see Reimann et al., 2005; and simulations below). A better procedure than usinga fixed threshold is to adjust the threshold to the data set at hand. Garrett (1989) used the chi-square plot for this purpose, by plottingthe squared Mahalanobis distances (which have to be computed on the basis of robust estimations of location and scatter) against the quantiles of w2 p; the most extreme points are deleted until the remainingpoints follow a straight line. The deleted points are the identified outliers, the multivariate threshold corresponds to the distance of the closest outlier, the farthest background individual, or some intermediate distance. Alternately, the cube root of the squared Mahalanobis distances may be plotted against normal quantiles (e.g., Chork, 1990). This procedure (Garrett, 1989) is not automatic, it needs user interaction and experience on the part of the analyst. Moreover, especially for large data sets, it can be time consuming, and also to some extent it is subjective. In the next section a procedure that does not require analyst intervention, is reproducible and therefore objective, and takes the above points, (1)–(3), into consideration is introduced. 4. Adaptive outlier detection The chi-square plot is useful for visualisingthe deviation of the data distribution from multivariate normality in the tails. This principle is used in the following. Let Gnð Þ u denote the empirical distribution function of the squared robust distances RD2 i ; and let G uð Þ be the distribution function of w2 p: For multivariate normally distributed samples, Gn converges to G. Therefore the tails of Gn and G can be compared to detect outliers. The tails will be defined by d ¼ w2 p;1a for a certain small a (e.g., a ¼ 0:02), and pnð Þ¼ d sup uXd ð Þ G uð Þ Gnð Þ u þ (2) is considered, where + indicates the positive differences. In this way, pnð Þ d measures the departure of the empirical from the theoretical distribution only in the tails, defined by the value of d: pnð Þ d can be considered as a measure of outliers in the sample. Gervini (2003) used this idea as a reweighting step for the robust estimation of multivariate location and scatter. In this way, the efficiency (in terms of statistical precision) of the estimator could be improved considerably. pnð Þ d will not be directly used as a measure of outliers. As mentioned in the previous section, the threshold should be infinity in case of multivariate normally distributed background data. This means, that if the data are comingfrom a multivariate normal distribution, no observation should be declared as an outlier. Instead, observations with a large RD should be seen as extremes of the distribution. Therefore a critical value pcrit is introduced, which helps to distinguish between outliers and extremes. The measure of outliers in the sample is then defined as anðdÞ ¼ 0 if pnðdÞppcritðd; n; pÞ; pnðdÞ if pnðdÞ4pcritðd; n; pÞ: ( (3) The threshold value is then determined as cnð Þ¼ d G1 n ð Þ 1 anð Þ d : The critical value pcrit for distinguishing between outliers and extremes can be derived by simulation. For different sample sizes n and different dimensions (numbers of variables) p data from a multivariate normal distribution are simulated. Then Eq. (2) is applied for computingthe value pnð Þ d for a fixed value d (in the simulations d ¼ w2 p;0:98 is used). The procedure is repeated 1000 times for every considered value of n and p. To directly compute the limitingdistribution of the statistic defined by Eq. (2) would be a more elegant way for determiningthe critical value. However, even for related simpler problems Cso¨rgo+ and Re´ve´sz (1981, Chapter 5) note that this is analytically extremely difficult and they recommend simulation. The resultingvalues give an indication of the differences between the theoretical and the empirical distributions, G uð Þ Gnð Þ u ; if the data are sampled from multivariate normal distributions. To be on the safe side, the 95% percentile of the 1000 simulated values can be used for every n and p, and these percentiles are shown for p ¼ 2; 4, 6, 8, 10 by different symbols in Fig. 3. By transformingthe x-axis by the inverse of ffiffiffi n p it can be seen that—at least for larger sample size—the points lie on a line (see Fig. 3). The lines in Fig. 3 are estimated by least trimmed sum of squares (LTS) regression (Rousseeuw, 1984). UsingLTS regression the less precise ARTICLE IN PRESS 582 P. Filzmoser et al. / Computers & Geosciences 31 (2005) 579–587
P.Filzmoser et al.Computers Geosciences 31(2005)579-587 583 simulation results for smaller sample sizes have less dimension and low sample size.The estimated slopes influence.The slopes of the different lines(the intercept form a linear trend(Fig.6)and the resulting approx- is 0 because for n tending to infinity the difference imative formula is between empirical and theoretical distribution is 0)are shown in Fig.4.The resulting points can again be P(6,np)= 0.252-0.0018p for p>10. (5) approximated by a straight line,which allows definition √m of the critical value as a function of n and p: Peru(o.n,p)= 0.24-0.003p forp≤10. (4) √i 5.Example For larger dimension(p>10)the same procedure can be To test the procedure.data from the Kola project applied.The 95%percentiles of 1000 simulated values for different sample sizes and dimensions are shown in (Reimann et al.,1998)are again used.The objective is to Fig.5.The linear dependency becomes worse for high identify outliers in the O-horizon (organic surface soil) 0.020 p=2 0.012 p=15 -- P=4 p=20 90.015 p=6 0.010 p=30 p=8 -X- p=50 d 0.008 p=70 p=10 0.010 0.006 0.005 0.004 0.002 0.000 0.000 100001000500300200150 100 Sample size 10000 1000 500 300 200 Sample size Fig.3.Simulated critical values according to Eq.(2)for multivariate normal distributions with different sample sizes(x- Fig.5.Simulated critical values analogous to Fig.3.but for axis)and dimensions p.Linear trends for dimensions plotted. higher dimensions (p>10) and increasing sample size,are indicated by lines. 0.230 0.22 0.225 0.20 f0230 复018 0.215 0.16 0.210 0.14 0 2 4 6 8 10 20 30 4050 60 70 Dimension p Dimension p Fig.4.Slopes of lines from Fig.3 plotted against dimension p. Fig.6.Slopes of lines from Fig.5 plotted against dimension p. Line is an estimation of linear trend,and leads to Eg.(4). Line is an estimation of linear trend,and leads to Eq.(5)
simulation results for smaller sample sizes have less influence. The slopes of the different lines (the intercept is 0 because for n tendingto infinity the difference between empirical and theoretical distribution is 0) are shown in Fig. 4. The resultingpoints can again be approximated by a straight line, which allows definition of the critical value as a function of n and p: pcritðd; n; pÞ ¼ 0:24 0:003p ffiffiffi n p for pp10. (4) For larger dimension (p410) the same procedure can be applied. The 95% percentiles of 1000 simulated values for different sample sizes and dimensions are shown in Fig. 5. The linear dependency becomes worse for high dimension and low sample size. The estimated slopes form a linear trend (Fig. 6) and the resultingapproximative formula is pcritðd; n; pÞ ¼ 0:252 0:0018p ffiffiffi n p for p410. (5) 5. Example To test the procedure, data from the Kola project (Reimann et al., 1998) are again used. The objective is to identify outliers in the O-horizon (organic surface soil) ARTICLE IN PRESS Max. difference between distribution functions 10000 1000 500 300 200 150 100 Sample size 0.000 0.005 0.010 0.015 0.020 p=2 p=4 p=6 p=8 p=10 Fig. 3. Simulated critical values according to Eq. (2) for multivariate normal distributions with different sample sizes (xaxis) and dimensions p. Linear trends for dimensions plotted, and increasingsample size, are indicated by lines. 2 8 10 0.210 0.215 0.220 0.225 0.230 Dimension p Slopes from simulation 4 6 Fig. 4. Slopes of lines from Fig. 3 plotted against dimension p. Line is an estimation of linear trend, and leads to Eq. (4). Max. difference between distribution functions 10000 1000 500 300 200 Sample size 0.000 0.002 0.004 0.006 0.008 0.010 0.012 p=15 p=20 p=30 p=50 p=70 Fig. 5. Simulated critical values analogous to Fig. 3, but for higher dimensions (p410). 20 30 40 50 60 70 0.14 0.16 0.18 0.20 0.22 Dimension p Slopes from simulation Fig. 6. Slopes of lines from Fig. 5 plotted against dimension p. Line is an estimation of linear trend, and leads to Eq. (5). P. Filzmoser et al. / Computers & Geosciences 31 (2005) 579–587 583