30 Years of Multidimensional Multivariate Visualization Pak Chung Wong R.Daniel Bergeron pcw@cs.unh.edu rdb@cs.unh.edu Department of Computer Science University of New Hampshire Durham,New Hampshire 03824,USA Abstract We present a survey of multidimensional multivariate(mdmv)visualization techniques developed during the last three decades.This subfield of scientific visualization deals with the analysis of data with multiple parameters or factors,and the key relationships among them.The course of development is roughly organized into four stages,within which major milestones are discussed.Recently developed techniques are explored with examples. 1 Introduction Multidimensional multivariate visualization is an important subfield of scientific visualization.It was studied sep- arately by statisticians and psychologists long before computer science was deemed a discipline.The appearance of low-priced personal computers and workstations during the 1980's breathed new life into graphical analysis of mdmv data.This research topic was among one of the short-term goals included in the 1987 National Science Foundation(NSF)sponsored workshop on Visualization in Scientific Computing [MDB87].The quest for effective and efficient mdmv visualization techniques has expanded since then. This paper attempts to trace three decades of intensive development in this visualization field.It is by no means a comprehensive survey.We provide a brief history along with a description of the principal concepts of some mdmv visualization techniques.Recently developed mdmv visualization techniques are discussed in detail with examples.A remark of the trends of mdmv visualization research is given. 2 Four Stages of Multidimensional Multivariate Visualization Develop- ment The last three decades of mdmv visualization development can be roughly characterized into four stages.The classic exploratory data analysis (EDA)book by Tukey [Tuk77],the 1987 NSF workshop on Visualization in 1
30 Years of Multidimensional Multivariate Visualization Pak Chung Wong R. Daniel Bergeron pcw@cs.unh.edu rdb@cs.unh.edu Department of Computer Science University of New Hampshire Durham, New Hampshire 03824, USA Abstract We present a survey of multidimensional multivariate (mdmv) visualization techniques developed during the last three decades. This subfield of scientific visualization deals with the analysis of data with multiple parameters or factors, and the key relationships among them. The course of development is roughly organized into four stages, within which major milestones are discussed. Recently developed techniques are explored with examples. 1 Introduction Multidimensional multivariate visualization is an important subfield of scientific visualization. It was studied separately by statisticians and psychologists long before computer science was deemed a discipline. The appearance of low-priced personal computers and workstations during the 1980’s breathed new life into graphical analysis of mdmv data. This research topic was among one of the short-term goals included in the 1987 National Science Foundation (NSF) sponsored workshop on Visualization in Scientific Computing [MDB87]. The quest for effective and efficient mdmv visualization techniques has expanded since then. This paper attempts to trace three decades of intensive development in this visualization field. It is by no means a comprehensive survey. We provide a brief history along with a description of the principal concepts of some mdmv visualization techniques. Recently developed mdmv visualization techniques are discussed in detail with examples. A remark of the trends of mdmv visualization research is given. 2 Four Stages of Multidimensional Multivariate Visualization Development The last three decades of mdmv visualization development can be roughly characterized into four stages. The classic exploratory data analysis (EDA) book by Tukey [Tuk77], the 1987 NSF workshop on Visualization in 1
Scientific Computing [MDB87],and the IEEE Visualization'91 conference [NR91]are the watersheds defining these stages.The first stage was primarily concerned with the graphical presentation of either one or two variate data.The second stage was dominated by Tukey's exploratory data analysis.Scientists started looking at graphical data with a different perspective.Although most of the graphics was still two dimensional,scientists were able to encode data with multiple parameters,i.e.,multivariate,into meaningful two dimensional plots.The momentum of this work carried on through the next stage when NSF recognized the importance of mdmv data visualization.The involvement of computer scientists accelerated the growth of the research by computerizing many of the old ideas and developing many new ones.The mission was formally defined and many promising concepts were developed during the following few years.The final(current)stage is concerned with the elaboration and assessment of mdmy visualization techniques.It remains to be seen whether the existing mdmv visualization concepts can lead to better visualization of a problem and better understanding of the underlying science.This discussion of mdmv visualization is far from complete.There are other important topics including volume visualization and vector/tensor field visualization that are not covered.The principal concepts and research issues related to these subjects can be found in [Nie92,PvW92,KHK+94,HPvW94] 2.1 Pre-1976 The Searching Stage Scientists have studied multivariate visualization since 1782 when Crome used point symbols to show the geo- graphical distribution in Europe of56 commodities [Col93].In 1950,Gibson [Gib50]started the research on visual texture perception.Later,Pickett and White [PW66]proposed mapping data sets onto artificial graphical objects composed of lines.This texture mapping work was further investigated by Pickett [Pic70],and was eventually computerized [PG88].Chernoff [Che73]presented his arrays of cartoon faces for multivariate data in 1973.In this well-known technique,variates are mapped to the shape of the cartoon faces and their facial features including nose,mouth,and eyes.These faces are then displayed in a two dimensional graph. The searching stage can be characterized by relatively small sized data,and tools for data visualization that usually consisted of color pencils and graph paper.The graphical output was mostly two dimensional y-displays. Statisticians were the dominant research force during this period.Graphics was used to bring out the key features of the data,suggest statistical analysis methods that are applied to the data,and present the conclusions [Fis70]. 2.2 1977-1985 The Awakening Stage Tukey's exploratory data analysis signified a new era of scientific data visualization.Exploratory data analysis is more than a tool;it is a way of thinking.It teaches people how to visually decode information from the data.When the personal computer arrived,it became the scientist's most powerful tool ever.Now scientists could visualize data beyond two dimensions interactively.The painfully long calculations suddenly became available in real time. Statisticians could visualize data during each stage of the analysis instead of waiting until the final results were available.The availability of other computer hardware such as high resolution color displays also gave the study of mdmv visualization new opportunities. During this stage,two and three dimensional spatial data were the most common data types being studied, 2
Scientific Computing [MDB87], and the IEEE Visualization ’91 conference [NR91] are the watersheds defining these stages. The first stage was primarily concerned with the graphical presentation of either one or two variate data. The second stage was dominated by Tukey’s exploratory data analysis. Scientists started looking at graphical data with a different perspective. Although most of the graphics was still two dimensional, scientists were able to encode data with multiple parameters, i.e., multivariate, into meaningful two dimensional plots. The momentum of this work carried on through the next stage when NSF recognized the importance of mdmv data visualization. The involvement of computer scientists accelerated the growth of the research by computerizing many of the old ideas and developing many new ones. The mission was formally defined and many promising concepts were developed during the following few years. The final (current) stage is concerned with the elaboration and assessment of mdmv visualization techniques. It remains to be seen whether the existing mdmv visualization concepts can lead to better visualization of a problem and better understanding of the underlying science. This discussion of mdmv visualization is far from complete. There are other important topics including volume visualization and vector/tensor field visualization that are not covered. The principal concepts and research issues related to these subjects can be found in [Nie92, PvW92, KHK+ 94, HPvW94]. 2.1 Pre–1976 The Searching Stage Scientists have studied multivariate visualization since 1782 when Crome used point symbols to show the geographical distribution in Europe of 56 commodities [Col93]. In 1950, Gibson [Gib50] started the research on visual texture perception. Later, Pickett and White [PW66] proposed mapping data sets onto artificial graphical objects composed of lines. This texture mapping work was further investigated by Pickett [Pic70], and was eventually computerized [PG88]. Chernoff [Che73] presented his arrays of cartoon faces for multivariate data in 1973. In this well-known technique, variates are mapped to the shape of the cartoon faces and their facial features including nose, mouth, and eyes. These faces are then displayed in a two dimensional graph. The searching stage can be characterized by relatively small sized data, and tools for data visualization that usually consisted of color pencils and graph paper. The graphical output was mostly two dimensional xy-displays. Statisticians were the dominant research force during this period. Graphics was used to bring out the key features of the data, suggest statistical analysis methods that are applied to the data, and present the conclusions [Fis70]. 2.2 1977–1985 The Awakening Stage Tukey’s exploratory data analysis signified a new era of scientific data visualization. Exploratory data analysis is more than a tool; it is a way of thinking. It teaches people how to visually decode information from the data. When the personal computer arrived, it became the scientist’s most powerful tool ever. Now scientists could visualize data beyond two dimensions interactively. The painfully long calculations suddenly became available in real time. Statisticians could visualize data during each stage of the analysis instead of waiting until the final results were available. The availability of other computer hardware such as high resolution color displays also gave the study of mdmv visualization new opportunities. During this stage, two and three dimensional spatial data were the most common data types being studied, 2
although multivariate data started gaining more attention.Asimov [Asi85]presented the grand tour technique for viewing projections of multivariate data on two dimensional planes.Earth resource satellites sent out decades ago are still transmitting data continuously.Gigabyte sized multivariate data had arrived. 2.3 1986-1991 The Discovery Stage The 1987 NSF workshop formally declared the need for two and three dimensional spatial object visualization.The two dimensional projections of multivariate data sets is also included as one of the short-term potential targets for scientific visualization research.Once the mission was defined,scientists started pushing hard on the representation and visualization of mdmv data.The limited availability of high speed graphics hardware during the previous stage was gradually conquered.A majority of research was directed away from the development of exploratory data analysis tools,which lay heavily on statistical measures,towards colorful high dimensional graphics that required high speed computations.Some of the mdmv visualization concepts developed during this stage include:grand tour methods [BA86],parallel coordinates [IRC87,ID87,ID90],iconography [PG88,BG89b,Bed90,Lev91],worlds within worlds [FB90a,FB90b],dimension stacking [LWW90],hierarchical axis [MGTS90,MTS91a,MTS91b], hyperbox [AC91],and various ideas collected in [Cle93,CMM93].Some of these techniques attempt to show all dimensions and all variates visually as one display,whereas others aim at direct manipulation graphics,in which the user interactively selects subsets for display by using an input device such as a mouse.Virtual reality [FB90a,FB90b]began to appear in the mdmv visualization literature. 2.4 1992-present The Elaboration and Assessment Stage In 1990 and 1991,there were at least fourteen mdmv related papers published in the IEEE Visualization conferences. A total of four have been published in the three visualization conferences since then.This stage so far has been a period of retrenchment in the development of new mdmv visualization techniques.Some of the most recently developed tools are,each in a different way,elaborations of work done in previous stages.For example,HyperSlice [vWvL93]is an attempt to combine the panel matrix of scatterplot matrix with direct manipulation of brushing [BC87].Auto Visual [BF92,BF93]is an extended version of worlds within worlds with a new rule-based interfaces.XmdvTool [War94]integrates four existing mdmv visualization tools:dimension stacking,scatterplot matrix,glyphs,and parallel coordinates into one system with enhanced n-dimensional brushing. The research in mdmv visualization has also been diversified into multidisciplinary collaborations.Attempts to combine sound with graphics [SPW92,SBG92]are currently being made.The concept of a rule-based queue [BF92,BF93]was also introduced.One of the latest research issues of mdmv visualization is the need to evaluate the correctness,effectiveness,and usefulness of mdmv visualization techniques.Similar concerns also appear in the other fields of visualization research [RET+94,HPvW94]. 3
although multivariate data started gaining more attention. Asimov [Asi85] presented the grand tour technique for viewing projections of multivariate data on two dimensional planes. Earth resource satellites sent out decades ago are still transmitting data continuously. Gigabyte sized multivariate data had arrived. 2.3 1986–1991 The Discovery Stage The 1987 NSF workshop formally declared the need for two and three dimensional spatial object visualization. The two dimensional projections of multivariate data sets is also included as one of the short-term potential targets for scientific visualization research. Once the mission was defined, scientists started pushing hard on the representation and visualization of mdmv data. The limited availability of high speed graphics hardware during the previous stage was gradually conquered. A majority of research was directed away from the development of exploratory data analysis tools, which lay heavily on statistical measures, towards colorful high dimensional graphics that required high speed computations. Some of the mdmv visualization concepts developed during this stage include: grand tour methods [BA86], parallel coordinates [IRC87, ID87, ID90], iconography [PG88, BG89b, Bed90, Lev91], worlds within worlds [FB90a, FB90b], dimension stacking [LWW90], hierarchical axis [MGTS90, MTS91a, MTS91b], hyperbox [AC91], and various ideas collected in [Cle93, CMM93]. Some of these techniques attempt to show all dimensions and all variates visually as one display, whereas others aim at direct manipulation graphics, in which the user interactively selects subsets for display by using an input device such as a mouse. Virtual reality [FB90a, FB90b] began to appear in the mdmv visualization literature. 2.4 1992–present The Elaboration and Assessment Stage In 1990 and 1991, there were at least fourteen mdmv related papers published in the IEEE Visualization conferences. A total of four have been published in the three visualization conferences since then. This stage so far has been a period of retrenchment in the development of new mdmv visualization techniques. Some of the most recently developed tools are, each in a different way, elaborations of work done in previous stages. For example, HyperSlice [vWvL93] is an attempt to combine the panel matrix of scatterplot matrix with direct manipulation of brushing [BC87]. AutoVisual [BF92, BF93] is an extended version of worlds within worlds with a new rule-based interfaces. XmdvTool [War94] integrates four existing mdmv visualization tools: dimension stacking, scatterplot matrix, glyphs, and parallel coordinates into one system with enhanced n-dimensional brushing. The research in mdmv visualization has also been diversified into multidisciplinary collaborations. Attempts to combine sound with graphics [SPW92, SBG92] are currently being made. The concept of a rule-based queue [BF92, BF93] was also introduced. One of the latest research issues of mdmv visualization is the need to evaluate the correctness, effectiveness, and usefulness of mdmv visualization techniques. Similar concerns also appear in the other fields of visualization research [RET+ 94, HPvW94]. 3
3 Terminology Unfortunately,the mdmv literature suffers from ill-defined and inconsistent terminology.The term dimensionality is especially overloaded.Mathematicians consider dimension as the number of independent variables in an algebraic equation.Engineers take dimension as measurements of any sort(breadth,length,height,and thickness). Even the prefix multi is frequently interchanged with another prefix hyper.In statistics literatures,the prefix multi means two or more,indicating a natural breakpoint between one and two dimension in probabilistic methods.For the breakpoint between three and four (or beyond),the prefix hyper is used [Cle93].We use the prefix multi to refer to dimensionality of two or more. Beddow [Bed92]points out the difference between multidimensional objects and multidimensional data. Multidimensional objects are spatial objects,and our goal is to understand their geometry.The most common form are two dimensional images and three dimensional volumes.They can best be described as n-dimensional Euclidean spaces R".Multidimensional data,on the other hand,refers to the study of relationships among multiple parameters.Mathematically these parameters can be classified into two categories:dependent and independent [KK93].Some statisticians prefer the terms factor and response [Cle93].A variable is said to be dependent if it is a function of another variable,the independent variable.The relationship of an independent variable z and a dependent variable y can best be described by the mathematical equation y=f().We adopt the convention that the term multidimensional refers to the dimensionality of the independent variables,while the term multivariate refers to the dimensionality of the dependent variables [BCH+94].This is by far the most popular way to describe the dimensionality of mdmv data sets in scientific visualization literature.For example,a three dimensional volume space in which both temperature and pressure are observed and recorded in various locations produces 3d2v data. Beddow [Bed92]argues that analytic methods used to explore n-dimensional Euclidean spaces R"are not appropriate for general multivariate analysis.In mdmv visualization research,the emphasis shifts away from the strong mathematical definition of dependent and independent variates towards the broader definition of multiple variables or factors.This happens not only in mdmv scientific visualization research but also in statistical studies. The tools are different,but the goal is the same:to find the hidden relationships between the variables(also known as fitting in statistics). In general,raw scientific data can be categorized into a hierarchy of data types.The most general and the lowest of the hierarchy is the nominal data,whose values have no inherent ordering.For example,the names of the fifty states are nominal data.The next higher type of the hierarchy is ordinal data,whose values are ordered, but for which no meaningful distance metric exists.The seven rainbow colors(i.e.,red,orange,..)belong to this category.The highest of the hierarchy is metric data,which has a meaningful distance metric between any two values.Times,distances,and temperatures are examples.If we bin metric data into ranges,it becomes ordinal data.If we further remove the ordering constraints,the data is nominal.Some of the visualization techniques included in this survey are specially designed to handle metric data(see Sections 5.2.2 and 5.2.9.) The above 3d2v temperature/pressure example more or less implies that each 3 dimensional coordinates contain simple(i.e.,neither a set nor an interval)and atomic(i.e.,not composite)values of pressure and temperatures.This is different from the case when we measure,for example,the chemical contents of a volume.Each coordinates
3 Terminology Unfortunately, the mdmv literature suffers from ill-defined and inconsistent terminology. The term dimensionality is especially overloaded. Mathematicians consider dimension as the number of independent variables in an algebraic equation. Engineers take dimension as measurements of any sort (breadth, length, height, and thickness). Even the prefix multi is frequently interchanged with another prefix hyper. In statistics literatures, the prefix multi means two or more, indicating a natural breakpoint between one and two dimension in probabilistic methods. For the breakpoint between three and four (or beyond), the prefix hyper is used [Cle93]. We use the prefix multi to refer to dimensionality of two or more. Beddow [Bed92] points out the difference between multidimensional objects and multidimensional data. Multidimensional objects are spatial objects, and our goal is to understand their geometry. The most common form are two dimensional images and three dimensional volumes. They can best be described as n-dimensional Euclidean spaces Rn . Multidimensional data, on the other hand, refers to the study of relationships among multiple parameters. Mathematically these parameters can be classified into two categories: dependent and independent [KK93]. Some statisticians prefer the terms factor and response [Cle93]. A variable is said to be dependent if it is a function of another variable, the independent variable. The relationship of an independent variable x and a dependent variable y can best be described by the mathematical equation y = f (x). We adopt the convention that the term multidimensional refers to the dimensionality of the independent variables, while the term multivariate refers to the dimensionality of the dependent variables [BCH+ 94]. This is by far the most popular way to describe the dimensionality of mdmv data sets in scientific visualization literature. For example, a three dimensional volume space in which both temperature and pressure are observed and recorded in various locations produces 3d2v data. Beddow [Bed92] argues that analytic methods used to explore n-dimensional Euclidean spaces Rn are not appropriate for general multivariate analysis. In mdmv visualization research, the emphasis shifts away from the strong mathematical definition of dependent and independent variates towards the broader definition of multiple variables or factors. This happens not only in mdmv scientific visualization research but also in statistical studies. The tools are different, but the goal is the same: to find the hidden relationships between the variables (also known as fitting in statistics). In general, raw scientific data can be categorized into a hierarchy of data types. The most general and the lowest of the hierarchy is the nominal data, whose values have no inherent ordering. For example, the names of the fifty states are nominal data. The next higher type of the hierarchy is ordinal data, whose values are ordered, but for which no meaningful distance metric exists. The seven rainbow colors (i.e., red, orange, ) belong to this category. The highest of the hierarchy is metric data, which has a meaningful distance metric between any two values. Times, distances, and temperatures are examples. If we bin metric data into ranges, it becomes ordinal data. If we further remove the ordering constraints, the data is nominal. Some of the visualization techniques included in this survey are specially designed to handle metric data (see Sections 5.2.2 and 5.2.9.) The above 3d2v temperature/pressure example more or less implies that each 3 dimensional coordinates contain simple (i.e., neither a set nor an interval) and atomic (i.e., not composite) values of pressure and temperatures. This is different from the case when we measure, for example, the chemical contents of a volume. Each coordinates 4
now has a set (instead of a simple value)of composite data (i.e.,chemical elements.)The varying numbers of values of a variate plotted in any single dimensional point is known as the density of that coordinate. 4 Fundamental Objective and Approach The main objectives of mdmv visualization are to visually summarize an mdmv data set,and find key trends and relationships among the variates.Different properties and characteristics of the data may changes the way we carry out visualization,but not its goals. The traditional two dimensional point and line plots are among the most commonly used visualization tech- niques for data with lower number of variates.This technique can be enhanced by putting an array of plots into one display,so as to add another variate to the visual presentation.This approach is discussed in Sections 5.1.5 and5.2.2. We can also map the variates of the data into graphical primitives of differnt colors,sizes,shapes,and locations (see Sections 5.2.4,5.2.5,and 5.2.6.)The display of all dimensions and all variates creates some kind of texture patterns,and provide critical insights needed for scientific discovery. For large(larger than the number of pixels of a display)scientific data,we can display a certain portion of data and allow the user to navigate the rest interactively.This is described in Sections 5.2.2,5.2.7,5.2.8,and 5.2.9, Most of the visualization techniques assume a Euclidean space environment.Orthogonal axes,however,are not always the best choice to plot data.Sections 5.2.7,5.2.10,and 5.2.11 give some alternatives. A powerful visualization technique is to display the data frame by frame according to a time variate.This animation approach is discussed in Sections 5.3.1,5.3.2,and 5.3.3. 5 Multidimensional Multivariate Visualization and Concepts The body of this paper covers the principal concepts and brief history of some of the popular mdmy visualization techniques.During the last decade,hundreds of so-called new mdmv visualization techniques have been invented. (Refer to [KK93]for more details in this regard.)A majority of them are designed for special purposes such as volume visualization and vector/tensor field visualization,which are not covered in our discussion.Some of the rest are merely ad hoc tools that produce pretty pictures.They are difficult to create and their results are hard to interpret.We are interested in techniques that are founded on a solid basis and that have potential for practical value. Categorizing mdmv visualization techniques is a difficult task.Possible criteria for such a categorization include the goal of the visualization,the type and/or dimensionality of the data,the dimensionality of the visualization technique,etc.We have not found a convincing set of criteria that cleanly differentiate the visualization techniques we wish to describe.We have chosen to group the techniques into those based on 2-variate displays,those based on multivariate displays,and those using time as an animation parameter: Technigues based on 2-variate displays include the fundamental 2-variate displays and simultaneous views 5
now has a set (instead of a simple value) of composite data (i.e., chemical elements.) The varying numbers of values of a variate plotted in any single dimensional point is known as the density of that coordinate. 4 Fundamental Objective and Approach The main objectives of mdmv visualization are to visually summarize an mdmv data set, and find key trends and relationships among the variates. Different properties and characteristics of the data may changes the way we carry out visualization, but not its goals. The traditional two dimensional point and line plots are among the most commonly used visualization techniques for data with lower number of variates. This technique can be enhanced by putting an array of plots into one display, so as to add another variate to the visual presentation. This approach is discussed in Sections 5.1.5 and 5.2.2. We can also map the variates of the data into graphical primitives of differnt colors, sizes, shapes, and locations (see Sections 5.2.4, 5.2.5, and 5.2.6.) The display of all dimensions and all variates creates some kind of texture patterns, and provide critical insights needed for scientific discovery. For large (larger than the number of pixels of a display) scientific data, we can display a certain portion of data and allow the user to navigate the rest interactively. This is described in Sections 5.2.2, 5.2.7, 5.2.8, and 5.2.9, Most of the visualization techniques assume a Euclidean space environment. Orthogonal axes, however, are not always the best choice to plot data. Sections 5.2.7, 5.2.10, and 5.2.11 give some alternatives. A powerful visualization technique is to display the data frame by frame according to a time variate. This animation approach is discussed in Sections 5.3.1, 5.3.2, and 5.3.3. 5 Multidimensional Multivariate Visualization and Concepts The body of this paper covers the principal concepts and brief history of some of the popular mdmv visualization techniques. During the last decade, hundreds of so-called new mdmv visualization techniques have been invented. (Refer to [KK93] for more details in this regard.) A majority of them are designed for special purposes such as volume visualization and vector/tensor field visualization, which are not covered in our discussion. Some of the rest are merely ad hoc tools that produce pretty pictures. They are difficult to create and their results are hard to interpret. We are interested in techniques that are founded on a solid basis and that have potential for practical value. Categorizing mdmv visualization techniques is a difficult task. Possible criteria for such a categorization include the goal of the visualization, the type and/or dimensionality of the data, the dimensionality of the visualization technique, etc. We have not found a convincing set of criteria that cleanly differentiate the visualization techniques we wish to describe. We have chosen to group the techniques into those based on 2-variate displays, those based on multivariate displays, and those using time as an animation parameter: Techniques based on 2-variate displays include the fundamental 2-variate displays and simultaneous views 5