The Nature of econometrics and Economic data observation. When econometric methods are used to analyze time series data, the data should be stored in chronological order The variable avgmin refers to the average minimum wage for the year, avgcov is the average coverage rate( the percentage of workers covered by the minimum wage law), unemp is the unemployment rate, and gnp is the gross national product. We will use these data later in a time series analysis of the effect of the minimum wage or Pooled cross sections Some data sets have both cross-sectional and time series features. For example, suppose at two cross-sectional household surveys are taken in the United States, one in 1985 and one in 1990. In 1985, a random sample of households is surveyed for variables such as income, savings, family size, and so on. In 1990, a new random sample of households is taken using the same survey questions. In order to increase our sample size, we can form a pooled cross section by combining the two years. Because random samples are taken in each year, it would be a fluke if the same household appeared in the sample during both years. ( The size of the sample is usually very small compared with the num- ber of households in the United States. This important factor distinguishes a pooled cross section from a panel data set. Pooling cross sections from different years is often an effective way of analyzing the effects of a new government policy. The idea is to collect data from the years before and after a key policy change. As an example, consider the following data set on hous- ing prices taken in 1993 and 1995, when there was a reduction in property taxes in 1994. Suppose we have data on 250 houses for 1993 and on 270 houses for 1995. One way to store such a data set is given in Table 1.4 Observations I through 250 correspond to the houses sold in 1993, an rations rresp 270 houses sold in 1995. While the re store the data turns out not to be crucial, keeping track of the year for each obser- vation is usually very important. This is why we enter year as a separate variable a pooled cross section is analyzed much like a standard cross section, except that we often need to account for secular differences in the variables across the time. In fact in addition to increasing the sample size, the point of a pooled cross-sectional analysis is often to see how a key relationship has changed over time Panel or Longitudinal Data A panel data (or longitudinal data) set consists of a time series for each cross- sectional member in the data set. As an example, suppose we have wage, education, and employment history for a set of individuals followed over a ten-year period. Or we might collect information, such as investment and financial data, about the same set of firms over a five-year time period. Panel data can also be collected on geographical its. For example, we can collect data for the same set of counties in the United States on immigration flows, tax rates, wage rates, government expenditures, etc, for the years 1980,1985,and1990 The key feature of panel data that distinguishes it from a pooled cross section is the fact that the same cross-sectional units(individuals, firms, or counties in the above
observation. When econometric methods are used to analyze time series data, the data should be stored in chronological order. The variable avgmin refers to the average minimum wage for the year, avgcov is the average coverage rate (the percentage of workers covered by the minimum wage law), unemp is the unemployment rate, and gnp is the gross national product. We will use these data later in a time series analysis of the effect of the minimum wage on employment. Pooled Cross Sections Some data sets have both cross-sectional and time series features. For example, suppose that two cross-sectional household surveys are taken in the United States, one in 1985 and one in 1990. In 1985, a random sample of households is surveyed for variables such as income, savings, family size, and so on. In 1990, a new random sample of households is taken using the same survey questions. In order to increase our sample size, we can form a pooled cross section by combining the two years. Because random samples are taken in each year, it would be a fluke if the same household appeared in the sample during both years. (The size of the sample is usually very small compared with the number of households in the United States.) This important factor distinguishes a pooled cross section from a panel data set. Pooling cross sections from different years is often an effective way of analyzing the effects of a new government policy. The idea is to collect data from the years before and after a key policy change. As an example, consider the following data set on housing prices taken in 1993 and 1995, when there was a reduction in property taxes in 1994. Suppose we have data on 250 houses for 1993 and on 270 houses for 1995. One way to store such a data set is given in Table 1.4. Observations 1 through 250 correspond to the houses sold in 1993, and observations 251 through 520 correspond to the 270 houses sold in 1995. While the order in which we store the data turns out not to be crucial, keeping track of the year for each observation is usually very important. This is why we enter year as a separate variable. A pooled cross section is analyzed much like a standard cross section, except that we often need to account for secular differences in the variables across the time. In fact, in addition to increasing the sample size, the point of a pooled cross-sectional analysis is often to see how a key relationship has changed over time. Panel or Longitudinal Data A panel data (or longitudinal data) set consists of a time series for each crosssectional member in the data set. As an example, suppose we have wage, education, and employment history for a set of individuals followed over a ten-year period. Or we might collect information, such as investment and financial data, about the same set of firms over a five-year time period. Panel data can also be collected on geographical units. For example, we can collect data for the same set of counties in the United States on immigration flows, tax rates, wage rates, government expenditures, etc., for the years 1980, 1985, and 1990. The key feature of panel data that distinguishes it from a pooled cross section is the fact that the same cross-sectional units (individuals, firms, or counties in the above Chapter 1 The Nature of Econometrics and Economic Data 10 14/99 4:34 PM Page 10
The Nature of econometrics and Economic data ttable 1.4 Pooled Cross Sections: Two Years of Housing Prices obsno price rft bdrms bthrms 1 1993 85500 42 1600 2.0 1993 67300 1440 3 2.5 1993 134000 2000 2501993243600 2600 3.0 1995 65000 1250 1.0 1995 182400 2200 253 1995 97500 15 1540 2.0 520 1995 57200 1.5 examples)are followed over a given time period. The data in Table 1. 4 are not consid ered a panel data set because the houses sold are likely to be different in 1993 and 1995; if there are any duplicates, the number is likely to be so small as to be unimportant. In contrast, Table 1. 5 contains a two-year panel data set on crime and related statistics for 150 cities in the United States There are several interesting features in Table 1.5. First, each city has been given a umber from I through 150. Which city we decide to call city 1, city 2, and so on, is irrelevant. As with a pure cross section, the ordering in the cross section of a panel data set does not matter. We could use the city name in place of a number, but it is often use ful to have both
examples) are followed over a given time period. The data in Table 1.4 are not considered a panel data set because the houses sold are likely to be different in 1993 and 1995; if there are any duplicates, the number is likely to be so small as to be unimportant. In contrast, Table 1.5 contains a two-year panel data set on crime and related statistics for 150 cities in the United States. There are several interesting features in Table 1.5. First, each city has been given a number from 1 through 150. Which city we decide to call city 1, city 2, and so on, is irrelevant. As with a pure cross section, the ordering in the cross section of a panel data set does not matter. We could use the city name in place of a number, but it is often useful to have both. Chapter 1 The Nature of Econometrics and Economic Data 11 Table 1.4 Pooled Cross Sections: Two Years of Housing Prices obsno year hprice proptax sqrft bdrms bthrms 1 1993 85500 42 1600 3 2.0 2 1993 67300 36 1440 3 2.5 3 1993 134000 38 2000 4 2.5 250 1993 243600 41 2600 4 3.0 251 1995 65000 16 1250 2 1.0 252 1995 182400 20 2200 4 2.0 253 1995 97500 15 1540 3 2.0 520 1995 57200 16 1100 2 1.5 d 7/14/99 4:34 PM Page 11
The Nature of econometrics and Economic data ttable 1.5 A Two-Year Panel Data Set on City Crime Statistics bsno vear murders population unen police 1986 350000 8.7 440 2 1990 8 359200 7.2 471 21986 64300 21990 65100 491986 260700 149 245000 9.8 334 501986 43000 .3 300 501990 546200 493 A second useful point is that the two years of data for city 1 fill the first two or observations. Observations 3 and 4 correspond to city 2, and so on. Since each of the 150 cities has two rows of data, any econometrics package will view this as 300 obser- vations. This data set can be treated as two pooled cross sections, where the same cities happen to show up in the same year. But, as we will see in Chapters 13 and 14, we can also use the panel structure to respond to questions that cannot be answered by simply In organizing the observations in Table 1.5, we place the two years of data for each ty adjacent to one another, with the first year coming before the second in all cases. For just about every practical purpose, this is the preferred way for ordering panel data sets. Contrast this organization with the way the pooled cross sections are stored in Table 1. 4. In short, the reason for ordering panel data as in Table 1. 5 is that we will need to perform data transformations for each city across the two years. Because panel data require replication of the same units over time, panel data sets, especially those on individuals, households, and firms, are more difficult to obtain than pooled cross sections. Not surprisingly, observing the same units over time leads to sev
A second useful point is that the two years of data for city 1 fill the first two rows or observations. Observations 3 and 4 correspond to city 2, and so on. Since each of the 150 cities has two rows of data, any econometrics package will view this as 300 observations. This data set can be treated as two pooled cross sections, where the same cities happen to show up in the same year. But, as we will see in Chapters 13 and 14, we can also use the panel structure to respond to questions that cannot be answered by simply viewing this as a pooled cross section. In organizing the observations in Table 1.5, we place the two years of data for each city adjacent to one another, with the first year coming before the second in all cases. For just about every practical purpose, this is the preferred way for ordering panel data sets. Contrast this organization with the way the pooled cross sections are stored in Table 1.4. In short, the reason for ordering panel data as in Table 1.5 is that we will need to perform data transformations for each city across the two years. Because panel data require replication of the same units over time, panel data sets, especially those on individuals, households, and firms, are more difficult to obtain than pooled cross sections. Not surprisingly, observing the same units over time leads to sevChapter 1 The Nature of Econometrics and Economic Data 12 Table 1.5 A Two-Year Panel Data Set on City Crime Statistics obsno city year murders population unem police 1 1 1986 5 350000 8.7 440 2 1 1990 8 359200 7.2 471 3 2 1986 2 64300 5.4 75 4 2 1990 1 65100 5.5 75 297 149 1986 10 260700 9.6 286 298 149 1990 6 245000 9.8 334 299 150 1986 25 543000 4.3 520 300 150 1990 32 546200 5.2 493 14/99 4:34 PM Page 12
The Nature of econometrics and Economic Data eral advantages over cross-sectional data or even pooled cross-sectional data. The ben- fit that we will focus on in this text is that having multiple observations on the same units allows us to control certain unobserved characteristics of individuals, firms, and so on. as we will see. the use of more than one observation can facilitate causal infer- ence in situations where inferring causality would be very difficult if only a single cross section were available. A second advantage of panel data is that it often allows us to study the importance of lags in behavior or the result of decision making. This infor- mation can be significant since many economic policies can be expected to have an impact only after some time has passed. Most books at the undergraduate level do not contain a discussion methods for panel data. However, economists now recognize that some questions are difficult, if not impossible, to answer satisfactorily without panel data. As you will see, we can make considerable progress with simple panel data analysis, a method which is not much more difficult than dealing with a standard cross-sectional data set A Comment on Data structures Part 1 of this text is concerned with the analysis of cross-sectional data, as this poses the fewest conceptual and technical difficulties. At the same time, it illustrates most of he key themes of econometric analysis. We will use the methods and insights from cross-sectional analysis in the remainder of the text. While the econometric analysis of time series uses many of the same tools as sectional analysis, it is more complicated due to the trending, highly persistent I of many economic time series. Examples that have been traditionally used to illustrate the manner in which econometric methods can be applied to time series data are now widely believed to be flawed. It makes little sense to use such examples initially, since his practice will only reinforce poor econometric practice. Therefore, we will postpone the treatment of time series econometrics until Part 2, when the important issues con cerning trends, persistence, dynamics, and seasonality will be introduced In Part 3, we treat pooled cross sections and panel data explicitly. The analysis of independently pooled cross sections and simple panel data analysis are fairly straight- rd extensions of pure cross-sectional analysis. Chapter 13 to deal with these topics 1 4 CAUSALITY AND THE NOTION OF CETERS PARIBUS IN ECONOMETRIC ANALYSS In most tests of economic theory, and certainly for evaluating public policy, the econo- mist's goal is to infer that one variable has a causal effect on another variable(such as crime rate or worker productivity). Simply finding an association between two or more variables might be suggestive, but unless causality can be established, it is rarely The notion of ceteris paribus-which means"other (relevant) factors being equal"plays an important role in causal analysis. This idea has been implicit in some of our earlier discussion, particularly Examples 1.1 and 1. 2, but thus far we have not
eral advantages over cross-sectional data or even pooled cross-sectional data. The benefit that we will focus on in this text is that having multiple observations on the same units allows us to control certain unobserved characteristics of individuals, firms, and so on. As we will see, the use of more than one observation can facilitate causal inference in situations where inferring causality would be very difficult if only a single cross section were available. A second advantage of panel data is that it often allows us to study the importance of lags in behavior or the result of decision making. This information can be significant since many economic policies can be expected to have an impact only after some time has passed. Most books at the undergraduate level do not contain a discussion of econometric methods for panel data. However, economists now recognize that some questions are difficult, if not impossible, to answer satisfactorily without panel data. As you will see, we can make considerable progress with simple panel data analysis, a method which is not much more difficult than dealing with a standard cross-sectional data set. A Comment on Data Structures Part 1 of this text is concerned with the analysis of cross-sectional data, as this poses the fewest conceptual and technical difficulties. At the same time, it illustrates most of the key themes of econometric analysis. We will use the methods and insights from cross-sectional analysis in the remainder of the text. While the econometric analysis of time series uses many of the same tools as crosssectional analysis, it is more complicated due to the trending, highly persistent nature of many economic time series. Examples that have been traditionally used to illustrate the manner in which econometric methods can be applied to time series data are now widely believed to be flawed. It makes little sense to use such examples initially, since this practice will only reinforce poor econometric practice. Therefore, we will postpone the treatment of time series econometrics until Part 2, when the important issues concerning trends, persistence, dynamics, and seasonality will be introduced. In Part 3, we treat pooled cross sections and panel data explicitly. The analysis of independently pooled cross sections and simple panel data analysis are fairly straightforward extensions of pure cross-sectional analysis. Nevertheless, we will wait until Chapter 13 to deal with these topics. 1.4CAUSALITY AND THE NOTION OF CETERIS PARIBUS IN ECONOMETRIC ANALYSIS In most tests of economic theory, and certainly for evaluating public policy, the economist’s goal is to infer that one variable has a causal effect on another variable (such as crime rate or worker productivity). Simply finding an association between two or more variables might be suggestive, but unless causality can be established, it is rarely compelling. The notion of ceteris paribus—which means “other (relevant) factors being equal”—plays an important role in causal analysis. This idea has been implicit in some of our earlier discussion, particularly Examples 1.1 and 1.2, but thus far we have not explicitly mentioned it. Chapter 1 The Nature of Econometrics and Economic Data 13 d 7/14/99 4:34 PM Page 13
The Nature of econometrics and Economic Data You probably remember from introductory economics that most economic ques- ns are ceteris paribus by nature. For example, in analyzing consumer demand, we are interested in knowing the effect of changing the price of a good on its quantity de- manded, while holding all other factors-such as income, prices of other goods, and individual tastes fixed. if other factors are not held fixed then we cannot know the causal effect of a price change on quantity demanded. Holding other factors fixed is critical for policy analysis as well. In the job trainin example(Example 1. 2), we might be interested in the effect of another week of job raining on wages, with all other components being equal (in particular, education and experience). If we succeed in holding all other relevant factors fixed and then find a link between job training and wages, we can conclude that job training has a causal effect on worker productivity. While this may seem pretty simple, even at this early stage it should be clear that, except in very special cases, it will not be possible to literally hold all else equal. The key question in most empirical studies is: Have enough other factors been held fixed to make a case for causality? Rarely is an econometric study evaluated without raising this issue. In most serious applications, the number of factors that can affect the variable of interest--such as criminal activity or wages-is immense, and the isolation of any particular variable may seem like a hopeless effort. However, we will eventually see that, when carefully applied, econometric methods can simulate a ceteris paribus At this point, we cannot yet explain how econometric methods can be used to esti- mate ceteris paribus effects, so we will consider some problems that can arise in trying to infer causality in economics. We do not use any equations in this discussion. For each example, the problem of inferring causality disappears if an appropriate experiment can be carried out. Thus, it is useful to describe how such an experiment might be struc- tured, and to observe that, in most cases, obtaining experimental data is impractical. It is also helpful to think about why the available data fails to have the important features of an experimental data set We rely for now on your intuitive understanding of terms such as random, inde endence. and correlation. all of which should be familiar from bility and statistics course. (These concepts are reviewed in Appendix B )We begin tes some of these E 1.3 (Effects of Fertilizer on Crop Yield) ome early econometric studies [for example Griliches(1957)] considered the effects of new fertilizers on crop yields. Suppose the crop under consideration is soybeans. Since fer tilizer amount is only one factor affecting yields-some others include rainfall, quality of land, and presence of parasites-this issue must be posed as a ceteris paribus question One way to determine the causal effect of fertilizer amount on soybean yield is to conduct an experiment, which might include the following steps. Choose several one-acre plots of land. Apply different amounts of fertilizer to each plot and subsequently measure the yields this gives us a cross-sectional data set. Then, use statistical methods(to be introduced in Chapter 2)to measure the association between yields and fertilizer amount
You probably remember from introductory economics that most economic questions are ceteris paribus by nature. For example, in analyzing consumer demand, we are interested in knowing the effect of changing the price of a good on its quantity demanded, while holding all other factors—such as income, prices of other goods, and individual tastes—fixed. If other factors are not held fixed, then we cannot know the causal effect of a price change on quantity demanded. Holding other factors fixed is critical for policy analysis as well. In the job training example (Example 1.2), we might be interested in the effect of another week of job training on wages, with all other components being equal (in particular, education and experience). If we succeed in holding all other relevant factors fixed and then find a link between job training and wages, we can conclude that job training has a causal effect on worker productivity. While this may seem pretty simple, even at this early stage it should be clear that, except in very special cases, it will not be possible to literally hold all else equal. The key question in most empirical studies is: Have enough other factors been held fixed to make a case for causality? Rarely is an econometric study evaluated without raising this issue. In most serious applications, the number of factors that can affect the variable of interest—such as criminal activity or wages—is immense, and the isolation of any particular variable may seem like a hopeless effort. However, we will eventually see that, when carefully applied, econometric methods can simulate a ceteris paribus experiment. At this point, we cannot yet explain how econometric methods can be used to estimate ceteris paribus effects, so we will consider some problems that can arise in trying to infer causality in economics. We do not use any equations in this discussion. For each example, the problem of inferring causality disappears if an appropriate experiment can be carried out. Thus, it is useful to describe how such an experiment might be structured, and to observe that, in most cases, obtaining experimental data is impractical. It is also helpful to think about why the available data fails to have the important features of an experimental data set. We rely for now on your intuitive understanding of terms such as random, independence, and correlation, all of which should be familiar from an introductory probability and statistics course. (These concepts are reviewed in Appendix B.) We begin with an example that illustrates some of these important issues. EXAMPLE 1.3 (Effects of Fertilizer on Crop Yield) Some early econometric studies [for example, Griliches (1957)] considered the effects of new fertilizers on crop yields. Suppose the crop under consideration is soybeans. Since fertilizer amount is only one factor affecting yields—some others include rainfall, quality of land, and presence of parasites—this issue must be posed as a ceteris paribus question. One way to determine the causal effect of fertilizer amount on soybean yield is to conduct an experiment, which might include the following steps. Choose several one-acre plots of land. Apply different amounts of fertilizer to each plot and subsequently measure the yields; this gives us a cross-sectional data set. Then, use statistical methods (to be introduced in Chapter 2) to measure the association between yields and fertilizer amounts. Chapter 1 The Nature of Econometrics and Economic Data 14 14/99 4:34 PM Page 14