The Nature of econometrics and Economic data served factors, such as the wage for criminal activity, moral character, family back ground, and errors in measuring things like criminal activity and the probability of arrest. We could add family background variables to the model, such as number of sib- lings, parents'education, and so on, but we can never eliminate u entirely. In fact, deal- ng with this error term or disturbance term is perhaps the most important component of any econometric analysis. The constants Bo, Bl,..., B are the parameters of the econometric model, and they describe the directions and strengths of the relationship between crime and the factors used to determine crime in the model A complete econometric model for Example 1.2 might be wage=B.+ B,educ B2exper Straining u, 14 here the term u contains factors such as"innate ability, "quality of educ famil background, and the myriad other factors that can influence a persons wage. If we are specifically concerned about the effects of job training, then B, is the parameter of For the most part, econometric analysis begins by specifying an econometric model without consideration of the details of the models creation. We generally follow this approach, largely because careful derivation of something like the economic model of crime is time consuming and can take us into some specialized and often difficult areas of economic theory. Economic reasoning will play a role in our examples, and we will merge any underlying economic theory into the econometric model specification. In the economic model of crime example, we would start with an econometric model such as (1.3)and use economic reasoning and common sense as guides for choosing the vari ables. While this approach loses some of the richness of economic analysis, it is com- monly and effectively applied by careful researchers nce an econometric model such as(1.3)or (1. 4)has been specified, various hypotheses of interest can be stated in terms of the unknown parameters. For example, in equation(1.3)we might hypothesize that wage, the wage that can be earned in legal employment, has no effect on criminal behavior. In the context of this particular econo- metric model, the hypothesis is equivalent to P1=0. An empirical analysis, by definition, requires data. After data on the relevant vari ables have been collected, econometric methods are used to estimate the parameters in he econometric model and to formally test hypotheses of interest. In some cases, the econometric model is used to make predictions in either the testing of a theory or the study of a policys impact. Because data collection is so important in empirical work, Section 1.3 will describe he kinds of data that we are likely to encounter 13 THE STRUCTURE OF ECONOMIC DATA Economic data sets come in a variety of types. While some econometric methods can be applied with little or no modification to many different kinds of data sets, the spe- cial features of some data sets must be accounted for or should be exploited. We next describe the most important data structures encountered in applied work
served factors, such as the wage for criminal activity, moral character, family background, and errors in measuring things like criminal activity and the probability of arrest. We could add family background variables to the model, such as number of siblings, parents’ education, and so on, but we can never eliminate u entirely. In fact, dealing with this error term or disturbance term is perhaps the most important component of any econometric analysis. The constants 0, 1, …, 6 are the parameters of the econometric model, and they describe the directions and strengths of the relationship between crime and the factors used to determine crime in the model. A complete econometric model for Example 1.2 might be wage 0 1educ 2exper 3training u, (1.4) where the term u contains factors such as “innate ability,” quality of education, family background, and the myriad other factors that can influence a person’s wage. If we are specifically concerned about the effects of job training, then 3 is the parameter of interest. For the most part, econometric analysis begins by specifying an econometric model, without consideration of the details of the model’s creation. We generally follow this approach, largely because careful derivation of something like the economic model of crime is time consuming and can take us into some specialized and often difficult areas of economic theory. Economic reasoning will play a role in our examples, and we will merge any underlying economic theory into the econometric model specification. In the economic model of crime example, we would start with an econometric model such as (1.3) and use economic reasoning and common sense as guides for choosing the variables. While this approach loses some of the richness of economic analysis, it is commonly and effectively applied by careful researchers. Once an econometric model such as (1.3) or (1.4) has been specified, various hypotheses of interest can be stated in terms of the unknown parameters. For example, in equation (1.3) we might hypothesize that wagem, the wage that can be earned in legal employment, has no effect on criminal behavior. In the context of this particular econometric model, the hypothesis is equivalent to 1 0. An empirical analysis, by definition, requires data. After data on the relevant variables have been collected, econometric methods are used to estimate the parameters in the econometric model and to formally test hypotheses of interest. In some cases, the econometric model is used to make predictions in either the testing of a theory or the study of a policy’s impact. Because data collection is so important in empirical work, Section 1.3 will describe the kinds of data that we are likely to encounter. 1.3 THE STRUCTURE OF ECONOMIC DATA Economic data sets come in a variety of types. While some econometric methods can be applied with little or no modification to many different kinds of data sets, the special features of some data sets must be accounted for or should be exploited. We next describe the most important data structures encountered in applied work. Chapter 1 The Nature of Econometrics and Economic Data 5 d 7/14/99 4:34 PM Page 5
The Nature of econometrics and Economic data Cross-Sectional data A cross-sectional data set consists of a sample of individuals, households, firms, cities, states, countries, or a variety of other units, taken at a given point in time. Sometimes the data on all units do not correspond to precisely the same time period. For example several families may be surveyed during different weeks within a year. In a pure cross section analysis we would ignore any minor timing differences in collecting the data. If a set of families was surveyed during different weeks of the same year, we would still view this as a cross-sectional data set An important feature of cross-sectional data is that we can often assume that they have been obtained by random sampling from the underlying population. For exam- ple, if we obtain information on wages, education, experience, and other characteristics by randomly drawing 500 people from the working population, then we have a random sample from the population of all working people. Random sampling is the sampling scheme covered in introductory statistics courses, and it simplifies the analysis of cross sectional data. A review of random sampling is contained in Appendix C. Sometimes random sampling is not appropriate as an assumption for analyzing cross-sectional data. For example, suppose we are interested in studying factors that influence the accumulation of family wealth. We could survey a random sample of fam- ilies, but some families might refuse to report their wealth. If, for example, wealthier families are less likely to disclose their wealth, then the resulting sample on wealth is ot a random sample from the population of all families. This is an illustration of a sam- ple selection problem, an advanced topic that we will discuss in Chapter 17 Another violation of random sampling occurs when we sample from units that ar large relative to the population, particularly geographical units. The potential problem in such cases is that the population is not large enough to reasonably assume the obser vations are independent draws. For example, if we want to explain new business activ- ity across states as a function of wage rates, energy prices, corporate and property tax rates, services provided, quality of the workforce, and other state characteristics, it is unlikely that business activities in states near one another are independent. It turns that the econometric methods that we discuss do work in such situations, but they some- times need to be refined. For the most part, we will ignore the intricacies that arise in analyzing such situations and treat these problems in a random sampling framework, even when it is not technically correct to do so. Cross-sectional data are widely used in economics and other social sciences. In eco- omics, the analysis of cross-sectional data is closely aligned with the applied micro- economics fields, such as labor economics, state and local public finance, industrial organization, urban economics, demography, and health economics. Data on individu- als, households, firms, and cities at a given point in time are important for testing micro- economic hypotheses and evaluating economic policies The cross-sectional data used for econometric analysis can be represented and stored in computers. Table 1. 1 contains, in abbreviated form, a cross-sectional data set on 526 working individuals for the year 1976.(This is a subset of the data in the file WAGEl.RAW)The variables include wage(in dollars per hour), educ(years of educa- tion), exper (years of potential labor force experience), female(an indicator for gender), and married(marital status). These last two variables are binary(zero-one)in nature
Cross-Sectional Data A cross-sectional data set consists of a sample of individuals, households, firms, cities, states, countries, or a variety of other units, taken at a given point in time. Sometimes the data on all units do not correspond to precisely the same time period. For example, several families may be surveyed during different weeks within a year. In a pure cross section analysis we would ignore any minor timing differences in collecting the data. If a set of families was surveyed during different weeks of the same year, we would still view this as a cross-sectional data set. An important feature of cross-sectional data is that we can often assume that they have been obtained by random sampling from the underlying population. For example, if we obtain information on wages, education, experience, and other characteristics by randomly drawing 500 people from the working population, then we have a random sample from the population of all working people. Random sampling is the sampling scheme covered in introductory statistics courses, and it simplifies the analysis of crosssectional data. A review of random sampling is contained in Appendix C. Sometimes random sampling is not appropriate as an assumption for analyzing cross-sectional data. For example, suppose we are interested in studying factors that influence the accumulation of family wealth. We could survey a random sample of families, but some families might refuse to report their wealth. If, for example, wealthier families are less likely to disclose their wealth, then the resulting sample on wealth is not a random sample from the population of all families. This is an illustration of a sample selection problem, an advanced topic that we will discuss in Chapter 17. Another violation of random sampling occurs when we sample from units that are large relative to the population, particularly geographical units. The potential problem in such cases is that the population is not large enough to reasonably assume the observations are independent draws. For example, if we want to explain new business activity across states as a function of wage rates, energy prices, corporate and property tax rates, services provided, quality of the workforce, and other state characteristics, it is unlikely that business activities in states near one another are independent. It turns out that the econometric methods that we discuss do work in such situations, but they sometimes need to be refined. For the most part, we will ignore the intricacies that arise in analyzing such situations and treat these problems in a random sampling framework, even when it is not technically correct to do so. Cross-sectional data are widely used in economics and other social sciences. In economics, the analysis of cross-sectional data is closely aligned with the applied microeconomics fields, such as labor economics, state and local public finance, industrial organization, urban economics, demography, and health economics. Data on individuals, households, firms, and cities at a given point in time are important for testing microeconomic hypotheses and evaluating economic policies. The cross-sectional data used for econometric analysis can be represented and stored in computers. Table 1.1 contains, in abbreviated form, a cross-sectional data set on 526 working individuals for the year 1976. (This is a subset of the data in the file WAGE1.RAW.) The variables include wage (in dollars per hour), educ (years of education), exper (years of potential labor force experience), female (an indicator for gender), and married (marital status). These last two variables are binary (zero-one) in nature Chapter 1 The Nature of Econometrics and Economic Data 6 14/99 4:34 PM Page 6
The Nature of econometrics and Economic data Ttable 11 A Cross-Sectional Data Set on Wages and Other Individual Characteristics obsno educ exper emale married 3.10 2 0 2 3.24 12 1 5.30 12 525 11.56 526 50 14 and serve to indicate qualitative features of the individual. (The person is female or not; the person is married or not. We will have much to say about binary variables in Chapter 7 and beyond The variable obsno in Table 1. I is the observation number assigned to each person in the sample. Unlike the other variables, it is not a characteristic of the individual. All econometrics and statistics software packages assign an observation number to each data unit. Intuition should tell you that, for data such as that in Table 1. 1, it does not matter which person is labeled as observation one, which person is called Observation Two, and so on. The fact that the ordering of the data does not matter for econometric nalysis is a key feature of cross-sectional data sets obtained from random sampling Different variables sometimes correspond to different time periods in cross- sectional data sets. For example, in order to determine the effects of government poli cies on long-term economic growth, economists have studied the relationship between growth in real per capita gross domestic product(GDP)or ain period(say 19 to 1985)and variables determined in part by government policy in 1960(government consumption as a percentage of GDP and adult secondary education rates). Such a data set might be represented as in Table 1. 2, which constitutes part of the data set used in the study of cross-country growth rates by De Long and Summers(1991)
and serve to indicate qualitative features of the individual. (The person is female or not; the person is married or not.) We will have much to say about binary variables in Chapter 7 and beyond. The variable obsno in Table 1.1 is the observation number assigned to each person in the sample. Unlike the other variables, it is not a characteristic of the individual. All econometrics and statistics software packages assign an observation number to each data unit. Intuition should tell you that, for data such as that in Table 1.1, it does not matter which person is labeled as observation one, which person is called Observation Two, and so on. The fact that the ordering of the data does not matter for econometric analysis is a key feature of cross-sectional data sets obtained from random sampling. Different variables sometimes correspond to different time periods in crosssectional data sets. For example, in order to determine the effects of government policies on long-term economic growth, economists have studied the relationship between growth in real per capita gross domestic product (GDP) over a certain period (say 1960 to 1985) and variables determined in part by government policy in 1960 (government consumption as a percentage of GDP and adult secondary education rates). Such a data set might be represented as in Table 1.2, which constitutes part of the data set used in the study of cross-country growth rates by De Long and Summers (1991). Chapter 1 The Nature of Econometrics and Economic Data 7 Table 1.1 A Cross-Sectional Data Set on Wages and Other Individual Characteristics obsno wage educ exper female married 1 3.10 11 2 1 0 2 3.24 12 22 1 1 3 3.00 11 2 0 0 4 6.00 8 44 0 1 5 5.30 12 7 0 1 525 11.56 16 5 0 1 526 3.50 14 5 1 0 d 7/14/99 4:34 PM Page 7
The Nature of econometrics and Economic data ttable 1.2 A Data Set on Economic Growth Rates and Country Characteristics obsno country gpcrgap ovcon seconds Argentina 0.89 32 Austria 3.32 16 3 2.56 Bolivia 1.24 Zimbabwe 2.30 The variable gpcrgdp represents average growth in real per capita GDP over the period 1960 to 1985. The fact that govcons60(government consumption as a percentage of GDP) and second60(percent of adult population with a secondary education) corre spond to the year 1960, while gpcrgdp is the average growth over the period from 1960 to 1985, does not lead to any special problems in treating this information as a cross- sectional data set. The order of the observations is listed alphabetically by country, but there is nothing about this ordering that affects any subsequent analysis Time series data a time series data set consists of observations on a variable or several variables over time. Examples of time series data include stock prices, money supply, consumer price index, gross domestic product, annual homicide rates, and automobile sales figure Because past events can influence future events and lags in behavior are prevalent in the social sciences, time is an important dimension in a time series data set. Unlike the arrangement of cross-sectional data, the chronological ordering of observations in a time series conveys potentially important information a key feature of time series data that makes it more difficult to analyze than cross- sectional data is the fact that economic observations can rarely, if ever, be assumed to be independent across time. Most economic and other time series are related, often strongly related, to their recent histories. For example, knowing something about the ross domestic product from last quarter tells us quite a bit about the likely range GDP during this quarter, since GDP tends to remain fairly stable from one quarter to
The variable gpcrgdp represents average growth in real per capita GDP over the period 1960 to 1985. The fact that govcons60 (government consumption as a percentage of GDP) and second60 (percent of adult population with a secondary education) correspond to the year 1960, while gpcrgdp is the average growth over the period from 1960 to 1985, does not lead to any special problems in treating this information as a crosssectional data set. The order of the observations is listed alphabetically by country, but there is nothing about this ordering that affects any subsequent analysis. Time Series Data A time series data set consists of observations on a variable or several variables over time. Examples of time series data include stock prices, money supply, consumer price index, gross domestic product, annual homicide rates, and automobile sales figures. Because past events can influence future events and lags in behavior are prevalent in the social sciences, time is an important dimension in a time series data set. Unlike the arrangement of cross-sectional data, the chronological ordering of observations in a time series conveys potentially important information. A key feature of time series data that makes it more difficult to analyze than crosssectional data is the fact that economic observations can rarely, if ever, be assumed to be independent across time. Most economic and other time series are related, often strongly related, to their recent histories. For example, knowing something about the gross domestic product from last quarter tells us quite a bit about the likely range of the GDP during this quarter, since GDP tends to remain fairly stable from one quarter to Chapter 1 The Nature of Econometrics and Economic Data 8 Table 1.2 A Data Set on Economic Growth Rates and Country Characteristics obsno country gpcrgdp govcons60 second60 1 Argentina 0.89 9 32 2 Austria 3.32 16 50 3 Belgium 2.56 13 69 4 Bolivia 1.24 18 12 61 Zimbabwe 2.30 17 6 14/99 4:34 PM Page 8
The Nature of Econometrics and Economic data the next. While most econometric procedures can be used with both cross-sectional and time series data, more needs to be done in specifying econometric models for time series data before standard econometric methods can be justified. In addition, modifi- cations and embellishments to standard econometric techniques have been developed account for and exploit the dependent nature of economic time series and to address other issues, such as the fact that some economic variables tend to display clear trends over time Another feature of time series data that can require special attention is the data fre quency at which the data are collected In economics, the most common frequencies are daily, weekly, monthly, quarterly, and annually. Stock prices are recorded at daily intervals(excluding Saturday and Sunday). The money supply in the U.S. economy reported weekly. Many macroeconomic series are tabulated monthly, including infla tion and employment rates. Other macro series are recorded less frequently, such as every three months(every quarter). Gross domestic product is an important example of a quarterly series. Other time series, such as infant mortality rates for states in the United States, are available only on an annual basis Many weekly, monthly, and quarterly economic time series display a strong seasonal pattern, which can be an important factor in a time series analysis. For ex- ample, monthly data on housing starts differs across the months simply due to changin eather conditions. We will learn how to deal with seasonal time series in Chapter 10 Table 1.3 ins a time series data set obtained from an article by Castillo- Freeman and Freeman(1992) on minimum wage effects in Puerto Rico. The earliest year in the data set is the first observation, and the most recent year available is the last Ttable 1.3 Minimum Wage, Unemployment, and Related Data for Puerto Rico obsno year agcol unemp 1950 0.20 20.1 154 878.7 5 1952 0.23 22.6 4.8 0159 1986 281.6 1987 3.35 58.2 16.8 4496.7
the next. While most econometric procedures can be used with both cross-sectional and time series data, more needs to be done in specifying econometric models for time series data before standard econometric methods can be justified. In addition, modifications and embellishments to standard econometric techniques have been developed to account for and exploit the dependent nature of economic time series and to address other issues, such as the fact that some economic variables tend to display clear trends over time. Another feature of time series data that can require special attention is the data frequency at which the data are collected. In economics, the most common frequencies are daily, weekly, monthly, quarterly, and annually. Stock prices are recorded at daily intervals (excluding Saturday and Sunday). The money supply in the U.S. economy is reported weekly. Many macroeconomic series are tabulated monthly, including inflation and employment rates. Other macro series are recorded less frequently, such as every three months (every quarter). Gross domestic product is an important example of a quarterly series. Other time series, such as infant mortality rates for states in the United States, are available only on an annual basis. Many weekly, monthly, and quarterly economic time series display a strong seasonal pattern, which can be an important factor in a time series analysis. For example, monthly data on housing starts differs across the months simply due to changing weather conditions. We will learn how to deal with seasonal time series in Chapter 10. Table 1.3 contains a time series data set obtained from an article by CastilloFreeman and Freeman (1992) on minimum wage effects in Puerto Rico. The earliest year in the data set is the first observation, and the most recent year available is the last Chapter 1 The Nature of Econometrics and Economic Data 9 Table 1.3 Minimum Wage, Unemployment, and Related Data for Puerto Rico obsno year avgmin avgcov unemp gnp 1 1950 0.20 20.1 15.4 878.7 2 1951 0.21 20.7 16.0 925.0 3 1952 0.23 22.6 14.8 1015.9 37 1986 3.35 58.1 18.9 4281.6 38 1987 3.35 58.2 16.8 4496.7 d 7/14/99 4:34 PM Page 9