In many situations in demography and the social sciences. however, we have a dependent variable, Y, that is dichotomous Lecture 10 rather than continuous, e.g., whether or not a woman has had a second birth whether nd birth is a male or a female baby, whether or not a woman uses any contraceptive method, whether or not a person Logistic regression has migrated in the last 5 years, whether or the People's University staff uses public transportation coming to work, whether or not an under-graduate student completed study last year in the Demography Department in People's University was awarded BA degree Why use logistic regression? In all these situations the outcome ofy In linear regression alue 1 represents yes, or a"success, and the value o no or a failure the +bX1+b2x2+…+bxn+e mean of this dichotomous(also referred to binary) dependent variable, designated the dependent variable, Y, is conti p, is the proportion of times that it takes and unbounded and we want to value 1 set of explanatory(independent, or X) variables that will assist us in predicting its mean value while explaining its observe variability
1 1 Lecture 10 Logistic Regression 2 Why use logistic regression? In linear regression: Y =b0 + b1X1 + b2X2 + .... + bnXn + e the dependent variable, Y, is continuous and unbounded, and we want to identify a set of explanatory (independent, or X) variables that will assist us in predicting its mean value while explaining its observed variability 2 3 In many situations in demography and the social sciences, however, we have a dependent variable, Y, that is dichotomous, rather than continuous, e.g., whether or not a woman has had a second birth, whether the second birth is a male or a female baby, whether or not a woman uses any contraceptive method, whether or not a person has migrated in the last 5 years, whether or not the People’s University staff uses public transportation coming to work, whether or not an under-graduate student completed study last year in the Demography Department in People’s University was awarded BA degree, etc. 4 In all these situations, the outcome of Y only assumes two forms; usually, the value 1 represents yes, or a “success,” and the value 0, no, or a “failure.” The mean of this dichotomous (also referred to as binary) dependent variable, designated p, is the proportion of times that it takes the value 1
EXample: Obtaining Abortion To make a statistical model of this relationship, we could feasibly fit a linear The data in the following tables were gression line to the cases with derived from the 1997 survey, which pregnancy number as the explanatory contains information on abortion use and ariable and a dichotomous dependent associated information the tables and the variable(0=not having abortion, 1=havit chart show the incidence of abortion abortion). There are two main problems according to the number of pregnancies. It with this approach can be seen that the proportion of women obtaining abortion increases rapidly from a very low proportion at the first pregnancy to a big proportion among women having 5 or more pregnancies. The first problem is that it is possible, and indeed happens in this case, that the fitted regression line will cross below zero andor above one right in the range where we do not want that to occur the fitted regression line can be shown to have the form p=-0.01314+0.13798*PREG where p is the proportion having abortion and PREG is numbers of pregnancy
3 5 Example: Obtaining Abortion • The data in the following tables were derived from the 1997 survey, which contains information on abortion use and associated information. The tables and the chart show the incidence of abortion according to the number of pregnancies. It can be seen that the proportion of women obtaining abortion increases rapidly from a very low proportion at the first pregnancy to a big proportion among women having 5 or more pregnancies. 6 PREG5 * whether abortion Crosstabulation Count 880 9 889 950 418 1368 499 461 960 230 271 501 102 193 295 2661 1352 4013 1 2 3 4 5 PREG5 Total no yes whether abortion Total PREG5 * whether abortion Crosstabulation % within PREG5 99.0% 1.0% 100.0% 69.4% 30.6% 100.0% 52.0% 48.0% 100.0% 45.9% 54.1% 100.0% 34.6% 65.4% 100.0% 66.3% 33.7% 100.0% 1.00 2.00 3.00 4.00 5.00 PREG5 Total no yes whether abortion Total PREG5 1 2 3 4 5 Mean whether abortion .7 .6 .5 .4 .3 .2 .1 0.0 4 7 • To make a statistical model of this relationship, we could feasibly fit a linear regression line to the cases with pregnancy number as the explanatory variable and a dichotomous dependent variable (0=not having abortion, 1=having abortion). There are two main problems with this approach. 8 • The first problem is that it is possible, and indeed happens in this case, that the fitted regression line will cross below zero and/or above one right in the range where we do not want that to occur. The fitted regression line can be shown to have the form p = -0.01314+ 0.13798*PREG where p is the proportion having abortion and PREG is numbers of pregnancy
The estimated probability can be Residual plot greater than 1 or less than 0 This line is above 1 up to 7 pregnancies, meaning that more than 100 per cent of all pregnancies at pregnancy 7 or over are aborted. And there are many other cases where the predicted proportions are negative. Apart from the fact that such results are impossible, we might nevertheless be inclined to accept them in the limited range where they are valid The linearity assumption is Recall that this scatter diagram should seriously violated show no pattern at all, as if a handful of stones were dropped at the centre of the This would be very dangerous to do, diagram On the contrary, in this case it because of the second problem, which is could not show a pattern more clearly! This that the assumptions of linear regression are violated badly in this case. This can be pattern of two lines across the diagram is seen clearly in the plots obtained with the caused by the fact that the dependent REGRESSION sub-commands variable can only take two values(1 for particularly the final scatterplot of the having abortion and 0 otherwise), and the distribution of the residuals consequently standardized residuals against the predicted values has a binomial distribution not a normal distribution
5 9 The estimated probability can be greater than 1 or less than 0 • This line is above 1 up to 7 pregnancies, meaning that more than 100 per cent of all pregnancies at pregnancy 7 or over are aborted. And there are many other cases where the predicted proportions are negative. Apart from the fact that such results are impossible, we might nevertheless be inclined to accept them in the limited range where they are valid. 10 The linearity assumption is seriously violated • This would be very dangerous to do, because of the second problem, which is that the assumptions of linear regression are violated badly in this case. This can be seen clearly in the plots obtained with the REGRESSION sub-commands, particularly the final scatterplot of the standardized residuals against the predicted values: 6 11 Residual plot Standardized Residual -3 -2 -1 0 1 2 3 Standardized Predicted Value 6 5 4 3 2 1 0 -1 -2 12 • Recall that this scatter diagram should show no pattern at all, as if a handful of stones were dropped at the centre of the diagram. On the contrary, in this case it could not show a pattern more clearly! This pattern of two lines across the diagram is caused by the fact that the dependent variable can only take two values (1 for having abortion and 0 otherwise), and the distribution of the residuals consequently has a binomial distribution, not a normal distribution
Because we break the linearity This curve has the following form when the assumption the usual hypothesis testing parameter b, is positive, or its mirror image when the parameter is negative procedures are invali R square tends to be very low. The fit of the line is poor because the response can only be 0 or 1 so the values do not cluster around the line 98=653210 Logistic Regression Logistic Function To get around both problems, we will The logistic curve has the property that it instead fit a curve of a particular form to never takes values less than zero or greater the data. This type of curve, known as a than one. The way to fit it is to transform the logistic curve, has the following general definition of the logistic curve given above into a linear form P=exp(bo+b, X)(1+exp(bo+b, X) loge(p/(1-p)=bo+b,'X The function on the left-hand side of this where p is the proportion at each value of equation has various names, of which the the explanatory variable X, bo and b,are most common are the logistic function ' and numerical constants to be estimated and the log-odds function. The log equatior exp is the exponential function has the general form of a linear model
7 13 • Because we break the linearity assumption the usual hypothesis testing procedures are invalid. • R square tends to be very low. The fit of the line is poor because the response can only be 0 or 1 so the values do not cluster around the line. 14 Logistic Regression • To get around both problems, we will instead fit a curve of a particular form to the data. This type of curve, known as a logistic curve, has the following general form: P = exp(b0+b1*X)/(1+exp(b0+b1*X)) where p is the proportion at each value of the explanatory variable X, b0 and b1 are numerical constants to be estimated, and exp is the exponential function. 8 15 • This curve has the following form when the parameter b1 is positive, or its mirror image when the parameter is negative. 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 -4 -3 -2 -1 0 1 2 3 4 5 6 7 8 9 10 11 12 16 Logistic Function • The logistic curve has the property that it never takes values less than zero or greater than one. The way to fit it is to transform the definition of the logistic curve given above into a linear form: loge(p/(1-p)) = b0+b1*X • The function on the left-hand side of this equation has various names, of which the most common are the 'logistic function' and the 'log-odds function'. The log equation has the general form of a linear model
Probability and Odds Taking the natural logarithm of each side of the odds equation yields the following A probability is the likelihood that a given event will occur. It is the frequency of a given outcome divided by the total number of all possible outcomes +bx a definition of "odds"is the likelihood of a en event occurring, compared to the likelihood of the same event not occurring The above equation has the logit o on the left-side bo+b The logit is a linear Probabi bility p=1 bo+bx hn bo+br function of the X bother The probability is Odds a non -inear 1+eo+x function of the X ariables
9 17 Probability and Odds • A probability is the likelihood that a given event will occur. It is the frequency of a given outcome divided by the total number of all possible outcomes. • A definition of “odds” is the likelihood of a given event occurring, compared to the likelihood of the same event not occurring. 18 Probability Odds 10 19 Taking the natural logarithm of each side of the odds equation yields the following: The above equation has the logit on the left-side 20 The logit is a linear function of the X variables The probability is a non-linear function of the X variables