1 Model Selection in Linear Regression Basic Ideas "Model Selection"in linear regression attempts to suggest the best model for a given purpose. Recall that the two main purposes of linear regression models are: Estimate the effect of one or more covariates while adjusting for the possible confounding effects of other variables. Prediction of the outcome for the next set of similar subjects Also,we must again recall the famous quite from George Box: “All models are wrong,but some are useful..” The above quote is important here for two reasons: If we are looking to estimate effects,it reminds us that we do not need to find a single "final model",and draw all conclusions from that model alone.Since no single model is correct,we are free to estimate a large number of different models,and draw overall conclusions about the effect of a variable,confounding, and so on,from that model. So,in some sense,model selection is not relevant to estimating effects,where we will almost always want to look at more that one model,sometimes perhaps as many as 20 or 30 models will be helpful in drawing overall conclusions. We have already seen examples of this when we investigated confounding.Nev- ertheless,it turns out that a Bayesian program for model selection will,because of the way its output is formatted,be very useful for investigating effects.Thus, in learning a program for model selection,we will simultaneously see a way to easily draw overall conclusions for parameter estimation in the face of confound- ing
1 Model Selection in Linear Regression Basic Ideas “Model Selection” in linear regression attempts to suggest the best model for a given purpose. Recall that the two main purposes of linear regression models are: • Estimate the effect of one or more covariates while adjusting for the possible confounding effects of other variables. • Prediction of the outcome for the next set of similar subjects Also, we must again recall the famous quite from George Box: “All models are wrong, but some are useful.” The above quote is important here for two reasons: • If we are looking to estimate effects, it reminds us that we do not need to find a single “final model”, and draw all conclusions from that model alone. Since no single model is correct, we are free to estimate a large number of different models, and draw overall conclusions about the effect of a variable, confounding, and so on, from that model. So, in some sense, model selection is not relevant to estimating effects, where we will almost always want to look at more that one model, sometimes perhaps as many as 20 or 30 models will be helpful in drawing overall conclusions. We have already seen examples of this when we investigated confounding. Nevertheless, it turns out that a Bayesian program for model selection will, because of the way its output is formatted, be very useful for investigating effects. Thus, in learning a program for model selection, we will simultaneously see a way to easily draw overall conclusions for parameter estimation in the face of confounding
2 If we are looking at prediction,then it may make sense to ask "what is the best model that gives 'optimal'predictions for future subjects." Here again,we will see that finding a single model is often sub-optimal,and in fact,taking a Bayesian average over many models produces better predictions than any single model. There is no widely accepted model building strategy.We will look at some of the most common methods. Model Selection for Future Predictions Problem:We wish to predict Y using potential predictor variables X1,X2,...,Xp. Challenge:Which subset of the X1,X2,...,Xp potential predictors should be in- cluded in the model for“best”predictions? Frequentist Approaches: ●Backwards selection ●Forwards selection Backwards and Forwards selection All subsets selection ●AIC criterion ●Mallows Cp criterion ●R2 criterion Adjusted R2 criterion 。PRESSp criterion Bayesian Approaches: ●Bayes Factors ●BIC criterion 。DIC criterion We will now take each of these and define them.Following that,we will compare how they each actually work in practice,including programming them in R. Backwards Selection:First run a model with all covariates,X1,X2,...,Xp in- cluded in the model.Then,check to see which of the covariates has the largest p-value,and eliminate it from the model,leaving p-1 indepedent variables
2 • If we are looking at prediction, then it may make sense to ask “what is the best model that gives ‘optimal’ predictions for future subjects.” Here again, we will see that finding a single model is often sub-optimal, and in fact, taking a Bayesian average over many models produces better predictions than any single model. There is no widely accepted model building strategy. We will look at some of the most common methods. Model Selection for Future Predictions Problem: We wish to predict Y using potential predictor variables X1, X2, . . . , Xp. Challenge: Which subset of the X1, X2, . . . , Xp potential predictors should be included in the model for “best” predictions? Frequentist Approaches: • Backwards selection • Forwards selection • Backwards and Forwards selection • All subsets selection • AIC criterion • Mallows Cp criterion • R2 criterion • Adjusted R2 criterion • PRESSp criterion Bayesian Approaches: • Bayes Factors • BIC criterion • DIC criterion We will now take each of these and define them. Following that, we will compare how they each actually work in practice, including programming them in R. Backwards Selection: First run a model with all covariates, X1, X2, . . . , Xp included in the model. Then, check to see which of the covariates has the largest p-value, and eliminate it from the model, leaving p − 1 indepedent variables
3 left in the model.Repeat this procedure with those that are left,continually dropping variables until some stopping criterion is met.A typical criterion is that all p-values are above some threshold,like p >0.1. Forwards Selection:First run a model with no covariates,included in the model, i.e.,intercept only.Then,run p separate models,one for each of the possible independent variables,keeping track of the p-values each time.At the next step, consider a model with a single variable in it,the one with the lowest p values at the first step.Repeat this procedure,so that at the second step,consider all models that have two parameters in it,the one selected at the first step,and all others,one at a time,and create the second model as the one where the second value has the smallest p value,and so on.Continue to add variables until some stopping criterion is met.A typical criterion is that all p-values left at some stage are above some threshold,like p >0.15,so no more new parameters are added. Backwards and Forwards Selection:At each stage,consider both dropping and/or adding variables,checking some criterion (e.g.,based again on some p-value thresholds).A combination of the above two strategies. All subsets regression:Alternative to backwards/forwards procedures,a generic term which describes the idea of calculating some"fit"criterion over all possible models.We will see some of these below.In general,if there are p potential predictor variables,there will be 2p possible models.For example,if here are five possible X variables,there will be 25=32 models,and so on. AIC criterion:Stands for Akaikies Information Criterion.For each model,calcu- late: AIC=nln(SSE)-nln(n)+2p where SSE is the usual residual sum of squares from that model,p is the number of parameters in the current model,and n is the sample size.After doing this for all possible models,the "best"model is the one with the smallest AIC Note that the AIC is formed from three terms:The first is a measure of fit,since nIn(SSE)is essentially the sum of squared residuals.The second term,n In(n) is a constant,and really plays no role in selecting the model.The third term, 2p is a "penalty"term for adding more terms to the model.This is because the first term always decreases as more terms are added into the model,so this is needed for“balance'”. Mallows Cp criterion:Attempts to measure bias from a regression model.See textbook for full description,we omit details here. R2 criterion:Choose the model with largest R2.In general,this model will simply be the largest model,so not a very useful criterion.Can be helpful in choosing
3 left in the model. Repeat this procedure with those that are left, continually dropping variables until some stopping criterion is met. A typical criterion is that all p-values are above some threshold, like p > 0.1. Forwards Selection: First run a model with no covariates, included in the model, i.e., intercept only. Then, run p separate models, one for each of the possible independent variables, keeping track of the p-values each time. At the next step, consider a model with a single variable in it, the one with the lowest p values at the first step. Repeat this procedure, so that at the second step, consider all models that have two parameters in it, the one selected at the first step, and all others, one at a time, and create the second model as the one where the second value has the smallest p value, and so on. Continue to add variables until some stopping criterion is met. A typical criterion is that all p-values left at some stage are above some threshold, like p > 0.15, so no more new parameters are added. Backwards and Forwards Selection: At each stage, consider both dropping and/or adding variables, checking some criterion (e.g., based again on some p-value thresholds). A combination of the above two strategies. All subsets regression: Alternative to backwards/forwards procedures, a generic term which describes the idea of calculating some “fit” criterion over all possible models. We will see some of these below. In general, if there are p potential predictor variables, there will be 2p possible models. For example, if here are five possible X variables, there will be 25 = 32 models, and so on. AIC criterion: Stands for Akaikies Information Criterion. For each model, calculate: AIC = n ln(SSE) − n ln(n) + 2p where SSE is the usual residual sum of squares from that model, p is the number of parameters in the current model, and n is the sample size. After doing this for all possible models, the “best” model is the one with the smallest AIC. Note that the AIC is formed from three terms: The first is a measure of fit, since n ln(SSE) is essentially the sum of squared residuals. The second term, n ln(n) is a constant, and really plays no role in selecting the model. The third term, 2p is a “penalty” term for adding more terms to the model. This is because the first term always decreases as more terms are added into the model, so this is needed for “balance”. Mallows Cp criterion: Attempts to measure bias from a regression model. See textbook for full description, we omit details here. R2 criterion: Choose the model with largest R2 . In general, this model will simply be the largest model, so not a very useful criterion. Can be helpful in choosing
4 among models with the same numbers of included parameters.Not further discussed here. Adjusted R2 criterion:As above,but Adjusted R2 penalizes for numbers of pa- rameters,so largest model not necessarily always best.Generally selects models that are too large,because "penalty"is too small.Not further discussed here. PRESSp criterion:PRESS stands for Prediction sum of squares.Similar to SSE, but now calculates residual errors from a model that leaves out each variable,in turn.Not penalized for having more parameters,but largest model not always chosen.Still,other criteria tend to have better properties,so not discussed further here. Bayes Factors:Consider the following picture: 6 Bayes Factor=1 What is the p-value? 8 HO Data X HA x)woup The Bayes Factor =1 on the previous page becasue it is defined as: BF= Pridatamodel Prfdatamodel2 If modeli is not fully determined because of unknown parameters,then the Bayes Factor is still defined as
4 among models with the same numbers of included parameters. Not further discussed here. Adjusted R2 criterion: As above, but Adjusted R2 penalizes for numbers of parameters, so largest model not necessarily always best. Generally selects models that are too large, because “penalty” is too small. Not further discussed here. PRESSp criterion: PRESS stands for Prediction sum of squares. Similar to SSE, but now calculates residual errors from a model that leaves out each variable, in turn. Not penalized for having more parameters, but largest model not always chosen. Still, other criteria tend to have better properties, so not discussed further here. Bayes Factors: Consider the following picture: The Bayes Factor = 1 on the previous page becasue it is defined as: BF = P r{data|model1} P r{data|model2} If modeli is not fully determined because of unknown parameters, then the Bayes Factor is still defined as
6 BF= Prfdatamodel Prfdatalmodel2 but now where Prfdatamodel is defined as Prfdatalmodeli}= (likelihood x prior)de where 0i represents the vector of all unknown parameters for model i.What the above integral really means is that when there are unknown parameters in the terms in the definition of the Bayes Factor,we integrate them out (i.e.,like an average)over the prior distribution for these parameters. Problem:These can be hard to calculate,as they can involve high dimensional integrals.So,we can approximate Bayes Factors by the BIC,see below. BIC criterion:Stands for Bayesian Information Criterion,sometimes called the SBC for Schwartz Bayesian Criterion.For each model,calculate: BIC nIn(SSE)-nln(n)+In(n)p where SSE is the usual residual sum of squares from that model,p is the number of parameters in the current model,and n is the sample size.After doing this for all possible models,the "best"model is the one with the smallest BIC. Note the similarity between AIC and BIC,only the last term changes.We will compare the properties of these two criteria in detail later. Details are omitted here (see article by Raftery),but it can be shown that the BIC is related to an approximate Bayes factor,from a model with low information prior distributions (equal to one prior observation centered at the null value of zero for each coefficient). In some ways the best criterion to use for predictions,in large part because it leads to model averaging,see Raftery for details.We will extensively use Raftery's program called bic.glm for the rest of the year,extremely useful for model selection for prediction and for estimating effects. DIC criterion:Stands for "Deviance Information Criterion".Similar to the BIC, but designed for hierarchical models,estimates an "effective number of param- eters"for hierarchical models.Beyond the scope of this course,but beware of its existence if you need to select a hierarchical model
5 BF = P r{data|model1} P r{data|model2} but now where P r{data|modeli} is defined as P r{data|modeli} = Z (likelihood × prior) dθi where θi represents the vector of all unknown parameters for model i. What the above integral really means is that when there are unknown parameters in the terms in the definition of the Bayes Factor, we integrate them out (i.e., like an average) over the prior distribution for these parameters. Problem: These can be hard to calculate, as they can involve high dimensional integrals. So, we can approximate Bayes Factors by the BIC, see below. BIC criterion: Stands for Bayesian Information Criterion, sometimes called the SBC for Schwartz Bayesian Criterion. For each model, calculate: BIC = n ln(SSE) − n ln(n) + ln(n)p where SSE is the usual residual sum of squares from that model, p is the number of parameters in the current model, and n is the sample size. After doing this for all possible models, the “best” model is the one with the smallest BIC. Note the similarity between AIC and BIC, only the last term changes. We will compare the properties of these two criteria in detail later. Details are omitted here (see article by Raftery), but it can be shown that the BIC is related to an approximate Bayes factor, from a model with low information prior distributions (equal to one prior observation centered at the null value of zero for each coefficient). In some ways the best criterion to use for predictions, in large part because it leads to model averaging, see Raftery for details. We will extensively use Raftery’s program called bic.glm for the rest of the year, extremely useful for model selection for prediction and for estimating effects. DIC criterion: Stands for “Deviance Information Criterion”. Similar to the BIC, but designed for hierarchical models, estimates an “effective number of parameters” for hierarchical models. Beyond the scope of this course, but beware of its existence if you need to select a hierarchical model