Prof. Bryan Caplan
Weeks 6-7: Multiple Regression
Intuition Behind Multiple Regression
Controlling for Other Variables
- The simple regression model estimates a dependent variable as a function of ONE independent variable.
- This is unsatisfactory because frequently more than one factor matters!
- A multiple regression equation estimates the impact of two or more variables on the dependent variable:
- In principle, you figure out your optimal coefficients by just re-doing the "minimize SSE" procedure we did for simple regressions.
- In practice, this is a big pain, but fortunately computers can do it for us very easily. In Eviews, just use the Equation menu to regress Y on X1 X2 X3 X4 etc. This will give you coefficients, SE, tstats, etc., for the impact of EACH variable on Y.
R2 and Multiple Regression
- Sometimes you want to do multiple regression because you think two or more variables matter. For example, you initially regress inflation on money growth (getting a positive sign). But you suspect that something else (oil prices) also increases inflation. So you regress inflation on both money growth and the change in oil prices, probably getting positive coefficients on both.
- Other times, you want to do multiple regression because you think that one variable really DOESN'T matter.
- Ex: You first regress income on education, and find a positive impact of education. But maybe high IQ people get a lot of education, even though education doesn't increase earnings. In this case, you might regress income on BOTH education and IQ, and see if the coefficient on education declines or becomes insignificant.
- Still other times, you want to do multiple regression because you think that one variable really has the OPPOSITE of the effect you estimate in a simple regression. For example, if you regress crime on per-capita police expenditures, and get a positive coefficient, you may doubt that this shows that more police yield more crime. So you regress crime on per-capita police expenditures and lagged crime. It is quite possible that now the coefficient on police expenditures will turn negative, (and lagged crime will almost surely have a positive coefficient).
- When you add another variable to a regression and re-run it, people often say that you have "controlled" for the new variable. In other words, you have looked at the impact of the original variables, taking the extra one into account as well.
Omitted Variable Bias
- R2 is still defined the same as before: R2=1-(SSE/TSS)
- Important point: Adding more variables to a regression equation cannot decrease the R2.
- If the coefficient on the new variable is zero, then the R2 is the same as before (the error is the same and TSS is the same).
- If the coefficient on the new variable is not zero, ten the R2 must be greater than before (the error must have fallen somewhat, and TSS is the same).
- Easy way to get high R2: use as many variables as possible. However, since it is a mathematical truism that more variables improves your fit, people will be skeptical of your results if you use too many variables.
- One rule of thumb: Don't keep statistically insignificant variables in your specification.
- Fun fact: How many variables do you need to always get a perfect fit, with R2=1? For N observations, you need N variables (any variables will work!).
Examples of Omitted Variable Bias
- Your coefficients are biased if they are systematically wrong. If you do your results well, you know that they may not be exactly right, but your coefficients will be too high about as often as they are too low.
- Your results are biased, however, if your coefficients are wrong for reasons OTHER than random error.
- The most common bias is called "Omitted Variable Bias."
- This means that you "omitted" (left out) an important variable from your regression equation. This makes all of the remaining coefficients suspect.
- The best way to handle possible omitted variable bias: Put the omitted variable into the equation, and re-run the regression.
- If the coefficient on the previously omitted variable is 0, the other coefficients won't change. Otherwise, they may.
Correlation vs. Causation, Again
- Discrimination. Regress income on a variable that equals 1 if a person is black, and 0 otherwise. You will find that there is a BIG negative impact of being black - i.e., being black seems to "cause" lower income.
- But: blacks also have lower education levels, lower average age, etc. You need to control for these before presuming discrimination.
- Gender discrimination. Regress income on a variable that equals 1 if a person is a woman, and 0 otherwise. You will find a big negative impact of being woman.
- But: women also have fewer years of experience, take time off to have kids, are less likely to take jobs requiring long absences from home, more likely to work part-time, etc. You need to control these before inferring discrimination.
- Does government spending increase output during war? Regress output on government spending (or rate of growth of output on rate of growth of government spending). You may find a significant positive impact of G on Y.
- But: money supply growth is also high during war. Could this be the real reason for the correlation?
- Countries with big governments relative to GDP are on average richer than ones with small G/GDP. But is this causal?
- In the data, you will notice that Third World countries have small G/GDP, and First World countries have large G/GDP. But if you look within groups of similar countries, the small-government ones on average are richer. Again, you need to control for other factors.
- If you regress earnings on years of participation in government training programs, you often (particularly for males) find a negative impact. Can you think of other factors to control for?
- Some studies of the minimum wage find little impact of a higher minimum wage on employment. Can you think of some factors you should control for? (Do governments tend to raise the minimum wage during boom periods when the employment effect will be hard to detect?)
- Controlling for extra variables can help us to avoid confusing correlation and causation. But the danger is still there.
- Problem #1: Many variables are hard to quantify. Suppose, for example, that people get paid more for high ability, not education. But how do you measure "ability"? Just because you can't measure it, doesn't mean it isn't important.
- Problem #2: Data limitations. Even if you can quantify something, it may be hard to get the data. It may be easy to get data on loan approval and race, but hard to get data on credit-worthiness.
- Problem #3: "Endogeneity." Data is "exogenous" when it is generated by a randomized double-blind experiment or its equivalent. Otherwise it is called "endogenous." This can generate serious problems:
- Regress cancer treatment on cancer severity. You'll see a positive correlation is you don't use a randomized selection procedure, since very sick people will also get more powerful medicine.
- Suppose that the central bank increases the money supply in order to "accommodate the needs of trade." You observe a positive correlation between money and output. But does money cause the increase in output? You need a randomized selection procedure to get good results.
- Married males make more money than unmarried males. Does this mean you will get a raise if you get married?