Regression Details

This notebook will explore a few additional facets of linear regression with a single variable. Begin by loading the CEOs data from the wooldridge model. We’ll print out a head of the DataFrame to see a snapshot of what the data looks like.

import wooldridge as woo
ceos = woo.data('ceosal1')
ceos.describe()
salary pcsalary sales roe pcroe ros indus finance consprod utility lsalary lsales
count 209.000000 209.000000 209.000000 209.000000 209.000000 209.000000 209.000000 209.000000 209.000000 209.000000 209.000000 209.000000
mean 1281.119617 13.282297 6923.793282 17.184211 10.800478 61.803828 0.320574 0.220096 0.287081 0.172249 6.950386 8.292265
std 1372.345308 32.633921 10633.271088 8.518509 97.219399 68.177052 0.467818 0.415306 0.453486 0.378503 0.566374 1.013161
min 223.000000 -61.000000 175.199997 0.500000 -98.900002 -58.000000 0.000000 0.000000 0.000000 0.000000 5.407172 5.165928
25% 736.000000 -1.000000 2210.300049 12.400000 -21.200001 21.000000 0.000000 0.000000 0.000000 0.000000 6.601230 7.700883
50% 1039.000000 9.000000 3705.199951 15.500000 -3.000000 52.000000 0.000000 0.000000 0.000000 0.000000 6.946014 8.217492
75% 1407.000000 20.000000 7177.000000 20.000000 19.500000 81.000000 1.000000 0.000000 1.000000 0.000000 7.249215 8.878636
max 14822.000000 212.000000 97649.898438 56.299999 977.000000 418.000000 1.000000 1.000000 1.000000 1.000000 9.603868 11.489144

We’ve already computed ordinary least squares regression coefficients “by hand” once, which is a rite of passage that will surely stay with you for the rest of your life. But now that we’ve seen how it’s done, it’s time to speed things up by asking Python to do the regression for us.

The module we’ll use for this is statsmodels. Within statsmodels, the forumla.api suite of tools will allow us to perform linear regression with ease. The canonical import for statsmodels.formula.api is smf.

import statsmodels.formula.api as smf

Regression in statsmodels happens in two steps. In the first step, we tell Python about a regression formula that we want to use. The linear equation

\[ y = \beta_0 + \beta_1 x + u \]

is described as y ~ x in smf.

reg = smf.ols(formula='salary ~ roe', data=ceos)

Note that we do not need to specify an intercept term. The statsmodels module will add one in by default. The reason why will be considered later.

After creating a variable that holds the regression information, we must then instruct statsmodels to actually perform the regression. The reason for why this task is split into two steps will be shown later when we address heteroskedasticity.

res = reg.fit()

The output from a .fit() call will be a variable that stores all kinds of useful information.

For an overview of the regression estimation, use the .summary() command on the output variable. If you put the .summary() call inside of a print() statement then the output will be displayed on your screen a bit better.

print(res.summary())
                            OLS Regression Results                            
==============================================================================
Dep. Variable:                 salary   R-squared:                       0.013
Model:                            OLS   Adj. R-squared:                  0.008
Method:                 Least Squares   F-statistic:                     2.767
Date:                Mon, 27 Sep 2021   Prob (F-statistic):             0.0978
Time:                        19:27:25   Log-Likelihood:                -1804.5
No. Observations:                 209   AIC:                             3613.
Df Residuals:                     207   BIC:                             3620.
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept    963.1913    213.240      4.517      0.000     542.790    1383.592
roe           18.5012     11.123      1.663      0.098      -3.428      40.431
==============================================================================
Omnibus:                      311.096   Durbin-Watson:                   2.105
Prob(Omnibus):                  0.000   Jarque-Bera (JB):            31120.902
Skew:                           6.915   Prob(JB):                         0.00
Kurtosis:                      61.158   Cond. No.                         43.3
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

There’s a lot of output here, some of it deals with topics covered while much of it deals with things we have not yet discussed.

The Intercept (\(\beta_0\)) is estimated to be \(963.1913\). The slope coefficient on roe (\(\beta_1\)) is estimated to be \(18.5012\).

Along with the ability to display summary output, the output variable also stores information about \(\hat{y}\) and \(\hat{u}\). The information on \(\hat{y}\) is in .fittedvalues while the information about \(\hat{u}\) is in .resid. Of course, \(\hat{u}\) is merely \(y - \hat{y}\), so once we have .fittedvalues we could easily calculate \(\hat{u}\) ourselves without getting the data from .resid (but statsmodels is just super generous).

ceos['salary_hat'] = res.fittedvalues # creates a column of y hat values for our DataFrame
ceos['u_hat'] = res.resid # creates a column of u hat values for our DataFrame
ceos.head()
salary pcsalary sales roe pcroe ros indus finance consprod utility lsalary lsales salary_hat u_hat
0 1095 20 27595.000000 14.1 106.400002 191 1 0 0 0 6.998509 10.225389 1224.058071 -129.058071
1 1001 32 9958.000000 10.9 -30.600000 13 1 0 0 0 6.908755 9.206132 1164.854261 -163.854261
2 1122 9 6125.899902 23.5 -16.299999 14 1 0 0 0 7.022868 8.720281 1397.969216 -275.969216
3 578 -9 16246.000000 5.9 -25.700001 -21 1 0 0 0 6.359574 9.695602 1072.348338 -494.348338
4 1368 7 21783.199219 13.8 -3.000000 56 1 0 0 0 7.221105 9.988894 1218.507712 149.492288

Let’s turn our attention to \(R^2\), a measure for the goodness of fit for a linear regression. Define:

\[ \text{Sum of Squares, Total (SST)} = \sum_{i=1}^n (y_i - \bar{y})^2 \]

to be the total variation in the dependent variable. Note that the variance of the dependent variable is SST divided by \(n\). Also define

\[ \text{Sum of Squares, Explained (SSE)} = \sum_{i=1}^n (\hat{y}_i - \bar{y})^2 \]

to be the variation explained by the regression. Finally, let

\[ \text{Sum of Squares, Residual (SSR)} = \sum_{i=1}^n (y_i - \hat{y}_i)^2 \]

to be the variation in the dependent variable that is not explained by the regression.

Then, the \(R^2\) of the regression model is

\[ R^2 = \frac{SSE}{SST} = 1 - \frac{SSR}{SST}. \]
ceos['ST'] = (ceos['salary'] - ceos['salary'].mean())**2 # calculate R-squared
ceos['SE'] = (ceos['salary_hat'] - ceos['salary'].mean())**2
ceos['SR'] = (ceos['u_hat'])**2
ceos[['ST','SE','SR']].describe()
ST SE SR
count 2.090000e+02 209.000000 2.090000e+02
mean 1.874320e+06 24719.708325 1.849601e+06
std 1.449754e+07 59047.801988 1.438029e+07
min 3.535839e+00 2.427324 3.847426e-01
25% 4.457149e+04 1532.299834 4.076691e+04
50% 1.773417e+05 6882.895491 1.834925e+05
75% 4.544373e+05 20740.935509 3.904399e+05
max 1.833554e+08 523725.039775 1.822469e+08
print(ceos['SE'].sum() / ceos['ST'].sum())
print(1 - ceos['SR'].sum() / ceos['ST'].sum())
0.013188624081034115
0.01318862408103405

This value for \(R^2\) matches the one shown in the .summary() output above.

Likewise, if we assume that the variance of \(u\) is independent of \(x\) (meaning, in this instance, that the regression error has variance that does not depend on the level of roe), then the formula for \(var(\hat{\beta}_1)\) is given by

\[ var(\hat{\beta}_1) = \frac{var(u)}{\sum_{i=1}^n(x_i - \bar{x})^2} \]

where we can estimate \(var(u)\) with

\[ \hat{var}(u) = \frac{1}{n-2}\sum_{i=1}^n \hat{u}_i^2 \]

Often, we report the square root of \(var(\hat{\beta}_1)\), which is referred to as the standard error of the regression model.

import numpy as np
se_b1 = np.sqrt( ((res.resid**2).sum()/(len(ceos['SR'])-2)) / ((ceos['roe'] - ceos['roe'].mean())**2).sum() )
print(se_b1)
11.123250903287637

This output again matches the regression output from .summary() above.

To see it a bit more slowly:

n = len(ceos['SR'])
varHat_u = (1/(n-2)) * (res.resid**2).sum()
var_b1 = varHat_u / ((ceos['roe'] - ceos['roe'].mean())**2).sum()
se_b1 = np.sqrt(var_b1)
print(se_b1)
11.123250903287637

Knowledge of the standard error is important.

The formulas shown above assume that the variance of \(u\) is independent of \(x\). We can allow for the two to be related with the cov_type_'HC3' option to .fit().

res2 = reg.fit(cov_type='HC3')
print(res2.summary())
                            OLS Regression Results                            
==============================================================================
Dep. Variable:                 salary   R-squared:                       0.013
Model:                            OLS   Adj. R-squared:                  0.008
Method:                 Least Squares   F-statistic:                     7.191
Date:                Mon, 27 Sep 2021   Prob (F-statistic):            0.00792
Time:                        19:27:25   Log-Likelihood:                -1804.5
No. Observations:                 209   AIC:                             3613.
Df Residuals:                     207   BIC:                             3620.
Df Model:                           1                                         
Covariance Type:                  HC3                                         
==============================================================================
                 coef    std err          z      P>|z|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept    963.1913    122.072      7.890      0.000     723.935    1202.448
roe           18.5012      6.899      2.682      0.007       4.979      32.024
==============================================================================
Omnibus:                      311.096   Durbin-Watson:                   2.105
Prob(Omnibus):                  0.000   Jarque-Bera (JB):            31120.902
Skew:                           6.915   Prob(JB):                         0.00
Kurtosis:                      61.158   Cond. No.                         43.3
==============================================================================

Notes:
[1] Standard Errors are heteroscedasticity robust (HC3)

Note here that, unlike before, the 95% confidence interval for roe excludes \(0\).