PSA: Python, OLS and perfectly collinear variables

Unlike most implementations of linear models (e.g. Stata, R), Python packages don’t usually drop perfectly collinear variables.

Here’s Statsmodels as a first example: (see https://github.com/statsmodels/statsmodels/issues/3824)

import numpy as np 
import statsmodels.formula.api as smf
import pandas as pd 
e = np.random.normal(size = 30)

# creating two variables x and collinear 
# where collinear is just 2 times x
x1 = np.arange(30)
x2 = 2 * x1

y = 2 * x1 + e
data = pd.DataFrame({"y": y, "x1": x1, "x2": x2})
model = smf.ols("y ~ x1 + x2", data = data)
res = model.fit()
res.summary()



OLS Regression Results
Dep. Variable: y R-squared: 0.998
Model: OLS Adj. R-squared: 0.998
Method: Least Squares F-statistic: 1.354e+04
Date: Tue, 04 Jun 2019 Prob (F-statistic): 3.80e-39
Time: 20:23:08 Log-Likelihood: -35.185
No. Observations: 30 AIC: 74.37
Df Residuals: 28 BIC: 77.17
Df Model: 1
Covariance Type: nonrobust

coef std err t P>|t| [0.025 0.975]
Intercept 0.3681 0.288 1.277 0.212 -0.222 0.959
x1 0.3973 0.003 116.364 0.000 0.390 0.404
x2 0.7946 0.007 116.364 0.000 0.781 0.809
Omnibus: 3.352 Durbin-Watson: 2.071
Prob(Omnibus): 0.187 Jarque-Bera (JB): 2.993
Skew: -0.715 Prob(JB): 0.224
Kurtosis: 2.407 Cond. No. 7.31e+16


Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The smallest eigenvalue is 8.01e-30. This might indicate that there are
strong multicollinearity problems or that the design matrix is singular.

Neither does the popular machine learning package Scikit-Learn:

from sklearn.linear_model import LinearRegression
lm = LinearRegression()
lm.fit(X = data[["x1", "x2"]], y = data.y)
lm.coef_
array([0.39728922, 0.79457845])

source: https://knowyourmeme.com/photos/1250147-yamero

Avatar
Philip Khor

Data scientist with a background in economics.

comments powered by Disqus