PSA: Python, OLS and perfectly collinear variables

Unlike most implementations of linear models (e.g. Stata, R), Python packages don’t usually drop perfectly collinear variables.

Here’s Statsmodels as a first example: (see https://github.com/statsmodels/statsmodels/issues/3824)

import numpy as np 
import statsmodels.formula.api as smf
import pandas as pd 
e = np.random.normal(size = 30)

# creating two variables x and collinear 
# where collinear is just 2 times x
x1 = np.arange(30)
x2 = 2 * x1

y = 2 * x1 + e
data = pd.DataFrame({"y": y, "x1": x1, "x2": x2})
model = smf.ols("y ~ x1 + x2", data = data)
res = model.fit()
res.summary()

Neither does the popular machine learning package Scikit-Learn:

from sklearn.linear_model import LinearRegression
lm = LinearRegression()
lm.fit(X = data[["x1", "x2"]], y = data.y)
lm.coef_
array([0.39728922, 0.79457845])

source: https://knowyourmeme.com/photos/1250147-yamero

Avatar
Philip Khor

Data scientist with a background in economics.

comments powered by Disqus