PSA: Python, OLS and perfectly collinear variables

Unlike most implementations of linear models (e.g. Stata, R), Python packages don’t usually drop perfectly collinear variables.

Here’s Statsmodels as a first example: (see

import numpy as np 
import statsmodels.formula.api as smf
import pandas as pd 
e = np.random.normal(size = 30)

# creating two variables x and collinear 
# where collinear is just 2 times x
x1 = np.arange(30)
x2 = 2 * x1

y = 2 * x1 + e
data = pd.DataFrame({"y": y, "x1": x1, "x2": x2})
model = smf.ols("y ~ x1 + x2", data = data)
res =

Neither does the popular machine learning package Scikit-Learn:

from sklearn.linear_model import LinearRegression
lm = LinearRegression() = data[["x1", "x2"]], y = data.y)
array([0.39728922, 0.79457845])


Philip Khor

Data scientist with a background in economics.

comments powered by Disqus