PSA: Python, OLS and perfectly collinear variables

Unlike most implementations of linear models (e.g. Stata, R), Python packages don’t usually drop perfectly collinear variables.

Here’s Statsmodels as a first example: (see https://github.com/statsmodels/statsmodels/issues/3824)

import numpy as np
import statsmodels.formula.api as smf
import pandas as pd

e = np.random.normal(size = 30)

# creating two variables x and collinear
# where collinear is just 2 times x
x1 = np.arange(30)
x2 = 2 * x1

y = 2 * x1 + e
data = pd.DataFrame({"y": y, "x1": x1, "x2": x2})

model = smf.ols("y ~ x1 + x2", data = data)
res = model.fit()
res.summary()

Dep. Variable: R-squared: y 0.998 OLS 0.998 Least Squares 1.354e+04 Tue, 04 Jun 2019 3.80e-39 20:23:08 -35.185 30 74.37 28 77.17 1 nonrobust
coef std err t P>|t| [0.025 0.975] 0.3681 0.288 1.277 0.212 -0.222 0.959 0.3973 0.003 116.364 0.000 0.390 0.404 0.7946 0.007 116.364 0.000 0.781 0.809
 Omnibus: Durbin-Watson: 3.352 2.071 0.187 2.993 -0.715 0.224 2.407 7.31e+16

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The smallest eigenvalue is 8.01e-30. This might indicate that there are
strong multicollinearity problems or that the design matrix is singular.

Neither does the popular machine learning package Scikit-Learn:

from sklearn.linear_model import LinearRegression
lm = LinearRegression()
lm.fit(X = data[["x1", "x2"]], y = data.y)
lm.coef_

array([0.39728922, 0.79457845])