PSA: Python, OLS and perfectly collinear variables

Unlike most implementations of linear models (e.g. Stata, R), Python packages don’t usually drop perfectly collinear variables.

Here’s Statsmodels as a first example: (see https://github.com/statsmodels/statsmodels/issues/3824)

import numpy as np 
import statsmodels.formula.api as smf
import pandas as pd

e = np.random.normal(size = 30)

# creating two variables x and collinear 
# where collinear is just 2 times x
x1 = np.arange(30)
x2 = 2 * x1

y = 2 * x1 + e
data = pd.DataFrame({"y": y, "x1": x1, "x2": x2})

model = smf.ols("y ~ x1 + x2", data = data)
res = model.fit()
res.summary()

      <td>y</td>        <th>  R-squared:         </th> <td>   0.998</td>

             <td>OLS</td>       <th>  Adj. R-squared:    </th> <td>   0.998</td>

       <td>Least Squares</td>  <th>  F-statistic:       </th> <td>1.354e+04</td>

       <td>Tue, 04 Jun 2019</td> <th>  Prob (F-statistic):</th> <td>3.80e-39</td>

           <td>20:23:08</td>     <th>  Log-Likelihood:    </th> <td> -35.185</td>

<td>    30</td>      <th>  AIC:               </th> <td>   74.37</td>

    <td>    28</td>      <th>  BIC:               </th> <td>   77.17</td>

        <td>     1</td>      <th>                     </th>     <td> </td>

<td>nonrobust</td>    <th>                     </th>     <td> </td>

OLS Regression Results
Dep. Variable:
Model:
Method:
Date:
Time:
No. Observations:
Df Residuals:
Df Model:
Covariance Type:

   <td>    0.3973</td> <td>    0.003</td> <td>  116.364</td> <td> 0.000</td> <td>    0.390</td> <td>    0.404</td>

x1
	coef	std err	t	P>\|t\|	[0.025]
Intercept	0.3681	0.288	1.277	0.212	-0.222	0.959
x2	0.7946	0.007	116.364	0.000	0.781	0.809

 <td> 3.352</td> <th>  Durbin-Watson:     </th> <td>   2.071</td>

    <td>-0.715</td> <th>  Prob(JB):          </th> <td>   0.224</td>

<td> 2.407</td> <th>  Cond. No.          </th> <td>7.31e+16</td>

Omnibus:
Prob(Omnibus):	0.187	Jarque-Bera (JB):	2.993
Skew:
Kurtosis:

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The smallest eigenvalue is 8.01e-30. This might indicate that there are
strong multicollinearity problems or that the design matrix is singular.

Neither does the popular machine learning package Scikit-Learn:

from sklearn.linear_model import LinearRegression
lm = LinearRegression()
lm.fit(X = data[["x1", "x2"]], y = data.y)
lm.coef_

array([0.39728922, 0.79457845])

source: https://knowyourmeme.com/photos/1250147-yamero