PSA: Python, OLS and perfectly collinear variables

Unlike most implementations of linear models, Python packages don’t usually drop perfectly collinear variables.

econometrics
python
Author

Philip Khor

Published

June 5, 2019

Unlike most implementations of linear models (e.g. Stata, R), Python packages don’t usually drop perfectly collinear variables.

Here’s Statsmodels as a first example: (see https://github.com/statsmodels/statsmodels/issues/3824)

import numpy as np 
import statsmodels.formula.api as smf
import pandas as pd 
e = np.random.normal(size = 30)

# creating two variables x and collinear 
# where collinear is just 2 times x
x1 = np.arange(30)
x2 = 2 * x1

y = 2 * x1 + e
data = pd.DataFrame({"y": y, "x1": x1, "x2": x2})
model = smf.ols("y ~ x1 + x2", data = data)
res = model.fit()
res.summary()
      <td>y</td>        <th>  R-squared:         </th> <td>   0.998</td> 
             <td>OLS</td>       <th>  Adj. R-squared:    </th> <td>   0.998</td> 
       <td>Least Squares</td>  <th>  F-statistic:       </th> <td>1.354e+04</td>
       <td>Tue, 04 Jun 2019</td> <th>  Prob (F-statistic):</th> <td>3.80e-39</td> 
           <td>20:23:08</td>     <th>  Log-Likelihood:    </th> <td> -35.185</td> 
<td>    30</td>      <th>  AIC:               </th> <td>   74.37</td> 
    <td>    28</td>      <th>  BIC:               </th> <td>   77.17</td> 
        <td>     1</td>      <th>                     </th>     <td> </td>    
<td>nonrobust</td>    <th>                     </th>     <td> </td>    
OLS Regression Results
Dep. Variable:
Model:
Method:
Date:
Time:
No. Observations:
Df Residuals:
Df Model:
Covariance Type:
   <td>    0.3973</td> <td>    0.003</td> <td>  116.364</td> <td> 0.000</td> <td>    0.390</td> <td>    0.404</td>
coef std err t P>|t| [0.025]
Intercept 0.3681 0.288 1.277 0.212 -0.222 0.959
x1
x2 0.7946 0.007 116.364 0.000 0.781 0.809
 <td> 3.352</td> <th>  Durbin-Watson:     </th> <td>   2.071</td>
    <td>-0.715</td> <th>  Prob(JB):          </th> <td>   0.224</td>
<td> 2.407</td> <th>  Cond. No.          </th> <td>7.31e+16</td>
Omnibus:
Prob(Omnibus): 0.187 Jarque-Bera (JB): 2.993
Skew:
Kurtosis:



Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The smallest eigenvalue is 8.01e-30. This might indicate that there are
strong multicollinearity problems or that the design matrix is singular.

Neither does the popular machine learning package Scikit-Learn:

from sklearn.linear_model import LinearRegression
lm = LinearRegression()
lm.fit(X = data[["x1", "x2"]], y = data.y)
lm.coef_
array([0.39728922, 0.79457845])

source: https://knowyourmeme.com/photos/1250147-yamero