Linear Regression

Linear Regression models the relationship, more specifically the correlation between two or. more variables and as such, we will see how to plot them to look for these relationship models.

import pandas as pd
import matplotlib.pyplot as plt

# Import the housing information for analysis 
housing = pd.DataFrame.from_csv('../data/housing.csv', index_col=0)
housing.head()

# Use covariance to calculate the association
housing.cov()

# Use correlation to calculate the association is more appropriate in this case
housing.corr()

# scatter matrix plot
from pandas.tools.plotting import scatter_matrix
sm = scatter_matrix(housing, figsize=(10, 10))

# This time we take a closer look at MEDV vs LSTAT。 What is the association between MEDV and LSTAT you observed?
housing.plot(kind='scatter', x='LSTAT', y='MEDV', figsize=(10, 10))

Simple linear regression

yi=β0+β1∗xi+ϵi

# lets try to guess what are the real values of intercept and slope
# we call our guess b0, b1...
# Try to assign the value of b0, b1 to get a straight line that can describe our data 
b0 = 0.1
b1 = 1
housing['GuessResponse'] = b0 + b1*housing['RM']

# Also want to know the error of of guess...
# This show how far is our guess response from the true response
housing['observederror'] = housing['MEDV'] - housing['GuessResponse']


# plot your estimated line together with the points
plt.figure(figsize=(10, 10))
plt.title('Sum of sqaured error is {}'.format((((housing['observederror'])**2)).sum()))
plt.scatter(housing['RM'], housing['MEDV'], color='g', label='Observed')
plt.plot(housing['RM'], housing['GuessResponse'], color='red', label='GuessResponse')
plt.legend()
plt.xlim(housing['RM'].min()-2, housing['RM'].max()+2)
plt.ylim(housing['MEDV'].min()-2, housing['MEDV'].max()+2)
plt.show()

Validate Models

  • Linearity
  • Independence
  • Normality
  • Equal Variance

Normality Validation – Use QQ plot

import scipy.stats as stats
import matplotlib.pyplot as plt
z = (housing['error'] - housing['error'].mean())/housing['error'].std(ddof=1)

stats.probplat (z, dist='norm'm plot=plt)
plt.title('Normal QQplot')
plot.show()

Perform OLS regression based on the model

// SPY regressed on predictor on number of exchanges 
formula = 'spy~spy_lag1+sp500+nasdaq+dji+cac40+aord+daxi+nikkei+hsi'
lm = smf.ols(formula=formula, data=Train).fit()
lm.summary()

We can measure the performance of our model using some statistical metrics – RMSEAdjusted R2�2

Leave a comment