Part 3 of 3, Creating a Regression Model in Python
Using the past to predict the future! Say hello to part 3 of 3 in this series on regression modeling with python! In blog 1, I covered the important processing steps prior to creating a linear regression model. In blog 2, I showed you how to create the actual regression model along with demonstrating how to reuse the model with new data. In this blog I will cover how to check for post-linear model creation assumptions: Homoscedasticity & Normality in Residuals.
For a more thorough overview of the project related to this series of blogs see: https://github.com/rgpihlstrom/Phase2Project .
Now let’s get started!
Before diving into the topic at hand, lets review the 4 assumptions that need to hold true prior to releasing your model into a production environment.
Pre-Model Assumptions — Covered in Blog 1
1. Linearity between Target and Features
2. No Multicollinearity between Features
Post Model Assumptions- Covered here
3. Homoscedasticity in the Residual Error
4. Normality in the Residual Error.
We will be covering points 3 & 4 in this blog.
3. Homoscedasticity in the Residuals
When our model is created using linear regression, we are using OLS. For more information on OLS see here. The difference between the actual and predicted value is technically an error in our ability to predict the price accurately but we refer to those errors as residuals. When a residual does not have a constant random variance against the best fit line across your features (x axis), the model is said to be heteroscedastic. This explanation is certainly difficult to conceptualize without visuals — See the above theoretical examples.
Each blue dot represents the price of the home on the x and the residuals (actual — predicted) on the y axis. The goal is to have residuals = 0 as that would mean your model made a perfect prediction. When the residuals look like graphs A, or B you can see that the variance in residuals form a pattern. In graph A, as prices get larger so do the errors. In graph B, as prices gets larger the errors get smaller. In graph C, the errors are consistently random “the same” across the predictions. In graphs A and B there is most likely a missing feature from the model that we could use to account for those errors. We also saw an indication of this bias in our model in blog 2 with the points in the red circle on the scatter plot showing our predicted vs. actual prices.
Recognizing the violation of homoscedasticity is related to the confidence we have in the models ability to make accurate predictions for all data points in which the model was optimized. Remember the exercise here is linear regression, meaning as you increase or decrease your features, there is a clear corresponding relationship in output. Heteroscedasticity impacts the standard error that is built into creating the coefficients and pvalues associated with accepting your coefficients as significant. Said more plainly, heteroscedastic data should cause us to question our confidence in a model’s ability to consistently explain the variability in the data.
So what does our data look like? Is our data hetero or homoscedastic? See below:
From the above our data looks heteroscedastic. So how do we tell definitively if our data is hetero or homoscedastic?
Option 1. Plotting our data as shown above. With x axis as predicted price and the y axis as residuals:
#Graph our residuals against our predictions, this will give us a sense if our model is off for certain priced homes
plt.xticks(ticks=(300000, 500000, 700000, 900000, 1200000),labels= ('$300k', '$500k', '$700k', '$900k',$1.2M'))
plt.yticks(ticks=(-300000, -100000, 0, 300000, 500000),labels= ('$-300k', '$-100k', '$0k', '$300k','$500k'))
plt.plot(df_predictions["price_Predicted"], [0 for i in range(len(df_predictions["price_Predicted"]))],color="r");
Option 2. Breusch-Pagan or Goldfeld-Quandt. (Breusch-Pagan shown below)
We can also use two statistical tests: Breusch-Pagan or Goldfeld-Quandt. In either test the null hypothesis assumes homoscedasticity and a p-value below .05 indicates we should reject the null in favor of heteroscedasticity.
# het_breuschpagan suggests heteroscedasticity in the data P-value> .05, null = homoscedasticity vs. heteroscedasticity
from statsmodels.compat import lzip
import statsmodels.stats.api as sms
name = ['LM Statistic', 'LM-Test p-value', 'F-Statistic', 'F-Test p-value']
test = sms.het_breuschpagan(model.resid, predictors_int)
As you can see from the highlight, given our pvalue is much lower than .o5, we would reject the null that our data is homoscedastic and instead accept that our data is heteroscedastic. So what should we do? For the sake of brevity, I will not look to fix heteroskedasticity in this post. See the following post for options on how to fix heteroscedastic datasets.
4. Normality in Distribution of Residuals
The 4th assumption in linear regression modeling is that the residuals from our model are independent and normally distributed. If the residuals are not exactly normally distributed, but the sample size is large enough then the Central Limit Theorem says that predictions from our model will still be approximately correct. Therefore, this issue is really only relevant for small sample sizes. However, to test for normality in our residuals we can use formal tests such as Shapiro-Wilk and or visual tests such as the QQPlots. I will use the QQPlot for this blog. For details on the Shapiro-Wilk test in python see here.
What does QQ plot tell us?
A Q-Q plot tells us if our data comes from a specified distribution (normal in our case) by looking at your provided data, breaking it into quantiles (z scores) and then developing another “theoretical” set of data mimicking your data and then calculating the quantiles/ z-scores for that data, sort ordering from lowest to biggest and then using those points to create a scatter plot. I realize that is a mouth full. Our points, if normally distributed, will form a straight line on the 45-degree angle. Each point is paired between our data and the theoretical data. See the graphs below for examples of normally distributed vs. non-normally distributed datasets and their impact on the QQPlots. As you can see the ideal QQPlot is shown in graph B. The datapoints fall along the 45-degree. Each point in that dataset has a corresponding point coming from the other dataset. If our data does not match the theoretical in terms of points per quantile, you get a data point plotted beyond the 45-degree line (lower graphs A, C).
So how do our QQPlots look for our data and how do we create these using python? See below:
# Import appropriate libraries
import statsmodels.api as sm
import scipy.stats as stats
import pylab#Plot Histogram and QQP
sm.graphics.qqplot(data=model.resid, dist=stats.norm, line='45', fit=True, ax=axes)
axes.set_title('Histogram of Residuals')
axes.set_title('QQP of Residuals')
axes.set_ylabel("Residual Z Scores")
As you can see from the highlights, our data is not 100% normal, but overall our errors looks fairly normally distributed. I would say, for this assumption, we have passed and would be okay to proceed.
Above I covered the steps to review the post-model creation assumptions associated with linear regression modeling. I also provided additional resources to explore each of the topics further.
Through this 3 part series on linear regression I showed you how to process your data prior to model creation and check for the first two assumptions of linear regression. In blog 2, I showed you how to create a model using Statsmodel and how to reuse your model with new data. Lastly, here I finished the series with checking post model creation assumptions.
Hopefully you have benefited from this 3 part series, I know I have benefited from writing it! I look forward to seeing you soon!
Next Stop — DATA ALCHEMY!