Using The Past To Predict The Future

9 min readDec 9, 2020

Part 2 of 3, Creating a Regression Model in Python

Introduction

Using the past to predict the future! Say hello to part 2 of 3 in this series on regression modeling with python! In blog 1, I covered the important processing steps prior to creating a linear regression model. In this blog I build from that foundation by creating the actual regression model along with demonstrating how to reuse the model with new data. In blog 3 I will cover how to check for the post-linear model creation assumptions.

For a more thorough overview of the project related to this series of blogs see: https://github.com/rgpihlstrom/Phase2Project .

Now let’s get creating!

Refresher Of Our Data: A blend of King County Housing data with scraped data from the King County Tax Assessors Dept. For more details see blog 1.

King County Data + Scraped Data From King County Tax Assessors

1. Creating Our Model

With cleaned and assumption adhering data in hand, our first step is to separate our target from our features.

1. Our Actual Price — This is our Target (dependent variable)

2. Our Features (independent variables)

Remember in blog 1 we scaled our data. Therefore, we will use our scaled data to feed our model. I should also mention, that in this blog series I am using StatsModel to create our model. However, another popular regression model is available from SK Learn.

# Import appropriate Statsmodel Libraries
import statsmodels.api as sm# Separate Our Variables
# Get our Target Variable – Actual Price Reached
df_Regresion_Target = df_Regression_No_Outliers_With_ScaledData['log_price']#Get our Features - Remove any columns not used to predict our target.
df_Regresion_Features = df_Regression_No_Outliers_With_ScaledData.drop(columns=['id', 'date', 'price','log_price',‘sqft_above',  'TotalAppraisalValue'])# Use Features to create y intercept
predictors_int = sm.add_constant(df_Regresion_Features)# Feed Features and intercept to Model Method for Model creation
model = sm.OLS(df_Regresion_Target,predictors_int).fit()#View model
model.summary()

WOW! Just like that our model is MADE! Now lets view our results.

Adj R Squared — See red circle — This is your models overall score. It is a number between 0–1. A score of .8 says that 80% of the variability in the predicted price can be explained by the model. .8 is a good score.

Durbin-Watson — See blue circle — Tests autocorrelation in residuals. Values from 0 to less than 2 indicate positive autocorrelation and values from 2 to 4 indicate negative autocorrelation. If we get a value less than 2.5 we are in good shape.

Coefficients — See maroon square — These are the values that predict your target, in our case, price. While the actual values shown are not overly important, the comparisons between the values are important. Because we scaled our continuous values using minmax we can compare the coefficient of TotalAppraisalValue_Sc = 1.27 to sqft_above_sc = .131. The fact that TotalAppraisalValue is 800% higher than sqft_above_sc indicates the heavy reliance the model places on TotalAppraisalValue for making predictions. The coefficient of TotalAppraisalValue being so much higher than the other features is powerful, yet it does leave our model very susceptible to bias from that feature. Said differently, if our data in the column TotalAppraisalValue for an individual row has an error for any reason (bad appraisal, recording error, etc.) our error in the prediction will be large. If you have a model with a more balanced set of coefficients, an error introduced from any particular feature will not have as big of an impact on the predicted value, as the numbers used to generate the prediction are spread amongst a larger number of features.

Pvalues — See black circle — Shows the statistical significance of each of our model coefficients. Values less than .05 suggests our coefficients are significant and should remain in the model.

Great! Now we have our model but where are the predictions?

Seeing Our Predictions and Reviewing our Errors

#Run Predictions using the model
df_predictedPrice = model.predict(predictors_int)#Create Dataframe From Predictions
f_predictions = pd.DataFrame(df_predictedPrice, columns=["log_price_Predicted"])
df_predictions["price_Predicted"] = np.exp(df_predictions["log_price_Predicted"])
df_predictions["log_Residuals"] = model.resid#Combine Predictions with Original datafrome
df_predictions = df_predictions.reset_index(drop=True)
df_Regression_No_Outliers_With_ScaledData = df_Regression_No_Outliers_With_ScaledData.reset_index(drop=True)
df_Regression_No_Outliers_With_ScaledData = pd.concat([df_Regression_No_Outliers_With_ScaledData, df_predictions], axis=1)
df_Regression_No_Outliers_With_ScaledData['price'] = df_Regression_No_Outliers_With_ScaledData['price'].astype(float)
df_Regression_No_Outliers_With_ScaledData["price_Residuals"] =  df_Regression_No_Outliers_With_ScaledData.apply(lambda x: x['price_Predicted'] - x['price'], axis=1)#Reorder columns, put prices next to each other
df_Regression_No_Outliers_With_ScaledData = df_Regression_No_Outliers_With_ScaledData[['id', 'date', 'price', 'price_Predicted', 'price_Residuals', 'log_price', 'log_price_Predicted','log_Residuals','sqft_above', 'Sch_d_Top15',
'Sch_d_Top30', 'Sch_d_Top60', 'TotalAppraisalValue', 'sqft_above_sc','TotalAppraisalValue_sc',]]#Display Results
df_Regression_No_Outliers_With_ScaledData.head()#Run Predictions using the model
df_predictedPrice = model.predict(predictors_int)#Create Dataframe From Predictions
df_predictions = pd.DataFrame(df_predictedPrice, columns=["log_price_Predicted"])
df_predictions["price_Predicted"] = df_predictions.apply(lambda x: np.exp(x['log_price_Predicted']), axis=1)
df_predictions["log_Residuals"] = model.resid#Combine Predictions with Original datafrome, create Residuals column by substracting Actual From Predicted
df_predictions = df_predictions.reset_index(drop=True)
df_Regression_No_Outliers_With_ScaledData = df_Regression_No_Outliers_With_ScaledData.reset_index(drop=True)
df_Regression_No_Outliers_With_ScaledData = pd.concat([df_Regression_No_Outliers_With_ScaledData, df_predictions], axis=1)
df_Regression_No_Outliers_With_ScaledData['price'] = df_Regression_No_Outliers_With_ScaledData['price'].astype(float)
df_Regression_No_Outliers_With_ScaledData["price_Residuals"] =  df_Regression_No_Outliers_With_ScaledData.apply(lambda x: x['price_Predicted'] - x['price'], axis=1)
df_Regression_No_Outliers_With_ScaledData["price_Residuals_abs"] =  abs(df_Regression_No_Outliers_With_ScaledData["price_Residuals"])#Reorder columns, put prices next to each other
df_Regression_No_Outliers_With_ScaledData = df_Regression_No_Outliers_With_ScaledData[['id', 'date', 'price', 'price_Predicted', 'price_Residuals',
'price_Residuals_abs','log_price','log_price_Predicted','log_Residuals','sqft_above', 'Sch_d_Top15','Sch_d_Top30', 'Sch_d_Top60', 'TotalAppraisalValue','sqft_above_sc','TotalAppraisalValue_sc',]]#Display Results
df_Regression_No_Outliers_With_ScaledData.head()

See the highlighted cells. These are the columns that show our predictions for each home. Something to note. Given we scaled our data, the model made predictions in the scaled values. Therefore, we need to “reverse” our log transformation to make real sense of the data. We do that by using np.exp() on our log transformed data. See the line of code above, which we used to create the column “price_Predicted”. Also, I added in a few additional columns that will be nice to have in viewing our residuals — “price_Residuals”, and “price_Residuals_abs”. The column “price_Residuals_abs” can be used to sort and then examine the rows with the largest errors.

Lastly, to get a sense of all the predicted prices against the actual price, we can create a scatter plot with actual price on the y axis and predicted price on the x axis. Ideally the points will fall in a fairly straight 45-degree pattern.

#Graph our residuals against our predictions, this will give us a sense if our model is off for certain priced homes
plt.scatter(df_predictions["price_Predicted"], df_Regression_No_Outliers_With_ScaledData["price"])
plt.xlabel("Predicted Price")]
plt.xticks(ticks=(200000, 400000, 700000, 900000),
labels= ('$200k', '$400k', '$700k', '$900k'))
plt.ylabel("Residual")
plt.yticks(ticks=(200000, 400000, 700000, 900000),
labels= ('$200k', '$400k', '$700k', '$900k'))
plt.plot([200000, 900000], [200000, 900000], color = 'red', linewidth = 2)
plt.show()

Overall, Not bad! However, as indicated by the red circle our model is struggling some with homes that sold for higher prices, as our predicted prices are larger than the actual prices. Under normal circumstances we could go back and review those particular rows and see if there is something odd in the data and or if we are potentially missing a feature that would help us better predict the prices of homes that sell for > $700k.

Now that we have our predictions and residuals, we can graph them and examine our errors against our predictions. Remember our errors are the difference between the actual prices and our prices predicted from our model. More on visualizing these errors in blog 3. Here is a quick look. Additionally, you can sort your abs_price_predicted to get a sense of those rows. See below:

#Graph our residuals against our predictions, this will give us a sense if our modle is off for certain priced homes
plt.scatter(df_predictions["price_Predicted"], df_Regression_No_Outliers_With_ScaledData["price_Residuals"])# plt.plot(Model1.predict(df_X_train), [0 for i in range(len(df_X_train))]);
plt.xlabel("Predicted Price")
plt.xticks(ticks=(300000, 500000, 700000, 900000, 1200000),labels= ('$300k', '$500k', '$700k', '$900k','$1.2M'))
plt.ylabel("Residual")
plt.yticks(ticks=(-300000, -100000, 0, 300000, 500000),
labels= ('$-300k', '$-100k', '$0k', '$300k','$500k'))
plt.show()

See highlighted residuals. Those shown are those with the largest gap between actual and predicted. These are the points we circled in red above. Wow, it shows our model is off by $410,000 in the worst case. More on that in blog 3.

Checking Remaining “Post-Model” Assumptions

Now that our model has been created, and assuming we are happy with the Adj R Score, there are a few more things we need to do prior to declaring victory. Before we can be comfortable using our model in a production setting, we need to check the remaining two assumptions associated with linear regression modeling: Homoscedasticity and Normality in the Error term. Both of these topics are covered in blog 3. For the sake of this blog, we will move past these assumptions and I will quickly show you how to reuse your model after initial creation.

2. Using Your Model Again

What if we want to make new predictions? Do we have to add that new house into our original data source as another row and start over from step 1? Thankfully, no. All you need to do is the following:

1. Get access to your original scaler object

2. Get access to your original modeler object

3. Feed your new data into the scaler and modeler object and then run model.predict()

For this example, I created a fictitious row derived by taking the average of each of the features from our original dataset. Essentially this could be considered “The Average Home” in our dataset.

# Create the average home used the average from each feature
d = {'a_price'
[df_Regression_No_Outliers_With_ScaledData['price'].mean()]}
df_The_Average_Home = pd.DataFrame(data=d)
df_The_Average_Home['a_log_price'] = df_Regression_No_Outliers_With_ScaledData['log_price'].mean()
df_The_Average_Home['a_sqft_above'] = df_Regression_No_Outliers_With_ScaledData['sqft_above'].mean()
df_The_Average_Home['a_Sch_d_Top15'] = 0
df_The_Average_Home['a_Sch_d_Top30'] = 0
df_The_Average_Home['a_Sch_d_Top60'] = 1
df_The_Average_Home['a_TotalAppraisalValue'] = df_Regression_No_Outliers_With_ScaledData['TotalAppraisalValue'].mean()
df_The_Average_Home['a_sqft_above_sc'] = df_Regression_No_Outliers_With_ScaledData['sqft_above_sc'].mean()
df_The_Average_Home['a_TotalAppraisalValue_sc'] = df_Regression_No_Outliers_With_ScaledData['TotalAppraisalValue_sc'].mean()# Display the average home
df_The_Average_Home.head()

Now, we just need to feed our unscaled data into our scaler and then feed that output, along with our categorical features, into our previously created model. A note of caution! You need to make sure you are feeding your scaler the new data in the exact same order in which you fed the scaler in your original scaling. This also needs to hold true when using our model for making new predictions with our modeler object.

# Create dataframe to feed into scaler, make sure not to feed scaler your categorical data
df_New_Data_To_Be_Scaled = df_The_Average_Home[["a_sqft_above","a_TotalAppraisalValue"]]# Feed data into scales using the .transform() method vs. .fit_transform()
Newly_scaled_Data = scaler.transform(df_New_Data_To_Be_Scaled)# Create dataframe from newly scaled values
df_Newly_scaled_Data = pd.DataFrame(Newly_scaled_Data, columns=["sqft_above_sc","TotalAppraisalValue_sc"])
df_Newly_scaled_Data = df_Newly_scaled_Data.reset_index(drop=True)# Get caegorical values to feed into new prediction
df_Average_Categoricals = df_The_Average_Home[['a_Sch_d_Top15', 'a_Sch_d_Top30','a_Sch_d_Top60']]#Bring two sets of data together prior to feeding dataframe
df_Average_Categoricals = df_Average_Categoricals.reset_index(drop=True)
df_New_Row_For_Prediction = pd.concat([df_Newly_scaled_Data, df_Average_Categoricals], axis=1)#add contanct in position 1
df_New_Row_For_Prediction.insert (0, "constant", 1)# Use Features to create intercept
new_predictors_int = sm.add_constant(df_New_Row_For_Prediction)
Newpredicted_value_log = model.predict(df_New_Row_For_Prediction)
Newpredicted_value = np.exp(Newvalue_log)
Newpredicted_value

See highlighted. This is your newly predicted value. You can use the exact same process for any new row of data. Simply get the unscaled values, scale them, and feed both the continuous and categorical values into your model. You can determine the impact changing a feature has on the predicted value simply by noting the predicted value prior to making a change and then comparing that to the prediction after making the change. This would be the impact that changing a feature would have on the value of a home.

Conclusion & Preview of Blog 3

Above I covered the steps to create a model using Statsmodel. I also demonstrated how you can reuse your model with new data. In blog 1 I demonstrated some of the important processing steps required prior to model creation. In blog 3 I will finish my introduction to linear regression modeling by reviewing how to check for the remaining two assumptions. I look forward to seeing you soon!

Next Stop — DATA ALCHEMY!

Using The Past To Predict The Future

Introduction

1. Creating Our Model

2. Using Your Model Again

Conclusion & Preview of Blog 3

Written by Russell Pihlstrom