Multiple Linear Regression and Housing Prices
In this article, I’ll walk through the process I took to build a multiple linear regression model to predict house prices from various house and surrounding area features.
More specifically, I outline the major steps I took to build an Ordinary Least Squares (OLS) model to predict the prices of single-family homes in King County, WA. I also walk through the steps it took to satisfy the assumptions of OLS linear regression.
I was seeking answers to the following questions:
- What house features have the greatest effect on price?
- How much do each of these features influence house price?
- Are there common house features that are assumed a priori to affect house prices but don’t in this case?
- How should the results be interpreted in the context of building a house to optimize sale price?
Let’s jump in!
Link to the full notebook can be found here.
Data
The dataset I used here contains information about houses in King County, WA (Source). The variables and descriptions (Source) are listed here:
id
— Unique ID for each home soldate
— Date of the home saleprice
— Price of each home soldbedrooms
— Number of bedroomsbathrooms
— Number of bathrooms, where .5 accounts for a room with a toilet but no showersqft_living
— Square footage of the house interior living spacesqft_lot
— Square footage of the land spacefloors
— Number of floorswaterfront
— A dummy variable for whether the apartment was overlooking the waterfront or notview
— An index from 0 to 4 of how good the view of the property wascondition
— An index from 1 to 5 on the condition of the apartment,grade
— An index from 1 to 13, where 1–3 falls short of building construction and design, 7 has an average level of construction and design, and 11–13 have a high-quality level of construction and design.sqft_above
— The square footage of the interior housing space that is above ground levelsqft_basement
— The square footage of the interior housing space that is below ground levelyr_built
— The year the house was initially builtyr_renovated
— The year of the house’s last renovationzipcode
— What zipcode area the house is inlat
— Lattitudelong
— Longitudesqft_living15
— The square footage of interior housing living space for the nearest 15 neighborssqft_lot15
— The square footage of the land lots of the nearest 15 neighbors
My journey into the world of King County housing prices began with exploring the data in python using pandas.
Here, I was interested in understanding which of the various features might have a relationship with the price of the house.
A quick look at a scatterplot of all variables plotted against price gave me a first look into which features might be worth exploring further. Interestingly, the size of the lot (sqft_lot
) and the number of bedrooms (bedrooms
) did not seem to be correlated with price
much at all.
plt.clf()
fig, axes = plt.subplots(nrows = 4, ncols = 5, figsize =(20,14))
index = 0
for row in range(axes.shape[0]):
for col in range(axes.shape[1]):
try:
g=sns.scatterplot(x=df.drop('id',axis=1).columns[index], y='price', data=df, ax = axes[row,col], alpha=0.4)
g.set_xticks([])
g.set_xticklabels([])
index += 1
except:
break
Initial Glance
The average house in King County, WA sits on a 14,966 square foot lot (.34 acres), with 2,064 square feet of living space and has 2.5 bathrooms and 3 bedrooms and costs $534,210.
Null Values
In order to further analyze some of the data effectively, it had to be cleansed of outliers and null values.
Using the pandas isnull()
function I discovered that several of the variables contained null values:
Executive decisions:
- drop yr_renovated: too many null values
- fill in waterfront with a 0: 0 indicated there was no waterfront view
- fill in view with a 0: 0 was by far the most commonly occurring value
Outliers
Outliers are a source of inaccuracy and frustration in data analysis, and so I set out to locate and remove these before moving any further. This was particularly important because I was only interested in single-family homes.
I visualized a few of the predictors using boxplots to get a sense for any outliers.
Seeing some of the outliers here, and having never stepped foot in a single-family home with 30+ bedrooms, I decided to remove some rows.
I decided to remove houses with numbers of bedrooms and bathrooms that were 3 standard deviations above the mean. In this case, that meant including houses with less than 7 bedrooms and 5 bathrooms.
Discrete Variables
In addition to the juicy continuous data, I was interested in exploring several discrete variables further: view, condition, grade, latitude/longitude
In order to prepare categorical variables for an MLR model they needed to be one-hot encoded. This keeps each category even with one another and is done by creating dummy variables (placeholders for your actual variable).
Although I was interested in exploring the grade variable, I noticed there were 11 unique values, and it seemed like there might be a simpler way to analyze the data. Looking at the description of this feature and seeing the count of values for each of the 11 categories, it was clear ‘7’ was considered average and the most commonly occurring value, with a fair amount of data collectively above or below average as well. Given that this was an ordinal variable, I decided to encode this variable into three categories: above average, average and below average:
df[‘grade_encoded’] = df.grade.apply(lambda x: 0 if x < 7 else (1 if x == 7 else 2))
Last but not least, I was interested in exploring the latitude and longitude variables to see if there were any geographical relationships with price. Seeing the sheer number of coordinates, it made even less sense to one hot encode each latitude and longitude. In visualizing the data, it seemed that the houses could be split into different regions.
# split the map into north and south
m = folium.Map(width=550, height=350, location=[df.lat.median(), df.long.median()],zoom_start=9)
for lat, long in zip(df.head(150).lat, df.head(150).long):
loc = [lat,long]
if lat >= 47.5:
folium.CircleMarker(location=loc, color = 'red').add_to(m)
else:
# less than 47.5
folium.Marker(location=loc, color = 'blue').add_to(m)
So rather than use the latitude and longitude data itself, I grouped the data into 4 regions: north, south, east and west. I kept the ‘west’ column as a reference category.
df['north'] = df.lat.apply(lambda x: 1 if x >= df.lat.median() else 0)
df['south'] = df.lat.apply(lambda x: 1 if x < df.lat.median() else 0)
df['east'] = df.long.apply(lambda x: 1 if x >= df.long.median() else 0)
Correlation and Multicollinearity
Finally, I was ready to see which variables I might want to use in my model. I started with a heatmap:
And then to break things down further, I checked for variables most correlated with price, and with one another. One of the main assumptions in linear regression is that there is no significant multicollinearity between predictors in the model. Multicollinearity occurs when two or more predictors in a regression model are correlated. Regression equations do not account for the effects of one predictor variable on another and therefore this should be avoided.
The code below gave me pairs of variables correlated with price (Figure 1) as well as variables correlated with one another (Figure 2). Variables in Figure 2 with correlation coefficients above .75 were considered multicollinear.
# check for multicollinearity
corr_df=df.iloc[:,1:].corr().abs().stack().reset_index().sort_values(0, ascending=False)
# zip the variable name columns (Which were only named level_0 and level_1 by default) in a new column named "pairs"
corr_df['pairs'] = list(zip(corr_df.level_0, corr_df.level_1))corr_df.set_index(['pairs'], inplace=True)
corr_df.drop(columns=['level_1', 'level_0'], inplace = True)
corr_df.columns = ['cc']
corr_df.drop_duplicates(inplace=True)
corr_df = corr_df[corr_df.cc < 1]
multicollinear = corr_df[(corr_df.cc>=.75) & (corr_df.cc <1)]# check for correlations with price
corr_df = corr_df.reset_index()
price_correlations = corr_df[corr_df.pairs.apply(lambda x: True if 'price' in x else False)]
price_correlations = price_correlations.sort_values(by='cc',ascending=False)
price_correlations.pairs = price_correlations.pairs.apply(lambda x: list(filter(lambda i: i != 'price', list(x)))[0])
price_correlations.head(10)
Side by side, you can see variables correlated with price (Fig. 1), and with one another (Fig. 2). Although sqft_living15
and sqft_above
were highly correlated with price
, they were also highly correlated with sqft_living
and so they were not included in the model. The final variables of choice: sqft_living
, bathrooms
, grade_encoded
and north
.
Feature Engineering
Before building the regression model, I decided to preprocess some of the data. Having more normal distributions for the variables results in a better R-Squared Value.
Log Scaling
In plotting the continuous variables in my model, sqft_living
and price
I noticed they were both right-skewed.
Nothing a log transformation couldn’t amend! Before and after the log scale:
Implementation
Now that my features were picked out, scaled and transformed I was ready to fit my model and test to see whether I could reject my null hypothesis:
Null Hypothesis: All predictors (sqft_living
, grade_above
, bathrooms
, north
) have no effect on price.
In order to do both of these things at the same time, I used python’s statsmodels package to do so which is a good choice for verbose results:
Overall, I was interested here in the Adjusted R-Squared and the F-Statistic, both seen on the top right of the regression results displayed above. My adjusted R-Squared of .63 means that roughly 63% of the variance in price is explained by the predictors in my model.
An F-Statistic of 9280 is sufficiently large to reject my null hypothesis and conclude that these predictors do have a significant effect on price.
Interpretation of coefficients:
- A 1% increase in sqft_living will lead to a .63% increase in price
- Going up a grade level is associated with a 12% increase in price
- Each additional bathroom leads to a 2.5% increase in price on average
- Houses in the northern region are 41% more expensive on average
*Note that the interpretation of coefficients is slightly different when dependent and/or independent variables are log-transformed. For a more detailed explanation, please see here.
Model Evaluation
Finally, it was time to test the model to ensure that the final assumptions for linear regression were met and that the model would perform well on other data.
Assumptions of Linear Regression
For an OLS linear regression model, there are four main assumptions that need to be checked:
- Linearity
- No Multicollinearity between predictors
- Normality of the error terms/residuals
- Homoscedasticity
The first two assumptions (linearity, no multicollinearity) can be checked before a linear regression model is fitted, whereas the final two assumptions (homoscedasticity and residual normality) are assessed afterward.
Linear regression assumes that the residuals of the model are normally distributed. To first test this visually, I created a Q-Q plot. A plot whose points roughly follow a 45 degree line depicts a model which meets this assumption:
Although the residuals somewhat followed the line, they veered off towards the top end of the plot, which indicates that some of the predictors may be non-normal and/or that outliers in the data may exist. To further analyze, I conducted an Andreson-Darling test for normality by importing the necessary package from statsmodels
.
from statsmodels.stats.diagnostic import normal_ad
# perform test on residuals
p_value = normal_ad(residuals)[1]
The p-value was 0.0, which meant that I could conclude that the error terms are normally distributed.
Homoscedasticity is achieved when the same variance exists across error terms. To test this, I plotted the fitted values and the residuals on a scatter plot. Models that meet this assumption produce a plot with error terms that are symmetrical across a line plotted at y=0. As seen here, the model seems to meet this assumption. My Durbin-Watson value (1.99) in the model summary validates the fact that there is no autocorrelation detected.
Cross-Validation
To test the model to see how well it might perform when introducing other data, I was interested in its mean squared error (MSE).
To test this, I used k-fold cross-validation. This involves splitting the data into k equal-sized sections of data and then testing each subset separately while training on the entire dataset. To do this, I took advantage of sklearn’s cross_val_score package.
from sklearn.metrics import mean_squared_error, make_scorer
from sklearn.model_selection import cross_val_scoremse = make_scorer(mean_squared_error)
cv_5_results = cross_val_score(linreg, X, y, cv=5, scoring=mse)
cv_5_results
Taking a look at the results the MSEs for each individual subset of data are close to one another and it was therefore concluded the model should generalize well for other data.
array([0.1003476 , 0.10183708, 0.09580066, 0.1015491 , 0.08699936])
Conclusions
If you are buying a house in King County, the size of the house will be the most significant predictor of price. The number of bathrooms, the geographic area in the county where the house is located and the quality of the construction were also important factors.
A notable limitation with this model is the fact that the data used to produce it is specific to one county in a major metropolitan area. It may not generalize to other cities and counties.
For next steps, one could perhaps look to assess how well this model generalizes to other locations in the US. It would also be interesting to use the same data here to predict other outcomes aside from price, such as the . Finally, exploring interactions and polynomial relationships could unlock further insights!
For a developer in King County, if you are looking to maximize sale price, build big, don’t discount the importance of quality construction and remember that location matters. As a buyer on a budget, you may want to look in