As the leading online real estate marketplace in the United States, it is extremely important for Zillow to ensure the accuracy of the housing market predictions offered as “Zestimate” which is essential to services related to buying, renting, selling, and financing homes. We had been informed by the company that there was a demand in improving the existing housing predictions and were tasked with the creation of a better predictive model of home prices the local scale of the city of Philadelphia. This new predictive model would be built to serve as a useful tool for both homeowners and buyers to obtain accurate house price information to avoid underselling and overpaying, ultimately securing the credibility as well as competitive advantages of Zillow. We were told that the current housing market predictions Zillow uses do not factor in enough local intelligence. Therefore, to increase the accuracy of housing market predictions in Philadelphia, the model will be adapting local characteristics when predicting house prices in the city.
Ordinary Least Squares (OLS) regression, a statistical approach to find the relationship between different variables as well as to record how one variable changes in response to other variables, will be used to structure this new housing prediction model. Using the OLS regression model, we will examine how each of the local characteristics (independent variables) will influence the local house prices (dependent variable) differently to figure out the how strong are the correlations between the variables as well as how accurate our model is.
However, there are still important things we need to take into consideration and difficulties we must overcome throughout the creation of the model. Some of those include finding the most relevant data from reliable sources, trimming the different sets of data, and creating aesthetically pleasing as well as accurate visualizations. The limitations of the OLS model and potential errors in the data sets would also be something that could affect the accuracy of our results. Since we are trying to make predictions of house prices in the real world where nothing is fully predictable, we will have to keep in mind that there can be unprecedented and uncontrollable factors that can impact our results.
To evaluate how different variables will impact house prices in Philadelphia, we used the data from the U.S. Census Bureau and OpenPhillyData, as well as some data from the original dataset we were provided with to construct our model. The collected data was then grouped into three different categories that people might consider when purchasing a house: the condition of the house, the demographics of the neighborhood, and other environmental factors.
For the “House Condition” category, we included the age of the house, the quality grade of its interior condition, and the size of both indoor and outdoor area of the property. For the “Demographic Characteristics” category, we included the median household income of the past 12 months, the percentage of foreign-born residence, and the geographic mobility in the neighborhood where the house is located. For the “Environmental Impact” category, we included the distance from the house to the nearest recreational area/park, commercial area, and school. We also included the number of victims who received fatal damage within a 1000 feet buffer from each house.
# Load Libraries
library(geos)
library(rsgeo)
library(tidyverse)
library(tidycensus)
library(sf)
library(kableExtra)
library(knitr)
library(dplyr)
library(spdep)
library(caret)
library(ckanr)
library(FNN)
library(grid)
library(gridExtra)
library(ggcorrplot)
library(stargazer)
options(scipen=999)
options(tigris_class = "sf")
source("https://raw.githubusercontent.com/urbanSpatial/Public-Policy-Analytics-Landing/master/functions.r")
Since our project is heavily focused on geographic features, we retrieved the block groups and neighborhoods boundaries from OpenDataPhilly, and we filtered the variables that we need to use from the housing price raw data.
To better assess the environmental impact on housing prices, we employed two different methods. For the shooting data, we considered the number of victims who suffered fatal gunshot wounds within a 1,000-foot radius of each property. Fatal incidents included gunshots to critical areas such as the head, neck, and chest. Additionally, since the location and timing of crimes are unpredictable, it is essential to use an appropriate buffer to evaluate the perceived safety of a neighborhood. Even a fatal crime occurring at a relatively greater distance can instill fear and influence housing values. Therefore, we used a buffer to aggregate crime incidents rather than relying on the average nearest neighbor distance.
The second approach focuses on the proximity to schools, commercial corridors, and public parks and recreation centers. In this case, we utilized Euclidean distance, as the direct distance to these public resources is a key factor in determining the convenience of living in a particular location. This method captures the accessibility of amenities, with the variables for further analysis being the distances from each property to the nearest public facility.
nearest_school <- sf::st_nearest_feature(Philly_price_clean, schools.sf)
x <- rsgeo::as_rsgeo(Philly_price_clean)
y <- rsgeo::as_rsgeo(schools.sf)
Philly_price_clean$dist_to_school <- rsgeo::distance_euclidean_pairwise(x, y[nearest_school])
nearest_Commercial <- sf::st_nearest_feature(Philly_price_clean, Commercial_Corridors)
x <- rsgeo::as_rsgeo(Philly_price_clean)
y <- rsgeo::as_rsgeo(Commercial_Corridors)
Philly_price_clean$dist_to_commerce <- rsgeo::distance_euclidean_pairwise(x, y[nearest_Commercial])
nearest_PPR <- sf::st_nearest_feature(Philly_price_clean, PPR_Sites)
x <- rsgeo::as_rsgeo(Philly_price_clean)
y <- rsgeo::as_rsgeo(PPR_Sites)
Philly_price_clean$dist_to_PPR <- rsgeo::distance_euclidean_pairwise(x, y[nearest_PPR])
For the census variables, we linked the data from each block group to the properties located within that group. Additionally, we assigned a neighborhood designation to each house for future analysis and categorized the quality condition into four levels: 4 for good condition, 3 for fair condition, 2 for poor condition, and 1 for others.
Philly_Housing_joined <-
st_join(Philly_price_clean, blockgroup21, join = st_within)
Philly_Housing_joined <-
st_join(Philly_Housing_joined, neighborhood, join = st_within)
Table 2.1 gives us a comprehensive summary of the independent variables that might affect the housing prices. The total number of the sample is 23781. Here is a descriptions of the variables:
House condition: - total_area: the area that the owners have, which includes the living area and the outdoor area - houseAge: the age of the house since it was first built.
Demographic characteristics: - MedHHInc: the median household income in the past 12 months (in 2021 inflation-adjusted dollars) - pctForeign: the percentage of residents who were born outside of the United States among all residents - pctMobility: the percentage of geographic mobility in a different house in United States 1 year ago
Environmental impact: - shooting.Buffer: the sum of victims suffered from head gun shot within a 1000 feet buffer of each house - dist_to_school: the distance of a house to the nearest school - dist_to_commerce: the distance of a house to the nearest commercial corridor - dist_to_PPR: the distance of a house to the nearest public park and recreational center
The following table is a summary statistics table that highlights the relationship between each independent variable and the dependent variable (house price) with a total of 23,781 samples for testing. The table summarizes the mean, standard deviation, minimum and maximum values of each variable of our selection. We observed some relatively high values in the standard deviation of some of the variables which indicated that some of our results might be affected by potential outliers in the dataset.
numeric_columns <- sapply(Philly_Housing_joined, is.numeric)
for (col in names(Philly_Housing_joined)[numeric_columns]) {
col_mean <- mean(Philly_Housing_joined[[col]], na.rm = TRUE)
Philly_Housing_joined[[col]][is.na(Philly_Housing_joined[[col]])] <- col_mean
}
house_condition <- st_drop_geometry(Philly_Housing_joined) %>%
select(total_area, houseAge, interior_condition) %>%
gather(variable, value) %>%
group_by(variable) %>%
summarize(
mean = mean(value, na.rm = TRUE),
sd = sd(value, na.rm = TRUE),
Min = min(value, na.rm = TRUE),
Max = max(value, na.rm = TRUE)
)
demo <- st_drop_geometry(Philly_Housing_joined) %>%
select(MedHHInc, pctForeign, pctMobility) %>%
gather(variable, value) %>%
group_by(variable) %>%
summarize(
mean = mean(value, na.rm = TRUE),
sd = sd(value, na.rm = TRUE),
Min = min(value, na.rm = TRUE),
Max = max(value, na.rm = TRUE)
)
envi_impact <- st_drop_geometry(Philly_Housing_joined) %>%
select(shooting.Buffer, dist_to_school, dist_to_commerce, dist_to_PPR) %>%
gather(variable, value) %>%
group_by(variable) %>%
summarize(
mean = mean(value, na.rm = TRUE),
sd = sd(value, na.rm = TRUE),
Min = min(value, na.rm = TRUE),
Max = max(value, na.rm = TRUE)
)
summary_stats <- rbind(house_condition, demo, envi_impact)
summary_table <- knitr::kable(summary_stats, "html", digits = 1, caption = "Table of summary statistics") %>%
kable_styling(full_width = F)
grouped_table <- group_rows(summary_table, "House Condition", 1, 3)
grouped_table <- group_rows(grouped_table, "Demographic Characteristics", 4, 6)
grouped_table <- group_rows(grouped_table, "Environmental Impact", 7, 10)
grouped_table
variable | mean | sd | Min | Max |
---|---|---|---|---|
House Condition | ||||
houseAge | 85.4 | 29.7 | 0.0 | 273.0 |
interior_condition | 3.6 | 0.9 | 0.0 | 7.0 |
total_area | 1828.9 | 3813.0 | 0.0 | 226512.0 |
Demographic Characteristics | ||||
MedHHInc | 65581.2 | 31777.9 | 6302.0 | 250001.0 |
pctForeign | 0.1 | 0.1 | 0.0 | 0.8 |
pctMobility | 0.1 | 0.1 | 0.0 | 0.8 |
Environmental Impact | ||||
dist_to_PPR | 1796.5 | 1032.0 | 79.9 | 10317.7 |
dist_to_commerce | 670.0 | 638.3 | 0.0 | 7385.7 |
dist_to_school | 1147.7 | 647.6 | 32.4 | 5900.8 |
shooting.Buffer | 5.2 | 4.7 | 1.0 | 58.0 |
To better visualize the result, a correlation matrix graph is created to represent the pairwise degree of correlation between the variables on the x-axis to the ones on the y-axis. The relationship of the two variables is represented by the correlation coefficient on the scale of -1.0 to 1.0. When the correlation coefficient is at 0, there is no correlation between the variables. The further away the correlation coefficient is from 0, the more likely that the variables are more related to each other. A positive number indicates that there is a positive correlation between the two variables, while a negative number indicates that the correlation is negative. Based on this matrix, most of the variables are somewhat related to another variable except for a few white squares meaning correlation coefficient is at 0. The number of positive values indicated by the orange squares are about the same as the green squares that represent negative correlations. We observed the highest positive correlations in the distance between social amenities like schools, parks, and commercial corridors as well as the relationship between the age of the house to its interior conditions. On the other hand, there are high negative correlations between the interior conditions of the house to the median household income as well as geographic mobility, suggesting an inverse relationship.
numericVars <- st_drop_geometry(Philly_Housing_joined)[, c("total_area", "houseAge", "interior_condition", "MedHHInc", "pctForeign", "pctMobility", "shooting.Buffer", "dist_to_school", "dist_to_commerce", "dist_to_PPR")]
ggcorrplot(
round(cor(numericVars), 1),
p.mat = cor_pmat(numericVars),
colors = c("#b2182b", "white", "#2166ac"), # Darker shades of red and blue
type = "lower",
insig = "blank",
lab = TRUE, # Adding labels inside the cells for better visibility
lab_col = "black" # Color of the labels for clear contrast
) +
labs(
title = "Correlation matrix across variables",
caption = "Figure 2.1", # Replace `label` with `caption`
x = NULL, # Remove x-axis label
y = NULL # Remove y-axis label
) +
theme_minimal() +
theme(
axis.text.x = element_text(angle = 45, hjust = 1)
)
This scatterplot highlights the relationship between housing prices and median household income in the city of Philadelphia. As shown by the blue line, the line of best fit indicating the correlation between the two variables on the x and y axis, house prices and median household income are positively related with a slightly increasing slope. As the median household income increases, the home prices would also rise as a result in response to greater purchasing power of the households. Most of the points on this scatterplot are clustered on the lower left side of the graph which suggests people with lower household income are more likely to live in houses that are relatively low in price. On the right side of the plot, points seem to be more scattered since the wealthier residents have more housing options to choose from. There are also a few outliers that are higher in housing prices, of around 6,000,000, as compared to the rest of the values in this dataset.
ggplot(Philly_Housing_joined, aes(x = MedHHInc, y = sale_price)) +
geom_point(size = .5) +
geom_smooth(method = "lm", se = FALSE, color = "red") +
labs(
title = "Relationship Between Home Price and Median Household Income in Philladelphia",
x = "Median Household Income",
y = "Home Price",
caption = "Figure 2.2" # Replacing `label` with `caption`
) +
theme_minimal()
This scatterplot highlights the relationship between housing prices and house age in the city of Philadelphia. As shown by the line of best fit colored in blue, the prices and ages of the house are inversely related to each other with a slightly decreasing slope. The scatterplot is divided into two parts horizontally at around 125 years on the x-axis. Points tend to be a lot more clustered on the left-hand side of the plot, while the ones on the right are a lot more scattered. This is because most of the houses in the city of Philadelphia are not as old as 125 years old and even fewer old houses are available in the market. This suggests that we need to reconsider if the age of the house is a good factor to put here for price prediction since the relatively small number of old houses is affecting the slope of the line of benefit. However, even if we look at the left of the plot along, there is still an obvious decline in house prices as the age of the house increases which makes sense since most people would not go out of their way to get an old house with potential structural and other problems.
ggplot(Philly_Housing_joined, aes(x = houseAge, y = sale_price)) +
geom_point(size = .5) +
geom_smooth(method = "lm", se = FALSE, color = "red") +
labs(
title = "Relationship Between House Age and Home Price in Philadelphia ",
x = "House Age",
y = "Home Price",
label = "Figure 2.3"
) +
theme_minimal()
This scatterplot highlights the relationship between the price and the amount of living space of that property in the city of Philadelphia. As shown by the line of best fit colored in orange, the prices and the amount of living space of the house are positively related to each other with an increasing slope. However, most of the values in this dataset lie on the bottom left corner of the graph where the cluster of points is, and very few points are found elsewhere. This means that this particular scatterplot might not be the most effective representation of the dataset since the line of best fit is heavily influenced by the outliers in the dataset. We also observed great variations in the prices of houses that have smaller total living spaces. A reasonable explanation for this result might be that these smaller houses are located in different neighborhoods of the city where prices of the house is determined by other factors despite the fact that the houses are similar in size. Still, if we only look at the bottom left corner of the graph where points are clustered, it is still reasonable to conclude that in general, houses with larger living space tends to be more expensive based on the increasing slope of the line of best fit.
ggplot(Philly_Housing_joined, aes(x = total_area, y = sale_price)) +
geom_point(size = .5) +
geom_smooth(method = "lm", se = FALSE, color = "red") +
labs(
title = "Relationship Between Total Areas and Home Price",
x = "total areas",
y = "Home Price",
label = "Figure 2.4"
) +
theme_minimal()
This scatterplot highlights the relationship between the price and its distance to the nearest recreation area in the city of Philadelphia. As shown by the line of best fit colored in orange, the price of the house and how far away it is from a recreation site are positively related to each other with a slightly increasing slope. Most points on the plot are distributed along the line of best fit in the form of a cluster, meaning that the correlation is not as affected by outliers as some of the other scatterplots shown above. It is hard for us to really conclude that the further away a house is from a recreational site, the higher its price will be since the slope of the line is so little. There are also way more houses close to a recreational site than there are far from one. Since recreational sites such as parks in Philadelphia are located inside the city, our assumption is that households that live further away from the central area are more likely to be wealthier and own cars, meaning they will also be able to afford more expensive housing.
ggplot(Philly_Housing_joined, aes(x = dist_to_PPR, y = sale_price)) +
geom_point(size = .5) +
geom_smooth(method = "lm", se = FALSE, color = "red") +
labs(
title = "Price as a function of distance to Parks and Recreation",
x = "Distance to Parks and Recreation",
y = "Home Price",
label = "Figure 2.5"
) +
theme_minimal()
This choropleth map shows the distribution of sale prices of houses, grouped in equal intervals, by neighborhoods in the city of Philadelphia. The darker color on the map is in association with a higher sale price while a lighter color corresponds to a lower price. Most of the neighborhoods in the city of Philadelphia, according to the map, have relatively low sale prices except for the northwest part as well as the lower middle part of the city. The northwestern districts look a lot darker compared to most part of the rest of the city, meaning that there might be some luxury houses located in those area which can be potential outliers in our dataset. The lower middle districts that are relatively darker in color make up the Philadelphia downtown area where housing are likely to be more expensive based on the resources centralized in the city.
Philly_price_Model <- Philly_Housing_joined %>%
filter(toPredict == "MODELLING")
mean_sale_price <- Philly_price_Model %>%
group_by(GEOID) %>%
summarize(MeanSalePrice = mean(sale_price))
blockgroup21 <- st_join(blockgroup21, mean_sale_price, by = "GEOID")
blockgroup21$MeanSalePrice[is.na(blockgroup21$MeanSalePrice)] <- 0
ggplot() +
geom_sf(data = blockgroup21, aes(fill = MeanSalePrice)) +
scale_fill_gradient(low = "white", high = "darkblue",
name = "Sale Price (U.S. Dollars)") +
labs(title = "Choropleth Map of Housing Sale Price by neighborhoods",
label = "Figure 2.6") +
theme_minimal()
This choropleth map shows the distribution of geographic mobility rate by neighborhoods in the city of Philadelphia. The darker color on the map is in association with a higher geographic mobility rate while a lighter color corresponds to a lower rate. The geographic mobility rate of residents in Philadelphia seems to be distributed in scatters with no obvious clustering except for some blocks near downtown area in the lower middle part of the city. The block in the west where Fairmount Park is located does have a much darker color than the rest of the city due to how little its population is since the park takes up most of the space in that area. For that reason, the data represented by that block in this visualization is not as significant for the analysis of geographic mobility in this particular case.
ggplot() +
geom_sf(data = blockgroup21, aes(fill = pctMobility)) +
scale_fill_gradient(low = "white", high = "darkblue", name = "Geographic Mobility Rate") +
labs(title = "Choropleth Map of Geographic Mobility by Block Groups",
label = "Figure 2.7") +
theme_minimal()
This choropleth map shows the distribution of the average distance a house is to the nearest school by neighborhoods in the city of Philadelphia. The darker color on the map is in association with a further average distance while a lighter color corresponds to a shorter one. The average distance to school from houses is evenly spread out throughout the city of Philadelphia with a few blocks around the edge of the city’s boundaries further from the center.
mean_school_dis <- Philly_Housing_joined %>%
group_by(GEOID) %>%
summarize(MeanSchoolDis = mean(dist_to_school))
blockgroup21 <- st_join(blockgroup21, mean_school_dis, by = "GEOID")
blockgroup21$MeanSchoolDis[is.na(blockgroup21$MeanSchoolDis)] <- 0
ggplot() +
geom_sf(data = blockgroup21, aes(fill = MeanSchoolDis)) +
scale_fill_gradient(low = "white", high = "darkblue", name = "Average Distance") +
labs(title = "Choropleth Map of Distance to school by Block Groups",
label = "Figure 2.8") +
theme_minimal()
This heat map shows the general density of where fatal gunshot injuries took place in the city of Philadelphia. A darker color indicates that there is a higher density of total fatal gunshot incidents occurred in those areas, while a lighter color means that there is little to no incidents in the area. The blocks with the highest density is located in the middle as well as the southwestern part of the city indicating those areas have higher exposure to shooting incidents which can be more dangerous for people to live in.
ggplot() + geom_sf(data = blockgroup21, fill = "white") +
stat_density2d(data = data.frame(st_coordinates(PhillyShootingFatalInjury.sf)),
aes(X, Y, fill = ..level.., alpha = ..level..),
size = 0.01, bins = 40, geom = 'polygon') +
scale_fill_gradient(low = "lightblue", high = "purple", name = "Density") +
scale_alpha(range = c(0.00, 0.35), guide = "none") +
labs(title = "Density map of fatal gunshot victims, Philadelphia",
label = "Figure 2.9") +
mapTheme()
The goal of the project is to create a better house price predicting model for Zillow that would take social intelligence into account to generate more accurate predictions. We used OLS regression to estimate the relationship between the dependent variable house price and various independent variable: local social factors that might influence housing price in the city of Philadelphia. Visualization and manipulation of the data is done using the programming language R and the final deliverable is knitted into an html file from an R markdown file.
After we obtained the data that we need to conduct the study from open sources, we grouped them into three categories by their characteristics for easier reference. We first created a comprehensive table of summary statistics to provide general information and insights into the distribution of variables in our new data set. We then generated a correlation matrix across different variables in the dataset to visualize and identify some of the more related variables. Then through a series of maps, we studied how different independent variables would cause the house price to change geospatially.
We first separate the data into two groups: data with the actual prices and data that we need to predict the house prices for. We then split the “toPredict” dataset into a training and a test set for different purposes. The training set is fed into the machine learning algorithm for it to learn and generate the model with minimum error. The test dataset is then used to test how well the model is at predicting house prices. A linear regression was then created using R’s lm function to examine the performance of the model which will be analyzed in the results section. To improve the performance of the model, we conducted a 100-fold cross validation to interpret the mean absolute error as well as the correlation between our predicted house prices and the actual observed house prices across the city of Philadelphia. Visualizations such as a histogram plot, scatterplot, and a map are generated to interpret the accuracy and effectiveness of our predicting model. A Moran’s I test was also conducted to measure how similar house prices are to each other based on their spatial proximity and to figure out the potential cluster patterns. We also included a scatterplot that visualizes the errors in the spatial autocorrelation. Next, we used the regression model we created to predict the “unpredicted data”, which we had trimmed in the beginning, to generate a map for both the model and the unpredicted dataset. To test the performance of the neighborhood model in response to different sets of data, we ran another regression by and visualized the mean absolute percent errors by neighborhood using a map. Another test was conducted using the African American resident data from the U.S.Census, categorizing the neighborhood and comparing the mean absolute percent errors in the dataset.
The table below is the linear regression model summary of the OLS
regression model with the dependent variable being house price and the
predictors being each variable listed below. The total number of
observations in the training dataset is 14,272 with the constant, the
value of the dependent variable house price when the value of every
independent variable is held at 0, being 299,768.7. The coefficient can
be found in the second column of the variable role, representing the
amount of change in the dependent variable as the independent variable
goes up. A positive number indicates that the two variables are
positively related to each other while a negative number indicates an
inverse relationship between those values. Directly below the
coefficient is the standard error for that correlation and is placed in
parentheses “()”. The lower the value is, the less stand error the
coefficient would have, making the coefficient more likely to be precise
and vice versa.
The statistical significance level of each coefficient is represented by
the asterisk symbol “”, measured by p-value, indicating the
probability to reject the null hypothesis which states that there is no
relationship between house prices with each independent variable. Having
3 asterisks “” suggests that the p-value of that variable
is less than 0.01 meaning that it plays a more important role in
predicting the price. Two asterisks “” suggest that the p-value
of that variable is less than 0.05 but more than 0.01 meaning that it
plays a less important role in predicting the price. When there is no
asterisks next to the value of coefficient, the p-value is more than
0.05, meaning that the variable is less likely to play an important role
in predicting house prices. The R-square (R2) and the adjusted R-square
(Adjusted R2) shows the fitness of our model to the data. In our case,
R2 equals 0.368, meaning that 36.8% of the total variability in house
prices can be explained by the predictors we have which show that our
model still has some room for improvement to better predict accurate
house prices. F-statistic is used to test the hypothesis of the
variances in the data which turned out to be 829.500 for our model. The
three asterisks on the side suggest that its p-value is less than 0.01,
meaning that the model can be considered statistically significant
overall and there is at least one predictor variable that is
significantly related to the dependent variable house price.
inTrain <- createDataPartition(
y = paste(Philly_price_Model$interior_condition),
p = .60, list = FALSE)
PhillyPrice.training <- Philly_price_Model[inTrain,]
PhillyPrice.test <- Philly_price_Model[-inTrain,]
reg.training <-
lm(sale_price ~ ., data = as.data.frame(PhillyPrice.training) %>%
dplyr::select(sale_price, total_area, interior_condition, houseAge, shooting.Buffer, dist_to_school, dist_to_commerce, dist_to_PPR, MedHHInc, pctForeign, pctMobility))
# Create a table of the model summary using stargazer
stargazer(reg.training,
title = "Table 3.1. Linear Regression Model Summary",
type = "text",
align = TRUE
)
##
## Table 3.1. Linear Regression Model Summary
## ===============================================
## Dependent variable:
## ---------------------------
## sale_price
## -----------------------------------------------
## total_area 13.774***
## (0.456)
##
## interior_condition -66,261.510***
## (1,960.943)
##
## houseAge -158.301***
## (59.572)
##
## shooting.Buffer -4,171.666***
## (362.172)
##
## dist_to_school 1.734
## (2.732)
##
## dist_to_commerce -27.782***
## (2.690)
##
## dist_to_PPR 6.712***
## (1.694)
##
## MedHHInc 2.834***
## (0.053)
##
## pctForeign 52,666.320***
## (13,185.680)
##
## pctMobility 242,317.300***
## (15,582.090)
##
## Constant 306,841.400***
## (10,555.420)
##
## -----------------------------------------------
## Observations 14,272
## R2 0.366
## Adjusted R2 0.366
## Residual Std. Error 189,961.200 (df = 14261)
## F Statistic 824.020*** (df = 10; 14261)
## ===============================================
## Note: *p<0.1; **p<0.05; ***p<0.01
Table 3.2 is the error summary which shows the mean absolute error (MAE) as well as the mean absolute percent error (MAPE), providing an overview of the errors in the training set. The mean absolute error is the measurement of the difference between the predicted/expected and the observed/actual in the dataset per observation while the mean absolute percent error put it into a percentage basis. In our case, the MAE of this model is 103,792.9 which tells us that our prediction is, on average, 103,792.9 dollars higher or lower than the observed value, and the predictions are 35% off the actual prices.
PhillyPrice.test <-
PhillyPrice.test %>%
mutate(sale_price.Predict = predict(reg.training, PhillyPrice.test),
sale_price.Error = sale_price.Predict - sale_price,
sale_price.AbsError = abs(sale_price.Predict - sale_price),
sale_price.APE = (abs(sale_price.Predict - sale_price)) / sale_price.Predict)
error_summary <- st_drop_geometry(PhillyPrice.test) %>%
summarize(
MAE = mean(sale_price.AbsError),
MAPE = mean(sale_price.APE)
)
kable(error_summary, "html", caption = "Error Summary for Test Set", digits = 2) %>%
kable_styling()
MAE | MAPE |
---|---|
102692.3 | 0.4 |
PhillyPrice.test %>%
st_drop_geometry() %>%
summarise(MAE = mean(sale_price.AbsError),
MAPE = mean(abs(sale_price.APE)*100)) %>%
kbl(col.name=c('Mean Absolute Error','Mean Absolute Percentage Error')) %>%
kable_classic()
Mean Absolute Error | Mean Absolute Percentage Error |
---|---|
102692.3 | 58.49086 |
This shows the 100-fold cross validation results of the dataset, showing the average goodness of fit of our model across 100 sublets with the same size. The root means square error (RMSE) we got is 180,808.5 after taking the square root of the mean square error. This suggests that we might not have the best model here for house price prediction since the relatively high RMSE value tells us that our model is not as accurate based on the errors in predictions. The mean of the MAE we got here is 103,295.6 and the standard deviation is 11,210.03 across the 100 folds. This histogram is created to better visualize the distribution of the MAE. The graph shows the frequency of each MAE value in a 1000-dollar binning. Even though we observed that the frequency tends to be relatively higher between 9,000 to 11,000, we still conclude that the model does not generalize really well to new data since there is no obvious clustering in the distribution.
fitControl <- trainControl(method = "cv", number = 100)
set.seed(825)
reg.cv <-
train(sale_price ~ ., data = st_drop_geometry(Philly_price_Model) %>%
dplyr::select(sale_price,
total_area, interior_condition,
houseAge, shooting.Buffer, dist_to_school,
dist_to_commerce, dist_to_PPR, MedHHInc,
pctForeign, pctMobility),
method = "lm", trControl = fitControl, na.action = na.pass)
reg.cv
## Linear Regression
##
## 23781 samples
## 10 predictor
##
## No pre-processing
## Resampling: Cross-Validated (100 fold)
## Summary of sample sizes: 23542, 23542, 23544, 23544, 23544, 23542, ...
## Resampling results:
##
## RMSE Rsquared MAE
## 180388.8 0.4294934 102792.5
##
## Tuning parameter 'intercept' was held constant at a value of TRUE
cv_results <- data.frame(Resamples = names(reg.cv$resample),
MAE = reg.cv$resample$MAE)
ggplot(cv_results, aes(x = MAE)) +
geom_histogram(binwidth = 1000, fill = "lightblue", color = "black", size = 0.3) + # Adjust the 'size' parameter for thinner borders
labs(title = "Distribution of Cross-Validation Mean Absolute Error (MAE) for Predictive Model",
x = "Mean Absolute Error (MAE)",
y = "Frequency") +
theme_minimal()
The figure below is the scatterplot of predicted house prices as a function of observed prices including a line of best fit colored in orange and the average predicted fit line colored in green. The line of best fit on this plot has a slope of 1 to represent the situation when the predicted sale prices are exactly the same as the observed sale prices. The average predicted fit line we generated here is not too far from the line of best fit but has no overlap with the orange line, meaning that we are making some reasonable predictions, but the model is not perfectly fitting.
scatter_plot <- ggplot(data = PhillyPrice.test, aes(x = sale_price.Predict, y = sale_price)) +
geom_point() +
geom_abline(intercept = 0, slope = 1, color = "orange", linetype = "dashed", size = 1.2) + # Adjusting dashed line
geom_abline(intercept = mean(PhillyPrice.test$sale_price), slope = 1, color = "green", linetype = "dashed", size = 1.2) + # Adjusting dashed line
labs(title = "Comparison of Actual Sale Price vs. Predicted Sale Price with Reference Lines",
x = "SalePrice Prediction",
y = "SalePrice") +
theme_minimal()
print(scatter_plot)
The map below highlights the differences between predicted and observed house prices in the city of Philadelphia. According to the map, the largest residuals in our data, which is represented by the orange dots on the map, seem to be clustering in downtown area and a bit more spread out in other parts of the city. The green dots that represent the smaller residuals tend to cluster in places far from the middle of the city such as the northwest, northeast, and south.
palette5 <- c("#25CB10", "#5AB60C", "#8FA108", "#C48C04", "#FA7800")
ggplot() +
geom_sf(data = blockgroup21, fill = "grey50") +
geom_sf(data = PhillyPrice.test, aes(colour = q5(sale_price.Error)),
show.legend = "point", size = .75) +
scale_colour_manual(values = palette5,
name="Quintile\nBreaks") +
labs(title="Spatial Distribution of Test Set Residuals by Quintile in Philadelphia") +
mapTheme()
This figure is a graph that shows the observed Moran’s I value as well as the histogram of Moran’s I test of the dataset. The observed Moran’s I value is at 0.35 represented by the orange vertical line to the right of the histogram. The p-value here is less than 0.01 which means that the null hypothesis can be rejected which indicates that there is a positive spatial autocorrelation here.
coords <- st_coordinates(Philly_price_Model)
neighborList <- knn2nb(knearneigh(coords, 5))
spatialWeights <- nb2listw(neighborList, style="W")
Philly_price_Model$lagPrice <- lag.listw(spatialWeights, Philly_price_Model$sale_price)
coords.test <- st_coordinates(PhillyPrice.test)
neighborList.test <- knn2nb(knearneigh(coords.test, 5))
spatialWeights.test <- nb2listw(neighborList.test, style="W")
PhillyPrice.test <- PhillyPrice.test %>%
mutate(lagPriceError = lag.listw(spatialWeights.test, sale_price.Error))
moranTest <- moran.mc(PhillyPrice.test$sale_price.Error,
spatialWeights.test, nsim = 999)
ggplot(as.data.frame(moranTest$res[c(1:999)]), aes(moranTest$res[c(1:999)])) +
geom_histogram(binwidth = 0.01) +
geom_vline(aes(xintercept = moranTest$statistic), colour = "#FA7800",size=1) +
scale_x_continuous(limits = c(-1, 1)) +
labs(title="Observed and permuted Moran's I",
subtitle= "Observed Moran's I in orange",
x="Moran's I",
y="Count") +
plotTheme()
This scatterplot highlights the relationship between the spatial lag price error, which refers to the weighted mean of the errors of 5 neighbors, and the sale price error. There is a positive relationship between the two variables according to this graph, meaning that as the house price error goes up, the errors of the neighborhoods would also increase.
ggplot(PhillyPrice.test, aes(x = lagPriceError, y = sale_price.Error)) +
geom_point() +
geom_smooth(method = "lm", se = FALSE, color = "red") +
labs(title = "Relationship Between Spatial Lag & Sale Price Errors",
x = "Spatial Lag Price Error",
y = "Sale Price Error") +
theme_classic() + # Use a classic theme to remove the grey background
theme(
axis.line = element_line(color = "black"), # Make the axes lines black
axis.ticks = element_line(color = "black"), # Make the axis ticks black
axis.text = element_text(color = "black"), # Make the axis text black
plot.title = element_text(face = "bold", size = 14), # Customize the title
panel.border = element_blank(), # Remove default panel borders
panel.grid = element_blank() # Remove all grid lines
)
Figure below is a map of both the modelling and challenge (unpredicted) dataset based on our regression model. This map suggests that there are obvious clusters of both higher and lower house prices in different parts of the city of Philadelphia. Higher predicted values represented by the orange color seem to be clustering in areas far from the center of the city such as the northeast, west, and southeast. Values of lower predicted values, by contrast, create clusters in the center well as in the southwestern part of the city.
Philly_Housing_joined <-
Philly_Housing_joined %>%
mutate(sale_price.Predict = predict(reg.training, Philly_Housing_joined))
ggplot() +
geom_sf(data = neighborhood, fill = "grey50") +
geom_sf(data = Philly_Housing_joined, aes(colour = q5(sale_price.Predict)),
show.legend = "point", size = .75) +
scale_colour_manual(values = palette5,
name="Quintile\nBreaks") +
labs(title="Map of predicted values") +
mapTheme()
This choropleth map shows the distribution of mean absolute percentage error by neighborhoods in the city of Philadelphia. The darker color on the map is in association with a higher MAPE value while a lighter color corresponds to a lower one. According to the map, most of the neighborhood in the neighborhood effect regression has a relatively low MAPE except for the central and southwestern part of the city.
reg.nhood <- lm(sale_price ~ ., data = as.data.frame(PhillyPrice.training) %>%
dplyr::select(sale_price, total_area, interior_condition, houseAge, shooting.Buffer, dist_to_school, dist_to_commerce, dist_to_PPR, MedHHInc, pctForeign, pctMobility))
PhillyPrice.test.nhood <-
PhillyPrice.test %>%
mutate(Regression = "Neighborhood Effects",
sale_price.Predict = predict(reg.nhood, PhillyPrice.test),
sale_price.Error = sale_price.Predict- sale_price,
sale_price.AbsError = abs(sale_price.Predict- sale_price),
sale_price.APE = (abs(sale_price.Predict- sale_price)) / sale_price)
mape_neighborhood <- PhillyPrice.test.nhood %>%
group_by(name) %>%
summarize(MAPE = mean(sale_price.APE, na.rm = TRUE))
neighborhood <- st_join(neighborhood, mape_neighborhood, by = "name")
ggplot(data = neighborhood) +
geom_sf(aes(fill = MAPE)) +
scale_fill_gradient(low = "white", high = "darkblue", name = "MAPE") +
labs(title = "Mean Absolute Percentage Error by Neighborhood") +
theme_minimal()
This scatterplot highlights the relationship between the mean of house price and the MAPE. There is an inverse relationship between the two variables according to this graph, meaning that as the house price error goes up, the MAPE decreases, and the predictions would get more accurate. However, the points on the plot are not distributed in clusters along the line of best fit, meaning the line might be affected by outliers in the dataset.
mean_price_summary <- PhillyPrice.test.nhood %>%
group_by(name) %>%
summarize(meanPrice = mean(sale_price, na.rm = TRUE))
neighborhood <- st_join(neighborhood, mean_price_summary, by = "name")
neighborhood <- st_join(neighborhood, mape_neighborhood, by = "name")
ggplot(data = neighborhood, aes(x = meanPrice, y = MAPE.y)) +
geom_point(alpha = 0.7) + # color by neighborhood for distinction
geom_smooth(method = "lm", se = FALSE, color = "red", linetype = "dashed") +
labs(title = "Impact of Mean Home Price on MAPE Across Neighborhoods in Philadelphia") +
theme_minimal()
To test the generalizability of our model, we picked African American data as a race context. Figure 3.11 provides information on the distribution of majority of African American groups as well as majority of non-African American groups. According to the map, we discovered that blocks with majority of non-African American people tend to live in the west, southwest, and lower middle parts of the city while African American groups are distributed from upper middle city to the northwest. In table 3.12, we calculated the MAE in baseline regression and compared it with the neighborhood-effect regression which shows that the difference between the neighborhood effects model is large while the goodness of fir across races are not as different.
blockgroup21 <- blockgroup21 %>%
mutate(AfricanContext = ifelse(pctForeign > .1, "Majority African-American", "Majority non African-American"))
grid.arrange(ncol = 2,
ggplot() +
geom_sf(data = na.omit(blockgroup21), aes(fill = AfricanContext)) +
scale_fill_manual(values = c("lightyellow", "lightblue"), name = "Legend") + # Change legend text here
labs(title = "Spatial Distribution of African-American Population in Philladelphia") +
mapTheme() +
theme(legend.position = "bottom",
legend.text = element_text(size = 10), # Increase text size for readability
legend.title = element_text(size = 12, face = "bold"), # Adjust title size and style
legend.box = "horizontal") # Adjust legend box arrangement
)
blockgroup21 <- blockgroup21 %>%
mutate(AfricanContext = ifelse(pctForeign > .1, "Majority African-American", "Majority non African-American"))
# B02001_003 African American
PhillyPrice.test <- PhillyPrice.test %>%
mutate(Regression = "Baseline Regression")
# #dplyr::select(PhillyPrice.test, starts_with("sale_price"), Regression, name)
# dplyr::select(PhillyPrice.test.nhood, starts_with("sale_price"), Regression, name)
bothRegressions <-
rbind(
dplyr::select(PhillyPrice.test, starts_with("sale_price"), Regression, name),
dplyr::select(PhillyPrice.test.nhood, starts_with("sale_price"), Regression, name))
st_join(bothRegressions, blockgroup21) %>%
group_by(Regression, AfricanContext) %>%
summarize(mean.MAPE = scales::percent(mean(sale_price.APE, na.rm = T))) %>%
st_drop_geometry() %>%
spread(AfricanContext, mean.MAPE) %>%
kable("html", caption = "Test set MAPE by neighborhood racial context") %>%
kable_styling(full_width = F)
Regression | Majority African-American | Majority non African-American |
---|---|---|
Baseline Regression | 36% | 44% |
Neighborhood Effects | 54% | 70% |
In general, it is very hard for us to conclude that our model is that
effective in predicting house prices in the city of Philadelphia. The
R-square we got from the predicting model appears to be 0.37 which means
that only 37% of the variation of house prices can be explained using
our model. The generalizability of our model does not perform very well
either, resulting in an MAE of 102,032.2 and an MAPE of 0.48. This
suggests that there are relatively high errors in our predicted value to
the actual value of the houses due to various social factors by region
and the model needs refinement in the future. Some of the more
interesting variables we included in the model such as the house’s
proximity to different social facilities like schools, parks, and
commercial corridors actually had very little affect on the price of the
property. Another interesting pattern we discovered is the relationship
between house prices and the number of living spaces which we, by
assumption, would think would have a strong correlation with but turned
out to be not exactly the case. Even though the model suggests that
there is a relatively strong slope of the best fit line, the price
variation in small houses was significantly large. This might be caused
by the housing price in downtown Philadelphia where a lot of small
expensive units can be found. On average, our model is capable of
predicting the house prices in Philadelphia with an error of plus or
minus 102,032.2 dollars from the sale prices. The most important factor
that affects the housing market price that we included in this study
appears to be the median household income of the residents. We observed
that there is a relatively strong correlation between the two factors
which make total sense since wealthier households have higher purchasing
power in general. Interestingly, we observed much more variation of
housing prices as the income level of households goes up, reflecting the
higher income household’s ability of choosing different housing
options.
Based on the MAE neighborhood map, it is reasonable for us to say that
our model performs better at predicting house values in the surrounding
areas as compared to the center of the city. This might be because there
is a larger population size in the city center and housing prices vary
due to the different needs of the people as well as the scarcity of
space. Our model struggled to capture the geographic disparities in
housing prices across Philadelphia. The residual maps and spatial
autocorrelation tests showed clear clustering of errors, particularly in
the downtown area. This suggests localized factors, like proximity to
transit or affluent neighborhoods, influence prices more than the
general variables we included. Moran’s I test confirmed moderate spatial
autocorrelation, indicating the need for spatial regression techniques,
like spatial lag models, to better address these regional patterns and
improve accuracy in location-specific predictions. To improve the model,
future iterations could incorporate non-linear relationships and more
advanced methods like decision trees or random forests to capture
complex interactions. Adding granular data, such as school ratings or
detailed crime statistics, would also enhance accuracy. The clustering
of errors highlights the need for spatial regression or geographically
weighted regression (GWR) to account for spatial variation. Expanding
the model with additional demographic factors like employment rates and
population density could further refine the predictions and make the
model more robust across different neighborhoods.
In conclusion, we would not recommend our model at its current phase to Zillow due to the number of errors in predictions as well as its lack of generalizability. However, the model can be good a good starting point or server as a base of a better developed house price prediction model by introducing more relevant data into the model and minimizing the errors of the predictions. We are also aware of the limitations of the OLS regression as a type of model and are willing to try other machine learning methods to increase the accuracy of our predictions. With those changes, we believe that our model would become more useful for Zillow to estimate more accurate housing prices.