I. Introduction

As the leading online real estate marketplace in the United States, it is extremely important for Zillow to ensure the accuracy of the housing market predictions offered as “Zestimate” which is essential to services related to buying, renting, selling, and financing homes. We had been informed by the company that there was a demand in improving the existing housing predictions and were tasked with the creation of a better predictive model of home prices the local scale of the city of Philadelphia. This new predictive model would be built to serve as a useful tool for both homeowners and buyers to obtain accurate house price information to avoid underselling and overpaying, ultimately securing the credibility as well as competitive advantages of Zillow. We were told that the current housing market predictions Zillow uses do not factor in enough local intelligence. Therefore, to increase the accuracy of housing market predictions in Philadelphia, the model will be adapting local characteristics when predicting house prices in the city.

Ordinary Least Squares (OLS) regression, a statistical approach to find the relationship between different variables as well as to record how one variable changes in response to other variables, will be used to structure this new housing prediction model. Using the OLS regression model, we will examine how each of the local characteristics (independent variables) will influence the local house prices (dependent variable) differently to figure out the how strong are the correlations between the variables as well as how accurate our model is.

However, there are still important things we need to take into consideration and difficulties we must overcome throughout the creation of the model. Some of those include finding the most relevant data from reliable sources, trimming the different sets of data, and creating aesthetically pleasing as well as accurate visualizations. The limitations of the OLS model and potential errors in the data sets would also be something that could affect the accuracy of our results. Since we are trying to make predictions of house prices in the real world where nothing is fully predictable, we will have to keep in mind that there can be unprecedented and uncontrollable factors that can impact our results.

II. Data Wrangling and Clean Up

To evaluate how different variables will impact house prices in Philadelphia, we used the data from the U.S. Census Bureau and OpenPhillyData, as well as some data from the original dataset we were provided with to construct our model. The collected data was then grouped into three different categories that people might consider when purchasing a house: the condition of the house, the demographics of the neighborhood, and other environmental factors.

For the “House Condition” category, we included the age of the house, the quality grade of its interior condition, and the size of both indoor and outdoor area of the property. For the “Demographic Characteristics” category, we included the median household income of the past 12 months, the percentage of foreign-born residence, and the geographic mobility in the neighborhood where the house is located. For the “Environmental Impact” category, we included the distance from the house to the nearest recreational area/park, commercial area, and school. We also included the number of victims who received fatal damage within a 1000 feet buffer from each house.

2.1 Import packages and start the setting

# Load Libraries
library(geos)
library(rsgeo)
library(tidyverse)
library(tidycensus)
library(sf)
library(kableExtra)
library(knitr)
library(dplyr)
library(spdep)
library(caret)
library(ckanr)
library(FNN)
library(grid)
library(gridExtra)
library(ggcorrplot)
library(stargazer)


options(scipen=999)
options(tigris_class = "sf")

source("https://raw.githubusercontent.com/urbanSpatial/Public-Policy-Analytics-Landing/master/functions.r")

2.2 Import and clean the general dataset

Since our project is heavily focused on geographic features, we retrieved the block groups and neighborhoods boundaries from OpenDataPhilly, and we filtered the variables that we need to use from the housing price raw data.

To better assess the environmental impact on housing prices, we employed two different methods. For the shooting data, we considered the number of victims who suffered fatal gunshot wounds within a 1,000-foot radius of each property. Fatal incidents included gunshots to critical areas such as the head, neck, and chest. Additionally, since the location and timing of crimes are unpredictable, it is essential to use an appropriate buffer to evaluate the perceived safety of a neighborhood. Even a fatal crime occurring at a relatively greater distance can instill fear and influence housing values. Therefore, we used a buffer to aggregate crime incidents rather than relying on the average nearest neighbor distance.

The second approach focuses on the proximity to schools, commercial corridors, and public parks and recreation centers. In this case, we utilized Euclidean distance, as the direct distance to these public resources is a key factor in determining the convenience of living in a particular location. This method captures the accessibility of amenities, with the variables for further analysis being the distances from each property to the nearest public facility.

nearest_school <- sf::st_nearest_feature(Philly_price_clean, schools.sf)
x <- rsgeo::as_rsgeo(Philly_price_clean)
y <- rsgeo::as_rsgeo(schools.sf)

Philly_price_clean$dist_to_school <- rsgeo::distance_euclidean_pairwise(x, y[nearest_school])

nearest_Commercial <- sf::st_nearest_feature(Philly_price_clean, Commercial_Corridors)
x <- rsgeo::as_rsgeo(Philly_price_clean)
y <- rsgeo::as_rsgeo(Commercial_Corridors)
Philly_price_clean$dist_to_commerce <- rsgeo::distance_euclidean_pairwise(x, y[nearest_Commercial])

nearest_PPR <- sf::st_nearest_feature(Philly_price_clean, PPR_Sites)
x <- rsgeo::as_rsgeo(Philly_price_clean)
y <- rsgeo::as_rsgeo(PPR_Sites)
Philly_price_clean$dist_to_PPR <- rsgeo::distance_euclidean_pairwise(x, y[nearest_PPR])

For the census variables, we linked the data from each block group to the properties located within that group. Additionally, we assigned a neighborhood designation to each house for future analysis and categorized the quality condition into four levels: 4 for good condition, 3 for fair condition, 2 for poor condition, and 1 for others.

Philly_Housing_joined <-
  st_join(Philly_price_clean, blockgroup21, join = st_within)
Philly_Housing_joined <-
  st_join(Philly_Housing_joined, neighborhood, join = st_within)

A table of summary statistics with variable descriptions

Table 2.1 gives us a comprehensive summary of the independent variables that might affect the housing prices. The total number of the sample is 23781. Here is a descriptions of the variables:

House condition: - total_area: the area that the owners have, which includes the living area and the outdoor area - houseAge: the age of the house since it was first built.

Demographic characteristics: - MedHHInc: the median household income in the past 12 months (in 2021 inflation-adjusted dollars) - pctForeign: the percentage of residents who were born outside of the United States among all residents - pctMobility: the percentage of geographic mobility in a different house in United States 1 year ago

Environmental impact: - shooting.Buffer: the sum of victims suffered from head gun shot within a 1000 feet buffer of each house - dist_to_school: the distance of a house to the nearest school - dist_to_commerce: the distance of a house to the nearest commercial corridor - dist_to_PPR: the distance of a house to the nearest public park and recreational center

The following table is a summary statistics table that highlights the relationship between each independent variable and the dependent variable (house price) with a total of 23,781 samples for testing. The table summarizes the mean, standard deviation, minimum and maximum values of each variable of our selection. We observed some relatively high values in the standard deviation of some of the variables which indicated that some of our results might be affected by potential outliers in the dataset.

numeric_columns <- sapply(Philly_Housing_joined, is.numeric)
for (col in names(Philly_Housing_joined)[numeric_columns]) {
  col_mean <- mean(Philly_Housing_joined[[col]], na.rm = TRUE)
  Philly_Housing_joined[[col]][is.na(Philly_Housing_joined[[col]])] <- col_mean
}

house_condition <- st_drop_geometry(Philly_Housing_joined) %>%
  select(total_area, houseAge, interior_condition) %>%
  gather(variable, value) %>%
  group_by(variable) %>%
  summarize(
    mean = mean(value, na.rm = TRUE),
    sd = sd(value, na.rm = TRUE),
    Min = min(value, na.rm = TRUE),
    Max = max(value, na.rm = TRUE)
  )

demo <- st_drop_geometry(Philly_Housing_joined) %>%
  select(MedHHInc, pctForeign, pctMobility) %>%
  gather(variable, value) %>%
  group_by(variable) %>%
  summarize(
    mean = mean(value, na.rm = TRUE),
    sd = sd(value, na.rm = TRUE),
    Min = min(value, na.rm = TRUE),
    Max = max(value, na.rm = TRUE)
  )

envi_impact <- st_drop_geometry(Philly_Housing_joined) %>%
  select(shooting.Buffer, dist_to_school, dist_to_commerce, dist_to_PPR) %>%
  gather(variable, value) %>%
  group_by(variable) %>%
  summarize(
    mean = mean(value, na.rm = TRUE),
    sd = sd(value, na.rm = TRUE),
    Min = min(value, na.rm = TRUE),
    Max = max(value, na.rm = TRUE)
  )

summary_stats <- rbind(house_condition, demo, envi_impact)  
summary_table <- knitr::kable(summary_stats, "html", digits = 1, caption = "Table of summary statistics") %>%
  kable_styling(full_width = F)

grouped_table <- group_rows(summary_table, "House Condition", 1, 3)
grouped_table <- group_rows(grouped_table, "Demographic Characteristics", 4, 6)
grouped_table <- group_rows(grouped_table, "Environmental Impact", 7, 10)

grouped_table
Table of summary statistics
variable mean sd Min Max
House Condition
houseAge 85.4 29.7 0.0 273.0
interior_condition 3.6 0.9 0.0 7.0
total_area 1828.9 3813.0 0.0 226512.0
Demographic Characteristics
MedHHInc 65581.2 31777.9 6302.0 250001.0
pctForeign 0.1 0.1 0.0 0.8
pctMobility 0.1 0.1 0.0 0.8
Environmental Impact
dist_to_PPR 1796.5 1032.0 79.9 10317.7
dist_to_commerce 670.0 638.3 0.0 7385.7
dist_to_school 1147.7 647.6 32.4 5900.8
shooting.Buffer 5.2 4.7 1.0 58.0

Correlation Matrix across variables

To better visualize the result, a correlation matrix graph is created to represent the pairwise degree of correlation between the variables on the x-axis to the ones on the y-axis. The relationship of the two variables is represented by the correlation coefficient on the scale of -1.0 to 1.0. When the correlation coefficient is at 0, there is no correlation between the variables. The further away the correlation coefficient is from 0, the more likely that the variables are more related to each other. A positive number indicates that there is a positive correlation between the two variables, while a negative number indicates that the correlation is negative. Based on this matrix, most of the variables are somewhat related to another variable except for a few white squares meaning correlation coefficient is at 0. The number of positive values indicated by the orange squares are about the same as the green squares that represent negative correlations. We observed the highest positive correlations in the distance between social amenities like schools, parks, and commercial corridors as well as the relationship between the age of the house to its interior conditions. On the other hand, there are high negative correlations between the interior conditions of the house to the median household income as well as geographic mobility, suggesting an inverse relationship.

numericVars <- st_drop_geometry(Philly_Housing_joined)[, c("total_area", "houseAge", "interior_condition", "MedHHInc", "pctForeign", "pctMobility", "shooting.Buffer", "dist_to_school", "dist_to_commerce", "dist_to_PPR")]




ggcorrplot(
  round(cor(numericVars), 1),
  p.mat = cor_pmat(numericVars),
  colors = c("#b2182b", "white", "#2166ac"),  # Darker shades of red and blue
  type = "lower",
  insig = "blank",
  lab = TRUE,  # Adding labels inside the cells for better visibility
  lab_col = "black"  # Color of the labels for clear contrast
) +
  labs(
    title = "Correlation matrix across variables",
    caption = "Figure 2.1",  # Replace `label` with `caption`
    x = NULL,  # Remove x-axis label
    y = NULL   # Remove y-axis label
  ) +
  theme_minimal() +
  theme(
    axis.text.x = element_text(angle = 45, hjust = 1)
  )

Home price correlation scatterplot with median household income

Median Household Income

This scatterplot highlights the relationship between housing prices and median household income in the city of Philadelphia. As shown by the blue line, the line of best fit indicating the correlation between the two variables on the x and y axis, house prices and median household income are positively related with a slightly increasing slope. As the median household income increases, the home prices would also rise as a result in response to greater purchasing power of the households. Most of the points on this scatterplot are clustered on the lower left side of the graph which suggests people with lower household income are more likely to live in houses that are relatively low in price. On the right side of the plot, points seem to be more scattered since the wealthier residents have more housing options to choose from. There are also a few outliers that are higher in housing prices, of around 6,000,000, as compared to the rest of the values in this dataset.

ggplot(Philly_Housing_joined, aes(x = MedHHInc, y = sale_price)) +
  geom_point(size = .5) +
  geom_smooth(method = "lm", se = FALSE, color = "red") +
  labs(
    title = "Relationship Between Home Price and Median Household Income in Philladelphia",
    x = "Median Household Income",
    y = "Home Price",
    caption = "Figure 2.2"  # Replacing `label` with `caption`
  ) +
  theme_minimal()

House Age

This scatterplot highlights the relationship between housing prices and house age in the city of Philadelphia. As shown by the line of best fit colored in blue, the prices and ages of the house are inversely related to each other with a slightly decreasing slope. The scatterplot is divided into two parts horizontally at around 125 years on the x-axis. Points tend to be a lot more clustered on the left-hand side of the plot, while the ones on the right are a lot more scattered. This is because most of the houses in the city of Philadelphia are not as old as 125 years old and even fewer old houses are available in the market. This suggests that we need to reconsider if the age of the house is a good factor to put here for price prediction since the relatively small number of old houses is affecting the slope of the line of benefit. However, even if we look at the left of the plot along, there is still an obvious decline in house prices as the age of the house increases which makes sense since most people would not go out of their way to get an old house with potential structural and other problems.

ggplot(Philly_Housing_joined, aes(x = houseAge, y = sale_price)) +
  geom_point(size = .5) +
  geom_smooth(method = "lm", se = FALSE, color = "red") +
  labs(
    title = "Relationship Between House Age and Home Price in Philadelphia ",
    x = "House Age",
    y = "Home Price",
    label = "Figure 2.3"
  ) +
  theme_minimal()

Total Area

This scatterplot highlights the relationship between the price and the amount of living space of that property in the city of Philadelphia. As shown by the line of best fit colored in orange, the prices and the amount of living space of the house are positively related to each other with an increasing slope. However, most of the values in this dataset lie on the bottom left corner of the graph where the cluster of points is, and very few points are found elsewhere. This means that this particular scatterplot might not be the most effective representation of the dataset since the line of best fit is heavily influenced by the outliers in the dataset. We also observed great variations in the prices of houses that have smaller total living spaces. A reasonable explanation for this result might be that these smaller houses are located in different neighborhoods of the city where prices of the house is determined by other factors despite the fact that the houses are similar in size. Still, if we only look at the bottom left corner of the graph where points are clustered, it is still reasonable to conclude that in general, houses with larger living space tends to be more expensive based on the increasing slope of the line of best fit.

ggplot(Philly_Housing_joined, aes(x = total_area, y = sale_price)) +
  geom_point(size = .5) +
  geom_smooth(method = "lm", se = FALSE, color = "red") +
  labs(
    title = "Relationship Between Total Areas and Home Price",
    x = "total areas",
    y = "Home Price",
    label = "Figure 2.4"
  ) +
  theme_minimal()

Recreational Sites

This scatterplot highlights the relationship between the price and its distance to the nearest recreation area in the city of Philadelphia. As shown by the line of best fit colored in orange, the price of the house and how far away it is from a recreation site are positively related to each other with a slightly increasing slope. Most points on the plot are distributed along the line of best fit in the form of a cluster, meaning that the correlation is not as affected by outliers as some of the other scatterplots shown above. It is hard for us to really conclude that the further away a house is from a recreational site, the higher its price will be since the slope of the line is so little. There are also way more houses close to a recreational site than there are far from one. Since recreational sites such as parks in Philadelphia are located inside the city, our assumption is that households that live further away from the central area are more likely to be wealthier and own cars, meaning they will also be able to afford more expensive housing.

ggplot(Philly_Housing_joined, aes(x = dist_to_PPR, y = sale_price)) +
  geom_point(size = .5) +
  geom_smooth(method = "lm", se = FALSE, color = "red") +
  labs(
    title = "Price as a function of distance to Parks and Recreation",
    x = "Distance to Parks and Recreation",
    y = "Home Price",
    label = "Figure 2.5"
  ) +
  theme_minimal()

Dependent variable one (sale price)

This choropleth map shows the distribution of sale prices of houses, grouped in equal intervals, by neighborhoods in the city of Philadelphia. The darker color on the map is in association with a higher sale price while a lighter color corresponds to a lower price. Most of the neighborhoods in the city of Philadelphia, according to the map, have relatively low sale prices except for the northwest part as well as the lower middle part of the city. The northwestern districts look a lot darker compared to most part of the rest of the city, meaning that there might be some luxury houses located in those area which can be potential outliers in our dataset. The lower middle districts that are relatively darker in color make up the Philadelphia downtown area where housing are likely to be more expensive based on the resources centralized in the city.

Philly_price_Model <- Philly_Housing_joined %>%
  filter(toPredict == "MODELLING")

mean_sale_price <- Philly_price_Model %>%
  group_by(GEOID) %>%
  summarize(MeanSalePrice = mean(sale_price))

blockgroup21 <- st_join(blockgroup21, mean_sale_price, by = "GEOID")
blockgroup21$MeanSalePrice[is.na(blockgroup21$MeanSalePrice)] <- 0
ggplot() +
  geom_sf(data = blockgroup21, aes(fill = MeanSalePrice)) +
  scale_fill_gradient(low = "white", high = "darkblue", 
                      name = "Sale Price (U.S. Dollars)") +
  labs(title = "Choropleth Map of Housing Sale Price by neighborhoods",
       label = "Figure 2.6") +
  theme_minimal()

Independent Variable Map: Geographic Mobility

This choropleth map shows the distribution of geographic mobility rate by neighborhoods in the city of Philadelphia. The darker color on the map is in association with a higher geographic mobility rate while a lighter color corresponds to a lower rate. The geographic mobility rate of residents in Philadelphia seems to be distributed in scatters with no obvious clustering except for some blocks near downtown area in the lower middle part of the city. The block in the west where Fairmount Park is located does have a much darker color than the rest of the city due to how little its population is since the park takes up most of the space in that area. For that reason, the data represented by that block in this visualization is not as significant for the analysis of geographic mobility in this particular case.

ggplot() +
  geom_sf(data = blockgroup21, aes(fill = pctMobility)) +
  scale_fill_gradient(low = "white", high = "darkblue", name = "Geographic Mobility Rate") +
  labs(title = "Choropleth Map of Geographic Mobility by Block Groups",
       label = "Figure 2.7") +
  theme_minimal()

Independent Variable Map: Distance to School

This choropleth map shows the distribution of the average distance a house is to the nearest school by neighborhoods in the city of Philadelphia. The darker color on the map is in association with a further average distance while a lighter color corresponds to a shorter one. The average distance to school from houses is evenly spread out throughout the city of Philadelphia with a few blocks around the edge of the city’s boundaries further from the center.

mean_school_dis <- Philly_Housing_joined %>%
  group_by(GEOID) %>%
  summarize(MeanSchoolDis = mean(dist_to_school))

blockgroup21 <- st_join(blockgroup21, mean_school_dis, by = "GEOID")
blockgroup21$MeanSchoolDis[is.na(blockgroup21$MeanSchoolDis)] <- 0
ggplot() +
  geom_sf(data = blockgroup21, aes(fill = MeanSchoolDis)) +
  scale_fill_gradient(low = "white", high = "darkblue", name = "Average Distance") +
  labs(title = "Choropleth Map of Distance to school by Block Groups",
       label = "Figure 2.8") +
  theme_minimal()

Independent Variable Map: Shooting

This heat map shows the general density of where fatal gunshot injuries took place in the city of Philadelphia. A darker color indicates that there is a higher density of total fatal gunshot incidents occurred in those areas, while a lighter color means that there is little to no incidents in the area. The blocks with the highest density is located in the middle as well as the southwestern part of the city indicating those areas have higher exposure to shooting incidents which can be more dangerous for people to live in.

ggplot() + geom_sf(data = blockgroup21, fill = "white") +
  stat_density2d(data = data.frame(st_coordinates(PhillyShootingFatalInjury.sf)), 
                 aes(X, Y, fill = ..level.., alpha = ..level..),
                 size = 0.01, bins = 40, geom = 'polygon') +
  scale_fill_gradient(low = "lightblue", high = "purple", name = "Density") +
  scale_alpha(range = c(0.00, 0.35), guide = "none") +
  labs(title = "Density map of fatal gunshot victims, Philadelphia",
       label = "Figure 2.9") +
  mapTheme()

Other map of interest: Map of Social Facilities (Parks, Schools, Commercial).

This choropleth map shows the distribution of the average distance from a house to the nearest social facility by neighborhoods in the city of Philadelphia. The darker color on the map is in association with a further average distance while a lighter color corresponds to a shorter one. The average distance to public facilities from houses is relatively shorter in the center city of Philadelphia, meaning that the residents there have more access to parks, schools, and commercial corridors. Where on the other hand, the blocks with greater average distance are located around the edge of the city’s boundary, suggesting that people who lives in there are less accessible to public facilities than those who live in the city center.

mean_CC_dis <- Philly_Housing_joined %>%
  group_by(GEOID) %>%
  summarize(MeanCCDis = mean(dist_to_commerce + dist_to_school + dist_to_PPR))

blockgroup21 <- st_join(blockgroup21, mean_CC_dis, by = "GEOID")
blockgroup21$MeanCCDis[is.na(blockgroup21$MeanCCDis)] <- 0
ggplot() +
  geom_sf(data = blockgroup21, aes(fill = MeanCCDis)) +
  scale_fill_gradient(low = "white", high = "darkblue", name = "Average Distance") +
  labs(title = "Average Distance to Commercial Corridors, Schools, and Parks",
       label = "Figure 2.10") +
  mapTheme()

III. Methods

The goal of the project is to create a better house price predicting model for Zillow that would take social intelligence into account to generate more accurate predictions. We used OLS regression to estimate the relationship between the dependent variable house price and various independent variable: local social factors that might influence housing price in the city of Philadelphia. Visualization and manipulation of the data is done using the programming language R and the final deliverable is knitted into an html file from an R markdown file.

Exploratory Analysis:

After we obtained the data that we need to conduct the study from open sources, we grouped them into three categories by their characteristics for easier reference. We first created a comprehensive table of summary statistics to provide general information and insights into the distribution of variables in our new data set. We then generated a correlation matrix across different variables in the dataset to visualize and identify some of the more related variables. Then through a series of maps, we studied how different independent variables would cause the house price to change geospatially.

OLS Regression Analysis

We first separate the data into two groups: data with the actual prices and data that we need to predict the house prices for. We then split the “toPredict” dataset into a training and a test set for different purposes. The training set is fed into the machine learning algorithm for it to learn and generate the model with minimum error. The test dataset is then used to test how well the model is at predicting house prices. A linear regression was then created using R’s lm function to examine the performance of the model which will be analyzed in the results section. To improve the performance of the model, we conducted a 100-fold cross validation to interpret the mean absolute error as well as the correlation between our predicted house prices and the actual observed house prices across the city of Philadelphia. Visualizations such as a histogram plot, scatterplot, and a map are generated to interpret the accuracy and effectiveness of our predicting model. A Moran’s I test was also conducted to measure how similar house prices are to each other based on their spatial proximity and to figure out the potential cluster patterns. We also included a scatterplot that visualizes the errors in the spatial autocorrelation. Next, we used the regression model we created to predict the “unpredicted data”, which we had trimmed in the beginning, to generate a map for both the model and the unpredicted dataset. To test the performance of the neighborhood model in response to different sets of data, we ran another regression by and visualized the mean absolute percent errors by neighborhood using a map. Another test was conducted using the African American resident data from the U.S.Census, categorizing the neighborhood and comparing the mean absolute percent errors in the dataset.

Results

Training set linear model summary results

The table below is the linear regression model summary of the OLS regression model with the dependent variable being house price and the predictors being each variable listed below. The total number of observations in the training dataset is 14,272 with the constant, the value of the dependent variable house price when the value of every independent variable is held at 0, being 299,768.7. The coefficient can be found in the second column of the variable role, representing the amount of change in the dependent variable as the independent variable goes up. A positive number indicates that the two variables are positively related to each other while a negative number indicates an inverse relationship between those values. Directly below the coefficient is the standard error for that correlation and is placed in parentheses “()”. The lower the value is, the less stand error the coefficient would have, making the coefficient more likely to be precise and vice versa.
The statistical significance level of each coefficient is represented by the asterisk symbol “”, measured by p-value, indicating the probability to reject the null hypothesis which states that there is no relationship between house prices with each independent variable. Having 3 asterisks “” suggests that the p-value of that variable is less than 0.01 meaning that it plays a more important role in predicting the price. Two asterisks “” suggest that the p-value of that variable is less than 0.05 but more than 0.01 meaning that it plays a less important role in predicting the price. When there is no asterisks next to the value of coefficient, the p-value is more than 0.05, meaning that the variable is less likely to play an important role in predicting house prices. The R-square (R2) and the adjusted R-square (Adjusted R2) shows the fitness of our model to the data. In our case, R2 equals 0.368, meaning that 36.8% of the total variability in house prices can be explained by the predictors we have which show that our model still has some room for improvement to better predict accurate house prices. F-statistic is used to test the hypothesis of the variances in the data which turned out to be 829.500 for our model. The three asterisks on the side suggest that its p-value is less than 0.01, meaning that the model can be considered statistically significant overall and there is at least one predictor variable that is significantly related to the dependent variable house price.

inTrain <- createDataPartition(
              y = paste(Philly_price_Model$interior_condition), 
              p = .60, list = FALSE)
PhillyPrice.training <- Philly_price_Model[inTrain,] 
PhillyPrice.test <- Philly_price_Model[-inTrain,]  
 
reg.training <-
  lm(sale_price ~ ., data = as.data.frame(PhillyPrice.training) %>%
                             dplyr::select(sale_price, total_area, interior_condition, houseAge, shooting.Buffer, dist_to_school, dist_to_commerce, dist_to_PPR, MedHHInc, pctForeign, pctMobility))

# Create a table of the model summary using stargazer
stargazer(reg.training,
          title = "Table 3.1. Linear Regression Model Summary",
          type = "text",
          align = TRUE
          )
## 
## Table 3.1. Linear Regression Model Summary
## ===============================================
##                         Dependent variable:    
##                     ---------------------------
##                             sale_price         
## -----------------------------------------------
## total_area                   13.774***         
##                               (0.456)          
##                                                
## interior_condition        -66,261.510***       
##                             (1,960.943)        
##                                                
## houseAge                    -158.301***        
##                              (59.572)          
##                                                
## shooting.Buffer            -4,171.666***       
##                              (362.172)         
##                                                
## dist_to_school                 1.734           
##                               (2.732)          
##                                                
## dist_to_commerce            -27.782***         
##                               (2.690)          
##                                                
## dist_to_PPR                  6.712***          
##                               (1.694)          
##                                                
## MedHHInc                     2.834***          
##                               (0.053)          
##                                                
## pctForeign                 52,666.320***       
##                            (13,185.680)        
##                                                
## pctMobility               242,317.300***       
##                            (15,582.090)        
##                                                
## Constant                  306,841.400***       
##                            (10,555.420)        
##                                                
## -----------------------------------------------
## Observations                  14,272           
## R2                             0.366           
## Adjusted R2                    0.366           
## Residual Std. Error  189,961.200 (df = 14261)  
## F Statistic         824.020*** (df = 10; 14261)
## ===============================================
## Note:               *p<0.1; **p<0.05; ***p<0.01

table of mean absolute error and MAPE for a single test set

Table 3.2 is the error summary which shows the mean absolute error (MAE) as well as the mean absolute percent error (MAPE), providing an overview of the errors in the training set. The mean absolute error is the measurement of the difference between the predicted/expected and the observed/actual in the dataset per observation while the mean absolute percent error put it into a percentage basis. In our case, the MAE of this model is 103,792.9 which tells us that our prediction is, on average, 103,792.9 dollars higher or lower than the observed value, and the predictions are 35% off the actual prices.

PhillyPrice.test <-
  PhillyPrice.test %>%
  mutate(sale_price.Predict = predict(reg.training, PhillyPrice.test),
         sale_price.Error = sale_price.Predict - sale_price,
         sale_price.AbsError = abs(sale_price.Predict - sale_price),
         sale_price.APE = (abs(sale_price.Predict - sale_price)) / sale_price.Predict)

error_summary <- st_drop_geometry(PhillyPrice.test) %>%
  summarize(
    MAE = mean(sale_price.AbsError),
    MAPE = mean(sale_price.APE)
  )

kable(error_summary, "html", caption = "Error Summary for Test Set", digits = 2) %>%
  kable_styling()
Error Summary for Test Set
MAE MAPE
102692.3 0.4
PhillyPrice.test %>% 
  st_drop_geometry() %>%
  summarise(MAE = mean(sale_price.AbsError),
            MAPE = mean(abs(sale_price.APE)*100)) %>%
  kbl(col.name=c('Mean Absolute Error','Mean Absolute Percentage Error')) %>%
  kable_classic()
Mean Absolute Error Mean Absolute Percentage Error
102692.3 58.49086

Cross validation tests with 100 folds

This shows the 100-fold cross validation results of the dataset, showing the average goodness of fit of our model across 100 sublets with the same size. The root means square error (RMSE) we got is 180,808.5 after taking the square root of the mean square error. This suggests that we might not have the best model here for house price prediction since the relatively high RMSE value tells us that our model is not as accurate based on the errors in predictions. The mean of the MAE we got here is 103,295.6 and the standard deviation is 11,210.03 across the 100 folds. This histogram is created to better visualize the distribution of the MAE. The graph shows the frequency of each MAE value in a 1000-dollar binning. Even though we observed that the frequency tends to be relatively higher between 9,000 to 11,000, we still conclude that the model does not generalize really well to new data since there is no obvious clustering in the distribution.

fitControl <- trainControl(method = "cv", number = 100)
set.seed(825)

reg.cv <- 
  train(sale_price ~ ., data = st_drop_geometry(Philly_price_Model) %>% 
                                dplyr::select(sale_price, 
                                total_area, interior_condition, 
                                houseAge, shooting.Buffer, dist_to_school, 
                                dist_to_commerce, dist_to_PPR, MedHHInc, 
                                pctForeign, pctMobility), 
     method = "lm", trControl = fitControl, na.action = na.pass)

reg.cv
## Linear Regression 
## 
## 23781 samples
##    10 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (100 fold) 
## Summary of sample sizes: 23542, 23542, 23544, 23544, 23544, 23542, ... 
## Resampling results:
## 
##   RMSE      Rsquared   MAE     
##   180388.8  0.4294934  102792.5
## 
## Tuning parameter 'intercept' was held constant at a value of TRUE
cv_results <- data.frame(Resamples = names(reg.cv$resample),
                         MAE = reg.cv$resample$MAE)
ggplot(cv_results, aes(x = MAE)) +
  geom_histogram(binwidth = 1000, fill = "lightblue", color = "black", size = 0.3) +  # Adjust the 'size' parameter for thinner borders
  labs(title = "Distribution of Cross-Validation Mean Absolute Error (MAE) for Predictive Model",
       x = "Mean Absolute Error (MAE)",
       y = "Frequency") +
  theme_minimal()

Comparison of Actual Sale Price vs. Predicted Sale Price with Reference Lines

The figure below is the scatterplot of predicted house prices as a function of observed prices including a line of best fit colored in orange and the average predicted fit line colored in green. The line of best fit on this plot has a slope of 1 to represent the situation when the predicted sale prices are exactly the same as the observed sale prices. The average predicted fit line we generated here is not too far from the line of best fit but has no overlap with the orange line, meaning that we are making some reasonable predictions, but the model is not perfectly fitting.

scatter_plot <- ggplot(data = PhillyPrice.test, aes(x = sale_price.Predict, y = sale_price)) +
  geom_point() +
  geom_abline(intercept = 0, slope = 1, color = "orange", linetype = "dashed", size = 1.2) +  # Adjusting dashed line
  geom_abline(intercept = mean(PhillyPrice.test$sale_price), slope = 1, color = "green", linetype = "dashed", size = 1.2) +  # Adjusting dashed line
  labs(title = "Comparison of Actual Sale Price vs. Predicted Sale Price with Reference Lines",
       x = "SalePrice Prediction",
       y = "SalePrice") +
  theme_minimal()  


print(scatter_plot)

A map of residuals for the test set.

The map below highlights the differences between predicted and observed house prices in the city of Philadelphia. According to the map, the largest residuals in our data, which is represented by the orange dots on the map, seem to be clustering in downtown area and a bit more spread out in other parts of the city. The green dots that represent the smaller residuals tend to cluster in places far from the middle of the city such as the northwest, northeast, and south.

palette5 <- c("#25CB10", "#5AB60C", "#8FA108",   "#C48C04", "#FA7800")
ggplot() +
  geom_sf(data = blockgroup21, fill = "grey50") +
  geom_sf(data = PhillyPrice.test, aes(colour = q5(sale_price.Error)), 
          show.legend = "point", size = .75) +
  scale_colour_manual(values = palette5,
                   name="Quintile\nBreaks") +
  labs(title="Spatial Distribution of Test Set Residuals by Quintile in Philadelphia") +
  mapTheme()

Moran’s I test results

This figure is a graph that shows the observed Moran’s I value as well as the histogram of Moran’s I test of the dataset. The observed Moran’s I value is at 0.35 represented by the orange vertical line to the right of the histogram. The p-value here is less than 0.01 which means that the null hypothesis can be rejected which indicates that there is a positive spatial autocorrelation here.

coords <- st_coordinates(Philly_price_Model) 

neighborList <- knn2nb(knearneigh(coords, 5))

spatialWeights <- nb2listw(neighborList, style="W")

Philly_price_Model$lagPrice <- lag.listw(spatialWeights, Philly_price_Model$sale_price)

coords.test <-  st_coordinates(PhillyPrice.test) 

neighborList.test <- knn2nb(knearneigh(coords.test, 5))

spatialWeights.test <- nb2listw(neighborList.test, style="W")

PhillyPrice.test <- PhillyPrice.test %>% 
  mutate(lagPriceError = lag.listw(spatialWeights.test, sale_price.Error))

moranTest <- moran.mc(PhillyPrice.test$sale_price.Error, 
                      spatialWeights.test, nsim = 999)

ggplot(as.data.frame(moranTest$res[c(1:999)]), aes(moranTest$res[c(1:999)])) +
  geom_histogram(binwidth = 0.01) +
  geom_vline(aes(xintercept = moranTest$statistic), colour = "#FA7800",size=1) +
  scale_x_continuous(limits = c(-1, 1)) +
  labs(title="Observed and permuted Moran's I",
       subtitle= "Observed Moran's I in orange",
       x="Moran's I",
       y="Count") +
  plotTheme()

Scatter plot of Relationship Between Spatial Lag & Sale Price Errors

This scatterplot highlights the relationship between the spatial lag price error, which refers to the weighted mean of the errors of 5 neighbors, and the sale price error. There is a positive relationship between the two variables according to this graph, meaning that as the house price error goes up, the errors of the neighborhoods would also increase.

ggplot(PhillyPrice.test, aes(x = lagPriceError, y = sale_price.Error)) +
  geom_point() +
  geom_smooth(method = "lm", se = FALSE, color = "red") +
  labs(title = "Relationship Between Spatial Lag & Sale Price Errors",
       x = "Spatial Lag Price Error",
       y = "Sale Price Error") +
  theme_classic() +  # Use a classic theme to remove the grey background
  theme(
    axis.line = element_line(color = "black"),  # Make the axes lines black
    axis.ticks = element_line(color = "black"),  # Make the axis ticks black
    axis.text = element_text(color = "black"),  # Make the axis text black
    plot.title = element_text(face = "bold", size = 14),  # Customize the title
    panel.border = element_blank(),  # Remove default panel borders
    panel.grid = element_blank()     # Remove all grid lines
  )

A map of predicted values for both model dataset and the unpredicted dataset.

Figure below is a map of both the modelling and challenge (unpredicted) dataset based on our regression model. This map suggests that there are obvious clusters of both higher and lower house prices in different parts of the city of Philadelphia. Higher predicted values represented by the orange color seem to be clustering in areas far from the center of the city such as the northeast, west, and southeast. Values of lower predicted values, by contrast, create clusters in the center well as in the southwestern part of the city.

Philly_Housing_joined <-
  Philly_Housing_joined %>%
  mutate(sale_price.Predict = predict(reg.training, Philly_Housing_joined))

ggplot() +
  geom_sf(data = neighborhood, fill = "grey50") +
  geom_sf(data = Philly_Housing_joined, aes(colour = q5(sale_price.Predict)), 
          show.legend = "point", size = .75) +
  scale_colour_manual(values = palette5,
                   name="Quintile\nBreaks") +
  labs(title="Map of predicted values") +
  mapTheme()

A map of mean absolute percentage error (MAPE) by neighborhood.

This choropleth map shows the distribution of mean absolute percentage error by neighborhoods in the city of Philadelphia. The darker color on the map is in association with a higher MAPE value while a lighter color corresponds to a lower one. According to the map, most of the neighborhood in the neighborhood effect regression has a relatively low MAPE except for the central and southwestern part of the city.

reg.nhood <- lm(sale_price ~ ., data = as.data.frame(PhillyPrice.training) %>% 
                                 dplyr::select(sale_price, total_area, interior_condition, houseAge, shooting.Buffer, dist_to_school, dist_to_commerce, dist_to_PPR, MedHHInc, pctForeign, pctMobility))

PhillyPrice.test.nhood <-
  PhillyPrice.test %>%
  mutate(Regression = "Neighborhood Effects",  
         sale_price.Predict = predict(reg.nhood, PhillyPrice.test),
         sale_price.Error = sale_price.Predict- sale_price,
         sale_price.AbsError = abs(sale_price.Predict- sale_price),
         sale_price.APE = (abs(sale_price.Predict- sale_price)) / sale_price)

mape_neighborhood <- PhillyPrice.test.nhood %>%
  group_by(name) %>%
  summarize(MAPE = mean(sale_price.APE, na.rm = TRUE))


neighborhood <- st_join(neighborhood, mape_neighborhood, by = "name")
ggplot(data = neighborhood) +
  geom_sf(aes(fill = MAPE)) +
  scale_fill_gradient(low = "white", high = "darkblue", name = "MAPE") +
  labs(title = "Mean Absolute Percentage Error by Neighborhood") +
  theme_minimal()

A scatter plot of MAPE by neighborhood as a function of mean price by neighborhood.

This scatterplot highlights the relationship between the mean of house price and the MAPE. There is an inverse relationship between the two variables according to this graph, meaning that as the house price error goes up, the MAPE decreases, and the predictions would get more accurate. However, the points on the plot are not distributed in clusters along the line of best fit, meaning the line might be affected by outliers in the dataset.

mean_price_summary <- PhillyPrice.test.nhood %>%
  group_by(name) %>%
  summarize(meanPrice = mean(sale_price, na.rm = TRUE))

neighborhood <- st_join(neighborhood, mean_price_summary, by = "name")
neighborhood <- st_join(neighborhood, mape_neighborhood, by = "name")

ggplot(data = neighborhood, aes(x = meanPrice, y = MAPE.y)) +
  geom_point(alpha = 0.7) +  # color by neighborhood for distinction
  geom_smooth(method = "lm", se = FALSE, color = "red", linetype = "dashed") +
  labs(title = "Impact of Mean Home Price on MAPE Across Neighborhoods in Philadelphia") +
  theme_minimal()

Generalizability across urban contexts

To test the generalizability of our model, we picked African American data as a race context. Figure 3.11 provides information on the distribution of majority of African American groups as well as majority of non-African American groups. According to the map, we discovered that blocks with majority of non-African American people tend to live in the west, southwest, and lower middle parts of the city while African American groups are distributed from upper middle city to the northwest. In table 3.12, we calculated the MAE in baseline regression and compared it with the neighborhood-effect regression which shows that the difference between the neighborhood effects model is large while the goodness of fir across races are not as different.

blockgroup21 <- blockgroup21 %>%
  mutate(AfricanContext = ifelse(pctForeign > .1, "Majority African-American", "Majority non African-American"))

grid.arrange(ncol = 2,
  ggplot() + 
    geom_sf(data = na.omit(blockgroup21), aes(fill = AfricanContext)) +
    scale_fill_manual(values = c("lightyellow", "lightblue"), name = "Legend") +  # Change legend text here
    labs(title = "Spatial Distribution of African-American Population in Philladelphia") +
    mapTheme() +
    theme(legend.position = "bottom", 
          legend.text = element_text(size = 10),  # Increase text size for readability
          legend.title = element_text(size = 12, face = "bold"),  # Adjust title size and style
          legend.box = "horizontal")              # Adjust legend box arrangement
)

blockgroup21 <- blockgroup21 %>%
  mutate(AfricanContext = ifelse(pctForeign > .1, "Majority African-American", "Majority non African-American"))
# B02001_003 African American

PhillyPrice.test <- PhillyPrice.test %>%
  mutate(Regression = "Baseline Regression")
# #dplyr::select(PhillyPrice.test, starts_with("sale_price"), Regression, name)
# dplyr::select(PhillyPrice.test.nhood, starts_with("sale_price"), Regression, name)
bothRegressions <-
  rbind(
    dplyr::select(PhillyPrice.test, starts_with("sale_price"), Regression, name),
    dplyr::select(PhillyPrice.test.nhood, starts_with("sale_price"), Regression, name))

st_join(bothRegressions, blockgroup21) %>% 
  group_by(Regression, AfricanContext) %>%
  summarize(mean.MAPE = scales::percent(mean(sale_price.APE, na.rm = T))) %>%
  st_drop_geometry() %>%
  spread(AfricanContext, mean.MAPE) %>%
  kable("html", caption = "Test set MAPE by neighborhood racial context") %>%
  kable_styling(full_width = F)
Test set MAPE by neighborhood racial context
Regression Majority African-American Majority non African-American
Baseline Regression 36% 44%
Neighborhood Effects 54% 70%

Discussion

In general, it is very hard for us to conclude that our model is that effective in predicting house prices in the city of Philadelphia. The R-square we got from the predicting model appears to be 0.37 which means that only 37% of the variation of house prices can be explained using our model. The generalizability of our model does not perform very well either, resulting in an MAE of 102,032.2 and an MAPE of 0.48. This suggests that there are relatively high errors in our predicted value to the actual value of the houses due to various social factors by region and the model needs refinement in the future. Some of the more interesting variables we included in the model such as the house’s proximity to different social facilities like schools, parks, and commercial corridors actually had very little affect on the price of the property. Another interesting pattern we discovered is the relationship between house prices and the number of living spaces which we, by assumption, would think would have a strong correlation with but turned out to be not exactly the case. Even though the model suggests that there is a relatively strong slope of the best fit line, the price variation in small houses was significantly large. This might be caused by the housing price in downtown Philadelphia where a lot of small expensive units can be found. On average, our model is capable of predicting the house prices in Philadelphia with an error of plus or minus 102,032.2 dollars from the sale prices. The most important factor that affects the housing market price that we included in this study appears to be the median household income of the residents. We observed that there is a relatively strong correlation between the two factors which make total sense since wealthier households have higher purchasing power in general. Interestingly, we observed much more variation of housing prices as the income level of households goes up, reflecting the higher income household’s ability of choosing different housing options.
Based on the MAE neighborhood map, it is reasonable for us to say that our model performs better at predicting house values in the surrounding areas as compared to the center of the city. This might be because there is a larger population size in the city center and housing prices vary due to the different needs of the people as well as the scarcity of space. Our model struggled to capture the geographic disparities in housing prices across Philadelphia. The residual maps and spatial autocorrelation tests showed clear clustering of errors, particularly in the downtown area. This suggests localized factors, like proximity to transit or affluent neighborhoods, influence prices more than the general variables we included. Moran’s I test confirmed moderate spatial autocorrelation, indicating the need for spatial regression techniques, like spatial lag models, to better address these regional patterns and improve accuracy in location-specific predictions. To improve the model, future iterations could incorporate non-linear relationships and more advanced methods like decision trees or random forests to capture complex interactions. Adding granular data, such as school ratings or detailed crime statistics, would also enhance accuracy. The clustering of errors highlights the need for spatial regression or geographically weighted regression (GWR) to account for spatial variation. Expanding the model with additional demographic factors like employment rates and population density could further refine the predictions and make the model more robust across different neighborhoods.

Conclusion

In conclusion, we would not recommend our model at its current phase to Zillow due to the number of errors in predictions as well as its lack of generalizability. However, the model can be good a good starting point or server as a base of a better developed house price prediction model by introducing more relevant data into the model and minimizing the errors of the predictions. We are also aware of the limitations of the OLS regression as a type of model and are willing to try other machine learning methods to increase the accuracy of our predictions. With those changes, we believe that our model would become more useful for Zillow to estimate more accurate housing prices.