Introduction

This project “Geospatial Risk Predictions of Armed Robbery Incidents in Chicago” explores the application of predictive policing through geospatial risk modeling to forecast the likelihood of armed robbery incidents in Chicago. By utilizing historical crime data, socioeconomic indicators, and spatial risk factors such as abandoned cars, street light outages, and sanitation complaints, the model aims to identify high-risk areas for armed robbery incidents. The study employs Poisson regression to predict crime counts and compares the accuracy of spatial models against traditional methods like kernel density estimation. Emphasizing the importance of addressing bias and data selection issues, the project aims to improve the generalizability and fairness of crime prediction models across different neighborhoods, particularly considering racial and socioeconomic disparities. Through spatial cross-validation and the inclusion of spatial process features, this analysis seeks to enhance predictive accuracy and offer insights into better resource allocation for law enforcement.

The objective of this report is to develop a geospatial risk model for predicting armed robbery incidents in Chicago. Robbery data is often influenced by selection bias due to two key factors: not all robbery incidents are reported to law enforcement, and areas with historically high robbery rates, which receive increased police presence, tend to continue reporting higher incidents. The underlying hypothesis is that crime risk is driven by exposure to various geospatial risk and protective factors, such as signs of blight or the presence of recreation centers. As exposure to risk factors rises, so does the likelihood of crime. In addition to identifying these risk factors, the model incorporates spatial variables to improve prediction accuracy. Poisson regression is used in this analysis due to its suitability for modeling count data. 1

Crime Data Collection

All datasets for this study were sourced from Chicago Data Portal. The Chicago Data Portal is an online platform that provides access to a wide range of municipal datasets. The portal includes an extensive record of crime incidents in Chicago, updated annually. For this analysis, crime data for 2021 was downloaded using the RSocrata package, which allows easy access to open data from platforms like Socrata, the software behind the Chicago Data Portal.

This report specifically filter the armed robbery crime incidents in Chicago. The armed robbery include: KNIFE / CUTTING INSTRUMENT, HANDGUN, OTHER FIREARM, or OTHER DANGEROUS WEAPON. After filtering the crime data for armed robbery incidents, we focused on those involving the following weapon types: knife/cutting instrument, handgun, other firearms, or other dangerous weapons. These specific categories provide a clearer understanding of more serious and violent robbery offenses, as they pose a higher threat to victims and are more likely to occur in areas with distinct risk factors. By narrowing our analysis to armed robberies, we aim to explore the spatial distribution of these incidents and assess how environmental and social variables, such as abandoned buildings, street light outages, and sanitation complaints, correlate with higher occurrences of these violent crimes. This targeted approach enables us to develop a more precise predictive model for robbery risk in different areas of Chicago.

robbery21 <- read.socrata("https://data.cityofchicago.org/Public-Safety/Crimes-2021/dwme-t96c")
saveRDS(robbery21, "robbery21.rds")
# Read from the locally saved file
robbery21 <- readRDS("robbery21.rds") %>% 
  filter(Primary.Type == "ROBBERY", grepl("ARMED", Description)) %>%
  mutate(x = gsub("[()]", "", Location)) %>%
  separate(x,into= c("Y","X"), sep=",") %>%
  mutate(X = as.numeric(X), Y = as.numeric(Y)) %>% 
  na.omit() %>%
  st_as_sf(coords = c("X", "Y"), crs = 4326, agr = "constant") %>%
  st_transform('ESRI:102271') %>%
  distinct()

The map displays the spatial distribution of armed robbery incidents in Chicago for 2021, revealing a clear pattern of concentration in specific areas. The highest density of incidents occurs in the central and southern parts of the city, with particular clustering around the city’s south side and certain western neighborhoods. These areas likely experience higher rates of violent crime due to a combination of socio-economic factors and environmental conditions. In contrast, the northern parts of the city, especially near the lakefront, show significantly fewer armed robberies. This spatial pattern suggests that certain regions may face more pronounced crime risks, possibly influenced by urban infrastructure, economic distress, or other factors that contribute to the likelihood of robbery incidents. Understanding these clusters can help inform targeted interventions and crime prevention strategies in high-risk areas.

ggplot() + 
  geom_sf(data = chicagoBoundary, fill = "grey") +
  geom_sf(data = robbery21, colour="darkred", size=0.1, show.legend = "point") +
  labs(title= "Armed Robbery Incidents in Chicago 2021") + 
  theme(axis.text.x=element_blank(),
        axis.text.y=element_blank(),
        axis.ticks =element_blank(),
        axis.title.x = element_blank(),
        axis.title.y = element_blank(),
        plot.subtitle = element_text(size = 9,face = "italic"),
        plot.title = element_text(size = 12, face = "bold"),
        panel.background = element_blank(),
        panel.border = element_rect(colour = "grey", fill=NA, linewidth =0.8)
        )

Below is the density map of armed robbery incidents in Chicago. The density map of armed robbery incidents in Chicago reveals two prominent hotspots. The first cluster is located in the central-western part of the city, while the second significant concentration is found slightly to the south of downtown. These areas experience the highest intensity of armed robberies, indicating localized zones with increased risk. The surrounding areas show a gradual decrease in density as we move away from these core hotspots, suggesting that armed robberies are not evenly distributed across the city but are instead concentrated in specific regions. The northern and far southern parts of the city exhibit much lower densities, indicating that armed robberies are relatively less common in these areas. This pattern highlights the need for focused crime prevention efforts in the identified high-risk zones.

options(scipen=0)
ggplot() + 
  geom_sf(data = chicagoBoundary, fill = "grey") +
  stat_density2d(data = data.frame(st_coordinates(robbery21)), 
                 aes(X, Y, fill = ..level.., alpha = ..level..),
                 size = 0.01, bins = 60, geom = 'polygon') +
  scale_fill_viridis(option = "turbo", name = "Density") +
  scale_alpha(range = c(0.00, 0.35), guide = "none") +
  labs(title = "Density of Armed Robbery") +
  theme(axis.text.x=element_blank(),
        axis.text.y=element_blank(),
        axis.ticks =element_blank(),
        axis.title.x = element_blank(),
        axis.title.y = element_blank(),
        plot.subtitle = element_text(size = 9,face = "italic"),
        plot.title = element_text(size = 12, face = "bold"),
        panel.background = element_blank(),
        panel.border = element_rect(colour = "grey", fill=NA, linewidth=0.8)
        )

Crime risk tends to fluctuate gradually across geographic space rather than conforming to predefined administrative boundaries. To effectively capture this continuous variation, point-level data can be aggregated into a network of evenly spaced grid cells. The code provided below generates such a fishnet for Chicago, offering a structured way to represent crime risk spatially. This grid-based approach is ideal for preparing the data for regression analysis, allowing for a smoother depiction of crime trends across the urban landscape.

fishnet <- 
  st_make_grid(chicagoBoundary,
               cellsize = 500, 
               square = TRUE) %>%
  .[chicagoBoundary] %>%            # fast way to select intersecting polygons
  st_sf() %>%
  mutate(uniqueID = 1:n())

ggplot() +
  geom_sf(data=fishnet, color="black", fill="lightblue") +
  labs(title = "Fishnet of Chicago") +
  theme(axis.text.x=element_blank(),
        axis.text.y=element_blank(),
        axis.ticks =element_blank(),
        axis.title.x = element_blank(),
        axis.title.y = element_blank(),
        plot.subtitle = element_text(size = 9,face = "italic"),
        plot.title = element_text(size = 12, face = "bold"),
        panel.background = element_blank(),
        panel.border = element_rect(colour = "grey", fill=NA, linewidth=0.8)
        )

Once the fishnet was created, use it as the spatial framework to visualize the distribution of robbery incidents. From this, it becomes clear that the highest number of robberies are concentrated in a few grid cells located in northeast Chicago. This pattern highlights the emergence of a clustered spatial trend in robbery occurrences.

robbery21_net <- 
  dplyr::select(robbery21) %>% 
  mutate(countRobbery = 1) %>% 
  aggregate(., fishnet, sum) %>%
  mutate(countRobbery = replace_na(countRobbery, 0),
         uniqueID = 1:n(),
         cvID = sample(round(nrow(fishnet) / 24), 
                       size=nrow(fishnet), replace = TRUE))

ggplot() +
  geom_sf(data = robbery21_net, aes(fill = countRobbery), color = NA) +
  scale_fill_viridis(option = "turbo", name = "Robbery Counts") +
  labs(title = "Count of Robberies for the Fishnet") +
  theme(axis.text.x=element_blank(),
        axis.text.y=element_blank(),
        axis.ticks =element_blank(),
        axis.title.x = element_blank(),
        axis.title.y = element_blank(),
        plot.subtitle = element_text(size = 9,face = "italic"),
        plot.title = element_text(size = 12, face = "bold"),
        panel.background = element_blank(),
        panel.border = element_rect(colour = "grey", fill=NA, linewidth=0.8)
        )

The bar chart below illustrates the distribution of robbery incidents in Chicago for 2021, revealing a highly skewed pattern. A large number of grid cells have few or no robbery incidents, while only a small number of grids experience significantly higher robbery counts. This non-normal distribution suggests that Poisson regression is a suitable method for analyzing this data, as it can handle the overdispersion and count nature of the variable.

ggplot(robbery21_net, aes(x = countRobbery)) +
  geom_histogram(fill = "steelblue", color = "#2c3e50", bins = 30) + 
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1, face = "plain", size = 7),
        axis.text.y = element_text(size = 7), 
        axis.title = element_text(size = 9), 
        plot.title = element_text(size = 14, face = "bold"), 
        plot.subtitle = element_text(size = 10), 
        panel.grid.major = element_line(color = "grey80"), 
        panel.grid.minor = element_blank(),
        panel.border = element_rect(colour = "grey60", fill = NA, size = 0.8)) + 
  labs(title = "Distribution of Robberies in Chicago 2021",
       subtitle = "Analysis of robbery incident distribution",
       caption = "Data: Chicago Data Portal Crimes 2021") +
  xlab("Robbery Incidents") +
  ylab("Count")

Risk Factors

In this project, eight risk factors are included in the model to predict robberies: reports of abandoned cars, street lights out, graffiti, sanitation complaints, abandoned buildings, locations of liquor stores selling alcohol to-go, ShotSpotter detections, and banks. Neighborhoods with abandoned cars, street lights out, graffiti, and sanitation issues often indicate neglect, economic distress, and reduced community oversight, which can create an environment conducive to criminal activities. Abandoned buildings can serve as hideouts for criminals, further contributing to neighborhood disorder. Liquor stores are associated with alcohol-related vulnerabilities, while ShotSpotter detections point to areas with higher gun violence. Finally, banks, due to the availability of cash, may attract robbers seeking financial gain. These factors collectively highlight areas that may be more susceptible to robberies.

abandonCars <- 
  read.socrata("https://data.cityofchicago.org/Service-Requests/311-Service-Requests-Abandoned-Vehicles/3c9v-pnva") %>%
    mutate(year = substr(creation_date,1,4)) %>% filter(year == "2020") %>%
    dplyr::select(Y = latitude, X = longitude) %>%
    na.omit() %>%
    st_as_sf(coords = c("X", "Y"), crs = 4326, agr = "constant") %>%
    st_transform(st_crs(fishnet)) %>%
    mutate(Legend = "Abandoned_Cars")
  
abandonBuildings <- 
  read.socrata("https://data.cityofchicago.org/Service-Requests/311-Service-Requests-Vacant-and-Abandoned-Building/7nii-7srd") %>%
    mutate(year = substr(date_service_request_was_received,1,4)) %>%  filter(year == "2018") %>%
    dplyr::select(Y = latitude, X = longitude) %>%
    na.omit() %>%
    st_as_sf(coords = c("X", "Y"), crs = 4326, agr = "constant") %>%
    st_transform(st_crs(fishnet)) %>%
    mutate(Legend = "Abandoned_Buildings")

graffiti <- 
  read.socrata("https://data.cityofchicago.org/Service-Requests/311-Service-Requests-Graffiti-Removal-Historical/hec5-y4x5") %>%
    mutate(year = substr(creation_date,1,4)) %>% filter(year == "2018") %>%
    filter(where_is_the_graffiti_located_ %in% c("Front", "Rear", "Side")) %>%
    dplyr::select(Y = latitude, X = longitude) %>%
    na.omit() %>%
    st_as_sf(coords = c("X", "Y"), crs = 4326, agr = "constant") %>%
    st_transform(st_crs(fishnet)) %>%
    mutate(Legend = "Graffiti")

streetLightsOut <- 
  read.socrata("https://data.cityofchicago.org/Service-Requests/311-Service-Requests-Street-Lights-All-Out/zuxi-7xem") %>%
    mutate(year = substr(creation_date,1,4)) %>% filter(year == "2018") %>%
    dplyr::select(Y = latitude, X = longitude) %>%
    na.omit() %>%
    st_as_sf(coords = c("X", "Y"), crs = 4326, agr = "constant") %>%
    st_transform(st_crs(fishnet)) %>%
    mutate(Legend = "Street_Lights_Out")

sanitation <-
  read.socrata("https://data.cityofchicago.org/Service-Requests/311-Service-Requests-Sanitation-Code-Complaints-Hi/me59-5fac") %>%
    mutate(year = substr(creation_date,1,4)) %>% filter(year == "2018") %>%
    dplyr::select(Y = latitude, X = longitude) %>%
    na.omit() %>%
    st_as_sf(coords = c("X", "Y"), crs = 4326, agr = "constant") %>%
    st_transform(st_crs(fishnet)) %>%
    mutate(Legend = "Sanitation")

liquorRetail <- 
  read.socrata("https://data.cityofchicago.org/resource/nrmj-3kcf.json") %>%  
    filter(business_activity == "Retail Sales of Packaged Liquor") %>%
    dplyr::select(Y = latitude, X = longitude) %>%
    na.omit() %>%
    st_as_sf(coords = c("X", "Y"), crs = 4326, agr = "constant") %>%
    st_transform(st_crs(fishnet)) %>%
    mutate(Legend = "Liquor_Retail")


shotSpotter <- 
  read.socrata("https://data.cityofchicago.org/Public-Safety/Violence-Reduction-Shotspotter-Alerts/3h7q-7mdb") %>%  
    dplyr::select(Y = latitude, X = longitude) %>%
    na.omit() %>%
    st_as_sf(coords = c("X", "Y"), crs = 4326, agr = "constant") %>%
    st_transform(st_crs(fishnet)) %>%
    mutate(Legend = "Shot_Spotter")

Banks <- 
  read.socrata("https://data.cityofchicago.org/Administration-Finance/Lending-Equity-Depository-Locations-2021/jgup-zt7k") %>%  
  dplyr::select(location) %>%
  na.omit()
Banks <- Banks %>%   
  mutate(X = as.numeric(sub("POINT \\((-?\\d+\\.\\d+) \\d+\\.\\d+\\)", "\\1", Banks$location)),
         Y = as.numeric(sub("POINT \\(-?\\d+\\.\\d+ (-?\\d+\\.\\d+)\\)", "\\1", Banks$location))) %>%
  dplyr::select(-location) %>% 
  filter(!is.na(X) & !is.na(Y)) %>% 
  st_as_sf(coords = c("X", "Y"), crs = 4326, agr = "constant") %>%
    st_transform(st_crs(fishnet)) %>%
    mutate(Legend = "Bank") 
vars_net <- 
  rbind(abandonCars,streetLightsOut,abandonBuildings,
        liquorRetail, graffiti, sanitation, shotSpotter, Banks) %>%
  st_join(., fishnet, join=st_within) %>%
  st_drop_geometry() %>%
  group_by(uniqueID, Legend) %>%
  summarize(count = n()) %>%
    full_join(fishnet) %>%
    spread(Legend, count, fill=0) %>%
    st_sf() %>%
    dplyr::select(-`<NA>`) %>%
    na.omit() %>%
    ungroup()

The figure illustrates the distinct spatial patterns of each risk factor across Chicago. Abandoned buildings are predominantly concentrated in the southern part of the city, while banks and liquor retail stores cluster around the central business district (CBD). Abandoned cars, sanitation complaints, and street lights out complaints are more evenly scattered throughout the city. Graffiti removal requests are mainly concentrated near the city center, whereas ShotSpotter detections tend to occur in areas outside the city center, highlighting the spatial variability in factors associated with potential robbery risks.

vars_net.long <- 
  gather(vars_net, Variable, value, -geometry, -uniqueID)

plot1 <- ggplot() +
  geom_sf(data = vars_net.long %>% filter(Variable == "Abandoned_Buildings"), aes(fill=value), colour=NA) +
  scale_fill_viridis(option = "turbo") + 
  labs(title = "Abandoned Buildings") +
  theme(axis.text.x=element_blank(),
        axis.text.y=element_blank(),
        axis.ticks =element_blank(),
        axis.title.x = element_blank(),
        axis.title.y = element_blank(),
        plot.subtitle = element_text(size = 9,face = "italic"),
        plot.title = element_text(size = 10, face = "bold"),
        panel.background = element_blank(),
        panel.border = element_rect(colour = "grey", fill=NA, size=0.8)
        )


plot2 <- ggplot() +
  geom_sf(data = vars_net.long %>% filter(Variable == "Abandoned_Cars"), aes(fill=value), colour=NA) +
  scale_fill_viridis(option = "turbo") + 
  labs(title = "Abandoned Cars") +
  theme(axis.text.x=element_blank(),
        axis.text.y=element_blank(),
        axis.ticks =element_blank(),
        axis.title.x = element_blank(),
        axis.title.y = element_blank(),
        plot.subtitle = element_text(size = 9,face = "italic"),
        plot.title = element_text(size = 10, face = "bold"),
        panel.background = element_blank(),
        panel.border = element_rect(colour = "grey", fill=NA, size=0.8)
        )

plot3 <- ggplot() +
  geom_sf(data = vars_net.long %>% filter(Variable == "Bank"), aes(fill=value), colour=NA) +
  scale_fill_viridis(option = "turbo") + 
  labs(title = "Banks") +
  theme(axis.text.x=element_blank(),
        axis.text.y=element_blank(),
        axis.ticks =element_blank(),
        axis.title.x = element_blank(),
        axis.title.y = element_blank(),
        plot.subtitle = element_text(size = 9,face = "italic"),
        plot.title = element_text(size = 10, face = "bold"),
        panel.background = element_blank(),
        panel.border = element_rect(colour = "grey", fill=NA, size=0.8)
        )

plot4 <- ggplot() +
  geom_sf(data = vars_net.long %>% filter(Variable == "Graffiti"), aes(fill=value), colour=NA) +
  scale_fill_viridis(option = "turbo") + 
  labs(title = "Graffiti") +
  theme(axis.text.x=element_blank(),
        axis.text.y=element_blank(),
        axis.ticks =element_blank(),
        axis.title.x = element_blank(),
        axis.title.y = element_blank(),
        plot.subtitle = element_text(size = 9,face = "italic"),
        plot.title = element_text(size = 10, face = "bold"),
        panel.background = element_blank(),
        panel.border = element_rect(colour = "grey", fill=NA, size=0.8)
        )


plot5 <- ggplot() +
  geom_sf(data = vars_net.long %>% filter(Variable == "Liquor_Retail"), aes(fill=value), colour=NA) +
  scale_fill_viridis(option = "turbo") + 
  labs(title = "Liquor Retail") +
  theme(axis.text.x=element_blank(),
        axis.text.y=element_blank(),
        axis.ticks =element_blank(),
        axis.title.x = element_blank(),
        axis.title.y = element_blank(),
        plot.subtitle = element_text(size = 9,face = "italic"),
        plot.title = element_text(size = 10, face = "bold"),
        panel.background = element_blank(),
        panel.border = element_rect(colour = "grey", fill=NA, size=0.8)
        )

plot6 <- ggplot() +
  geom_sf(data = vars_net.long %>% filter(Variable == "Sanitation"), aes(fill=value), colour=NA) +
  scale_fill_viridis(option = "turbo") + 
  labs(title = "Sanitation") +
  theme(axis.text.x=element_blank(),
        axis.text.y=element_blank(),
        axis.ticks =element_blank(),
        axis.title.x = element_blank(),
        axis.title.y = element_blank(),
        plot.subtitle = element_text(size = 9,face = "italic"),
        plot.title = element_text(size = 10, face = "bold"),
        panel.background = element_blank(),
        panel.border = element_rect(colour = "grey", fill=NA, size=0.8)
        )

plot7 <- ggplot() +
  geom_sf(data = vars_net.long %>% filter(Variable == "Shot_Spotter"), aes(fill=value), colour=NA) +
  scale_fill_viridis(option = "turbo") + 
  labs(title = "Shot Spotter") +
  theme(axis.text.x=element_blank(),
        axis.text.y=element_blank(),
        axis.ticks =element_blank(),
        axis.title.x = element_blank(),
        axis.title.y = element_blank(),
        plot.subtitle = element_text(size = 9,face = "italic"),
        plot.title = element_text(size = 10, face = "bold"),
        panel.background = element_blank(),
        panel.border = element_rect(colour = "grey", fill=NA, size=0.8)
        )


plot8 <- ggplot() +
  geom_sf(data = vars_net.long %>% filter(Variable == "Street_Lights_Out"), aes(fill=value), colour=NA) +
  scale_fill_viridis(option = "turbo") + 
  labs(title = "Street Light Out") +
  theme(axis.text.x=element_blank(),
        axis.text.y=element_blank(),
        axis.ticks =element_blank(),
        axis.title.x = element_blank(),
        axis.title.y = element_blank(),
        plot.subtitle = element_text(size = 9,face = "italic"),
        plot.title = element_text(size = 10, face = "bold"),
        panel.background = element_blank(),
        panel.border = element_rect(colour = "grey", fill=NA, size=0.8)
        )

wrap_plots(
  plot1, plot3, plot2, plot4,
  plot5, plot6, plot7, plot8,
  ncol = 4  
)

Feature Engineering

Nearest Distance

The first feature engineering approach we took is to calculate average nearest neighbor distance to hypothesize a smoother exposure relationship across space. Here, the nn_function is used. For each of our risk factors, average nearest distance is calculated from the centroid of each cell to its nearest three, abandoned cars, for example.

st_c <- st_coordinates
st_coid <- st_centroid

vars_net <-
  vars_net %>%
    mutate(
      Abandoned_Buildings.nn =
        nn_function(st_c(st_coid(vars_net)), st_c(abandonBuildings),3),
      Abandoned_Cars.nn =
        nn_function(st_c(st_coid(vars_net)), st_c(abandonCars),3),
      Graffiti.nn =
        nn_function(st_c(st_coid(vars_net)), st_c(graffiti),3),
      Liquor_Retail.nn =
        nn_function(st_c(st_coid(vars_net)), st_c(liquorRetail),3),
      Street_Lights_Out.nn =
        nn_function(st_c(st_coid(vars_net)), st_c(streetLightsOut),3),
      Sanitation.nn =
        nn_function(st_c(st_coid(vars_net)), st_c(sanitation),3),
      Shot_Spotter.nn =
        nn_function(st_c(st_coid(vars_net)), st_c(shotSpotter),3))

vars_net <-
  vars_net %>% mutate(Bank.nn = nn_function(st_c(st_coid(vars_net)), st_c(Banks),3))

The neighbor features provide a clearer picture of the spatial relationships between different risk factors.The southern and southeastern areas of Chicago appear to be most distant from many of these risks, particularly abandoned buildings, abandoned cars, and graffiti. Interestingly, the southwestern corner also shows some distance from risk factors like banks and liquor stores. One notable observation is that the northern part of the city is furthest from ShotSpotter locations, while still being relatively close to other risk factors such as sanitation issues and street lights out. The relative absence of ShotSpotters in this area raises questions about their role: are they an indicator of higher crime due to their concentration, or does their presence suggest increased policing, making neighborhoods safer? These observations highlight the need to examine whether ShotSpotters function as a crime detection tool or a proxy for neighborhood safety. These are important considerations for understanding the broader dynamics of crime and safety in Chicago.

vars_net.long.nn <- 
  dplyr::select(vars_net, ends_with(".nn")) %>%
    gather(Variable, value, -geometry)


p1.nn <- ggplot() +
  geom_sf(data = vars_net.long.nn %>% filter(Variable == "Abandoned_Buildings.nn"), aes(fill=value), colour=NA) +
  scale_fill_viridis(option = "turbo") + 
  labs(title = "Abandoned Buildings NN") +
  theme(axis.text.x=element_blank(),
        axis.text.y=element_blank(),
        axis.ticks =element_blank(),
        axis.title.x = element_blank(),
        axis.title.y = element_blank(),
        plot.subtitle = element_text(size = 9,face = "italic"),
        plot.title = element_text(size = 10, face = "bold"),
        panel.background = element_blank(),
        panel.border = element_rect(colour = "grey", fill=NA, size=0.8)
        )


p2.nn <- ggplot() +
  geom_sf(data = vars_net.long.nn %>% filter(Variable == "Abandoned_Cars.nn"), aes(fill=value), colour=NA) +
  scale_fill_viridis(option = "turbo") + 
  labs(title = "Abandoned Cars NN") +
  theme(axis.text.x=element_blank(),
        axis.text.y=element_blank(),
        axis.ticks =element_blank(),
        axis.title.x = element_blank(),
        axis.title.y = element_blank(),
        plot.subtitle = element_text(size = 9,face = "italic"),
        plot.title = element_text(size = 10, face = "bold"),
        panel.background = element_blank(),
        panel.border = element_rect(colour = "grey", fill=NA, size=0.8)
        )

p3.nn <- ggplot() +
  geom_sf(data = vars_net.long.nn %>% filter(Variable == "Bank.nn"), aes(fill=value), colour=NA) +
  scale_fill_viridis(option = "turbo") + 
  labs(title = "Banks NN") +
  theme(axis.text.x=element_blank(),
        axis.text.y=element_blank(),
        axis.ticks =element_blank(),
        axis.title.x = element_blank(),
        axis.title.y = element_blank(),
        plot.subtitle = element_text(size = 9,face = "italic"),
        plot.title = element_text(size = 10, face = "bold"),
        panel.background = element_blank(),
        panel.border = element_rect(colour = "grey", fill=NA, size=0.8)
        )

p4.nn <- ggplot() +
  geom_sf(data = vars_net.long.nn %>% filter(Variable == "Graffiti.nn"), aes(fill=value), colour=NA) +
  scale_fill_viridis(option = "turbo") + 
  labs(title = "Graffiti NN") +
  theme(axis.text.x=element_blank(),
        axis.text.y=element_blank(),
        axis.ticks =element_blank(),
        axis.title.x = element_blank(),
        axis.title.y = element_blank(),
        plot.subtitle = element_text(size = 9,face = "italic"),
        plot.title = element_text(size = 10, face = "bold"),
        panel.background = element_blank(),
        panel.border = element_rect(colour = "grey", fill=NA, size=0.8)
        )


p5.nn <- ggplot() +
  geom_sf(data = vars_net.long.nn %>% filter(Variable == "Liquor_Retail.nn"), aes(fill=value), colour=NA) +
  scale_fill_viridis(option = "turbo") + 
  labs(title = "Liquor Retail NN") +
  theme(axis.text.x=element_blank(),
        axis.text.y=element_blank(),
        axis.ticks =element_blank(),
        axis.title.x = element_blank(),
        axis.title.y = element_blank(),
        plot.subtitle = element_text(size = 9,face = "italic"),
        plot.title = element_text(size = 10, face = "bold"),
        panel.background = element_blank(),
        panel.border = element_rect(colour = "grey", fill=NA, size=0.8)
        )

p6.nn <- ggplot() +
  geom_sf(data = vars_net.long.nn %>% filter(Variable == "Sanitation.nn"), aes(fill=value), colour=NA) +
  scale_fill_viridis(option = "turbo") + 
  labs(title = "Sanitation NN") +
  theme(axis.text.x=element_blank(),
        axis.text.y=element_blank(),
        axis.ticks =element_blank(),
        axis.title.x = element_blank(),
        axis.title.y = element_blank(),
        plot.subtitle = element_text(size = 9,face = "italic"),
        plot.title = element_text(size = 10, face = "bold"),
        panel.background = element_blank(),
        panel.border = element_rect(colour = "grey", fill=NA, size=0.8)
        )

p7.nn <- ggplot() +
  geom_sf(data = vars_net.long.nn %>% filter(Variable == "Shot_Spotter.nn"), aes(fill=value), colour=NA) +
  scale_fill_viridis(option = "turbo") + 
  labs(title = "Shot Spotter NN") +
  theme(axis.text.x=element_blank(),
        axis.text.y=element_blank(),
        axis.ticks =element_blank(),
        axis.title.x = element_blank(),
        axis.title.y = element_blank(),
        plot.subtitle = element_text(size = 9,face = "italic"),
        plot.title = element_text(size = 10, face = "bold"),
        panel.background = element_blank(),
        panel.border = element_rect(colour = "grey", fill=NA, size=0.8)
        )


p8.nn <- ggplot() +
  geom_sf(data = vars_net.long.nn %>% filter(Variable == "Street_Lights_Out.nn"), aes(fill=value), colour=NA) +
  scale_fill_viridis(option = "turbo") + 
  labs(title = "Street Light Out NN") +
  theme(axis.text.x=element_blank(),
        axis.text.y=element_blank(),
        axis.ticks =element_blank(),
        axis.title.x = element_blank(),
        axis.title.y = element_blank(),
        plot.subtitle = element_text(size = 9,face = "italic"),
        plot.title = element_text(size = 10, face = "bold"),
        panel.background = element_blank(),
        panel.border = element_rect(colour = "grey", fill=NA, size=0.8)
        )

wrap_plots(
  p1.nn, p3.nn, p2.nn, p4.nn,
  p5.nn, p6.nn, p7.nn, p8.nn,
  ncol = 4  
)

Local Spatial Autocorrelation

identified robbery hotspots in Chicago using Local Moran’s I, which examines whether robbery counts at a location are clustered relative to neighboring cells. A spatial weights matrix was calculated using queen contiguity to define neighboring relationships. The results, visualized with Moran’s I values, p-values, and significant hotspots, show high spatial autocorrelation in northeastern Chicago near the CBD. Areas with significant hotspots (p-values ≤ 0.05) align with patterns seen in the kernel density map, indicating clusters of higher robbery counts than expected by chance.

final_net <-
  left_join(robbery21_net, st_drop_geometry(vars_net), by="uniqueID")
final_net.nb <- poly2nb(as_Spatial(final_net), queen=TRUE)
final_net.weights <- nb2listw(final_net.nb, style="W", zero.policy=TRUE)

local_morans <- localmoran(final_net$countRobbery, final_net.weights, zero.policy=TRUE) %>% 
  as.data.frame()

final_net.localMorans <- 
  cbind(local_morans, as.data.frame(final_net)) %>% 
  st_sf() %>%
  dplyr::select(Robbery_Count = countRobbery, 
                Local_Morans_I = Ii, 
                P_Value = `Pr(z != E(Ii))`) %>%
  mutate(Significant_Hotspots = ifelse(P_Value <= 0.001, 1, 0)) %>%
  gather(Variable, Value, -geometry)
moran1 <- ggplot() +
  geom_sf(data = final_net.localMorans %>% filter(Variable == "Robbery_Count"), aes(fill=Value), colour=NA) +
  scale_fill_viridis(option = "turbo") + 
  labs(title = "Robbery Count") +
  theme(axis.text.x=element_blank(),
        axis.text.y=element_blank(),
        axis.ticks =element_blank(),
        axis.title.x = element_blank(),
        axis.title.y = element_blank(),
        plot.subtitle = element_text(size = 9,face = "italic"),
        plot.title = element_text(size = 10, face = "bold"),
        panel.background = element_blank(),
        panel.border = element_rect(colour = "grey", fill=NA, size=0.8)
        )

moran2 <- ggplot() +
  geom_sf(data = final_net.localMorans %>% filter(Variable == "Local_Morans_I"), aes(fill=Value), colour=NA) +
  scale_fill_viridis(option = "turbo") + 
  labs(title = "Local Morans I") +
  theme(axis.text.x=element_blank(),
        axis.text.y=element_blank(),
        axis.ticks =element_blank(),
        axis.title.x = element_blank(),
        axis.title.y = element_blank(),
        plot.subtitle = element_text(size = 9,face = "italic"),
        plot.title = element_text(size = 10, face = "bold"),
        panel.background = element_blank(),
        panel.border = element_rect(colour = "grey", fill=NA, size=0.8)
        )

moran3 <- ggplot() +
  geom_sf(data = final_net.localMorans %>% filter(Variable == "P_Value"), aes(fill=Value), colour=NA) +
  scale_fill_viridis(option = "turbo") + 
  labs(title = "P Value") +
  theme(axis.text.x=element_blank(),
        axis.text.y=element_blank(),
        axis.ticks =element_blank(),
        axis.title.x = element_blank(),
        axis.title.y = element_blank(),
        plot.subtitle = element_text(size = 9,face = "italic"),
        plot.title = element_text(size = 10, face = "bold"),
        panel.background = element_blank(),
        panel.border = element_rect(colour = "grey", fill=NA, size=0.8)
        )


moran4 <- ggplot() +
  geom_sf(data = final_net.localMorans %>% filter(Variable == "Significant_Hotspots"), aes(fill=Value), colour=NA) +
  scale_fill_viridis(option = "turbo") + 
  labs(title = "Signigicant Hotspots") +
  theme(axis.text.x=element_blank(),
        axis.text.y=element_blank(),
        axis.ticks =element_blank(),
        axis.title.x = element_blank(),
        axis.title.y = element_blank(),
        plot.subtitle = element_text(size = 9,face = "italic"),
        plot.title = element_text(size = 10, face = "bold"),
        panel.background = element_blank(),
        panel.border = element_rect(colour = "grey", fill=NA, size=0.8)
        )

moran1 + moran2 + moran3 + moran4

Singificant Robbery Clusters

The map illustrates the distance to the nearest robbery hotspots across Chicago, with darker colors indicating shorter distances. The central and southern parts of the city exhibit the shortest distances to robbery hotspots, as shown by the deep blue and black areas. This suggests that these regions are located closer to areas with a high concentration of robbery incidents. In contrast, the outer northern and southern parts of the city, represented by yellow, green, and red hues, are farther away from these hotspots, indicating lower proximity to concentrated robbery activities. This pattern highlights the spatial distribution of robbery risks in Chicago, with more concentrated robbery activity occurring in specific regions, particularly in the central and southern areas. The distance gradient provides insight into how robbery risks decrease as one moves away from these core hotspots.

final_net <-
  final_net %>% 
  mutate(robbery.isSig = ifelse(localmoran(final_net$countRobbery, 
                             final_net.weights)[,5] <= 0.0000001, 1, 0)) %>%
  mutate(robbery.isSig.dist = nn_function(st_coordinates(st_centroid(final_net)),
                                          st_coordinates(st_centroid(
                                            filter(final_net, robbery.isSig == 1))), 1))
ggplot() +
      geom_sf(data = final_net, aes(fill=robbery.isSig.dist), colour=NA) +
      scale_fill_viridis(option = "turbo", name="NN Distance") +
      labs(title="Robbery NN Distance") +
      theme(axis.text.x=element_blank(),
        axis.text.y=element_blank(),
        axis.ticks =element_blank(),
        axis.title.x = element_blank(),
        axis.title.y = element_blank(),
        plot.subtitle = element_text(size = 9,face = "italic"),
        plot.title = element_text(size = 10, face = "bold"),
        panel.background = element_blank(),
        panel.border = element_rect(colour = "grey", fill=NA, size=0.8)
        )

All Risk Factors

The figure shows scatter plots of the count of predictor variables and their nearest neighbor distances in relation to robbery incidents. All count-based features, such as abandoned buildings, cars, and sanitation complaints, are positively correlated with the number of robberies, indicating that areas with higher occurrences of these risk factors tend to experience more robberies. The strongest positive relationships are observed between robbery incidents and sanitation complaints, ShotSpotter detections, and liquor retail locations.

Conversely, all nearest neighbor distance features are negatively correlated with robbery incidents, meaning that as the distance to risk factors increases, the number of robberies decreases. The strongest negative correlations are seen with distance to significant robbery hotspots, abandoned buildings, and liquor retail stores. This suggests that proximity to these risk factors plays a significant role in robbery patterns.

When it comes to model building, this information is critical for feature selection. To avoid multicollinearity, either the count or the nearest neighbor distance should be selected for each risk factor. In this case, the model incorporates nearest neighbor distances, as they tend to show stronger and more consistent correlations with robbery incidents.

final_net_long <- final_net%>% 
  st_drop_geometry() %>% 
  dplyr::select(-c(uniqueID, cvID)) %>% 
  pivot_longer(cols = -countRobbery, # everything except measurement
               names_to = "Type", # categorizes all quantitative variables into Type
               values_to = "Number") 

correlation.cor <-
  final_net_long %>%
    group_by(Type) %>%
    summarize(correlation = cor(Number, countRobbery, use = "complete.obs"))

final_net_long %>%
  ggplot(aes(x= Number, y = countRobbery)) +
  geom_point(size = 0.01, color = "#000004") +  
  geom_text(data = correlation.cor, aes(label = paste("r =", round(correlation, 2))),
            x=-Inf, y=Inf, vjust = 1.5, hjust = -.1, size=3) +
  geom_smooth(method='lm', formula= y~x, lwd=0.5, se = FALSE, color = "#BB3754") +
  facet_wrap(~ Type, scales = "free", ncol = 2, labeller= labeller(Type = c(
    `Abandoned_Buildings` = "Abandoned Buildings",
    `Abandoned_Buildings.nn` = "Distance to Abandoned Buildings",
    `Abandoned_Cars` = "Abandoned Cars",
    `Abandoned_Cars.nn` = "Distance to Abandoned Cars",
    `Bank` = "Bank",
    `Bank.nn` = "Distance to Bank",
    `Graffiti` = "Graffiti",
    `Graffiti.nn` = "Distance to Graffiti",
    `Liquor_Retail` = "Liquor Retail",
    `Liquor_Retail.nn` = "Distance to Liquor Retail",
    `robbery.isSig` = "Significant Robbery",
    `robbery.isSig.dist` = "Distance to Significant Robbery",
    `Sanitation` = "Sanitation",
    `Sanitation.nn` = "Distance to Sanitation",
    `Shot_Spotter` = "Shot Spotter", 
    `Shot_Spotter.nn` = "Distance to Shot Spotter",
    `Street_Lights_Out` = "Street Light Out",
    `Street_Lights_Out.nn` = "Distance to Broken Street Light")))  +
  labs(title = "Scatter Plots of Predictor Variables",
       caption = "Data: Chicago Data Portal") +
  theme(plot.subtitle = element_text(size = 9,face = "italic"),
        plot.title = element_text(size = 12, face = "bold"), 
        axis.text.x=element_text(size=6),
        axis.text.y=element_text(size=6), 
        axis.title=element_text(size=8), 
        panel.background = element_blank(),
        panel.border = element_rect(colour = "grey", fill=NA, size=0.8))

Poisson Regression

Random K-Fold CV

The table below presents goodness of fit metrics for four regression models. The first set of models includes only risk factors ‘reg.vars’, while the second set ‘reg.ss.vars’ incorporates both risk factors and the Local Moran’s I spatial process. To evaluate model performance, two types of cross-validation were conducted. The first method assigns a randomly generated cvID to each grid cell, allowing for random k-fold cross-validation to assess prediction accuracy.

reg.vars <- c("Abandoned_Buildings.nn", "Abandoned_Cars.nn", "Graffiti.nn", 
              "Liquor_Retail.nn", "Street_Lights_Out.nn", "Sanitation.nn", 
              "Shot_Spotter.nn", "Bank.nn")

reg.ss.vars <- c("Abandoned_Buildings.nn", "Abandoned_Cars.nn", "Graffiti.nn", 
                 "Liquor_Retail.nn", "Street_Lights_Out.nn", "Sanitation.nn", 
                 "Shot_Spotter.nn", "Bank.nn", "robbery.isSig", "robbery.isSig.dist")

reg.cv <- crossValidate(
  dataset = final_net,
  id = "cvID",
  dependentVariable = "countRobbery",
  indVariables = reg.vars) %>%
    dplyr::select(cvID = cvID, countRobbery, Prediction, geometry)

reg.ss.cv <- crossValidate(
  dataset = final_net,
  id = "cvID",
  dependentVariable = "countRobbery",
  indVariables = reg.ss.vars) %>%
    dplyr::select(cvID = cvID, countRobbery, Prediction, geometry)

Spatial LOGO CV

The second one hold out one neighborhood, train the model on the remaining n-1 areas, predict for the hold out, and record the goodness of fit. This is also called the spatial ‘Leave-one-group-out’ cross-validation (LOGO-CV), in which each neighborhood takes a turn as a hold-out. The result of each analysis is a sf layer with observed and predicted burglary counts.

neighborhoods <- 
  st_read("https://raw.githubusercontent.com/blackmad/neighborhoods/master/chicago.geojson") %>%
  st_transform(st_crs(fishnet)) 

final_net <-
  st_centroid(final_net) %>%
    st_join(dplyr::select(neighborhoods, name)) %>%
      st_drop_geometry() %>%
      left_join(dplyr::select(final_net, geometry, uniqueID)) %>%
      st_sf() %>%
  na.omit()

reg.spatialCV <- crossValidate(
  dataset = final_net,
  id = "name",
  dependentVariable = "countRobbery",
  indVariables = reg.vars) %>%
    dplyr::select(cvID = name, countRobbery, Prediction, geometry)

reg.ss.spatialCV <- crossValidate(
  dataset = final_net,
  id = "name",
  dependentVariable = "countRobbery",
  indVariables = reg.ss.vars) %>%
    dplyr::select(cvID = name, countRobbery, Prediction, geometry)

Error Analysis

The code block below creates a long form reg.summary, that binds together observed/predicted counts and errors for each grid cell and for each regression. Residuals are calculated for each regression by subtracting the original robbery incidents from predicted incidents.

reg.summary <- 
  rbind(
    mutate(reg.cv,           Error = Prediction - countRobbery,
                             Regression = "Random k-fold CV: Just Risk Factors"),
                             
    mutate(reg.ss.cv,        Error = Prediction - countRobbery,
                             Regression = "Random k-fold CV: Spatial Process"),
    
    mutate(reg.spatialCV,    Error = Prediction - countRobbery,
                             Regression = "Spatial LOGO-CV: Just Risk Factors"),
                             
    mutate(reg.ss.spatialCV, Error = Prediction - countRobbery,
                             Regression = "Spatial LOGO-CV: Spatial Process")) %>%
    st_sf() 

Comparing the error across different regressions, we see that adding spatial process features improve the model. In addition, the model appears slightly less robust for the spatial cross-validation, probably because LOGO-CV is such a conservative assumption.

error_by_reg_and_fold <- 
  reg.summary %>%
    group_by(Regression, cvID) %>% 
    summarize(Mean_Error = mean(Prediction - countRobbery, na.rm = T),
              MAE = mean(abs(Mean_Error), na.rm = T),
              SD_MAE = mean(abs(Mean_Error), na.rm = T)) %>% ungroup()

st_drop_geometry(error_by_reg_and_fold) %>%
  group_by(Regression) %>% 
    summarize(Mean_MAE = round(mean(MAE), 2),
              SD_MAE = round(sd(MAE), 2)) %>%
  kable(col.name=c("Regression", 'Mean Absolute Error','Standard Deviation MAE')) %>%
    kable_styling(bootstrap_options = c("striped", "hover", "condensed")) %>%
    row_spec(2, color = "black", background = "#FCFFA4") %>%
    row_spec(4, color = "black", background = "#FCFFA4") 
Regression Mean Absolute Error Standard Deviation MAE
Random k-fold CV: Just Risk Factors 0.34 0.29
Random k-fold CV: Spatial Process 0.32 0.27
Spatial LOGO-CV: Just Risk Factors 0.93 0.96
Spatial LOGO-CV: Spatial Process 0.69 0.77

The small multiple maps below visualizes the distribution of errors by regression. For the random k-fold cv method, the errors are pretty evenly distributed across space. However, for spatial logo-cv method, higher errors occur at spatial hotspots.

error_by_reg_and_fold %>% 
  ggplot() +
  geom_sf(aes(fill=MAE), color="transparent") +
  scale_fill_viridis(option = "turbo", direction = -1) + 
  facet_wrap(~Regression, ncol = 4) +
   labs(title = "MAE by Fold and Regression",
       caption = "Data: Chicago Data Portal") +
  theme(legend.position = "bottom",
        axis.text.x=element_blank(),
        axis.ticks.x=element_blank(),
        axis.text.y=element_blank(),
        axis.ticks.y=element_blank(),
        plot.subtitle = element_text(size = 9,face = "italic"),
        plot.title = element_text(size = 12, face = "bold"),
        panel.background = element_blank(),
        panel.border = element_rect(colour = "grey", fill=NA, size=0.8),
        strip.text = element_text(size = 7)) 

Prediction Accuracy

Generalizability Test

Let’s now test for generalizability across racial-neighborhood context. To test this proposition, tidycensus is used to pull race data by census tract. Percentage minority population for each census tract is calculated by comparing the total number of Latinx, African American, and Asian population to the total population. Census tracts are split into two groups: majority white and minority, depending on whether the proportion of minority population in that census tract exceeds 60%.

census_api_key(dlgInput(
  "Enter a Census API Key", # ask for an api key
  Sys.getenv("CENSUS_API_KEY")
)$res,
overwrite = TRUE)
tracts21 <- 
  get_acs(geography = "tract", 
          variables = c("B02001_001E", # total population
            "B02001_002E", # white population
            "B02001_003E", # black population
            "B02001_005E", # asian population
            "B03002_012E"), 
          year=2021, state=17, county=031, 
          geometry=TRUE, output="wide") %>%
  st_transform('ESRI:102271') %>% 
  rename(TotalPop = B02001_001E, 
         Whites = B02001_002E,
         African_Americans = B02001_003E,
         Asians = B02001_005E,
         Latinx = B03002_012E) %>% 
  mutate(pctMinority = ifelse(TotalPop > 0, (African_Americans + Asians + Latinx ) / TotalPop, 0), 
         majority = ifelse(pctMinority > 0.5, "minority", "majority")) %>%   .[neighborhoods,]

We may see that Chicago remains a very segregated city and there’s clear racial boundary between places where more than 60% of the population are minority and places that’s majority white. In particular, northern and northeastern (CBD) part of Chicago are majority white while the remaining parts are all made up of racial minorities.

ggplot() + geom_sf(data = na.omit(tracts21), aes(fill = majority), color = NA) +
    scale_fill_manual(values = c("#BB3754", "#56106E"), name="Race Context") +
    labs(title = "Race Context ",
         caption = "Data: American Community Survey 2021") +
  theme(
        axis.text.x=element_blank(),
        axis.ticks.x=element_blank(),
        axis.text.y=element_blank(),
        axis.ticks.y=element_blank(),
        plot.title = element_text(size = 12, face = "bold"),
        panel.background = element_blank(),
        panel.border = element_rect(colour = "grey", fill=NA, size=0.4)
        ) 

Error is calculated by subtracting the observed robbery count from the prediction. A positive difference represents an over-prediction. The least ideal result is a model that over-predicts risk in minority areas, and under-predicts in white areas. If reporting selection bias is an issue, such a model may unfairly allocate police resource disproportionately in black and brown communities. In our case, over-prediction of minorities is an issue for random k-fold cv regression, but not for logo-cv regressions. The latter two models under-predict minority neighborhoods and over-predict majority white neighborhood, making it looks like that this algorithms generalizes well with respect to race.

joinrace <- st_centroid(reg.summary) %>% 
  st_intersection(tracts21 %>%dplyr:: select(pctMinority)) %>% 
  st_drop_geometry() %>% 
  group_by(cvID) %>% 
  summarize(meanMinor = mean(pctMinority))


reg.summary <- reg.summary %>% 
  left_join(joinrace, by = "cvID") 


reg.summary %>% 
  mutate(raceContext = ifelse(meanMinor > .6, "Minority", "Majority White")) %>% 
  st_drop_geometry() %>% 
  group_by(Regression, raceContext) %>%
  summarize(mean.Error = mean(Error, na.rm = T)) %>%
  spread(raceContext, mean.Error) %>%
  kable() %>%
    kable_styling(bootstrap_options = c("striped", "hover", "condensed")) %>%
    row_spec(2, color = "black", background = "#FCFFA4") %>%
    row_spec(4, color = "black", background = "#FCFFA4") 
Regression Majority White Minority
Random k-fold CV: Just Risk Factors 0.2067014 -0.0155049
Random k-fold CV: Spatial Process 0.1954263 -0.0142177
Spatial LOGO-CV: Just Risk Factors 0.2224433 -0.1437801
Spatial LOGO-CV: Spatial Process 0.1220346 -0.0872692

Kernel Desntiy Estimation

The utility of this algorithm is judged relative to an alternative police allocation method: kernel density estimation. Kernel density works by centering a smooth kernel, or curve, atop each crime point such that the curve is at its highest directly over the point and the lowest at the range of a circular search radius. The code block below creates a Kernel density map of robbery with a 1000 foot search radius.

rob_ppp <- as.ppp(st_coordinates(robbery21), W = st_bbox(final_net))
rob_KD.1000 <- spatstat.explore::density.ppp(rob_ppp, 1000)
rob_KD.df <- data.frame(rasterToPoints(mask(raster(rob_KD.1000), as(neighborhoods, 'Spatial'))))


ggplot(data=rob_KD.df, aes(x=x, y=y)) +
  geom_raster(aes(fill=layer)) + 
  coord_sf(crs=st_crs(final_net)) + 
  scale_fill_viridis(option = "turbo", name="Density") +
  labs(title = "Kernel Density of Robbery 1000ft Radii") +
    theme(axis.text.x=element_blank(),
        axis.text.y=element_blank(),
        axis.ticks =element_blank(),
        axis.title.x = element_blank(),
        axis.title.y = element_blank(),
        plot.subtitle = element_text(size = 9,face = "italic"),
        plot.title = element_text(size = 12, face = "bold"),
        panel.background = element_blank(),
        panel.border = element_rect(colour = "grey", fill=NA, size=0.8)
        )

Next, a new goodness of fit indicator is created to illustrate whether the 2021 kernel density or risk predictions capture more of the 2022 robberies.

robbery22 <- 
  read.socrata("https://data.cityofchicago.org/Public-Safety/Crimes-2022/9hwr-2zxp") %>% 
   filter(Primary.Type == "ROBBERY") %>% 
  mutate(x = gsub("[()]", "", Location)) %>%
  separate(x,into= c("Y","X"), sep=",") %>%
  mutate(X = as.numeric(X),
         Y = as.numeric(Y)) %>% 
  na.omit %>%
  st_as_sf(coords = c("X", "Y"), crs = 4326, agr = "constant") %>%
  st_transform('ESRI:102271') %>% 
  distinct() %>%
  .[fishnet,]

The kernel density estimation was classified into 5 risk categories and the count for the 2022 robberies were summarized into the same sf layer. The purpose here is to see if the risk predictions capture more observed burglaries than the kernel density. If so, then the risk prediction model provides a more robust targeting tool for allocating police resources.

rob_KDE_sum <- as.data.frame(rob_KD.1000) %>%
  st_as_sf(coords = c("x", "y"), crs = st_crs(final_net)) %>%
  aggregate(., final_net, mean) 

kde_breaks <- classIntervals(rob_KDE_sum$value, 
                             n = 5, "fisher")

rob_KDE_sf <- rob_KDE_sum %>%
  mutate(label = "Kernel Density",
         Risk_Category = classInt::findCols(kde_breaks),
         Risk_Category = case_when(
           Risk_Category == 5 ~ "5th",
           Risk_Category == 4 ~ "4th",
           Risk_Category == 3 ~ "3rd",
           Risk_Category == 2 ~ "2nd",
           Risk_Category == 1 ~ "1st")) %>%
  cbind(
    aggregate(
      dplyr::select(robbery21) %>% mutate(robberyCount = 1), ., sum) %>%
    mutate(robberyCount = replace_na(robberyCount, 0))) %>%
  dplyr::select(label, Risk_Category, robberyCount)

The same process was repeated for the risk prediction. Note that both predictions from the LOGO-CV with and without the spatial features is being used here.

ml_breaks <- classIntervals(reg.ss.spatialCV$Prediction, 
                             n = 5, "fisher")
rob_risk_sf <-
  reg.ss.spatialCV %>%
  mutate(label = "Risk Predictions Spatial",
         Risk_Category =classInt::findCols(ml_breaks),
         Risk_Category = case_when(
           Risk_Category == 5 ~ "5th",
           Risk_Category == 4 ~ "4th",
           Risk_Category == 3 ~ "3rd",
           Risk_Category == 2 ~ "2nd",
           Risk_Category == 1 ~ "1st")) %>%
  cbind(
    aggregate(
      dplyr::select(robbery21) %>% mutate(robberyCount = 1), ., sum) %>%
      mutate(robberyCount = replace_na(robberyCount, 0))) %>%
  dplyr::select(label,Risk_Category, robberyCount)

ml_breaks_simple <- classIntervals(reg.ss.cv$Prediction, 
                             n = 5, "fisher")
rob_risk_sf_simple <-
  reg.ss.cv %>%
  mutate(label = "Risk Predictions Simple",
         Risk_Category =classInt::findCols(ml_breaks_simple),
         Risk_Category = case_when(
           Risk_Category == 5 ~ "5th",
           Risk_Category == 4 ~ "4th",
           Risk_Category == 3 ~ "3rd",
           Risk_Category == 2 ~ "2nd",
           Risk_Category == 1 ~ "1st")) %>%
  cbind(
    aggregate(
      dplyr::select(robbery21) %>% mutate(robberyCount = 1), ., sum) %>%
      mutate(robberyCount = replace_na(robberyCount, 0))) %>%
  dplyr::select(label,Risk_Category, robberyCount)

Prediction Comparison

Below a map is generated of the risk categories for all three model types. A strongly fit model should show that the highest risk category is uniquely targeted to places with a high density of burglary points. The map shows that our risk prediction models perform as good as the kernal density approach in capturing the distribution of robbery incidents.

rbind(rob_KDE_sf, rob_risk_sf, rob_risk_sf_simple) %>%
  na.omit() %>%
  gather(Variable, Value, -label, -Risk_Category, -geometry) %>%
  ggplot() +
    geom_sf(aes(fill = Risk_Category), colour = NA) +
    facet_wrap(~label, ) +
    scale_fill_viridis(option = "turbo", discrete = TRUE, name = "Risk Category") +
      geom_sf(data = sample_n(robbery22, 3000), size = .03, colour = "white") +
    labs(title="Comparison of Kernel Density and Risk Predictions",
         subtitle="2021 robbery risk predictions; 2022 armed robberies") + 
  theme(axis.text.x=element_blank(),
        axis.text.y=element_blank(),
        axis.ticks =element_blank(),
        axis.title.x = element_blank(),
        axis.title.y = element_blank(),
        plot.subtitle = element_text(size = 9,face = "italic"),
        plot.title = element_text(size = 12, face = "bold"),
        panel.background = element_blank(),
        panel.border = element_rect(colour = "grey", fill=NA, size=0.8)
        )

An effective way to assess prediction accuracy is by using a histogram. Ideally, a well-fitting model should show that the highest-risk areas capture a larger proportion of 2022 robberies compared to the kernel density estimate. However, in this case, the kernel density model outperforms both risk prediction models in the highest-risk regions. Despite this, the risk prediction models excel in identifying armed robbery incidents in lower-risk areas that the kernel density model overlooked.

rbind(rob_KDE_sf, rob_risk_sf, rob_risk_sf_simple) %>%
  st_drop_geometry() %>%
  na.omit() %>%
  gather(Variable, Value, -label, -Risk_Category) %>%
  group_by(label, Risk_Category) %>%
  summarize(countRobbery = sum(Value)) %>%
  ungroup() %>%
  group_by(label) %>%
  mutate(Pcnt_of_test_set_crimes = countRobbery / sum(countRobbery)) %>%
    ggplot(aes(Risk_Category,Pcnt_of_test_set_crimes)) +
      geom_bar(aes(fill=label), position="dodge", stat="identity") +
      scale_fill_viridis(option = "turbo", discrete = TRUE, name = "Model", direction = -1) +
      labs(title = "Risk Prediction vs. Kernel Density",
           y = "% of Test Set Robberies (per model)",
           x = "Risk Category") +
  theme_bw() +
      theme(axis.text.x = element_text(angle = 45, vjust = 0.5))

Conclusions

In conclusion, our model requires significant improvement before it can be implemented effectively. Compared to traditional kernel density estimation, it struggled to capture armed robbery incidents in the highest-risk areas. One key issue is that many of our predictor variables, such as reports of abandoned cars, graffiti, and sanitation complaints, were outdated, last updated in 2019. Using these to predict robberies in 2021 and 2022 likely led to inaccurate results, especially since shifts in these risk factors (like graffiti removal) would alter crime patterns. Up-to-date datasets are crucial for more accurate predictions.

Additionally, the COVID-19 pandemic likely impacted armed robbery patterns, with economic distress, lockdowns, and reallocation of police resources all contributing to shifts in crime locations. Our model needs to account for such factors to ensure predictions remain relevant over time.

On a positive note, the model performed better in lower-risk areas, where the kernel density method fell short. However, relying on a single model may not capture the full complexity of armed robbery patterns. A combined approach using both kernel density estimation and risk prediction could improve accuracy. While we avoided overpredicting in majority non-white neighborhoods, potential selection bias remains a concern. Incorporating spatial features and using k-fold cross-validation helped improve the model, but further refinement is needed to ensure it generalizes well across different contexts and time periods.

---
title: "Risk Predictions of Armed Robbery Incidents in Chicago"
author: "Yutong Jiang"
date: "Oct,18,2024"
output:
  html_document:
    theme: cosmo 
    toc: yes
    toc_float: yes
    code_folding: hide
    code_download: yes
editor_options:
  markdown:
    wrap: sentence
---



```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```


```{r load libraries, include=FALSE}

library(tidyverse)
library(sf)
library(RSocrata)
library(viridis)
library(spatstat.explore)
library(raster)
library(spdep)
library(FNN)
library(grid)
library(gridExtra)
library(knitr)
library(kableExtra)
library(tidycensus)
library(classInt)   # for KDE and ML risk class intervals
library(svDialogs)
library(patchwork)

# functions
root.dir = "https://raw.githubusercontent.com/urbanSpatial/Public-Policy-Analytics-Landing/master/DATA/"
source("https://raw.githubusercontent.com/urbanSpatial/Public-Policy-Analytics-Landing/master/functions.r")

chicagoBoundary <- 
  st_read(file.path(root.dir,"/Chapter5/chicagoBoundary.geojson")) %>%
  st_transform('ESRI:102271') 

```


# Introduction

This project "Geospatial Risk Predictions of Armed Robbery Incidents in Chicago" explores the application of predictive policing through geospatial risk modeling to forecast the likelihood of armed robbery incidents in Chicago. By utilizing historical crime data, socioeconomic indicators, and spatial risk factors such as abandoned cars, street light outages, and sanitation complaints, the model aims to identify high-risk areas for armed robbery incidents. The study employs Poisson regression to predict crime counts and compares the accuracy of spatial models against traditional methods like kernel density estimation. Emphasizing the importance of addressing bias and data selection issues, the project aims to improve the generalizability and fairness of crime prediction models across different neighborhoods, particularly considering racial and socioeconomic disparities. Through spatial cross-validation and the inclusion of spatial process features, this analysis seeks to enhance predictive accuracy and offer insights into better resource allocation for law enforcement.

The objective of this report is to develop a geospatial risk model for predicting armed robbery incidents in Chicago. Robbery data is often influenced by selection bias due to two key factors: not all robbery incidents are reported to law enforcement, and areas with historically high robbery rates, which receive increased police presence, tend to continue reporting higher incidents. The underlying hypothesis is that crime risk is driven by exposure to various geospatial risk and protective factors, such as signs of blight or the presence of recreation centers. As exposure to risk factors rises, so does the likelihood of crime. In addition to identifying these risk factors, the model incorporates spatial variables to improve prediction accuracy. Poisson regression is used in this analysis due to its suitability for modeling count data. 
1



# Crime Data Collection

All datasets for this study were sourced from [Chicago Data Portal](https://data.cityofchicago.org/). The Chicago Data Portal is an online platform that provides access to a wide range of municipal datasets. The portal includes an extensive record of crime incidents in Chicago, updated annually. For this analysis, crime data for 2021 was downloaded using the RSocrata package, which allows easy access to open data from platforms like Socrata, the software behind the Chicago Data Portal. 

This report specifically filter the armed robbery crime incidents in Chicago. The armed robbery include: KNIFE / CUTTING INSTRUMENT, HANDGUN, OTHER FIREARM, or OTHER DANGEROUS WEAPON. After filtering the crime data for armed robbery incidents, we focused on those involving the following weapon types: knife/cutting instrument, handgun, other firearms, or other dangerous weapons. These specific categories provide a clearer understanding of more serious and violent robbery offenses, as they pose a higher threat to victims and are more likely to occur in areas with distinct risk factors. By narrowing our analysis to armed robberies, we aim to explore the spatial distribution of these incidents and assess how environmental and social variables, such as abandoned buildings, street light outages, and sanitation complaints, correlate with higher occurrences of these violent crimes. This targeted approach enables us to develop a more precise predictive model for robbery risk in different areas of Chicago.

```{r load outcome variable, warning=FALSE, message=FALSE, results='hide'}
robbery21 <- read.socrata("https://data.cityofchicago.org/Public-Safety/Crimes-2021/dwme-t96c")
saveRDS(robbery21, "robbery21.rds")
# Read from the locally saved file
robbery21 <- readRDS("robbery21.rds") %>% 
  filter(Primary.Type == "ROBBERY", grepl("ARMED", Description)) %>%
  mutate(x = gsub("[()]", "", Location)) %>%
  separate(x,into= c("Y","X"), sep=",") %>%
  mutate(X = as.numeric(X), Y = as.numeric(Y)) %>% 
  na.omit() %>%
  st_as_sf(coords = c("X", "Y"), crs = 4326, agr = "constant") %>%
  st_transform('ESRI:102271') %>%
  distinct()

```

The map displays the spatial distribution of armed robbery incidents in Chicago for 2021, revealing a clear pattern of concentration in specific areas. The highest density of incidents occurs in the central and southern parts of the city, with particular clustering around the city's south side and certain western neighborhoods. These areas likely experience higher rates of violent crime due to a combination of socio-economic factors and environmental conditions. In contrast, the northern parts of the city, especially near the lakefront, show significantly fewer armed robberies. This spatial pattern suggests that certain regions may face more pronounced crime risks, possibly influenced by urban infrastructure, economic distress, or other factors that contribute to the likelihood of robbery incidents. Understanding these clusters can help inform targeted interventions and crime prevention strategies in high-risk areas.


```{r robbery locations, warning=FALSE, message=FALSE, fig.align = 'center'}

ggplot() + 
  geom_sf(data = chicagoBoundary, fill = "grey") +
  geom_sf(data = robbery21, colour="darkred", size=0.1, show.legend = "point") +
  labs(title= "Armed Robbery Incidents in Chicago 2021") + 
  theme(axis.text.x=element_blank(),
        axis.text.y=element_blank(),
        axis.ticks =element_blank(),
        axis.title.x = element_blank(),
        axis.title.y = element_blank(),
        plot.subtitle = element_text(size = 9,face = "italic"),
        plot.title = element_text(size = 12, face = "bold"),
        panel.background = element_blank(),
        panel.border = element_rect(colour = "grey", fill=NA, linewidth =0.8)
        )

```

Below is the density map of armed robbery incidents in Chicago. The density map of armed robbery incidents in Chicago reveals two prominent hotspots. The first cluster is located in the central-western part of the city, while the second significant concentration is found slightly to the south of downtown. These areas experience the highest intensity of armed robberies, indicating localized zones with increased risk. The surrounding areas show a gradual decrease in density as we move away from these core hotspots, suggesting that armed robberies are not evenly distributed across the city but are instead concentrated in specific regions. The northern and far southern parts of the city exhibit much lower densities, indicating that armed robberies are relatively less common in these areas. This pattern highlights the need for focused crime prevention efforts in the identified high-risk zones.

```{r robbery density, fig.width=5, message=FALSE, warning=FALSE, fig.align = 'center'}
options(scipen=0)
ggplot() + 
  geom_sf(data = chicagoBoundary, fill = "grey") +
  stat_density2d(data = data.frame(st_coordinates(robbery21)), 
                 aes(X, Y, fill = ..level.., alpha = ..level..),
                 size = 0.01, bins = 60, geom = 'polygon') +
  scale_fill_viridis(option = "turbo", name = "Density") +
  scale_alpha(range = c(0.00, 0.35), guide = "none") +
  labs(title = "Density of Armed Robbery") +
  theme(axis.text.x=element_blank(),
        axis.text.y=element_blank(),
        axis.ticks =element_blank(),
        axis.title.x = element_blank(),
        axis.title.y = element_blank(),
        plot.subtitle = element_text(size = 9,face = "italic"),
        plot.title = element_text(size = 12, face = "bold"),
        panel.background = element_blank(),
        panel.border = element_rect(colour = "grey", fill=NA, linewidth=0.8)
        )
```


Crime risk tends to fluctuate gradually across geographic space rather than conforming to predefined administrative boundaries. To effectively capture this continuous variation, point-level data can be aggregated into a network of evenly spaced grid cells. The code provided below generates such a fishnet for Chicago, offering a structured way to represent crime risk spatially. This grid-based approach is ideal for preparing the data for regression analysis, allowing for a smoother depiction of crime trends across the urban landscape.


```{r create fishnet, fig.width=5, fig.align = 'center'}
fishnet <- 
  st_make_grid(chicagoBoundary,
               cellsize = 500, 
               square = TRUE) %>%
  .[chicagoBoundary] %>%            # fast way to select intersecting polygons
  st_sf() %>%
  mutate(uniqueID = 1:n())

ggplot() +
  geom_sf(data=fishnet, color="black", fill="lightblue") +
  labs(title = "Fishnet of Chicago") +
  theme(axis.text.x=element_blank(),
        axis.text.y=element_blank(),
        axis.ticks =element_blank(),
        axis.title.x = element_blank(),
        axis.title.y = element_blank(),
        plot.subtitle = element_text(size = 9,face = "italic"),
        plot.title = element_text(size = 12, face = "bold"),
        panel.background = element_blank(),
        panel.border = element_rect(colour = "grey", fill=NA, linewidth=0.8)
        )
```

Once the fishnet was created, use it as the spatial framework to visualize the distribution of robbery incidents. From this, it becomes clear that the highest number of robberies are concentrated in a few grid cells located in northeast Chicago. This pattern highlights the emergence of a clustered spatial trend in robbery occurrences.

```{r join robbery to fishnet, fig.align = 'center'}

robbery21_net <- 
  dplyr::select(robbery21) %>% 
  mutate(countRobbery = 1) %>% 
  aggregate(., fishnet, sum) %>%
  mutate(countRobbery = replace_na(countRobbery, 0),
         uniqueID = 1:n(),
         cvID = sample(round(nrow(fishnet) / 24), 
                       size=nrow(fishnet), replace = TRUE))

ggplot() +
  geom_sf(data = robbery21_net, aes(fill = countRobbery), color = NA) +
  scale_fill_viridis(option = "turbo", name = "Robbery Counts") +
  labs(title = "Count of Robberies for the Fishnet") +
  theme(axis.text.x=element_blank(),
        axis.text.y=element_blank(),
        axis.ticks =element_blank(),
        axis.title.x = element_blank(),
        axis.title.y = element_blank(),
        plot.subtitle = element_text(size = 9,face = "italic"),
        plot.title = element_text(size = 12, face = "bold"),
        panel.background = element_blank(),
        panel.border = element_rect(colour = "grey", fill=NA, linewidth=0.8)
        )
```

The bar chart below illustrates the distribution of robbery incidents in Chicago for 2021, revealing a highly skewed pattern. A large number of grid cells have few or no robbery incidents, while only a small number of grids experience significantly higher robbery counts. This non-normal distribution suggests that Poisson regression is a suitable method for analyzing this data, as it can handle the overdispersion and count nature of the variable.

```{r distribution of robberies, message=FALSE, warning=FALSE, fig.align = 'center'}
ggplot(robbery21_net, aes(x = countRobbery)) +
  geom_histogram(fill = "steelblue", color = "#2c3e50", bins = 30) + 
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1, face = "plain", size = 7),
        axis.text.y = element_text(size = 7), 
        axis.title = element_text(size = 9), 
        plot.title = element_text(size = 14, face = "bold"), 
        plot.subtitle = element_text(size = 10), 
        panel.grid.major = element_line(color = "grey80"), 
        panel.grid.minor = element_blank(),
        panel.border = element_rect(colour = "grey60", fill = NA, size = 0.8)) + 
  labs(title = "Distribution of Robberies in Chicago 2021",
       subtitle = "Analysis of robbery incident distribution",
       caption = "Data: Chicago Data Portal Crimes 2021") +
  xlab("Robbery Incidents") +
  ylab("Count")

```

# Risk Factors

In this project, eight risk factors are included in the model to predict robberies: reports of abandoned cars, street lights out, graffiti, sanitation complaints, abandoned buildings, locations of liquor stores selling alcohol to-go, ShotSpotter detections, and banks. Neighborhoods with abandoned cars, street lights out, graffiti, and sanitation issues often indicate neglect, economic distress, and reduced community oversight, which can create an environment conducive to criminal activities. Abandoned buildings can serve as hideouts for criminals, further contributing to neighborhood disorder. Liquor stores are associated with alcohol-related vulnerabilities, while ShotSpotter detections point to areas with higher gun violence. Finally, banks, due to the availability of cash, may attract robbers seeking financial gain. These factors collectively highlight areas that may be more susceptible to robberies.

```{r load predictor variables, warning=FALSE, message=FALSE}
abandonCars <- 
  read.socrata("https://data.cityofchicago.org/Service-Requests/311-Service-Requests-Abandoned-Vehicles/3c9v-pnva") %>%
    mutate(year = substr(creation_date,1,4)) %>% filter(year == "2020") %>%
    dplyr::select(Y = latitude, X = longitude) %>%
    na.omit() %>%
    st_as_sf(coords = c("X", "Y"), crs = 4326, agr = "constant") %>%
    st_transform(st_crs(fishnet)) %>%
    mutate(Legend = "Abandoned_Cars")
  
abandonBuildings <- 
  read.socrata("https://data.cityofchicago.org/Service-Requests/311-Service-Requests-Vacant-and-Abandoned-Building/7nii-7srd") %>%
    mutate(year = substr(date_service_request_was_received,1,4)) %>%  filter(year == "2018") %>%
    dplyr::select(Y = latitude, X = longitude) %>%
    na.omit() %>%
    st_as_sf(coords = c("X", "Y"), crs = 4326, agr = "constant") %>%
    st_transform(st_crs(fishnet)) %>%
    mutate(Legend = "Abandoned_Buildings")

graffiti <- 
  read.socrata("https://data.cityofchicago.org/Service-Requests/311-Service-Requests-Graffiti-Removal-Historical/hec5-y4x5") %>%
    mutate(year = substr(creation_date,1,4)) %>% filter(year == "2018") %>%
    filter(where_is_the_graffiti_located_ %in% c("Front", "Rear", "Side")) %>%
    dplyr::select(Y = latitude, X = longitude) %>%
    na.omit() %>%
    st_as_sf(coords = c("X", "Y"), crs = 4326, agr = "constant") %>%
    st_transform(st_crs(fishnet)) %>%
    mutate(Legend = "Graffiti")

streetLightsOut <- 
  read.socrata("https://data.cityofchicago.org/Service-Requests/311-Service-Requests-Street-Lights-All-Out/zuxi-7xem") %>%
    mutate(year = substr(creation_date,1,4)) %>% filter(year == "2018") %>%
    dplyr::select(Y = latitude, X = longitude) %>%
    na.omit() %>%
    st_as_sf(coords = c("X", "Y"), crs = 4326, agr = "constant") %>%
    st_transform(st_crs(fishnet)) %>%
    mutate(Legend = "Street_Lights_Out")

sanitation <-
  read.socrata("https://data.cityofchicago.org/Service-Requests/311-Service-Requests-Sanitation-Code-Complaints-Hi/me59-5fac") %>%
    mutate(year = substr(creation_date,1,4)) %>% filter(year == "2018") %>%
    dplyr::select(Y = latitude, X = longitude) %>%
    na.omit() %>%
    st_as_sf(coords = c("X", "Y"), crs = 4326, agr = "constant") %>%
    st_transform(st_crs(fishnet)) %>%
    mutate(Legend = "Sanitation")

liquorRetail <- 
  read.socrata("https://data.cityofchicago.org/resource/nrmj-3kcf.json") %>%  
    filter(business_activity == "Retail Sales of Packaged Liquor") %>%
    dplyr::select(Y = latitude, X = longitude) %>%
    na.omit() %>%
    st_as_sf(coords = c("X", "Y"), crs = 4326, agr = "constant") %>%
    st_transform(st_crs(fishnet)) %>%
    mutate(Legend = "Liquor_Retail")


shotSpotter <- 
  read.socrata("https://data.cityofchicago.org/Public-Safety/Violence-Reduction-Shotspotter-Alerts/3h7q-7mdb") %>%  
    dplyr::select(Y = latitude, X = longitude) %>%
    na.omit() %>%
    st_as_sf(coords = c("X", "Y"), crs = 4326, agr = "constant") %>%
    st_transform(st_crs(fishnet)) %>%
    mutate(Legend = "Shot_Spotter")

Banks <- 
  read.socrata("https://data.cityofchicago.org/Administration-Finance/Lending-Equity-Depository-Locations-2021/jgup-zt7k") %>%  
  dplyr::select(location) %>%
  na.omit()
Banks <- Banks %>%   
  mutate(X = as.numeric(sub("POINT \\((-?\\d+\\.\\d+) \\d+\\.\\d+\\)", "\\1", Banks$location)),
         Y = as.numeric(sub("POINT \\(-?\\d+\\.\\d+ (-?\\d+\\.\\d+)\\)", "\\1", Banks$location))) %>%
  dplyr::select(-location) %>% 
  filter(!is.na(X) & !is.na(Y)) %>% 
  st_as_sf(coords = c("X", "Y"), crs = 4326, agr = "constant") %>%
    st_transform(st_crs(fishnet)) %>%
    mutate(Legend = "Bank") 
  

```



```{r join predictor variables to fishnet, warning=FALSE, message=FALSE}

vars_net <- 
  rbind(abandonCars,streetLightsOut,abandonBuildings,
        liquorRetail, graffiti, sanitation, shotSpotter, Banks) %>%
  st_join(., fishnet, join=st_within) %>%
  st_drop_geometry() %>%
  group_by(uniqueID, Legend) %>%
  summarize(count = n()) %>%
    full_join(fishnet) %>%
    spread(Legend, count, fill=0) %>%
    st_sf() %>%
    dplyr::select(-`<NA>`) %>%
    na.omit() %>%
    ungroup()

```

The figure illustrates the distinct spatial patterns of each risk factor across Chicago. Abandoned buildings are predominantly concentrated in the southern part of the city, while banks and liquor retail stores cluster around the central business district (CBD). Abandoned cars, sanitation complaints, and street lights out complaints are more evenly scattered throughout the city. Graffiti removal requests are mainly concentrated near the city center, whereas ShotSpotter detections tend to occur in areas outside the city center, highlighting the spatial variability in factors associated with potential robbery risks.

```{r visualize predictor variables, fig.height=8, fig.width=11, fig.align = 'center'}
vars_net.long <- 
  gather(vars_net, Variable, value, -geometry, -uniqueID)

plot1 <- ggplot() +
  geom_sf(data = vars_net.long %>% filter(Variable == "Abandoned_Buildings"), aes(fill=value), colour=NA) +
  scale_fill_viridis(option = "turbo") + 
  labs(title = "Abandoned Buildings") +
  theme(axis.text.x=element_blank(),
        axis.text.y=element_blank(),
        axis.ticks =element_blank(),
        axis.title.x = element_blank(),
        axis.title.y = element_blank(),
        plot.subtitle = element_text(size = 9,face = "italic"),
        plot.title = element_text(size = 10, face = "bold"),
        panel.background = element_blank(),
        panel.border = element_rect(colour = "grey", fill=NA, size=0.8)
        )


plot2 <- ggplot() +
  geom_sf(data = vars_net.long %>% filter(Variable == "Abandoned_Cars"), aes(fill=value), colour=NA) +
  scale_fill_viridis(option = "turbo") + 
  labs(title = "Abandoned Cars") +
  theme(axis.text.x=element_blank(),
        axis.text.y=element_blank(),
        axis.ticks =element_blank(),
        axis.title.x = element_blank(),
        axis.title.y = element_blank(),
        plot.subtitle = element_text(size = 9,face = "italic"),
        plot.title = element_text(size = 10, face = "bold"),
        panel.background = element_blank(),
        panel.border = element_rect(colour = "grey", fill=NA, size=0.8)
        )

plot3 <- ggplot() +
  geom_sf(data = vars_net.long %>% filter(Variable == "Bank"), aes(fill=value), colour=NA) +
  scale_fill_viridis(option = "turbo") + 
  labs(title = "Banks") +
  theme(axis.text.x=element_blank(),
        axis.text.y=element_blank(),
        axis.ticks =element_blank(),
        axis.title.x = element_blank(),
        axis.title.y = element_blank(),
        plot.subtitle = element_text(size = 9,face = "italic"),
        plot.title = element_text(size = 10, face = "bold"),
        panel.background = element_blank(),
        panel.border = element_rect(colour = "grey", fill=NA, size=0.8)
        )

plot4 <- ggplot() +
  geom_sf(data = vars_net.long %>% filter(Variable == "Graffiti"), aes(fill=value), colour=NA) +
  scale_fill_viridis(option = "turbo") + 
  labs(title = "Graffiti") +
  theme(axis.text.x=element_blank(),
        axis.text.y=element_blank(),
        axis.ticks =element_blank(),
        axis.title.x = element_blank(),
        axis.title.y = element_blank(),
        plot.subtitle = element_text(size = 9,face = "italic"),
        plot.title = element_text(size = 10, face = "bold"),
        panel.background = element_blank(),
        panel.border = element_rect(colour = "grey", fill=NA, size=0.8)
        )


plot5 <- ggplot() +
  geom_sf(data = vars_net.long %>% filter(Variable == "Liquor_Retail"), aes(fill=value), colour=NA) +
  scale_fill_viridis(option = "turbo") + 
  labs(title = "Liquor Retail") +
  theme(axis.text.x=element_blank(),
        axis.text.y=element_blank(),
        axis.ticks =element_blank(),
        axis.title.x = element_blank(),
        axis.title.y = element_blank(),
        plot.subtitle = element_text(size = 9,face = "italic"),
        plot.title = element_text(size = 10, face = "bold"),
        panel.background = element_blank(),
        panel.border = element_rect(colour = "grey", fill=NA, size=0.8)
        )

plot6 <- ggplot() +
  geom_sf(data = vars_net.long %>% filter(Variable == "Sanitation"), aes(fill=value), colour=NA) +
  scale_fill_viridis(option = "turbo") + 
  labs(title = "Sanitation") +
  theme(axis.text.x=element_blank(),
        axis.text.y=element_blank(),
        axis.ticks =element_blank(),
        axis.title.x = element_blank(),
        axis.title.y = element_blank(),
        plot.subtitle = element_text(size = 9,face = "italic"),
        plot.title = element_text(size = 10, face = "bold"),
        panel.background = element_blank(),
        panel.border = element_rect(colour = "grey", fill=NA, size=0.8)
        )

plot7 <- ggplot() +
  geom_sf(data = vars_net.long %>% filter(Variable == "Shot_Spotter"), aes(fill=value), colour=NA) +
  scale_fill_viridis(option = "turbo") + 
  labs(title = "Shot Spotter") +
  theme(axis.text.x=element_blank(),
        axis.text.y=element_blank(),
        axis.ticks =element_blank(),
        axis.title.x = element_blank(),
        axis.title.y = element_blank(),
        plot.subtitle = element_text(size = 9,face = "italic"),
        plot.title = element_text(size = 10, face = "bold"),
        panel.background = element_blank(),
        panel.border = element_rect(colour = "grey", fill=NA, size=0.8)
        )


plot8 <- ggplot() +
  geom_sf(data = vars_net.long %>% filter(Variable == "Street_Lights_Out"), aes(fill=value), colour=NA) +
  scale_fill_viridis(option = "turbo") + 
  labs(title = "Street Light Out") +
  theme(axis.text.x=element_blank(),
        axis.text.y=element_blank(),
        axis.ticks =element_blank(),
        axis.title.x = element_blank(),
        axis.title.y = element_blank(),
        plot.subtitle = element_text(size = 9,face = "italic"),
        plot.title = element_text(size = 10, face = "bold"),
        panel.background = element_blank(),
        panel.border = element_rect(colour = "grey", fill=NA, size=0.8)
        )

wrap_plots(
  plot1, plot3, plot2, plot4,
  plot5, plot6, plot7, plot8,
  ncol = 4  
)


```

# Feature Engineering


## Nearest Distance

The first feature engineering approach we took is to calculate average nearest neighbor distance to hypothesize a smoother exposure relationship across space. Here, the nn_function is used. For each of our risk factors, average nearest distance is calculated from the centroid of each cell to its nearest three, abandoned cars, for example. 

```{r calculate nearest distance, warning=FALSE, message=FALSE}

st_c <- st_coordinates
st_coid <- st_centroid

vars_net <-
  vars_net %>%
    mutate(
      Abandoned_Buildings.nn =
        nn_function(st_c(st_coid(vars_net)), st_c(abandonBuildings),3),
      Abandoned_Cars.nn =
        nn_function(st_c(st_coid(vars_net)), st_c(abandonCars),3),
      Graffiti.nn =
        nn_function(st_c(st_coid(vars_net)), st_c(graffiti),3),
      Liquor_Retail.nn =
        nn_function(st_c(st_coid(vars_net)), st_c(liquorRetail),3),
      Street_Lights_Out.nn =
        nn_function(st_c(st_coid(vars_net)), st_c(streetLightsOut),3),
      Sanitation.nn =
        nn_function(st_c(st_coid(vars_net)), st_c(sanitation),3),
      Shot_Spotter.nn =
        nn_function(st_c(st_coid(vars_net)), st_c(shotSpotter),3))

vars_net <-
  vars_net %>% mutate(Bank.nn = nn_function(st_c(st_coid(vars_net)), st_c(Banks),3))
 

```

The neighbor features provide a clearer picture of the spatial relationships between different risk factors.The southern and southeastern areas of Chicago appear to be most distant from many of these risks, particularly abandoned buildings, abandoned cars, and graffiti. Interestingly, the southwestern corner also shows some distance from risk factors like banks and liquor stores. One notable observation is that the northern part of the city is furthest from ShotSpotter locations, while still being relatively close to other risk factors such as sanitation issues and street lights out. The relative absence of ShotSpotters in this area raises questions about their role: are they an indicator of higher crime due to their concentration, or does their presence suggest increased policing, making neighborhoods safer? These observations highlight the need to examine whether ShotSpotters function as a crime detection tool or a proxy for neighborhood safety. These are important considerations for understanding the broader dynamics of crime and safety in Chicago.

```{r visualize nearest distance, fig.height=8, fig.width=11, fig.align = 'center'}

vars_net.long.nn <- 
  dplyr::select(vars_net, ends_with(".nn")) %>%
    gather(Variable, value, -geometry)


p1.nn <- ggplot() +
  geom_sf(data = vars_net.long.nn %>% filter(Variable == "Abandoned_Buildings.nn"), aes(fill=value), colour=NA) +
  scale_fill_viridis(option = "turbo") + 
  labs(title = "Abandoned Buildings NN") +
  theme(axis.text.x=element_blank(),
        axis.text.y=element_blank(),
        axis.ticks =element_blank(),
        axis.title.x = element_blank(),
        axis.title.y = element_blank(),
        plot.subtitle = element_text(size = 9,face = "italic"),
        plot.title = element_text(size = 10, face = "bold"),
        panel.background = element_blank(),
        panel.border = element_rect(colour = "grey", fill=NA, size=0.8)
        )


p2.nn <- ggplot() +
  geom_sf(data = vars_net.long.nn %>% filter(Variable == "Abandoned_Cars.nn"), aes(fill=value), colour=NA) +
  scale_fill_viridis(option = "turbo") + 
  labs(title = "Abandoned Cars NN") +
  theme(axis.text.x=element_blank(),
        axis.text.y=element_blank(),
        axis.ticks =element_blank(),
        axis.title.x = element_blank(),
        axis.title.y = element_blank(),
        plot.subtitle = element_text(size = 9,face = "italic"),
        plot.title = element_text(size = 10, face = "bold"),
        panel.background = element_blank(),
        panel.border = element_rect(colour = "grey", fill=NA, size=0.8)
        )

p3.nn <- ggplot() +
  geom_sf(data = vars_net.long.nn %>% filter(Variable == "Bank.nn"), aes(fill=value), colour=NA) +
  scale_fill_viridis(option = "turbo") + 
  labs(title = "Banks NN") +
  theme(axis.text.x=element_blank(),
        axis.text.y=element_blank(),
        axis.ticks =element_blank(),
        axis.title.x = element_blank(),
        axis.title.y = element_blank(),
        plot.subtitle = element_text(size = 9,face = "italic"),
        plot.title = element_text(size = 10, face = "bold"),
        panel.background = element_blank(),
        panel.border = element_rect(colour = "grey", fill=NA, size=0.8)
        )

p4.nn <- ggplot() +
  geom_sf(data = vars_net.long.nn %>% filter(Variable == "Graffiti.nn"), aes(fill=value), colour=NA) +
  scale_fill_viridis(option = "turbo") + 
  labs(title = "Graffiti NN") +
  theme(axis.text.x=element_blank(),
        axis.text.y=element_blank(),
        axis.ticks =element_blank(),
        axis.title.x = element_blank(),
        axis.title.y = element_blank(),
        plot.subtitle = element_text(size = 9,face = "italic"),
        plot.title = element_text(size = 10, face = "bold"),
        panel.background = element_blank(),
        panel.border = element_rect(colour = "grey", fill=NA, size=0.8)
        )


p5.nn <- ggplot() +
  geom_sf(data = vars_net.long.nn %>% filter(Variable == "Liquor_Retail.nn"), aes(fill=value), colour=NA) +
  scale_fill_viridis(option = "turbo") + 
  labs(title = "Liquor Retail NN") +
  theme(axis.text.x=element_blank(),
        axis.text.y=element_blank(),
        axis.ticks =element_blank(),
        axis.title.x = element_blank(),
        axis.title.y = element_blank(),
        plot.subtitle = element_text(size = 9,face = "italic"),
        plot.title = element_text(size = 10, face = "bold"),
        panel.background = element_blank(),
        panel.border = element_rect(colour = "grey", fill=NA, size=0.8)
        )

p6.nn <- ggplot() +
  geom_sf(data = vars_net.long.nn %>% filter(Variable == "Sanitation.nn"), aes(fill=value), colour=NA) +
  scale_fill_viridis(option = "turbo") + 
  labs(title = "Sanitation NN") +
  theme(axis.text.x=element_blank(),
        axis.text.y=element_blank(),
        axis.ticks =element_blank(),
        axis.title.x = element_blank(),
        axis.title.y = element_blank(),
        plot.subtitle = element_text(size = 9,face = "italic"),
        plot.title = element_text(size = 10, face = "bold"),
        panel.background = element_blank(),
        panel.border = element_rect(colour = "grey", fill=NA, size=0.8)
        )

p7.nn <- ggplot() +
  geom_sf(data = vars_net.long.nn %>% filter(Variable == "Shot_Spotter.nn"), aes(fill=value), colour=NA) +
  scale_fill_viridis(option = "turbo") + 
  labs(title = "Shot Spotter NN") +
  theme(axis.text.x=element_blank(),
        axis.text.y=element_blank(),
        axis.ticks =element_blank(),
        axis.title.x = element_blank(),
        axis.title.y = element_blank(),
        plot.subtitle = element_text(size = 9,face = "italic"),
        plot.title = element_text(size = 10, face = "bold"),
        panel.background = element_blank(),
        panel.border = element_rect(colour = "grey", fill=NA, size=0.8)
        )


p8.nn <- ggplot() +
  geom_sf(data = vars_net.long.nn %>% filter(Variable == "Street_Lights_Out.nn"), aes(fill=value), colour=NA) +
  scale_fill_viridis(option = "turbo") + 
  labs(title = "Street Light Out NN") +
  theme(axis.text.x=element_blank(),
        axis.text.y=element_blank(),
        axis.ticks =element_blank(),
        axis.title.x = element_blank(),
        axis.title.y = element_blank(),
        plot.subtitle = element_text(size = 9,face = "italic"),
        plot.title = element_text(size = 10, face = "bold"),
        panel.background = element_blank(),
        panel.border = element_rect(colour = "grey", fill=NA, size=0.8)
        )

wrap_plots(
  p1.nn, p3.nn, p2.nn, p4.nn,
  p5.nn, p6.nn, p7.nn, p8.nn,
  ncol = 4  
)


```


## Local Spatial Autocorrelation

identified robbery hotspots in Chicago using Local Moran's I, which examines whether robbery counts at a location are clustered relative to neighboring cells. A spatial weights matrix was calculated using queen contiguity to define neighboring relationships. The results, visualized with Moran's I values, p-values, and significant hotspots, show high spatial autocorrelation in northeastern Chicago near the CBD. Areas with significant hotspots (p-values ≤ 0.05) align with patterns seen in the kernel density map, indicating clusters of higher robbery counts than expected by chance.




```{r calculate local moran I, message=FALSE, warning=FALSE}

final_net <-
  left_join(robbery21_net, st_drop_geometry(vars_net), by="uniqueID")
final_net.nb <- poly2nb(as_Spatial(final_net), queen=TRUE)
final_net.weights <- nb2listw(final_net.nb, style="W", zero.policy=TRUE)

local_morans <- localmoran(final_net$countRobbery, final_net.weights, zero.policy=TRUE) %>% 
  as.data.frame()

final_net.localMorans <- 
  cbind(local_morans, as.data.frame(final_net)) %>% 
  st_sf() %>%
  dplyr::select(Robbery_Count = countRobbery, 
                Local_Morans_I = Ii, 
                P_Value = `Pr(z != E(Ii))`) %>%
  mutate(Significant_Hotspots = ifelse(P_Value <= 0.001, 1, 0)) %>%
  gather(Variable, Value, -geometry)
```


```{r visualize moran I, fig.height=8, fig.width=11, fig.align = 'center'}

moran1 <- ggplot() +
  geom_sf(data = final_net.localMorans %>% filter(Variable == "Robbery_Count"), aes(fill=Value), colour=NA) +
  scale_fill_viridis(option = "turbo") + 
  labs(title = "Robbery Count") +
  theme(axis.text.x=element_blank(),
        axis.text.y=element_blank(),
        axis.ticks =element_blank(),
        axis.title.x = element_blank(),
        axis.title.y = element_blank(),
        plot.subtitle = element_text(size = 9,face = "italic"),
        plot.title = element_text(size = 10, face = "bold"),
        panel.background = element_blank(),
        panel.border = element_rect(colour = "grey", fill=NA, size=0.8)
        )

moran2 <- ggplot() +
  geom_sf(data = final_net.localMorans %>% filter(Variable == "Local_Morans_I"), aes(fill=Value), colour=NA) +
  scale_fill_viridis(option = "turbo") + 
  labs(title = "Local Morans I") +
  theme(axis.text.x=element_blank(),
        axis.text.y=element_blank(),
        axis.ticks =element_blank(),
        axis.title.x = element_blank(),
        axis.title.y = element_blank(),
        plot.subtitle = element_text(size = 9,face = "italic"),
        plot.title = element_text(size = 10, face = "bold"),
        panel.background = element_blank(),
        panel.border = element_rect(colour = "grey", fill=NA, size=0.8)
        )

moran3 <- ggplot() +
  geom_sf(data = final_net.localMorans %>% filter(Variable == "P_Value"), aes(fill=Value), colour=NA) +
  scale_fill_viridis(option = "turbo") + 
  labs(title = "P Value") +
  theme(axis.text.x=element_blank(),
        axis.text.y=element_blank(),
        axis.ticks =element_blank(),
        axis.title.x = element_blank(),
        axis.title.y = element_blank(),
        plot.subtitle = element_text(size = 9,face = "italic"),
        plot.title = element_text(size = 10, face = "bold"),
        panel.background = element_blank(),
        panel.border = element_rect(colour = "grey", fill=NA, size=0.8)
        )


moran4 <- ggplot() +
  geom_sf(data = final_net.localMorans %>% filter(Variable == "Significant_Hotspots"), aes(fill=Value), colour=NA) +
  scale_fill_viridis(option = "turbo") + 
  labs(title = "Signigicant Hotspots") +
  theme(axis.text.x=element_blank(),
        axis.text.y=element_blank(),
        axis.ticks =element_blank(),
        axis.title.x = element_blank(),
        axis.title.y = element_blank(),
        plot.subtitle = element_text(size = 9,face = "italic"),
        plot.title = element_text(size = 10, face = "bold"),
        panel.background = element_blank(),
        panel.border = element_rect(colour = "grey", fill=NA, size=0.8)
        )

moran1 + moran2 + moran3 + moran4
```

## Singificant Robbery Clusters

The map illustrates the distance to the nearest robbery hotspots across Chicago, with darker colors indicating shorter distances. The central and southern parts of the city exhibit the shortest distances to robbery hotspots, as shown by the deep blue and black areas. This suggests that these regions are located closer to areas with a high concentration of robbery incidents. In contrast, the outer northern and southern parts of the city, represented by yellow, green, and red hues, are farther away from these hotspots, indicating lower proximity to concentrated robbery activities. This pattern highlights the spatial distribution of robbery risks in Chicago, with more concentrated robbery activity occurring in specific regions, particularly in the central and southern areas. The distance gradient provides insight into how robbery risks decrease as one moves away from these core hotspots.

```{r calculate significant robbery points, warning=FALSE, message=FALSE}

final_net <-
  final_net %>% 
  mutate(robbery.isSig = ifelse(localmoran(final_net$countRobbery, 
                             final_net.weights)[,5] <= 0.0000001, 1, 0)) %>%
  mutate(robbery.isSig.dist = nn_function(st_coordinates(st_centroid(final_net)),
                                          st_coordinates(st_centroid(
                                            filter(final_net, robbery.isSig == 1))), 1))
```


```{r visualize robbery NN Distance, fig.align = 'center'}

ggplot() +
      geom_sf(data = final_net, aes(fill=robbery.isSig.dist), colour=NA) +
      scale_fill_viridis(option = "turbo", name="NN Distance") +
      labs(title="Robbery NN Distance") +
      theme(axis.text.x=element_blank(),
        axis.text.y=element_blank(),
        axis.ticks =element_blank(),
        axis.title.x = element_blank(),
        axis.title.y = element_blank(),
        plot.subtitle = element_text(size = 9,face = "italic"),
        plot.title = element_text(size = 10, face = "bold"),
        panel.background = element_blank(),
        panel.border = element_rect(colour = "grey", fill=NA, size=0.8)
        )


```

## All Risk Factors

The figure shows scatter plots of the count of predictor variables and their nearest neighbor distances in relation to robbery incidents. All count-based features, such as abandoned buildings, cars, and sanitation complaints, are positively correlated with the number of robberies, indicating that areas with higher occurrences of these risk factors tend to experience more robberies. The strongest positive relationships are observed between robbery incidents and sanitation complaints, ShotSpotter detections, and liquor retail locations.

Conversely, all nearest neighbor distance features are negatively correlated with robbery incidents, meaning that as the distance to risk factors increases, the number of robberies decreases. The strongest negative correlations are seen with distance to significant robbery hotspots, abandoned buildings, and liquor retail stores. This suggests that proximity to these risk factors plays a significant role in robbery patterns.

When it comes to model building, this information is critical for feature selection. To avoid multicollinearity, either the count or the nearest neighbor distance should be selected for each risk factor. In this case, the model incorporates nearest neighbor distances, as they tend to show stronger and more consistent correlations with robbery incidents.


```{r correlation, fig.height=15, fig.width=8, fig.align = 'center'}

final_net_long <- final_net%>% 
  st_drop_geometry() %>% 
  dplyr::select(-c(uniqueID, cvID)) %>% 
  pivot_longer(cols = -countRobbery, # everything except measurement
               names_to = "Type", # categorizes all quantitative variables into Type
               values_to = "Number") 

correlation.cor <-
  final_net_long %>%
    group_by(Type) %>%
    summarize(correlation = cor(Number, countRobbery, use = "complete.obs"))

final_net_long %>%
  ggplot(aes(x= Number, y = countRobbery)) +
  geom_point(size = 0.01, color = "#000004") +  
  geom_text(data = correlation.cor, aes(label = paste("r =", round(correlation, 2))),
            x=-Inf, y=Inf, vjust = 1.5, hjust = -.1, size=3) +
  geom_smooth(method='lm', formula= y~x, lwd=0.5, se = FALSE, color = "#BB3754") +
  facet_wrap(~ Type, scales = "free", ncol = 2, labeller= labeller(Type = c(
    `Abandoned_Buildings` = "Abandoned Buildings",
    `Abandoned_Buildings.nn` = "Distance to Abandoned Buildings",
    `Abandoned_Cars` = "Abandoned Cars",
    `Abandoned_Cars.nn` = "Distance to Abandoned Cars",
    `Bank` = "Bank",
    `Bank.nn` = "Distance to Bank",
    `Graffiti` = "Graffiti",
    `Graffiti.nn` = "Distance to Graffiti",
    `Liquor_Retail` = "Liquor Retail",
    `Liquor_Retail.nn` = "Distance to Liquor Retail",
    `robbery.isSig` = "Significant Robbery",
    `robbery.isSig.dist` = "Distance to Significant Robbery",
    `Sanitation` = "Sanitation",
    `Sanitation.nn` = "Distance to Sanitation",
    `Shot_Spotter` = "Shot Spotter", 
    `Shot_Spotter.nn` = "Distance to Shot Spotter",
    `Street_Lights_Out` = "Street Light Out",
    `Street_Lights_Out.nn` = "Distance to Broken Street Light")))  +
  labs(title = "Scatter Plots of Predictor Variables",
       caption = "Data: Chicago Data Portal") +
  theme(plot.subtitle = element_text(size = 9,face = "italic"),
        plot.title = element_text(size = 12, face = "bold"), 
        axis.text.x=element_text(size=6),
        axis.text.y=element_text(size=6), 
        axis.title=element_text(size=8), 
        panel.background = element_blank(),
        panel.border = element_rect(colour = "grey", fill=NA, size=0.8))

```


# Poisson Regression

## Random K-Fold CV

The table below presents goodness of fit metrics for four regression models. The first set of models includes only risk factors 'reg.vars', while the second set 'reg.ss.vars' incorporates both risk factors and the Local Moran's I spatial process. To evaluate model performance, two types of cross-validation were conducted. The first method assigns a randomly generated cvID to each grid cell, allowing for random k-fold cross-validation to assess prediction accuracy.

```{r random k-fold regression, warning=FALSE, results='hide'}

reg.vars <- c("Abandoned_Buildings.nn", "Abandoned_Cars.nn", "Graffiti.nn", 
              "Liquor_Retail.nn", "Street_Lights_Out.nn", "Sanitation.nn", 
              "Shot_Spotter.nn", "Bank.nn")

reg.ss.vars <- c("Abandoned_Buildings.nn", "Abandoned_Cars.nn", "Graffiti.nn", 
                 "Liquor_Retail.nn", "Street_Lights_Out.nn", "Sanitation.nn", 
                 "Shot_Spotter.nn", "Bank.nn", "robbery.isSig", "robbery.isSig.dist")

reg.cv <- crossValidate(
  dataset = final_net,
  id = "cvID",
  dependentVariable = "countRobbery",
  indVariables = reg.vars) %>%
    dplyr::select(cvID = cvID, countRobbery, Prediction, geometry)

reg.ss.cv <- crossValidate(
  dataset = final_net,
  id = "cvID",
  dependentVariable = "countRobbery",
  indVariables = reg.ss.vars) %>%
    dplyr::select(cvID = cvID, countRobbery, Prediction, geometry)

```

## Spatial LOGO CV

The second one hold out one neighborhood, train the model on the remaining n-1 areas, predict for the hold out, and record the goodness of fit. This is also called the spatial 'Leave-one-group-out' cross-validation (LOGO-CV), in which each neighborhood takes a turn as a hold-out. The result of each analysis is a sf layer with observed and predicted burglary counts.

```{r spatial logo cv regression, warning=FALSE, message=FALSE, results='hide'}

neighborhoods <- 
  st_read("https://raw.githubusercontent.com/blackmad/neighborhoods/master/chicago.geojson") %>%
  st_transform(st_crs(fishnet)) 

final_net <-
  st_centroid(final_net) %>%
    st_join(dplyr::select(neighborhoods, name)) %>%
      st_drop_geometry() %>%
      left_join(dplyr::select(final_net, geometry, uniqueID)) %>%
      st_sf() %>%
  na.omit()

reg.spatialCV <- crossValidate(
  dataset = final_net,
  id = "name",
  dependentVariable = "countRobbery",
  indVariables = reg.vars) %>%
    dplyr::select(cvID = name, countRobbery, Prediction, geometry)

reg.ss.spatialCV <- crossValidate(
  dataset = final_net,
  id = "name",
  dependentVariable = "countRobbery",
  indVariables = reg.ss.vars) %>%
    dplyr::select(cvID = name, countRobbery, Prediction, geometry)
```

## Error Analysis

The code block below creates a long form reg.summary, that binds together observed/predicted counts and errors for each grid cell and for each regression. Residuals are calculated for each regression by subtracting the original robbery incidents from predicted incidents. 

```{r summarize regression}

reg.summary <- 
  rbind(
    mutate(reg.cv,           Error = Prediction - countRobbery,
                             Regression = "Random k-fold CV: Just Risk Factors"),
                             
    mutate(reg.ss.cv,        Error = Prediction - countRobbery,
                             Regression = "Random k-fold CV: Spatial Process"),
    
    mutate(reg.spatialCV,    Error = Prediction - countRobbery,
                             Regression = "Spatial LOGO-CV: Just Risk Factors"),
                             
    mutate(reg.ss.spatialCV, Error = Prediction - countRobbery,
                             Regression = "Spatial LOGO-CV: Spatial Process")) %>%
    st_sf() 
```

Comparing the error across different regressions, we see that adding spatial process features improve the model. In addition, the model appears slightly less robust for the spatial cross-validation, probably because LOGO-CV is such a conservative assumption. 

```{r summarize errors by fold and regression, message=FALSE, warning=FALSE}

error_by_reg_and_fold <- 
  reg.summary %>%
    group_by(Regression, cvID) %>% 
    summarize(Mean_Error = mean(Prediction - countRobbery, na.rm = T),
              MAE = mean(abs(Mean_Error), na.rm = T),
              SD_MAE = mean(abs(Mean_Error), na.rm = T)) %>% ungroup()

st_drop_geometry(error_by_reg_and_fold) %>%
  group_by(Regression) %>% 
    summarize(Mean_MAE = round(mean(MAE), 2),
              SD_MAE = round(sd(MAE), 2)) %>%
  kable(col.name=c("Regression", 'Mean Absolute Error','Standard Deviation MAE')) %>%
    kable_styling(bootstrap_options = c("striped", "hover", "condensed")) %>%
    row_spec(2, color = "black", background = "#FCFFA4") %>%
    row_spec(4, color = "black", background = "#FCFFA4") 


```

The small multiple maps below visualizes the distribution of errors by regression. For the random k-fold cv method, the errors are pretty evenly distributed across space. However, for spatial logo-cv method, higher errors occur at spatial hotspots. 

```{r visualize error,fig.height=8, fig.width=11, fig.align = 'center'}

error_by_reg_and_fold %>% 
  ggplot() +
  geom_sf(aes(fill=MAE), color="transparent") +
  scale_fill_viridis(option = "turbo", direction = -1) + 
  facet_wrap(~Regression, ncol = 4) +
   labs(title = "MAE by Fold and Regression",
       caption = "Data: Chicago Data Portal") +
  theme(legend.position = "bottom",
        axis.text.x=element_blank(),
        axis.ticks.x=element_blank(),
        axis.text.y=element_blank(),
        axis.ticks.y=element_blank(),
        plot.subtitle = element_text(size = 9,face = "italic"),
        plot.title = element_text(size = 12, face = "bold"),
        panel.background = element_blank(),
        panel.border = element_rect(colour = "grey", fill=NA, size=0.8),
        strip.text = element_text(size = 7)) 

```

# Prediction Accuracy 

## Generalizability Test

Let’s now test for generalizability across racial-neighborhood context. To test this proposition, tidycensus is used to pull race data by census tract. Percentage minority population for each census tract is calculated by comparing the total number of Latinx, African American, and Asian population to the total population. Census tracts are split into two groups: majority white and minority, depending on whether the proportion of minority population in that census tract exceeds 60%. 


```{r download race data, message=FALSE, warning=FALSE, results='hide'}

census_api_key(dlgInput(
  "Enter a Census API Key", # ask for an api key
  Sys.getenv("CENSUS_API_KEY")
)$res,
overwrite = TRUE)
tracts21 <- 
  get_acs(geography = "tract", 
          variables = c("B02001_001E", # total population
            "B02001_002E", # white population
            "B02001_003E", # black population
            "B02001_005E", # asian population
            "B03002_012E"), 
          year=2021, state=17, county=031, 
          geometry=TRUE, output="wide") %>%
  st_transform('ESRI:102271') %>% 
  rename(TotalPop = B02001_001E, 
         Whites = B02001_002E,
         African_Americans = B02001_003E,
         Asians = B02001_005E,
         Latinx = B03002_012E) %>% 
  mutate(pctMinority = ifelse(TotalPop > 0, (African_Americans + Asians + Latinx ) / TotalPop, 0), 
         majority = ifelse(pctMinority > 0.5, "minority", "majority")) %>%   .[neighborhoods,]


```

We may see that Chicago remains a very segregated city and there's clear racial boundary between places where more than 60% of the population are minority and places that's majority white. In particular, northern and northeastern (CBD) part of Chicago are majority white while the remaining parts are all made up of racial minorities. 

```{r visualize race context, fig.align = 'center'}

ggplot() + geom_sf(data = na.omit(tracts21), aes(fill = majority), color = NA) +
    scale_fill_manual(values = c("#BB3754", "#56106E"), name="Race Context") +
    labs(title = "Race Context ",
         caption = "Data: American Community Survey 2021") +
  theme(
        axis.text.x=element_blank(),
        axis.ticks.x=element_blank(),
        axis.text.y=element_blank(),
        axis.ticks.y=element_blank(),
        plot.title = element_text(size = 12, face = "bold"),
        panel.background = element_blank(),
        panel.border = element_rect(colour = "grey", fill=NA, size=0.4)
        ) 
```

Error is calculated by subtracting the observed robbery count from the prediction. A positive difference represents an over-prediction. The least ideal result is a model that over-predicts risk in minority areas, and under-predicts in white areas. If reporting selection bias is an issue, such a model may unfairly allocate police resource disproportionately in black and brown communities. In our case, over-prediction of minorities is an issue for random k-fold cv regression, but not for logo-cv regressions. The latter two models under-predict minority neighborhoods and over-predict majority white neighborhood, making it looks like that this algorithms generalizes well with respect to race. 

```{r generalizability by race, message=FALSE, warning=FALSE}

joinrace <- st_centroid(reg.summary) %>% 
  st_intersection(tracts21 %>%dplyr:: select(pctMinority)) %>% 
  st_drop_geometry() %>% 
  group_by(cvID) %>% 
  summarize(meanMinor = mean(pctMinority))


reg.summary <- reg.summary %>% 
  left_join(joinrace, by = "cvID") 


reg.summary %>% 
  mutate(raceContext = ifelse(meanMinor > .6, "Minority", "Majority White")) %>% 
  st_drop_geometry() %>% 
  group_by(Regression, raceContext) %>%
  summarize(mean.Error = mean(Error, na.rm = T)) %>%
  spread(raceContext, mean.Error) %>%
  kable() %>%
    kable_styling(bootstrap_options = c("striped", "hover", "condensed")) %>%
    row_spec(2, color = "black", background = "#FCFFA4") %>%
    row_spec(4, color = "black", background = "#FCFFA4") 

```

## Kernel Desntiy Estimation

The utility of this algorithm is judged relative to an alternative police allocation method: kernel density estimation. Kernel density works by centering a smooth kernel, or curve, atop each crime point such that the curve is at its highest directly over the point and the lowest at the range of a circular search radius. The code block below creates a Kernel density map of robbery with a 1000 foot search radius. 
 
```{r calculate 2021 Kernal Density, fig.width=6, message=FALSE, warning=FALSE, fig.align = 'center'}

rob_ppp <- as.ppp(st_coordinates(robbery21), W = st_bbox(final_net))
rob_KD.1000 <- spatstat.explore::density.ppp(rob_ppp, 1000)
rob_KD.df <- data.frame(rasterToPoints(mask(raster(rob_KD.1000), as(neighborhoods, 'Spatial'))))


ggplot(data=rob_KD.df, aes(x=x, y=y)) +
  geom_raster(aes(fill=layer)) + 
  coord_sf(crs=st_crs(final_net)) + 
  scale_fill_viridis(option = "turbo", name="Density") +
  labs(title = "Kernel Density of Robbery 1000ft Radii") +
    theme(axis.text.x=element_blank(),
        axis.text.y=element_blank(),
        axis.ticks =element_blank(),
        axis.title.x = element_blank(),
        axis.title.y = element_blank(),
        plot.subtitle = element_text(size = 9,face = "italic"),
        plot.title = element_text(size = 12, face = "bold"),
        panel.background = element_blank(),
        panel.border = element_rect(colour = "grey", fill=NA, size=0.8)
        )
```

Next, a new goodness of fit indicator is created to illustrate whether the 2021 kernel density or risk predictions capture more of the 2022 robberies.

```{r download 2022 robery data, warning=FALSE, message=FALSE}

robbery22 <- 
  read.socrata("https://data.cityofchicago.org/Public-Safety/Crimes-2022/9hwr-2zxp") %>% 
   filter(Primary.Type == "ROBBERY") %>% 
  mutate(x = gsub("[()]", "", Location)) %>%
  separate(x,into= c("Y","X"), sep=",") %>%
  mutate(X = as.numeric(X),
         Y = as.numeric(Y)) %>% 
  na.omit %>%
  st_as_sf(coords = c("X", "Y"), crs = 4326, agr = "constant") %>%
  st_transform('ESRI:102271') %>% 
  distinct() %>%
  .[fishnet,]

```

The kernel density estimation was classified into 5 risk categories and the count for the 2022 robberies were summarized into the same sf layer.  The purpose here is to see if the risk predictions capture more observed burglaries than the kernel density. If so, then the risk prediction model provides a more robust targeting tool for allocating police resources. 

```{r categorize KDE}

rob_KDE_sum <- as.data.frame(rob_KD.1000) %>%
  st_as_sf(coords = c("x", "y"), crs = st_crs(final_net)) %>%
  aggregate(., final_net, mean) 

kde_breaks <- classIntervals(rob_KDE_sum$value, 
                             n = 5, "fisher")

rob_KDE_sf <- rob_KDE_sum %>%
  mutate(label = "Kernel Density",
         Risk_Category = classInt::findCols(kde_breaks),
         Risk_Category = case_when(
           Risk_Category == 5 ~ "5th",
           Risk_Category == 4 ~ "4th",
           Risk_Category == 3 ~ "3rd",
           Risk_Category == 2 ~ "2nd",
           Risk_Category == 1 ~ "1st")) %>%
  cbind(
    aggregate(
      dplyr::select(robbery21) %>% mutate(robberyCount = 1), ., sum) %>%
    mutate(robberyCount = replace_na(robberyCount, 0))) %>%
  dplyr::select(label, Risk_Category, robberyCount)

```

The same process was repeated for the risk prediction. Note that both predictions from the LOGO-CV with and without the spatial features is being used here.

```{r categorize random k fold}

ml_breaks <- classIntervals(reg.ss.spatialCV$Prediction, 
                             n = 5, "fisher")
rob_risk_sf <-
  reg.ss.spatialCV %>%
  mutate(label = "Risk Predictions Spatial",
         Risk_Category =classInt::findCols(ml_breaks),
         Risk_Category = case_when(
           Risk_Category == 5 ~ "5th",
           Risk_Category == 4 ~ "4th",
           Risk_Category == 3 ~ "3rd",
           Risk_Category == 2 ~ "2nd",
           Risk_Category == 1 ~ "1st")) %>%
  cbind(
    aggregate(
      dplyr::select(robbery21) %>% mutate(robberyCount = 1), ., sum) %>%
      mutate(robberyCount = replace_na(robberyCount, 0))) %>%
  dplyr::select(label,Risk_Category, robberyCount)

ml_breaks_simple <- classIntervals(reg.ss.cv$Prediction, 
                             n = 5, "fisher")
rob_risk_sf_simple <-
  reg.ss.cv %>%
  mutate(label = "Risk Predictions Simple",
         Risk_Category =classInt::findCols(ml_breaks_simple),
         Risk_Category = case_when(
           Risk_Category == 5 ~ "5th",
           Risk_Category == 4 ~ "4th",
           Risk_Category == 3 ~ "3rd",
           Risk_Category == 2 ~ "2nd",
           Risk_Category == 1 ~ "1st")) %>%
  cbind(
    aggregate(
      dplyr::select(robbery21) %>% mutate(robberyCount = 1), ., sum) %>%
      mutate(robberyCount = replace_na(robberyCount, 0))) %>%
  dplyr::select(label,Risk_Category, robberyCount)

```

## Prediction Comparison

Below a map is generated of the risk categories for all three model types. A strongly fit model should show that the highest risk category is uniquely targeted to places with a high density of burglary points. The map shows that our risk prediction models perform as good as the kernal density approach in capturing the distribution of robbery incidents. 

```{r prediction comparisons map, fig.align = 'center'}
rbind(rob_KDE_sf, rob_risk_sf, rob_risk_sf_simple) %>%
  na.omit() %>%
  gather(Variable, Value, -label, -Risk_Category, -geometry) %>%
  ggplot() +
    geom_sf(aes(fill = Risk_Category), colour = NA) +
    facet_wrap(~label, ) +
    scale_fill_viridis(option = "turbo", discrete = TRUE, name = "Risk Category") +
      geom_sf(data = sample_n(robbery22, 3000), size = .03, colour = "white") +
    labs(title="Comparison of Kernel Density and Risk Predictions",
         subtitle="2021 robbery risk predictions; 2022 armed robberies") + 
  theme(axis.text.x=element_blank(),
        axis.text.y=element_blank(),
        axis.ticks =element_blank(),
        axis.title.x = element_blank(),
        axis.title.y = element_blank(),
        plot.subtitle = element_text(size = 9,face = "italic"),
        plot.title = element_text(size = 12, face = "bold"),
        panel.background = element_blank(),
        panel.border = element_rect(colour = "grey", fill=NA, size=0.8)
        )
```

An effective way to assess prediction accuracy is by using a histogram. Ideally, a well-fitting model should show that the highest-risk areas capture a larger proportion of 2022 robberies compared to the kernel density estimate. However, in this case, the kernel density model outperforms both risk prediction models in the highest-risk regions. Despite this, the risk prediction models excel in identifying armed robbery incidents in lower-risk areas that the kernel density model overlooked.





```{r prediction comparisons figure, warning=FALSE, message=FALSE, fig.align = 'center'}
rbind(rob_KDE_sf, rob_risk_sf, rob_risk_sf_simple) %>%
  st_drop_geometry() %>%
  na.omit() %>%
  gather(Variable, Value, -label, -Risk_Category) %>%
  group_by(label, Risk_Category) %>%
  summarize(countRobbery = sum(Value)) %>%
  ungroup() %>%
  group_by(label) %>%
  mutate(Pcnt_of_test_set_crimes = countRobbery / sum(countRobbery)) %>%
    ggplot(aes(Risk_Category,Pcnt_of_test_set_crimes)) +
      geom_bar(aes(fill=label), position="dodge", stat="identity") +
      scale_fill_viridis(option = "turbo", discrete = TRUE, name = "Model", direction = -1) +
      labs(title = "Risk Prediction vs. Kernel Density",
           y = "% of Test Set Robberies (per model)",
           x = "Risk Category") +
  theme_bw() +
      theme(axis.text.x = element_text(angle = 45, vjust = 0.5))
```

# Conclusions

In conclusion, our model requires significant improvement before it can be implemented effectively. Compared to traditional kernel density estimation, it struggled to capture armed robbery incidents in the highest-risk areas. One key issue is that many of our predictor variables, such as reports of abandoned cars, graffiti, and sanitation complaints, were outdated, last updated in 2019. Using these to predict robberies in 2021 and 2022 likely led to inaccurate results, especially since shifts in these risk factors (like graffiti removal) would alter crime patterns. Up-to-date datasets are crucial for more accurate predictions.

Additionally, the COVID-19 pandemic likely impacted armed robbery patterns, with economic distress, lockdowns, and reallocation of police resources all contributing to shifts in crime locations. Our model needs to account for such factors to ensure predictions remain relevant over time.

On a positive note, the model performed better in lower-risk areas, where the kernel density method fell short. However, relying on a single model may not capture the full complexity of armed robbery patterns. A combined approach using both kernel density estimation and risk prediction could improve accuracy. While we avoided overpredicting in majority non-white neighborhoods, potential selection bias remains a concern. Incorporating spatial features and using k-fold cross-validation helped improve the model, but further refinement is needed to ensure it generalizes well across different contexts and time periods.