Introduction

Bikeshare programs have become an essential part of urban transportation, offering a sustainable and flexible commuting option. Citi Bike, launched by Lyft in 2008 in New York City, has expanded to Jersey City to meet the needs of commuters traveling to Manhattan. As Citi Bike’s popularity grows, managing bike availability effectively becomes crucial, especially given the fluctuations in demand throughout the day and across different locations.

Maintaining a balanced distribution of bikes, known as “re-balancing,” is a complex challenge influenced by these spatial and temporal demand patterns. Companies like Uber and Lyft use proprietary algorithms to address these issues, but the details of their models are not publicly available. This report aims to develop a transparent, open-source approach to forecasting bike demand in Jersey City using Ordinary Least Squares (OLS) regression. By analyzing historical data, including weather and bike usage patterns, our goal is to create a simple yet effective model to anticipate demand spikes and guide efficient bike reallocation.

We will begin by loading and processing our datasets, followed by an exploratory analysis of usage trends and weather impacts. Next, we’ll create a space-time panel to examine correlations and finally develop and evaluate four predictive models, testing their performance and reliability.

Set Up

The data used in this analysis comes from NYC OpenData, which provides monthly records of Citi Bike usage. Even though Jersey City is in New Jersey, Citi Bike’s operations are largely centered in New York City, and thus the data is managed through NYC’s system. For this study, we focused on bikeshare trips in Jersey City during August 2024, capturing information about the starting and ending stations for each ride. Additionally, we acquired the census tract geometries for all of Hudson County, which encompasses Jersey City, using the tidycensus package.

nycbike <- read.csv("/Users/bin/Documents/MUSA/MUSA 5080/JC-202308-citibike-tripdata.csv")

njCensus <- get_acs(geography = "tract", 
          variables = c("B01003_001", "B19013_001", 
                        "B02001_002", "B08013_001",
                        "B08012_001", "B08301_001", 
                        "B08301_010", "B01002_001"), 
          year = 2021, 
          state = "NJ", 
          geometry = TRUE, 
          county=c("Hudson"),
          output = "wide") %>%
  rename(Total_Pop =  B01003_001E,
         Med_Inc = B19013_001E,
         Med_Age = B01002_001E,
         White_Pop = B02001_002E,
         Travel_Time = B08013_001E,
         Num_Commuters = B08012_001E,
         Means_of_Transport = B08301_001E,
         Total_Public_Trans = B08301_010E) %>%
  select(Total_Pop, Med_Inc, White_Pop, Travel_Time,
         Means_of_Transport, Total_Public_Trans,
         Med_Age,
         GEOID, geometry) %>%
  mutate(Percent_White = White_Pop / Total_Pop,
         Mean_Commute_Time = Travel_Time / Total_Public_Trans,
         Percent_Taking_Public_Trans = Total_Public_Trans / Means_of_Transport)

jcTracts <- njCensus %>%
  as.data.frame() %>%
  distinct(GEOID, .keep_all = TRUE) %>%
  select(GEOID, geometry) %>% 
  st_sf()

The bikeshare dataset we’re working with contains spatial details, but it isn’t inherently formatted as spatial data. To conduct a spatial analysis, we first need to integrate it with the census geometry. Our focus is on the starting locations of rides, as our aim is to predict bike availability at stations when users initiate a trip. To prepare the data, we filter out records lacking latitude or longitude information and address errors where certain coordinates incorrectly place stations in the Hudson River. These inaccurate entries are removed to ensure data quality.

bike_census <- st_join(nycbike %>% 
                         filter(is.na(start_lat) == FALSE &
                                is.na(start_lng) == FALSE &
                                is.na(end_lat) == FALSE &
                                is.na(end_lng) == FALSE) %>%
                         st_as_sf(., coords = c("start_lng", "start_lat"), crs = 4326),
                       jcTracts %>%  st_transform(crs=4326),
                       join=st_intersects, left = TRUE) %>%
  rename(Origin.Tract = GEOID) %>%
  mutate(start_lng = unlist(map(geometry, 1)),
         start_lat = unlist(map(geometry, 2)))%>%
  filter(!(ride_id %in% c("BEC9CC4D1007C753", "86087F431785EDE7", "A1AAB92631439883", "71938DA085242DCC", "4203C77668C47C75", "C92410CA2AEF3947", "6E02CBF36C90F244", "0259885253FD238E", "B39136B37AFB5FE9", "1EC9376D1559C51C", "CDDF16D3A37BE370"))) %>% 
  as.data.frame()

The following map shows the Locartion of ride starting points in Jersey City

ggplot()+
  geom_sf(data = jcTracts %>%
          st_transform(crs=4326), fill="white")+
  geom_point(data = bike_census, aes(x=start_lng, y = start_lat), 
            color = "blue", alpha = 1, size = 0.3) + 
  ylim(min(bike_census$start_lat), max(bike_census$start_lat))+
  xlim(min(bike_census$start_lng), max(bike_census$start_lng)) +
  labs(title = "Location of Rides Starting Points in Jersey City") +
  theme(axis.text.x=element_blank(),
        axis.text.y=element_blank(),
        axis.ticks =element_blank(),
        axis.title.x = element_blank(),
        axis.title.y = element_blank(),
        plot.subtitle = element_text(size = 9,face = "italic"),
        plot.title = element_text(size = 12, face = "bold"),
        panel.background = element_blank(),
        panel.border = element_rect(colour = "grey", fill=NA, linewidth=0.8)
        )

We carry out essential feature engineering to derive temporal insights from the ride data. Specifically, we’ll create time intervals at both 60-minute and 15-minute increments based on each ride’s start time. Additionally, we’ll categorize these intervals into broader time periods, such as “Overnight,” “AM Rush,” “Mid-Day,” and “PM Rush,” and classify each day as either a weekday or weekend.

bike_census <- bike_census %>%
  mutate(
    interval60 = floor_date(ymd_hms(started_at), unit = "hour"),
    interval15 = floor_date(ymd_hms(started_at), unit = "15 mins"),
    week = week(interval60),
    dotw = wday(interval60),  # Removed label = TRUE
    time_of_day = case_when(
      hour(interval60) < 7 | hour(interval60) > 18 ~ "Overnight",
      hour(interval60) >= 7 & hour(interval60) < 10 ~ "AM Rush",
      hour(interval60) >= 10 & hour(interval60) < 15 ~ "Mid-Day",
      hour(interval60) >= 15 & hour(interval60) <= 18 ~ "PM Rush"
    ),
    hour = hour(started_at),
    weekend = ifelse(dotw %in% c(1, 7), "Weekend", "Weekday")  # Use numeric values for days
  )

To better understand commuting patterns and rider behavior throughout the day, we generated line strings to represent the routes between the start and end points of rides. To simplify the visualization, we limited it to 200 sample rides. The visualization reveals that the majority of trips are confined to Jersey City, though some riders use Citi Bike to travel from Jersey City to Manhattan.

lines <- st_sfc(
  lapply(1:nrow(bike_census), function(i) {
    st_linestring(matrix(c(bike_census[i, "start_lng"], bike_census[i, "start_lat"],
                            bike_census[i, "end_lng"], bike_census[i, "end_lat"]), ncol = 2, byrow = TRUE))
  }))

lines_sf <- st_sf(geometry = lines)
lines_sf <- st_set_crs(lines_sf, 4326)

Route <- lines_sf %>% head(200)
mapview(Route, zcol = NULL, color = "#6E0E0A", linewidth = 0.001, alpha = 0.3) # only show 200

Exploratory Analysis

In this section, we delve into some further exploratory analysis of our data set.

Temporal Analysis

We began our analysis by examining the temporal patterns of bikeshare usage in Jersey City throughout August 2023, focusing on the distribution of trips by the hour. Our visualization reveals distinct fluctuations, with demand rising and falling in a predictable cycle each day. During weekdays, we observe pronounced peaks that align with typical rush hours—mornings and late afternoons—when people are commuting to and from work. These busy periods are followed by noticeable lulls during the night and early morning hours, as well as during weekends, when overall usage tends to decrease. The pattern is characterized by a series of five consecutive high-demand peaks that correspond to weekdays, consistently followed by two lower peaks over the weekend, highlighting the influence of workweek routines on bikeshare activity. This cyclical behavior underscores the strong link between bikeshare demand and daily commuting schedules.

ggplot(bike_census %>%
         group_by(interval60) %>%
         tally())+
  geom_line(aes(x = interval60, y = n), color= "blue")+
  labs(title="Bike Share Trips Per Hour in Jersey City, Aug 2023",
       x="Date", 
       y="Number of trips") + 
theme(plot.subtitle = element_text(size = 9,face = "italic"),
        plot.title = element_text(size = 12, face = "bold"), 
        axis.text.x=element_text(size=8),
        axis.text.y=element_text(size=8), 
        axis.title=element_text(size=10), 
        panel.background = element_blank(),
        panel.border = element_rect(colour = "grey", fill=NA, linewidth =0.8))

Next, we analyzed how trips were distributed across different stations on an hourly basis throughout August. The resulting visualization revealed a highly skewed distribution: the vast majority of stations recorded either no rides or just a few trips per hour. However, a select few stations stood out, experiencing significantly higher demand. These high-traffic stations sometimes saw as many as 40 to 50 rides within a single hour on certain days. This uneven distribution highlights the varying levels of bikeshare usage across Jersey City, with certain locations acting as major hubs of activity while many others experience minimal ridership.

ggplot(bike_census %>%
         group_by(interval60, start_station_name) %>%
         tally())+
  geom_histogram(aes(n), binwidth = 5, fill="lightblue")+
  labs(title="Bike Share Trips Per Hour by Station in Jersey City, Aug 2023",
       x="Trip Counts", 
       y="Number of Stations") + 
    theme(plot.subtitle = element_text(size = 9,face = "italic"),
        plot.title = element_text(size = 12, face = "bold"), 
        axis.text.x=element_text(size=8),
        axis.text.y=element_text(size=8), 
        axis.title=element_text(size=10), 
        panel.background = element_blank(),
        panel.border = element_rect(colour = "grey", fill=NA, linewidth =0.8))

We further explored the average number of trips per station at different times of the day. The analysis showed that PM rush hours consistently had the highest average trip counts, with several stations averaging more than 10 trips, indicating substantial demand at these locations. Midday came next in terms of activity levels, where some stations recorded averages of over 5 trips. Surprisingly, there was also considerable demand overnight, as a few stations averaged more than 5 rides even during late-night hours. This unexpected overnight activity suggests that bikeshare usage extends beyond traditional commuting patterns, potentially serving other purposes such as late-night leisure or early-morning travel.

bike_census %>%
        group_by(interval60, start_station_name, time_of_day) %>%
         tally()%>%
  group_by(start_station_name, time_of_day)%>%
  summarize(mean_trips = mean(n))%>%
  ggplot()+
  geom_histogram(aes(mean_trips, fill=time_of_day), binwidth = 1)+
  scale_fill_manual(values = palette4, name="Time of Day") + 
  labs(title="Mean Number of Hourly Trips Per Station",
       x="Number of trips", 
       y="Frequency")+
  facet_wrap(~time_of_day) +
theme(plot.subtitle = element_text(size = 9,face = "italic"),
        plot.title = element_text(size = 12, face = "bold"), 
        axis.text.x=element_text(size=8),
        axis.text.y=element_text(size=8), 
        axis.title=element_text(size=10), 
        panel.background = element_blank(),
        panel.border = element_rect(colour = "grey", fill=NA, size=0.8))

We then examined the total number of trips throughout August, breaking them down by day of the week. Two key patterns emerged from this visualization. First, the number of trips fluctuates significantly over the course of a day, with the lowest activity observed in the early morning hours and the highest in the late afternoon, corresponding to evening commute times. Second, the data shows that trip volumes are generally higher on Tuesdays, Wednesdays, and Thursdays, suggesting that midweek days experience the greatest demand for bikeshare services, likely driven by regular work-related travel.

ggplot(bike_census %>% mutate(hour = hour(started_at)))+
     geom_freqpoly(aes(hour, color = dotw), binwidth = 3, lwd = 0.8)+
  scale_color_manual(values = c("#124E78", "#819FA1", "#F0F0C9", "#F2BB05", "#E58507", "#D74E09", "#6E0E0A"), name = "Day") + 
  labs(title="Bike Share Trips in Jersey City by Day of the Week, Aug 2023",
       x="Hour", 
       y="Trip Counts") + 
  theme(plot.subtitle = element_text(size = 9,face = "italic"),
        plot.title = element_text(size = 12, face = "bold"), 
        axis.text.x=element_text(size=8),
        axis.text.y=element_text(size=8), 
        axis.title=element_text(size=10), 
        panel.background = element_blank(),
        panel.border = element_rect(colour = "grey", fill=NA, size=0.8))
## Warning: The following aesthetics were dropped during statistical transformation:
## colour.
## ℹ This can happen when ggplot fails to infer the correct grouping structure in
##   the data.
## ℹ Did you forget to specify a `group` aesthetic or to convert a numerical
##   variable into a factor?

We created a similar visualization but this time compared trip volumes between weekdays and weekends. The differences are striking: bikeshare usage is noticeably higher on weekdays compared to weekends. Additionally, on weekends, trips start to pick up much later in the day, reflecting the absence of the typical weekday morning rush. This delay suggests that without the pressure of early work commutes, people are more likely to use bikeshare services later in the day for leisure or non-work-related activities, highlighting the influence of daily routines on bikeshare demand.

ggplot(bike_census)+
     geom_freqpoly(aes(hour, color = weekend), binwidth = 3, lwd = 1.2)+
  scale_color_manual(values=palette2) + 
  labs(color = "Time of Week") +
  labs(title="Bike Share Trips in Jersey City - Weekend vs Weekday, Aug 2023",
       x="Hour", 
       y="Trip Counts") + 
  theme(plot.subtitle = element_text(size = 9,face = "italic"),
        plot.title = element_text(size = 12, face = "bold"), 
        axis.text.x=element_text(size=8),
        axis.text.y=element_text(size=8), 
        axis.title=element_text(size=10), 
        panel.background = element_blank(),
        panel.border = element_rect(colour = "grey", fill=NA, size=0.8))

# Assuming `bike_census` has a column 'Origin.Tract' that maps each bike trip to a census tract
# and `jcTracts` is already loaded and transformed as necessary

# Aggregate bike trip data by census tract
bike_trips_by_tract <- bike_census %>%
  group_by(Origin.Tract) %>%
  summarise(Total_Trips = n(), .groups = 'drop') %>%
  ungroup()

# Join this aggregated data to the jcTracts GeoDataFrame
jcTracts <- st_transform(jcTracts, crs = 4326)  # Make sure CRS matches, if not already
jcTracts_data <- jcTracts %>%
  left_join(bike_trips_by_tract, by = c("GEOID" = "Origin.Tract"))

# Calculate the bounding box for centering
bbox <- st_bbox(jcTracts_data)

# Plot the choropleth map with adjusted limits
ggplot(data = jcTracts_data) +
  geom_sf(aes(fill = Total_Trips), color = "white", size = 0.1) +
  scale_fill_viridis_c(option = "D", name = "Total Bike Trips", na.value = "white") +
  coord_sf(xlim = c(bbox["xmin"], bbox["xmax"]), ylim = c(bbox["ymin"], bbox["ymax"]), expand = FALSE) +  # Set limits
  labs(title = "Bike Share Trips per Census Tract in Jersey City") +
  theme_minimal() +
  theme(legend.position = "bottom",
        axis.text.x = element_blank(),
        axis.text.y = element_blank(),
        axis.ticks = element_blank(),
        axis.title.x = element_blank(),
        axis.title.y = element_blank(),
        plot.title = element_text(size = 14, face = "bold"),
        plot.subtitle = element_text(size = 12))

Expanding on our analysis, we grouped the data by station and time of day to explore trip volumes during different periods on weekdays versus weekends. To keep the visualization clear, we focused on the top 50 station-time occurrences. Each bar in the chart represents a specific combination of station, time of day, and whether it’s a weekday or weekend. The results clearly show that PM rush hours generate the highest bikeshare demand, particularly at a few key stations. Notably, weekend rides are scarce in the top 50 list, appearing only three times. A closer look reveals that Grove St PATH station had a total of 1,117 rides during PM rush on a weekday and 801 rides overnight on a weekday in August. Hoboken Terminal - River St & Hudson Pl recorded 864 rides during weekday PM rush. South Waterfront Walkway - Sinatra Dr & 1 St had 319 rides overnight on a weekend and 731 rides during PM rush on a weekday. These details underscore how certain stations experience significant spikes in demand, especially during weekday commute hours.

to_plot <- bike_census%>% # only show 100 stations
  group_by(start_station_id, start_lat, start_lng, weekend, time_of_day, start_station_name) %>%
  tally() %>% 
  arrange(-n) %>% 
  head(50) 
to_plot$ID <- seq_along(to_plot$start_station_id)

to_plot %>% 
  arrange(-n) %>%
  head(50) %>% 
  ggplot(aes(x = reorder(ID, -n), n, fill = time_of_day, color = weekend)) +
  scale_fill_manual(values = palette4, name="Time of Day") + 
  guides(color="none") + 
  geom_bar(stat = "identity", position="stack") +
  scale_color_manual(values = c("transparent", "black")) +
  labs(title="Top 50 Occurences of Rides By Time")+
  ylab("Number of Rides") +
  xlab("") + 
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) + 
  theme(plot.subtitle = element_text(size = 9,face = "italic"),
        plot.title = element_text(size = 12, face = "bold"), 
        axis.text.x = element_blank(),
    axis.ticks.x = element_blank(), 
        axis.text.y=element_text(size=8), 
        axis.title=element_text(size=10), 
        panel.background = element_blank(),
        panel.border = element_rect(colour = "grey", fill=NA, size=0.8))

Spatial Analysis

Beyond our temporal analysis, we created maps to visualize the locations of stations with the highest bikeshare demand, broken down by time of day and day of the week. These proportional maps reveal that the greatest demand occurs during PM rush hours and overnight on weekdays, while AM rush hours see the lowest activity. The stations with the highest usage are predominantly clustered along the Hudson River. This pattern could be due to two factors: a higher concentration of stations in this area or strategic placement by Citi Bike to accommodate commuters traveling from Jersey City to Manhattan for work or school. The proximity to major transit connections likely drives this increased demand.

ggplot()+
  geom_sf(data = jcTracts %>%
          st_transform(crs=4326), fill = "white")+
  geom_point(data = to_plot,
             aes(x=start_lng, y = start_lat, size=n), color = alpha("#6E0E0A", 0.7)) + 
  ylim(min(bike_census$start_lat), max(bike_census$start_lat))+
  xlim(min(bike_census$start_lng), max(bike_census$start_lng)) +
  scale_size(range = c(0.5, 8)) + 
  labs(size = "Number of Rides") +
  facet_grid(weekend ~ time_of_day) + 
    labs(title="Locations of Stations with Greatest Bikeshare Demand by Time")+ 
  theme(axis.text.x=element_blank(),
        axis.text.y=element_blank(),
        axis.ticks =element_blank(),
        axis.title.x = element_blank(),
        axis.title.y = element_blank(),
        plot.subtitle = element_text(size = 9,face = "italic"),
        plot.title = element_text(size = 12, face = "bold"),
        panel.background = element_blank(),
        panel.border = element_rect(colour = "grey", fill=NA, linewidth=0.8)
        )

We analyzed all the starting stations collectively to assess spatial autocorrelation, aiming to determine if stations with high demand tend to be located near other high-demand stations. If such a pattern exists, it would allow us to predict demand at a specific station based on the activity levels at neighboring stations. To investigate this, we selected the top 80 stations with the highest number of rides in August 2023 and calculated Moran’s I using 999 random Monte Carlo simulations. The results indicated a highly significant Moran’s I, confirming that there is indeed spatial autocorrelation among nearby bikeshare stations, meaning demand is not randomly distributed but clustered in specific areas.

moran_data <- bike_census %>%
  group_by(start_station_id, start_lat, start_lng, start_station_name) %>%
  tally() %>% 
  arrange(-n) %>% 
  head(80) %>% 
  st_as_sf(., coords = c("start_lng", "start_lat"), crs = 4326)


coords <-  st_coordinates(moran_data) 
neighborList <- knn2nb(knearneigh(coords, 5))
spatialWeights <- nb2listw(neighborList, style="W")

moranTest <- moran.mc(moran_data$n, 
                      spatialWeights, nsim = 999)

ggplot(as.data.frame(moranTest$res[c(1:999)]), aes(moranTest$res[c(1:999)])) +
  geom_histogram(binwidth = 0.01, fill = "#124E78") +
  geom_vline(aes(xintercept = moranTest$statistic), colour = "#6E0E0A" ,lwd=1) +
  scale_x_continuous(limits = c(-0.5, 0.5)) +
  labs(title = "Observed and Permuted Moran's I for Stations with Most Rides",
       subtitle = "Observed Moran's I in Red",
       x = "Moran's I",
       y = "Count") +
  theme_light() +   
theme(plot.subtitle = element_text(size = 9,face = "italic"),
        plot.title = element_text(size = 12, face = "bold"), 
        axis.text.x=element_text(size=8),
        axis.text.y=element_text(size=8), 
        axis.title=element_text(size=10))

The map below illustrates the clustering of stations with higher bikeshare demand. Specifically, two main clusters emerge, and within each of these clusters, a few stations experience exceptionally high numbers of rides. These high-demand stations are typically surrounded by others with slightly lower, but still significant, levels of activity, highlighting a clear pattern of concentrated bikeshare usage.

  ggplot() +
  geom_sf(data = jcTracts, fill = "white") +
  geom_sf(data = moran_data,
          aes(size = n, color = n)) + 
  ylim(min(bike_census$start_lat), max(bike_census$start_lat))+
  xlim(min(bike_census$start_lng), max(bike_census$start_lng)) + 
  scale_size(
    range = c(1, 6),
    guide = guide_legend(
      direction = "horizontal",
      nrow = 1,
      label.position = "bottom")) +
  scale_color_gradientn(colours = hcl.colors(5, "RdBu",rev = TRUE, alpha = 0.9), 
                        guide = guide_legend(direction = "horizontal", nrow = 1)) +
  labs(title = "Stations with Most Rides",
       subtitle =  "Dot size proportional to rides") + 
  theme(legend.position = "bottom") + 
  theme(axis.text.x=element_blank(),
        axis.text.y=element_blank(),
        axis.ticks =element_blank(),
        axis.title.x = element_blank(),
        axis.title.y = element_blank(),
        plot.subtitle = element_text(size = 9,face = "italic"),
        plot.title = element_text(size = 12, face = "bold"),
        panel.background = element_blank(),
        panel.border = element_rect(colour = "grey", fill=NA, linewidth=0.8)
        )

This map below combines the spatial and temporal characteristics of bike rides at 15min interval into a choropleth map.

bike_aggregated <- bike_census %>%
  group_by(interval15, Origin.Tract) %>%
  summarize(Trip_Count = n(), .groups = 'drop')

# Joining with spatial data
choropleth_data <- bike_aggregated %>%
  left_join(jcTracts, by = c("Origin.Tract" = "GEOID")) %>%
  st_sf()

# Creating the choropleth map with the same scale
choropleth_bike_map <- ggplot() +
  geom_sf(data = jcTracts, fill = "white", color = "grey", alpha = 0.5) + # Base map
  geom_sf(data = choropleth_data, aes(fill = Trip_Count), color = NA) + # Choropleth layer
  scale_fill_gradientn(
    colours = hcl.colors(5, "YlOrRd", rev = TRUE),
    name = "Number of Trips",
    na.value = "transparent"
  ) +
  # Set the same scale using xlim and ylim
  xlim(min(bike_census$start_lng), max(bike_census$start_lng)) +
  ylim(min(bike_census$start_lat), max(bike_census$start_lat)) +
  labs(
    title = "Bikeshare Trip Density by 15-Minute Intervals",
    subtitle = "Number of bike rides aggregated by origin tract and time interval"
  ) +
  theme(
    legend.position = "bottom",
    axis.text.x = element_blank(),
    axis.text.y = element_blank(),
    axis.ticks = element_blank(),
    axis.title.x = element_blank(),
    axis.title.y = element_blank(),
    plot.subtitle = element_text(size = 9, face = "italic"),
    plot.title = element_text(size = 12, face = "bold"),
    panel.background = element_blank(),
    panel.border = element_rect(colour = "grey", fill = NA, linewidth = 0.8)
  )

# Display the updated map
print(choropleth_bike_map)

Weather Analysis

In addition to space and time, weather plays a crucial role in influencing bikeshare demand. We expect that demand typically drops on rainy, windy, or extremely hot days. To incorporate this factor into our analysis, we collected hourly weather data from Newark Airport using the riem_measures function, covering the period from August 1 to September 1, 2023. Since Newark Airport is close enough to Jersey City, it serves as a reliable source for local weather conditions. We then created a weather panel summarizing key variables such as temperature, precipitation, and wind speed for each hour.

weather.Panel <- 
  riem_measures(station = "EWR", date_start = "2023-08-01", date_end = "2023-09-01") %>%
  dplyr::select(valid, tmpf, p01i, sknt) %>%
  replace(is.na(.), 0) %>%
  mutate(interval60 = ymd_h(substr(valid, 1, 13))) %>%
  mutate(
    week = week(interval60),
    dotw = wday(interval60)  # Remove label = TRUE
  ) %>%
  group_by(interval60) %>%
  summarize(
    Temperature = max(tmpf),
    Precipitation = sum(p01i),
    Wind_Speed = max(sknt)
  ) %>%
  mutate(Temperature = ifelse(Temperature == 0, 42, Temperature))

Examining the temperature, precipitation, and wind data collected for August, we observe a few notable weather patterns. There were several rainy days scattered throughout the month, while wind speeds remained generally stable, with only a few instances of particularly strong winds. Temperature showed daily fluctuations, with higher readings during the daytime and cooler temperatures at night, reflecting a typical summer cycle.

grid.arrange(
  ggplot(weather.Panel, aes(interval60,Precipitation)) + geom_line(color="#124E78") + 
    labs(title="Percipitation", x="Hour", y="Perecipitation") + 
    theme(plot.subtitle = element_text(size = 9,face = "italic"),
        plot.title = element_text(size = 12, face = "bold"), 
        axis.text.x=element_text(size=8),
        axis.text.y=element_text(size=8), 
        axis.title=element_text(size=10), 
        panel.background = element_blank()), 
  ggplot(weather.Panel, aes(interval60,Wind_Speed)) + geom_line(color="#124E78") + 
    labs(title="Wind Speed", x="Hour", y="Wind Speed") +
    theme(plot.subtitle = element_text(size = 9,face = "italic"),
        plot.title = element_text(size = 12, face = "bold"), 
        axis.text.x=element_text(size=8),
        axis.text.y=element_text(size=8), 
        axis.title=element_text(size=10), 
        panel.background = element_blank()),  
  ggplot(weather.Panel, aes(interval60,Temperature)) + geom_line(color="#124E78") + 
    labs(title="Temperature", x="Hour", y="Temperature") +theme(plot.subtitle = element_text(size = 9,face = "italic"),
        plot.title = element_text(size = 12, face = "bold"), 
        axis.text.x=element_text(size=8),
        axis.text.y=element_text(size=8), 
        axis.title=element_text(size=10), 
        panel.background = element_blank()), 
  top="Weather Data ")

Space-Time Panel

We proceeded by constructing a study panel that captures unique combinations of space and time. This means each row in the panel corresponds to a ride occurring at a specific station during a specific hour, allowing us to analyze bikeshare activity across both temporal and spatial dimensions.

# Check if 'interval60' exists and is not entirely NULL
if("interval60" %in% names(bike_census) && !all(is.na(bike_census$interval60))) {
  print("Column 'interval60' exists and has data.")
  print(head(bike_census$interval60))
} else {
  print("Column 'interval60' does not exist or is all NA.")
}
## [1] "Column 'interval60' exists and has data."
## [1] "2023-08-07 19:00:00 UTC" "2023-08-01 13:00:00 UTC"
## [3] "2023-08-15 17:00:00 UTC" "2023-08-01 12:00:00 UTC"
## [5] "2023-08-08 12:00:00 UTC" "2023-08-02 07:00:00 UTC"
# If 'interval60' needs to be created or recalculated:
if(!"interval60" %in% names(bike_census) || all(is.na(bike_census$interval60))) {
  print("Recalculating 'interval60' from existing datetime column, assuming 'started_at' exists.")
  
  # Assuming 'started_at' is the column from which 'interval60' should be derived:
  if("started_at" %in% names(bike_census)) {
    bike_census$interval60 <- as.POSIXct(bike_census$started_at, format = "%Y-%m-%d %H:%M:%S") # Customize format as necessary
    bike_census$interval60 <- floor_date(bike_census$interval60, "hour") # Using lubridate to round down to the nearest hour
    print("New 'interval60' calculated and added to 'bike_census'.")
  } else {
    print("'started_at' column also missing; please check data source.")
  }
}

# Check and recreate the data frame for joining
if("interval60" %in% names(bike_census) && "start_station_id" %in% names(bike_census)) {
  combined_df <- expand.grid(interval60 = unique(bike_census$interval60),
                            start_station_id = unique(bike_census$start_station_id))
  print("Combined DataFrame created successfully. Displaying structure:")
  print(str(combined_df))
} else {
  print("One or both key columns are still missing or invalid.")
}
## [1] "Combined DataFrame created successfully. Displaying structure:"
## 'data.frame':    78120 obs. of  2 variables:
##  $ interval60      : POSIXct, format: "2023-08-07 19:00:00" "2023-08-01 13:00:00" ...
##  $ start_station_id: Factor w/ 105 levels "HB302","JC059",..: 1 1 1 1 1 1 1 1 1 1 ...
##  - attr(*, "out.attrs")=List of 2
##   ..$ dim     : Named int [1:2] 744 105
##   .. ..- attr(*, "names")= chr [1:2] "interval60" "start_station_id"
##   ..$ dimnames:List of 2
##   .. ..$ interval60      : chr [1:744] "interval60=2023-08-07 19:00:00" "interval60=2023-08-01 13:00:00" "interval60=2023-08-15 17:00:00" "interval60=2023-08-01 12:00:00" ...
##   .. ..$ start_station_id: chr [1:105] "start_station_id=HB302" "start_station_id=JC059" "start_station_id=JC105" "start_station_id=JC063" ...
## NULL
study.panel <- 
  expand.grid(interval60=unique(bike_census$interval60), 
              start_station_id = unique(bike_census$start_station_id)) %>%
  left_join(., bike_census %>%
              select(start_station_id, start_station_name, Origin.Tract, start_lng, start_lat)%>%
              distinct() %>%
              group_by(start_station_id) %>%
              slice(1))

We then enhanced this panel by adding more detailed information. This involved counting the number of rides at each station for each hour, incorporating weather data, merging in census information, and using the lubridate package to derive time-related variables such as the day of the week and hour of the day.

bike.panel <- 
  bike_census %>%
  mutate(Trip_Counter = 1) %>%
  right_join(study.panel) %>% 
  group_by(interval60, start_station_id, start_station_name, Origin.Tract, start_lng, start_lat) %>%
  summarize(Trip_Count = sum(Trip_Counter, na.rm = TRUE), .groups = 'drop') %>%
  left_join(weather.Panel) %>%
  ungroup() %>%
  filter(!is.na(start_station_id)) %>%
  mutate(
    week = week(interval60),
    dotw = wday(interval60)  # Removed label = TRUE
  ) %>%
  filter(!is.na(Origin.Tract)) %>% 
  left_join(
    njCensus %>%
      as.data.frame() %>%
      select(-geometry),
    by = c("Origin.Tract" = "GEOID")
  )

Weather Correlation

With our panel data ready, we conducted further exploratory analysis to prepare for our regression modeling. Our first focus was on understanding the relationship between temperature and the number of bike rides in August. The analysis revealed a strong, positive, and linear correlation, indicating that bikeshare demand tends to increase as temperatures rise throughout the month.

bike.panel %>%
  group_by(interval60) %>% 
  summarize(Trip_Count = mean(Trip_Count),
            Temperature = first(Temperature)) %>%
  mutate(week = week(interval60)) %>%
  ggplot(aes(Temperature, Trip_Count)) + 
    geom_point(size = .5, color = "black") + geom_smooth(method = "lm", se= FALSE, color="darkred") +
    facet_wrap(~week, ncol=8) + 
    labs(title="Trip Count As a Fuction of Temperature by Week",
         x="Temperature", y="Mean Trip Count") + 
  theme(plot.subtitle = element_text(size = 9,face = "italic"),
        plot.title = element_text(size = 12, face = "bold"), 
        axis.text.x = element_blank(),
    axis.ticks.x = element_blank(), 
        axis.text.y=element_text(size=8), 
        axis.title=element_text(size=10), 
        panel.background = element_blank(),
        panel.border = element_rect(colour = "grey", fill=NA, size=0.8))

Next, we examined the correlation between wind speed and the number of bike rides in August. We observed a generally positive linear relationship, though the strength of this correlation varied across different weeks. This suggests that bikeshare demand tends to increase on windy days, but the impact of wind speed on ride activity is not uniform throughout the month.

bike.panel %>%
  group_by(interval60) %>% 
  summarize(Trip_Count = mean(Trip_Count),
            Wind = first(Wind_Speed)) %>%
  mutate(week = week(interval60)) %>%
  ggplot(aes(Wind, Trip_Count)) + 
    geom_point(size = .5, color = "black") + geom_smooth(method = "lm", se= FALSE, color="darkred") +
    facet_wrap(~week, ncol=8) + 
    labs(title="Trip Count As a Fuction of Wind by Week",
         x="Wind Speed", y="Mean Trip Count") + 
  theme(plot.subtitle = element_text(size = 9,face = "italic"),
        plot.title = element_text(size = 12, face = "bold"), 
        axis.text.x = element_blank(),
    axis.ticks.x = element_blank(), 
        axis.text.y=element_text(size=8), 
        axis.title=element_text(size=10), 
        panel.background = element_blank(),
        panel.border = element_rect(colour = "grey", fill=NA, size=0.8)) + 
    theme(plot.subtitle = element_text(size = 9,face = "italic"),
        plot.title = element_text(size = 12, face = "bold"), 
        axis.text.x = element_blank(),
    axis.ticks.x = element_blank(), 
        axis.text.y=element_text(size=8), 
        axis.title=element_text(size=10), 
        panel.background = element_blank(),
        panel.border = element_rect(colour = "grey", fill=NA, size=0.8))

Serial Correlation

If the number of bike trips shows serial (temporal) correlation, incorporating time lag features into our model could significantly enhance its predictive power. The underlying concept is that bikeshare demand at any given hour is likely to be influenced by demand patterns immediately preceding or following that hour. For example, if we know there was a high number of trips during one hour, we can reasonably predict that nearby hours will also have elevated demand. To capture these temporal dependencies, we computed six different lagged variables, each representing the number of trips one, two, three, four, twelve hours earlier, and one day earlier. These lag features allow us to account for short-term fluctuations as well as daily patterns, providing a more comprehensive framework for modeling bikeshare activity. By leveraging these time-based correlations, our model can better anticipate shifts in demand, improving accuracy and reliability.

bike.panel <- 
  bike.panel %>% 
  arrange(start_station_id, interval60) %>% 
  mutate(lagHour = dplyr::lag(Trip_Count,1),
         lag2Hours = dplyr::lag(Trip_Count,2),
         lag3Hours = dplyr::lag(Trip_Count,3),
         lag4Hours = dplyr::lag(Trip_Count,4),
         lag12Hours = dplyr::lag(Trip_Count,12),
         lag1day = dplyr::lag(Trip_Count,24)) %>%
   mutate(day = yday(interval60))

We visualized the relationships between the number of trips and the various lagged hours to better understand these temporal correlations. The analysis revealed some interesting patterns. There is a strong, linear, and negative relationship between the number of trips and the lagged variable representing 12 hours prior. For instance, the number of rides recorded at 5 PM on a given day tends to be negatively correlated with the number of rides at 5 AM on the same day, likely reflecting contrasting peak and off-peak usage times.

Conversely, we observed a strong, positive, and linear correlation between the number of trips at a given hour and both the immediate next hour (laghour) and the same hour on the following day (lag1day). This indicates that bikeshare demand is highly predictable from one hour to the next and tends to follow consistent daily patterns. However, the strength of the correlation decreases for the lag variables representing two and three hours prior. For example, the number of rides at 4 PM is more strongly correlated with the number of rides at 5 PM than with rides at 6 or 7 PM. This gradual weakening of correlation over time highlights how bikeshare demand is most heavily influenced by recent past activity, making these time lag features valuable for modeling purposes.

as.data.frame(bike.panel) %>%
    group_by(interval60) %>% 
    summarise_at(vars(starts_with("lag"), "Trip_Count"), mean, na.rm = TRUE) %>%
    gather(Variable, Value, -interval60, -Trip_Count) %>%
  ggplot(aes(x = Value , y = Trip_Count)) +
    geom_point(size = .5, color = "black") + geom_smooth(method = "lm", se= FALSE, color="darkred") +
  facet_wrap(~Variable, ncol = 6)+
  labs(x = "Variable", y = "Correlation", title = "Correlation Between Serial Lag Trips and Trip Count") +
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) + 
  theme(plot.subtitle = element_text(size = 9,face = "italic"),
        plot.title = element_text(size = 12, face = "bold"), 
        axis.text.x = element_blank(),
    axis.ticks.x = element_blank(), 
        axis.text.y=element_text(size=8), 
        axis.title=element_text(size=10), 
        panel.background = element_blank(),
        panel.border = element_rect(colour = "grey", fill=NA, size=0.8))

Spatial Correlation

We previously established that spatial autocorrelation is present in our data, indicating that the number of trips at one station during a specific time is related to the number of trips at nearby stations. In other words, knowing the demand at one station can help us estimate demand at neighboring stations. To leverage this spatial relationship, we created a new dataframe called lag_result.

We then implemented a nested loop to systematically calculate spatial lags for each hour in August 2023. For each hour, we identified the five nearest neighboring stations for every station and computed the average number of trips across these neighbors. This provided a measure of spatial influence, capturing how demand at one station is affected by activity at surrounding stations. Finally, we integrated these spatial lag values back into our panel dataset, enhancing our model’s ability to predict bikeshare demand by incorporating both spatial and temporal dependencies.

library(sf)
library(spdep)
library(dplyr)
library(parallel)

# Step 1: Filter data once and convert to sf object
bike.panel_sf <- bike.panel %>%
  st_as_sf(coords = c("start_lng", "start_lat"), crs = 4326)

# Step 2: Define a function to calculate spatial lag for a given day and hour
calculate_spatial_lag <- function(d, h, bike_data) {
  # Load required libraries within the function
  library(dplyr)
  library(sf)
  library(spdep)
  library(lubridate)
  
  # Filter data for the specified day and hour
  spacelag <- bike_data %>%
    mutate(hour = hour(interval60), day = yday(interval60)) %>%
    filter(day == d & hour == h)
  
  if (nrow(spacelag) == 0) return(NULL)  # Skip if no data
  
  # Compute spatial neighbors and weights
  coords <- st_coordinates(spacelag)
  neighborList <- knn2nb(knearneigh(coords, k = 5))
  spatialWeights <- nb2listw(neighborList, style = "W")
  
  # Calculate spatial lag
  spacelag <- spacelag %>%
    mutate(lagTrip = lag.listw(spatialWeights, Trip_Count))
  
  return(spacelag)
}

# Step 3: Use parallel processing to speed up the loop
cl <- makeCluster(detectCores() - 1)  # Create a cluster using available cores
clusterExport(cl, c("bike.panel_sf", "calculate_spatial_lag"))  # Export data and function
clusterEvalQ(cl, {  # Load required libraries in each worker
  library(dplyr)
  library(sf)
  library(spdep)
  library(lubridate)
})
## [[1]]
##  [1] "lubridate" "spdep"     "spData"    "sf"        "dplyr"     "stats"    
##  [7] "graphics"  "grDevices" "utils"     "datasets"  "methods"   "base"     
## 
## [[2]]
##  [1] "lubridate" "spdep"     "spData"    "sf"        "dplyr"     "stats"    
##  [7] "graphics"  "grDevices" "utils"     "datasets"  "methods"   "base"     
## 
## [[3]]
##  [1] "lubridate" "spdep"     "spData"    "sf"        "dplyr"     "stats"    
##  [7] "graphics"  "grDevices" "utils"     "datasets"  "methods"   "base"
# Run the loop in parallel
lag_result_list <- parLapply(cl, 213:243, function(d) {
  lapply(0:23, function(h) calculate_spatial_lag(d, h, bike.panel_sf))
})

stopCluster(cl)  # Stop the cluster

# Combine the results into a single data frame
lag_result <- do.call(rbind, unlist(lag_result_list, recursive = FALSE)) %>%
  mutate(start_lng = unlist(map(geometry, 1)),
         start_lat = unlist(map(geometry, 2))) %>%
  st_drop_geometry()

# Final result
bike.panel <- lag_result

The figure below illustrates a consistent, strong, and positive linear relationship between the number of trips at a station and the number of trips at nearby stations throughout all weeks in August. This confirms the presence of significant spatial dependence, suggesting that stations in close proximity tend to experience similar levels of demand. Given this finding, we will incorporate this spatial lag variable into our regression analysis to improve the model’s predictive accuracy by accounting for the influence of nearby station activity.

bike.panel %>%
  group_by(interval60) %>% 
  summarize(Trip_Count = mean(Trip_Count),
            Lag_Count = mean(lagTrip)) %>%
  mutate(week = week(interval60)) %>%
  ggplot(aes(Lag_Count, Trip_Count)) + 
    geom_point(size = 1, color = "#6E0E0A") + geom_smooth(method = "lm", se= FALSE, color="#124E78") +
    facet_wrap(~week, ncol=8) + 
    labs(title="Correlation Between Spatial Lag Trips and Trip Count",
         x="Lag Trip Count", y="Mean Trip Count") + 
  theme(plot.subtitle = element_text(size = 9,face = "italic"),
        plot.title = element_text(size = 12, face = "bold"), 
        axis.text.x = element_blank(),
    axis.ticks.x = element_blank(), 
        axis.text.y=element_text(size=8), 
        axis.title=element_text(size=10), 
        panel.background = element_blank(),
        panel.border = element_rect(colour = "grey", fill=NA, size=0.8))

OLS Regression

We divided our five weeks of data into a training set consisting of three weeks and a testing set comprising the remaining two weeks. Using the training data, we estimated four distinct linear regression models, each incorporating different fixed effects:

  • reg1 captures only temporal factors, including hour-of-day fixed effects, day of the week, wind speed, and temperature.
  • reg2 emphasizes spatial components, using station-specific effects and weather conditions.
  • reg3 combines both spatial and temporal elements to account for interactions between space and time.
  • reg4 extends the third model by adding both time and spatial lag features, capturing temporal and spatial dependencies to enhance predictive accuracy.
bike.Train <- filter(bike.panel, week <= 33)
bike.Test <- filter(bike.panel, week > 33)

# Time
reg1 <- 
  lm(Trip_Count ~  hour(interval60) + dotw + Temperature + Wind_Speed, data=bike.Train)

# Space
reg2 <- 
  lm(Trip_Count ~  start_station_name + Temperature + Wind_Speed,  data=bike.Train)

# Time and Space
reg3 <- 
  lm(Trip_Count ~  start_station_name + hour(interval60) + dotw + Temperature + Wind_Speed, 
     data=bike.Train)

# Time and Space and TimeLag and SpaceLag
reg4 <- 
  lm(Trip_Count ~  start_station_name +  hour(interval60) + dotw + Temperature + Wind_Speed +
                   lagHour + lag2Hours + lag12Hours + lag1day + lagTrip, 
     data=bike.Train)

Making Predictions

We utilized the purrr package to efficiently iterate over a set of nested data frames, allowing us to compare different model specifications. To prepare for testing, we organized our data by week, nesting it into separate tibbles. We then created a simple prediction function that takes a tibble (dat) and a regression model (fit) as inputs and generates predictions, returning them as pred. This approach streamlines the process of applying our models across multiple subsets of data, making it easy to evaluate and compare their performance.

bike.Test.weekNest <- 
  bike.Test %>%
  nest(-week) 

model_pred <- function(dat, fit){
   pred <- predict(fit, newdata = dat)}

The nested data format enables us to iterate through each model for every week and compute summary statistics efficiently. In the code block that follows, we generate week_predictions for each week within the testing set. Using the map function from purrr, we apply the model_pred function to each nested tibble. For instance, in the first line of the mutate operation, a new column called Time_FE is created, which contains predictions from reg1, the model that includes time fixed effects. These predictions are produced by mapping model_pred to each data subset, with fit set to the reg1 model. This method allows us to easily apply multiple models and gather prediction results for comparison.

week_predictions <- 
  bike.Test.weekNest %>% 
    mutate(ATime_FE = map(.x = data, fit = reg1, .f = model_pred),
           BSpace_FE = map(.x = data, fit = reg2, .f = model_pred),
           CTime_Space_FE = map(.x = data, fit = reg3, .f = model_pred),
           DTime_Space_FE_Lags = map(.x = data, fit = reg4, .f = model_pred)) %>% 
    gather(Regression, Prediction, -data, -week) %>%
    mutate(Observed = map(data, pull, Trip_Count),
           Absolute_Error = map2(Observed, Prediction,  ~ abs(.x - .y)),
           MAE = map_dbl(Absolute_Error, mean, na.rm = TRUE),
           sd_AE = map_dbl(Absolute_Error, sd, na.rm = TRUE))

Error Analysis

When comparing the mean absolute errors (MAE) across our four models, we observe that Model 1, which only incorporates time effects, has the highest error rate. Models 2 and 3, which account for spatial effects and the combination of space and time, show improved accuracy. However, the greatest enhancement in predictive performance comes from Model 4, which includes both space and time lag features, significantly reducing the error. To ensure the reliability of Model 4, we also performed a variance inflation factor (VIF) analysis to confirm that multicollinearity was not an issue among the predictors.

week_predictions %>%
  dplyr::select(week, Regression, MAE) %>%
  gather(Variable, MAE, -Regression, -week) %>%
  ggplot(aes(week, MAE)) + 
    geom_bar(aes(fill = Regression), position = "dodge", stat="identity") +
    scale_fill_manual(values = palette4) +
    labs(title = "Mean Absolute Errors by Model Specification and Week") +
  theme(plot.subtitle = element_text(size = 9,face = "italic"),
        plot.title = element_text(size = 12, face = "bold"), 
        axis.text.x = element_blank(),
    axis.ticks.x = element_blank(), 
        axis.text.y=element_text(size=8), 
        axis.title=element_text(size=10), 
        panel.background = element_blank(),
        panel.border = element_rect(colour = "grey", fill=NA, size=0.8))

The figure below provides a visual comparison of the accuracy of each model by plotting the predicted versus observed number of trips. Among the four models, Model 4 clearly demonstrates superior predictive performance, closely aligning with the actual bikeshare demand. This highlights the effectiveness of incorporating both spatial and temporal lag features, making Model 4 the most robust and reliable in forecasting trip counts.

week_predictions %>% 
    mutate(interval60 = map(data, pull, interval60),
           start_station_id = map(data, pull, start_station_id)) %>%
    dplyr::select(interval60, start_station_id, Observed, Prediction, Regression) %>%
    unnest() %>%
    gather(Variable, Value, -Regression, -interval60, -start_station_id) %>%
    group_by(Regression, Variable, interval60) %>%
    summarize(Value = sum(Value)) %>%
    ggplot(aes(interval60, Value, colour=Variable)) + 
      geom_line(size = 1.1) + 
      scale_color_manual(values=palette2)+
      facet_wrap(~Regression, ncol=1) +
      labs(title = "Predicted/Observed Bike Share Time Series", subtitle = "Jersey City: a test set of 2 weeks",  x = "Hour", y= "Station Trips") + 
   theme(plot.subtitle = element_text(size = 9,face = "italic"),
        plot.title = element_text(size = 12, face = "bold"), 
        axis.text.x = element_blank(),
    axis.ticks.x = element_blank(), 
        axis.text.y=element_text(size=8), 
        axis.title=element_text(size=10), 
        panel.background = element_blank(),
        panel.border = element_rect(colour = "grey", fill=NA, size=0.8))

To evaluate whether our predictions generalize well across different locations and times, we mapped the mean absolute error (MAE) of Model 4 across various stations. The results show that the model struggles to accurately predict the number of rides at stations located along the Hudson River, which tend to experience high bikeshare demand on both weekdays and weekends. However, for stations with lower ridership throughout August, the model performs significantly better, demonstrating reasonable predictive accuracy in those areas.

week_predictions %>% 
    mutate(interval60 = map(data, pull, interval60),
           start_station_id = map(data, pull, start_station_id), 
           start_latitude = map(data, pull, start_lat), 
           start_longitude = map(data, pull, start_lng)) %>%
    select(interval60, start_station_id, start_longitude, start_latitude, Observed, Prediction, Regression) %>%
    unnest() %>%
  filter(Regression == "DTime_Space_FE_Lags") %>%
  group_by(start_station_id, start_longitude, start_latitude) %>%
  summarize(MAE = mean(abs(Observed-Prediction), na.rm = TRUE))%>%
ggplot(.)+
  geom_sf(data = jcTracts,  fill = "white")+
  geom_point(aes(x = start_longitude, y = start_latitude, color = MAE), 
             fill = "transparent")+
  scale_color_continuous(low = "#124E78", high = "#6E0E0A", name= "MAE")+
 ylim(min(bike_census$start_lat), max(bike_census$start_lat))+
  xlim(min(bike_census$start_lng), max(bike_census$start_lng)) +
  labs(title="Mean Abs Error, Test Set, Model 4") + 
  theme(axis.text.x=element_blank(),
        axis.text.y=element_blank(),
        axis.ticks =element_blank(),
        axis.title.x = element_blank(),
        axis.title.y = element_blank(),
        plot.subtitle = element_text(size = 9,face = "italic"),
        plot.title = element_text(size = 12, face = "bold"),
        panel.background = element_blank(),
        panel.border = element_rect(colour = "grey", fill=NA, linewidth=0.8)
        )

When examining Model 4’s performance across both space and time, the figure demonstrates that the model maintains consistent predictions throughout different periods. Stations identified as having high demand on weekends are similarly predicted to have high demand during weekdays. However, the errors persist over time, indicating that stations with historically high demand also exhibit greater variability in ride counts, making them more challenging to predict with our current set of features. In fact, our model only explains 50.12% of the variability, which is a limitation. The concern is that if these larger errors remain unaddressed for high-demand stations, we risk failing to adequately meet bikeshare users’ needs at these crucial locations.

week_predictions %>% 
    mutate(interval60 = map(data, pull, interval60),
           start_station_id = map(data, pull, start_station_id), 
           start_latitude = map(data, pull, start_lat), 
           start_longitude = map(data, pull, start_lng),
           dotw = map(data, pull, dotw) ) %>%
    select(interval60, start_station_id, start_longitude, 
           start_latitude, Observed, Prediction, Regression,
           dotw) %>%
    unnest() %>%
  filter(Regression == "DTime_Space_FE_Lags")%>%
  mutate(weekend = ifelse(dotw %in% c("Sun", "Sat"), "Weekend", "Weekday"),
         time_of_day = case_when(hour(interval60) < 7 | hour(interval60) > 18 ~ "Overnight",
                                 hour(interval60) >= 7 & hour(interval60) < 10 ~ "AM Rush",
                                 hour(interval60) >= 10 & hour(interval60) < 15 ~ "Mid-Day",
                                 hour(interval60) >= 15 & hour(interval60) <= 18 ~ "PM Rush")) %>%
  group_by(start_station_id, weekend, time_of_day, start_longitude, start_latitude) %>%
  summarize(MAE = mean(abs(Observed-Prediction), na.rm = TRUE))%>%
  ggplot(.)+
  geom_sf(data = jcTracts, color = "grey", fill = "transparent")+
  geom_point(aes(x = start_longitude, y = start_latitude, color = MAE), 
             fill = "transparent", size = 0.5)+
  scale_color_continuous(low = "#124E78", high = "#6E0E0A", name= "MAE")+
 ylim(min(bike_census$start_lat), max(bike_census$start_lat))+
  xlim(min(bike_census$start_lng), max(bike_census$start_lng)) +
  facet_grid(weekend~time_of_day)+
  labs(title="Mean Absolute Errors by Time, Test Set, Model 4") + 
  theme(axis.text.x=element_blank(),
        axis.text.y=element_blank(),
        axis.ticks =element_blank(),
        axis.title.x = element_blank(),
        axis.title.y = element_blank(),
        plot.subtitle = element_text(size = 9,face = "italic"),
        plot.title = element_text(size = 12, face = "bold"),
        panel.background = element_blank(),
        panel.border = element_rect(colour = "grey", fill=NA, linewidth=0.8)
        )

Cross-Validation

Lastly, we conducted 100 cross-validation tests using Model 4 on data spanning all five weeks. The mean absolute error (MAE) was 0.44, indicating that while this model is our most effective solution so far, there is still a notable level of prediction error. To improve the model further, it would be valuable to cross-validate against socio-economic indicators. This approach could reveal whether bike demand at stations in specific neighborhoods is being disproportionately under- or over-predicted. However, our current model does not incorporate socioeconomic variables as predictors, and the dynamics in Jersey City—such as the continued development of new stations—may also impact demand patterns.

fitControl <- trainControl(method = "cv", number = 100)

reg.cv <- train(Trip_Count ~ start_station_name +  hour(interval60) + dotw + Temperature + Wind_Speed + lagHour + lag2Hours + lag12Hours + lag1day + lagTrip, data=bike.panel, method = "lm", trControl = fitControl, na.action = na.pass)

reg.cv$resample %>% 
  summarise(MAE = mean(reg.cv$resample[,3]),
            sd(reg.cv$resample[,3])
) %>%
  kbl(col.name=c('Mean Absolute Error','Standard Deviation of MAE')) %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed"))
Mean Absolute Error Standard Deviation of MAE
0.4418984 0.029647

Conclusion

In summary, this report highlights the urgent need for efficient bikeshare management strategies in Jersey City and presents a machine learning approach to address this challenge. Our analysis demonstrates that bikeshare demand is influenced by a combination of spatial, temporal, and weather-related factors. Specifically, the model shows that forecasting demand is more accurate when considering the number of trips in the preceding and following hours, the same hour on the previous day, and nearby station activity. We also discovered that demand peaks during windy and hot days in August, with the highest usage occurring during weekday PM rush hours and, to a lesser extent, overnight. Weekends see a significant spike in demand during the PM rush as well. Notably, Citi Bike is heavily used for commuting to New York City, making stations along the Hudson River, especially near ferry terminals, the most active.

We further explored specific high-demand stations in detail. Grove St PATH station recorded an impressive 1,117 rides during the weekday PM rush and 801 rides overnight in August, reflecting its strategic location in downtown Jersey City. Hoboken Terminal - River St & Hudson Pl had 864 rides during weekday PM rush hours, likely due to its proximity to major transit connections. South Waterfront Walkway - Sinatra Dr & 1 St station saw 319 rides overnight on a weekend and 731 during the weekday PM rush, benefiting from its location near the NJ Transit Terminal and the popular Pier A Park. These findings suggest that stations experience higher demand because they serve key areas such as employment hubs, transit nodes for commuting, and recreational spaces along the waterfront. To address this, local authorities should consider rebalancing bike supplies more strategically at these critical locations.

Although our model successfully captures over half of the demand variations, it still requires further refinement to increase its predictive accuracy. Without these improvements, the model risks failing to meet rider needs and complicating bike redistribution efforts. Nevertheless, the straightforward and open-source nature of our approach makes it a valuable tool for bikeshare operators looking to improve service quality and optimize bike availability. As urban transportation continues to evolve, the use of predictive models like ours will be vital for creating sustainable, well-balanced mobility systems.

---
title: "Predicting Bikeshare Demand in Jersey City"
author: "Yutong Jiang"
date: "Nov,15,2024"
output:
  html_document:
    theme: cosmo 
    toc: yes
    toc_float: yes
    code_folding: hide
    code_download: yes
editor_options:
  markdown:
    wrap: sentence
---


```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```


# Introduction

Bikeshare programs have become an essential part of urban transportation, offering a sustainable and flexible commuting option. Citi Bike, launched by Lyft in 2008 in New York City, has expanded to Jersey City to meet the needs of commuters traveling to Manhattan. As Citi Bike's popularity grows, managing bike availability effectively becomes crucial, especially given the fluctuations in demand throughout the day and across different locations.

Maintaining a balanced distribution of bikes, known as "re-balancing," is a complex challenge influenced by these spatial and temporal demand patterns. Companies like Uber and Lyft use proprietary algorithms to address these issues, but the details of their models are not publicly available. This report aims to develop a transparent, open-source approach to forecasting bike demand in Jersey City using Ordinary Least Squares (OLS) regression. By analyzing historical data, including weather and bike usage patterns, our goal is to create a simple yet effective model to anticipate demand spikes and guide efficient bike reallocation.

We will begin by loading and processing our datasets, followed by an exploratory analysis of usage trends and weather impacts. Next, we’ll create a space-time panel to examine correlations and finally develop and evaluate four predictive models, testing their performance and reliability.


# Set Up

```{r library and data, include=FALSE}
library(tidyverse)
library(sf)
library(lubridate)
library(tigris)
library(gganimate)
library(riem)
library(gridExtra)
library(knitr)
library(kableExtra)
library(tidycensus)
library(mapview)
library(sfdep)
library(spdep)
library(caret)
library (readr)

options(tigris_class = "sf")
source("https://raw.githubusercontent.com/urbanSpatial/Public-Policy-Analytics-Landing/master/functions.r")

palette5 <- c("#1F78B4", "#33A02C", "#E31A1C", "#FF7F00", "#6A3D9A")  # Blue, Green, Red, Orange, Purple
palette4 <- c("#1F78B4", "#33A02C", "#E31A1C", "#6A3D9A")            # Blue, Green, Red, Purple
palette2 <- c("#1F78B4", "#E31A1C")                                  # Blue, Red


tidycensus::census_api_key("e79f3706b6d61249968c6ce88794f6f556e5bf3d", overwrite = TRUE)
```

The data used in this analysis comes from NYC OpenData, which provides monthly records of Citi Bike usage. Even though Jersey City is in New Jersey, Citi Bike's operations are largely centered in New York City, and thus the data is managed through NYC's system. For this study, we focused on bikeshare trips in Jersey City during August 2024, capturing information about the starting and ending stations for each ride. Additionally, we acquired the census tract geometries for all of Hudson County, which encompasses Jersey City, using the tidycensus package.

```{r obtain census geom and bike data, warning=FALSE, message=FALSE, results='hide', cache=TRUE}

nycbike <- read.csv("/Users/bin/Documents/MUSA/MUSA 5080/JC-202308-citibike-tripdata.csv")

njCensus <- get_acs(geography = "tract", 
          variables = c("B01003_001", "B19013_001", 
                        "B02001_002", "B08013_001",
                        "B08012_001", "B08301_001", 
                        "B08301_010", "B01002_001"), 
          year = 2021, 
          state = "NJ", 
          geometry = TRUE, 
          county=c("Hudson"),
          output = "wide") %>%
  rename(Total_Pop =  B01003_001E,
         Med_Inc = B19013_001E,
         Med_Age = B01002_001E,
         White_Pop = B02001_002E,
         Travel_Time = B08013_001E,
         Num_Commuters = B08012_001E,
         Means_of_Transport = B08301_001E,
         Total_Public_Trans = B08301_010E) %>%
  select(Total_Pop, Med_Inc, White_Pop, Travel_Time,
         Means_of_Transport, Total_Public_Trans,
         Med_Age,
         GEOID, geometry) %>%
  mutate(Percent_White = White_Pop / Total_Pop,
         Mean_Commute_Time = Travel_Time / Total_Public_Trans,
         Percent_Taking_Public_Trans = Total_Public_Trans / Means_of_Transport)

jcTracts <- njCensus %>%
  as.data.frame() %>%
  distinct(GEOID, .keep_all = TRUE) %>%
  select(GEOID, geometry) %>% 
  st_sf()
```

The bikeshare dataset we’re working with contains spatial details, but it isn't inherently formatted as spatial data. To conduct a spatial analysis, we first need to integrate it with the census geometry. Our focus is on the starting locations of rides, as our aim is to predict bike availability at stations when users initiate a trip. To prepare the data, we filter out records lacking latitude or longitude information and address errors where certain coordinates incorrectly place stations in the Hudson River. These inaccurate entries are removed to ensure data quality.



```{r add census tract}

bike_census <- st_join(nycbike %>% 
                         filter(is.na(start_lat) == FALSE &
                                is.na(start_lng) == FALSE &
                                is.na(end_lat) == FALSE &
                                is.na(end_lng) == FALSE) %>%
                         st_as_sf(., coords = c("start_lng", "start_lat"), crs = 4326),
                       jcTracts %>%  st_transform(crs=4326),
                       join=st_intersects, left = TRUE) %>%
  rename(Origin.Tract = GEOID) %>%
  mutate(start_lng = unlist(map(geometry, 1)),
         start_lat = unlist(map(geometry, 2)))%>%
  filter(!(ride_id %in% c("BEC9CC4D1007C753", "86087F431785EDE7", "A1AAB92631439883", "71938DA085242DCC", "4203C77668C47C75", "C92410CA2AEF3947", "6E02CBF36C90F244", "0259885253FD238E", "B39136B37AFB5FE9", "1EC9376D1559C51C", "CDDF16D3A37BE370"))) %>% 
  as.data.frame()
  
```

The following map shows the Locartion of ride starting points in Jersey City

```{r distribution of stations}

ggplot()+
  geom_sf(data = jcTracts %>%
          st_transform(crs=4326), fill="white")+
  geom_point(data = bike_census, aes(x=start_lng, y = start_lat), 
            color = "blue", alpha = 1, size = 0.3) + 
  ylim(min(bike_census$start_lat), max(bike_census$start_lat))+
  xlim(min(bike_census$start_lng), max(bike_census$start_lng)) +
  labs(title = "Location of Rides Starting Points in Jersey City") +
  theme(axis.text.x=element_blank(),
        axis.text.y=element_blank(),
        axis.ticks =element_blank(),
        axis.title.x = element_blank(),
        axis.title.y = element_blank(),
        plot.subtitle = element_text(size = 9,face = "italic"),
        plot.title = element_text(size = 12, face = "bold"),
        panel.background = element_blank(),
        panel.border = element_rect(colour = "grey", fill=NA, linewidth=0.8)
        )

```

We carry out essential feature engineering to derive temporal insights from the ride data. Specifically, we'll create time intervals at both 60-minute and 15-minute increments based on each ride's start time. Additionally, we'll categorize these intervals into broader time periods, such as "Overnight," "AM Rush," "Mid-Day," and "PM Rush," and classify each day as either a weekday or weekend.

```{r temporal feature engineering}
bike_census <- bike_census %>%
  mutate(
    interval60 = floor_date(ymd_hms(started_at), unit = "hour"),
    interval15 = floor_date(ymd_hms(started_at), unit = "15 mins"),
    week = week(interval60),
    dotw = wday(interval60),  # Removed label = TRUE
    time_of_day = case_when(
      hour(interval60) < 7 | hour(interval60) > 18 ~ "Overnight",
      hour(interval60) >= 7 & hour(interval60) < 10 ~ "AM Rush",
      hour(interval60) >= 10 & hour(interval60) < 15 ~ "Mid-Day",
      hour(interval60) >= 15 & hour(interval60) <= 18 ~ "PM Rush"
    ),
    hour = hour(started_at),
    weekend = ifelse(dotw %in% c(1, 7), "Weekend", "Weekday")  # Use numeric values for days
  )

```

To better understand commuting patterns and rider behavior throughout the day, we generated line strings to represent the routes between the start and end points of rides. To simplify the visualization, we limited it to 200 sample rides. The visualization reveals that the majority of trips are confined to Jersey City, though some riders use Citi Bike to travel from Jersey City to Manhattan.

```{r origin-desintation matrix, warning=FALSE}

lines <- st_sfc(
  lapply(1:nrow(bike_census), function(i) {
    st_linestring(matrix(c(bike_census[i, "start_lng"], bike_census[i, "start_lat"],
                            bike_census[i, "end_lng"], bike_census[i, "end_lat"]), ncol = 2, byrow = TRUE))
  }))

lines_sf <- st_sf(geometry = lines)
lines_sf <- st_set_crs(lines_sf, 4326)

Route <- lines_sf %>% head(200)
mapview(Route, zcol = NULL, color = "#6E0E0A", linewidth = 0.001, alpha = 0.3) # only show 200
```


# Exploratory Analysis

In this section, we delve into some further exploratory analysis of our data set. 

## Temporal Analysis

We began our analysis by examining the temporal patterns of bikeshare usage in Jersey City throughout August 2023, focusing on the distribution of trips by the hour. Our visualization reveals distinct fluctuations, with demand rising and falling in a predictable cycle each day. During weekdays, we observe pronounced peaks that align with typical rush hours—mornings and late afternoons—when people are commuting to and from work. These busy periods are followed by noticeable lulls during the night and early morning hours, as well as during weekends, when overall usage tends to decrease. The pattern is characterized by a series of five consecutive high-demand peaks that correspond to weekdays, consistently followed by two lower peaks over the weekend, highlighting the influence of workweek routines on bikeshare activity. This cyclical behavior underscores the strong link between bikeshare demand and daily commuting schedules.

```{r bikeshare viz1}
ggplot(bike_census %>%
         group_by(interval60) %>%
         tally())+
  geom_line(aes(x = interval60, y = n), color= "blue")+
  labs(title="Bike Share Trips Per Hour in Jersey City, Aug 2023",
       x="Date", 
       y="Number of trips") + 
theme(plot.subtitle = element_text(size = 9,face = "italic"),
        plot.title = element_text(size = 12, face = "bold"), 
        axis.text.x=element_text(size=8),
        axis.text.y=element_text(size=8), 
        axis.title=element_text(size=10), 
        panel.background = element_blank(),
        panel.border = element_rect(colour = "grey", fill=NA, linewidth =0.8))
  
```

Next, we analyzed how trips were distributed across different stations on an hourly basis throughout August. The resulting visualization revealed a highly skewed distribution: the vast majority of stations recorded either no rides or just a few trips per hour. However, a select few stations stood out, experiencing significantly higher demand. These high-traffic stations sometimes saw as many as 40 to 50 rides within a single hour on certain days. This uneven distribution highlights the varying levels of bikeshare usage across Jersey City, with certain locations acting as major hubs of activity while many others experience minimal ridership.

```{r bikeshare viz3, warning=FALSE}

ggplot(bike_census %>%
         group_by(interval60, start_station_name) %>%
         tally())+
  geom_histogram(aes(n), binwidth = 5, fill="lightblue")+
  labs(title="Bike Share Trips Per Hour by Station in Jersey City, Aug 2023",
       x="Trip Counts", 
       y="Number of Stations") + 
    theme(plot.subtitle = element_text(size = 9,face = "italic"),
        plot.title = element_text(size = 12, face = "bold"), 
        axis.text.x=element_text(size=8),
        axis.text.y=element_text(size=8), 
        axis.title=element_text(size=10), 
        panel.background = element_blank(),
        panel.border = element_rect(colour = "grey", fill=NA, linewidth =0.8))

```

We further explored the average number of trips per station at different times of the day. The analysis showed that PM rush hours consistently had the highest average trip counts, with several stations averaging more than 10 trips, indicating substantial demand at these locations. Midday came next in terms of activity levels, where some stations recorded averages of over 5 trips. Surprisingly, there was also considerable demand overnight, as a few stations averaged more than 5 rides even during late-night hours. This unexpected overnight activity suggests that bikeshare usage extends beyond traditional commuting patterns, potentially serving other purposes such as late-night leisure or early-morning travel.



```{r bikeshare viz2, warning=FALSE, message=FALSE}

bike_census %>%
        group_by(interval60, start_station_name, time_of_day) %>%
         tally()%>%
  group_by(start_station_name, time_of_day)%>%
  summarize(mean_trips = mean(n))%>%
  ggplot()+
  geom_histogram(aes(mean_trips, fill=time_of_day), binwidth = 1)+
  scale_fill_manual(values = palette4, name="Time of Day") + 
  labs(title="Mean Number of Hourly Trips Per Station",
       x="Number of trips", 
       y="Frequency")+
  facet_wrap(~time_of_day) +
theme(plot.subtitle = element_text(size = 9,face = "italic"),
        plot.title = element_text(size = 12, face = "bold"), 
        axis.text.x=element_text(size=8),
        axis.text.y=element_text(size=8), 
        axis.title=element_text(size=10), 
        panel.background = element_blank(),
        panel.border = element_rect(colour = "grey", fill=NA, size=0.8))

```

We then examined the total number of trips throughout August, breaking them down by day of the week. Two key patterns emerged from this visualization. First, the number of trips fluctuates significantly over the course of a day, with the lowest activity observed in the early morning hours and the highest in the late afternoon, corresponding to evening commute times. Second, the data shows that trip volumes are generally higher on Tuesdays, Wednesdays, and Thursdays, suggesting that midweek days experience the greatest demand for bikeshare services, likely driven by regular work-related travel.

```{r bikeshare viz4}

ggplot(bike_census %>% mutate(hour = hour(started_at)))+
     geom_freqpoly(aes(hour, color = dotw), binwidth = 3, lwd = 0.8)+
  scale_color_manual(values = c("#124E78", "#819FA1", "#F0F0C9", "#F2BB05", "#E58507", "#D74E09", "#6E0E0A"), name = "Day") + 
  labs(title="Bike Share Trips in Jersey City by Day of the Week, Aug 2023",
       x="Hour", 
       y="Trip Counts") + 
  theme(plot.subtitle = element_text(size = 9,face = "italic"),
        plot.title = element_text(size = 12, face = "bold"), 
        axis.text.x=element_text(size=8),
        axis.text.y=element_text(size=8), 
        axis.title=element_text(size=10), 
        panel.background = element_blank(),
        panel.border = element_rect(colour = "grey", fill=NA, size=0.8))

```

We created a similar visualization but this time compared trip volumes between weekdays and weekends. The differences are striking: bikeshare usage is noticeably higher on weekdays compared to weekends. Additionally, on weekends, trips start to pick up much later in the day, reflecting the absence of the typical weekday morning rush. This delay suggests that without the pressure of early work commutes, people are more likely to use bikeshare services later in the day for leisure or non-work-related activities, highlighting the influence of daily routines on bikeshare demand.

```{r bikeshare viz 5}

ggplot(bike_census)+
     geom_freqpoly(aes(hour, color = weekend), binwidth = 3, lwd = 1.2)+
  scale_color_manual(values=palette2) + 
  labs(color = "Time of Week") +
  labs(title="Bike Share Trips in Jersey City - Weekend vs Weekday, Aug 2023",
       x="Hour", 
       y="Trip Counts") + 
  theme(plot.subtitle = element_text(size = 9,face = "italic"),
        plot.title = element_text(size = 12, face = "bold"), 
        axis.text.x=element_text(size=8),
        axis.text.y=element_text(size=8), 
        axis.title=element_text(size=10), 
        panel.background = element_blank(),
        panel.border = element_rect(colour = "grey", fill=NA, size=0.8))

```

```{r origin_map }
# Assuming `bike_census` has a column 'Origin.Tract' that maps each bike trip to a census tract
# and `jcTracts` is already loaded and transformed as necessary

# Aggregate bike trip data by census tract
bike_trips_by_tract <- bike_census %>%
  group_by(Origin.Tract) %>%
  summarise(Total_Trips = n(), .groups = 'drop') %>%
  ungroup()

# Join this aggregated data to the jcTracts GeoDataFrame
jcTracts <- st_transform(jcTracts, crs = 4326)  # Make sure CRS matches, if not already
jcTracts_data <- jcTracts %>%
  left_join(bike_trips_by_tract, by = c("GEOID" = "Origin.Tract"))

# Calculate the bounding box for centering
bbox <- st_bbox(jcTracts_data)

# Plot the choropleth map with adjusted limits
ggplot(data = jcTracts_data) +
  geom_sf(aes(fill = Total_Trips), color = "white", size = 0.1) +
  scale_fill_viridis_c(option = "D", name = "Total Bike Trips", na.value = "white") +
  coord_sf(xlim = c(bbox["xmin"], bbox["xmax"]), ylim = c(bbox["ymin"], bbox["ymax"]), expand = FALSE) +  # Set limits
  labs(title = "Bike Share Trips per Census Tract in Jersey City") +
  theme_minimal() +
  theme(legend.position = "bottom",
        axis.text.x = element_blank(),
        axis.text.y = element_blank(),
        axis.ticks = element_blank(),
        axis.title.x = element_blank(),
        axis.title.y = element_blank(),
        plot.title = element_text(size = 14, face = "bold"),
        plot.subtitle = element_text(size = 12))

```

Expanding on our analysis, we grouped the data by station and time of day to explore trip volumes during different periods on weekdays versus weekends. To keep the visualization clear, we focused on the top 50 station-time occurrences. Each bar in the chart represents a specific combination of station, time of day, and whether it's a weekday or weekend. The results clearly show that PM rush hours generate the highest bikeshare demand, particularly at a few key stations. Notably, weekend rides are scarce in the top 50 list, appearing only three times. A closer look reveals that Grove St PATH station had a total of 1,117 rides during PM rush on a weekday and 801 rides overnight on a weekday in August. Hoboken Terminal - River St & Hudson Pl recorded 864 rides during weekday PM rush. South Waterfront Walkway - Sinatra Dr & 1 St had 319 rides overnight on a weekend and 731 rides during PM rush on a weekday. These details underscore how certain stations experience significant spikes in demand, especially during weekday commute hours.

```{r bikeshare viz 7, fig.width=10}

to_plot <- bike_census%>% # only show 100 stations
  group_by(start_station_id, start_lat, start_lng, weekend, time_of_day, start_station_name) %>%
  tally() %>% 
  arrange(-n) %>% 
  head(50) 
to_plot$ID <- seq_along(to_plot$start_station_id)

to_plot %>% 
  arrange(-n) %>%
  head(50) %>% 
  ggplot(aes(x = reorder(ID, -n), n, fill = time_of_day, color = weekend)) +
  scale_fill_manual(values = palette4, name="Time of Day") + 
  guides(color="none") + 
  geom_bar(stat = "identity", position="stack") +
  scale_color_manual(values = c("transparent", "black")) +
  labs(title="Top 50 Occurences of Rides By Time")+
  ylab("Number of Rides") +
  xlab("") + 
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) + 
  theme(plot.subtitle = element_text(size = 9,face = "italic"),
        plot.title = element_text(size = 12, face = "bold"), 
        axis.text.x = element_blank(),
    axis.ticks.x = element_blank(), 
        axis.text.y=element_text(size=8), 
        axis.title=element_text(size=10), 
        panel.background = element_blank(),
        panel.border = element_rect(colour = "grey", fill=NA, size=0.8))
```


## Spatial Analysis

Beyond our temporal analysis, we created maps to visualize the locations of stations with the highest bikeshare demand, broken down by time of day and day of the week. These proportional maps reveal that the greatest demand occurs during PM rush hours and overnight on weekdays, while AM rush hours see the lowest activity. The stations with the highest usage are predominantly clustered along the Hudson River. This pattern could be due to two factors: a higher concentration of stations in this area or strategic placement by Citi Bike to accommodate commuters traveling from Jersey City to Manhattan for work or school. The proximity to major transit connections likely drives this increased demand.

```{r bikeshare viz 6}

ggplot()+
  geom_sf(data = jcTracts %>%
          st_transform(crs=4326), fill = "white")+
  geom_point(data = to_plot,
             aes(x=start_lng, y = start_lat, size=n), color = alpha("#6E0E0A", 0.7)) + 
  ylim(min(bike_census$start_lat), max(bike_census$start_lat))+
  xlim(min(bike_census$start_lng), max(bike_census$start_lng)) +
  scale_size(range = c(0.5, 8)) + 
  labs(size = "Number of Rides") +
  facet_grid(weekend ~ time_of_day) + 
    labs(title="Locations of Stations with Greatest Bikeshare Demand by Time")+ 
  theme(axis.text.x=element_blank(),
        axis.text.y=element_blank(),
        axis.ticks =element_blank(),
        axis.title.x = element_blank(),
        axis.title.y = element_blank(),
        plot.subtitle = element_text(size = 9,face = "italic"),
        plot.title = element_text(size = 12, face = "bold"),
        panel.background = element_blank(),
        panel.border = element_rect(colour = "grey", fill=NA, linewidth=0.8)
        )
```

We analyzed all the starting stations collectively to assess spatial autocorrelation, aiming to determine if stations with high demand tend to be located near other high-demand stations. If such a pattern exists, it would allow us to predict demand at a specific station based on the activity levels at neighboring stations. To investigate this, we selected the top 80 stations with the highest number of rides in August 2023 and calculated Moran's I using 999 random Monte Carlo simulations. The results indicated a highly significant Moran's I, confirming that there is indeed spatial autocorrelation among nearby bikeshare stations, meaning demand is not randomly distributed but clustered in specific areas.




```{r spatial auto cor, warning=FALSE}

moran_data <- bike_census %>%
  group_by(start_station_id, start_lat, start_lng, start_station_name) %>%
  tally() %>% 
  arrange(-n) %>% 
  head(80) %>% 
  st_as_sf(., coords = c("start_lng", "start_lat"), crs = 4326)


coords <-  st_coordinates(moran_data) 
neighborList <- knn2nb(knearneigh(coords, 5))
spatialWeights <- nb2listw(neighborList, style="W")

moranTest <- moran.mc(moran_data$n, 
                      spatialWeights, nsim = 999)

ggplot(as.data.frame(moranTest$res[c(1:999)]), aes(moranTest$res[c(1:999)])) +
  geom_histogram(binwidth = 0.01, fill = "#124E78") +
  geom_vline(aes(xintercept = moranTest$statistic), colour = "#6E0E0A" ,lwd=1) +
  scale_x_continuous(limits = c(-0.5, 0.5)) +
  labs(title = "Observed and Permuted Moran's I for Stations with Most Rides",
       subtitle = "Observed Moran's I in Red",
       x = "Moran's I",
       y = "Count") +
  theme_light() +   
theme(plot.subtitle = element_text(size = 9,face = "italic"),
        plot.title = element_text(size = 12, face = "bold"), 
        axis.text.x=element_text(size=8),
        axis.text.y=element_text(size=8), 
        axis.title=element_text(size=10))
```

The map below illustrates the clustering of stations with higher bikeshare demand. Specifically, two main clusters emerge, and within each of these clusters, a few stations experience exceptionally high numbers of rides. These high-demand stations are typically surrounded by others with slightly lower, but still significant, levels of activity, highlighting a clear pattern of concentrated bikeshare usage.

```{r spatial cor viz, fig.height=8, fig.width=11}

  ggplot() +
  geom_sf(data = jcTracts, fill = "white") +
  geom_sf(data = moran_data,
          aes(size = n, color = n)) + 
  ylim(min(bike_census$start_lat), max(bike_census$start_lat))+
  xlim(min(bike_census$start_lng), max(bike_census$start_lng)) + 
  scale_size(
    range = c(1, 6),
    guide = guide_legend(
      direction = "horizontal",
      nrow = 1,
      label.position = "bottom")) +
  scale_color_gradientn(colours = hcl.colors(5, "RdBu",rev = TRUE, alpha = 0.9), 
                        guide = guide_legend(direction = "horizontal", nrow = 1)) +
  labs(title = "Stations with Most Rides",
       subtitle =  "Dot size proportional to rides") + 
  theme(legend.position = "bottom") + 
  theme(axis.text.x=element_blank(),
        axis.text.y=element_blank(),
        axis.ticks =element_blank(),
        axis.title.x = element_blank(),
        axis.title.y = element_blank(),
        plot.subtitle = element_text(size = 9,face = "italic"),
        plot.title = element_text(size = 12, face = "bold"),
        panel.background = element_blank(),
        panel.border = element_rect(colour = "grey", fill=NA, linewidth=0.8)
        )

```

This map below combines the spatial and temporal characteristics of bike rides at 15min interval into a choropleth map. 

```{r animate map temporal viz, message=FALSE, warning=FALSE, cache=TRUE,  fig.height=8, fig.width=11}
bike_aggregated <- bike_census %>%
  group_by(interval15, Origin.Tract) %>%
  summarize(Trip_Count = n(), .groups = 'drop')

# Joining with spatial data
choropleth_data <- bike_aggregated %>%
  left_join(jcTracts, by = c("Origin.Tract" = "GEOID")) %>%
  st_sf()

# Creating the choropleth map with the same scale
choropleth_bike_map <- ggplot() +
  geom_sf(data = jcTracts, fill = "white", color = "grey", alpha = 0.5) + # Base map
  geom_sf(data = choropleth_data, aes(fill = Trip_Count), color = NA) + # Choropleth layer
  scale_fill_gradientn(
    colours = hcl.colors(5, "YlOrRd", rev = TRUE),
    name = "Number of Trips",
    na.value = "transparent"
  ) +
  # Set the same scale using xlim and ylim
  xlim(min(bike_census$start_lng), max(bike_census$start_lng)) +
  ylim(min(bike_census$start_lat), max(bike_census$start_lat)) +
  labs(
    title = "Bikeshare Trip Density by 15-Minute Intervals",
    subtitle = "Number of bike rides aggregated by origin tract and time interval"
  ) +
  theme(
    legend.position = "bottom",
    axis.text.x = element_blank(),
    axis.text.y = element_blank(),
    axis.ticks = element_blank(),
    axis.title.x = element_blank(),
    axis.title.y = element_blank(),
    plot.subtitle = element_text(size = 9, face = "italic"),
    plot.title = element_text(size = 12, face = "bold"),
    panel.background = element_blank(),
    panel.border = element_rect(colour = "grey", fill = NA, linewidth = 0.8)
  )

# Display the updated map
print(choropleth_bike_map)

```



## Weather Analysis

In addition to space and time, weather plays a crucial role in influencing bikeshare demand. We expect that demand typically drops on rainy, windy, or extremely hot days. To incorporate this factor into our analysis, we collected hourly weather data from Newark Airport using the riem_measures function, covering the period from August 1 to September 1, 2023. Since Newark Airport is close enough to Jersey City, it serves as a reliable source for local weather conditions. We then created a weather panel summarizing key variables such as temperature, precipitation, and wind speed for each hour.

```{r download and wrangle weather data, warning=FALSE}
weather.Panel <- 
  riem_measures(station = "EWR", date_start = "2023-08-01", date_end = "2023-09-01") %>%
  dplyr::select(valid, tmpf, p01i, sknt) %>%
  replace(is.na(.), 0) %>%
  mutate(interval60 = ymd_h(substr(valid, 1, 13))) %>%
  mutate(
    week = week(interval60),
    dotw = wday(interval60)  # Remove label = TRUE
  ) %>%
  group_by(interval60) %>%
  summarize(
    Temperature = max(tmpf),
    Precipitation = sum(p01i),
    Wind_Speed = max(sknt)
  ) %>%
  mutate(Temperature = ifelse(Temperature == 0, 42, Temperature))

```

Examining the temperature, precipitation, and wind data collected for August, we observe a few notable weather patterns. There were several rainy days scattered throughout the month, while wind speeds remained generally stable, with only a few instances of particularly strong winds. Temperature showed daily fluctuations, with higher readings during the daytime and cooler temperatures at night, reflecting a typical summer cycle.

```{r visualize weather data, fig.height=11, fig.width=8, warning=FALSE}

grid.arrange(
  ggplot(weather.Panel, aes(interval60,Precipitation)) + geom_line(color="#124E78") + 
    labs(title="Percipitation", x="Hour", y="Perecipitation") + 
    theme(plot.subtitle = element_text(size = 9,face = "italic"),
        plot.title = element_text(size = 12, face = "bold"), 
        axis.text.x=element_text(size=8),
        axis.text.y=element_text(size=8), 
        axis.title=element_text(size=10), 
        panel.background = element_blank()), 
  ggplot(weather.Panel, aes(interval60,Wind_Speed)) + geom_line(color="#124E78") + 
    labs(title="Wind Speed", x="Hour", y="Wind Speed") +
    theme(plot.subtitle = element_text(size = 9,face = "italic"),
        plot.title = element_text(size = 12, face = "bold"), 
        axis.text.x=element_text(size=8),
        axis.text.y=element_text(size=8), 
        axis.title=element_text(size=10), 
        panel.background = element_blank()),  
  ggplot(weather.Panel, aes(interval60,Temperature)) + geom_line(color="#124E78") + 
    labs(title="Temperature", x="Hour", y="Temperature") +theme(plot.subtitle = element_text(size = 9,face = "italic"),
        plot.title = element_text(size = 12, face = "bold"), 
        axis.text.x=element_text(size=8),
        axis.text.y=element_text(size=8), 
        axis.title=element_text(size=10), 
        panel.background = element_blank()), 
  top="Weather Data ")

```


# Space-Time Panel

We proceeded by constructing a study panel that captures unique combinations of space and time. This means each row in the panel corresponds to a ride occurring at a specific station during a specific hour, allowing us to analyze bikeshare activity across both temporal and spatial dimensions.

```{r create study panel, warning=FALSE, message=FALSE}
# Check if 'interval60' exists and is not entirely NULL
if("interval60" %in% names(bike_census) && !all(is.na(bike_census$interval60))) {
  print("Column 'interval60' exists and has data.")
  print(head(bike_census$interval60))
} else {
  print("Column 'interval60' does not exist or is all NA.")
}

# If 'interval60' needs to be created or recalculated:
if(!"interval60" %in% names(bike_census) || all(is.na(bike_census$interval60))) {
  print("Recalculating 'interval60' from existing datetime column, assuming 'started_at' exists.")
  
  # Assuming 'started_at' is the column from which 'interval60' should be derived:
  if("started_at" %in% names(bike_census)) {
    bike_census$interval60 <- as.POSIXct(bike_census$started_at, format = "%Y-%m-%d %H:%M:%S") # Customize format as necessary
    bike_census$interval60 <- floor_date(bike_census$interval60, "hour") # Using lubridate to round down to the nearest hour
    print("New 'interval60' calculated and added to 'bike_census'.")
  } else {
    print("'started_at' column also missing; please check data source.")
  }
}

# Check and recreate the data frame for joining
if("interval60" %in% names(bike_census) && "start_station_id" %in% names(bike_census)) {
  combined_df <- expand.grid(interval60 = unique(bike_census$interval60),
                            start_station_id = unique(bike_census$start_station_id))
  print("Combined DataFrame created successfully. Displaying structure:")
  print(str(combined_df))
} else {
  print("One or both key columns are still missing or invalid.")
}
study.panel <- 
  expand.grid(interval60=unique(bike_census$interval60), 
              start_station_id = unique(bike_census$start_station_id)) %>%
  left_join(., bike_census %>%
              select(start_station_id, start_station_name, Origin.Tract, start_lng, start_lat)%>%
              distinct() %>%
              group_by(start_station_id) %>%
              slice(1))
```


We then enhanced this panel by adding more detailed information. This involved counting the number of rides at each station for each hour, incorporating weather data, merging in census information, and using the lubridate package to derive time-related variables such as the day of the week and hour of the day.

```{r add info to study panel, warning=FALSE, message=FALSE}

bike.panel <- 
  bike_census %>%
  mutate(Trip_Counter = 1) %>%
  right_join(study.panel) %>% 
  group_by(interval60, start_station_id, start_station_name, Origin.Tract, start_lng, start_lat) %>%
  summarize(Trip_Count = sum(Trip_Counter, na.rm = TRUE), .groups = 'drop') %>%
  left_join(weather.Panel) %>%
  ungroup() %>%
  filter(!is.na(start_station_id)) %>%
  mutate(
    week = week(interval60),
    dotw = wday(interval60)  # Removed label = TRUE
  ) %>%
  filter(!is.na(Origin.Tract)) %>% 
  left_join(
    njCensus %>%
      as.data.frame() %>%
      select(-geometry),
    by = c("Origin.Tract" = "GEOID")
  )

```

## Weather Correlation

With our panel data ready, we conducted further exploratory analysis to prepare for our regression modeling. Our first focus was on understanding the relationship between temperature and the number of bike rides in August. The analysis revealed a strong, positive, and linear correlation, indicating that bikeshare demand tends to increase as temperatures rise throughout the month.


```{r visualize temp, warning=FALSE, message=FALSE}

bike.panel %>%
  group_by(interval60) %>% 
  summarize(Trip_Count = mean(Trip_Count),
            Temperature = first(Temperature)) %>%
  mutate(week = week(interval60)) %>%
  ggplot(aes(Temperature, Trip_Count)) + 
    geom_point(size = .5, color = "black") + geom_smooth(method = "lm", se= FALSE, color="darkred") +
    facet_wrap(~week, ncol=8) + 
    labs(title="Trip Count As a Fuction of Temperature by Week",
         x="Temperature", y="Mean Trip Count") + 
  theme(plot.subtitle = element_text(size = 9,face = "italic"),
        plot.title = element_text(size = 12, face = "bold"), 
        axis.text.x = element_blank(),
    axis.ticks.x = element_blank(), 
        axis.text.y=element_text(size=8), 
        axis.title=element_text(size=10), 
        panel.background = element_blank(),
        panel.border = element_rect(colour = "grey", fill=NA, size=0.8))
  

```

Next, we examined the correlation between wind speed and the number of bike rides in August. We observed a generally positive linear relationship, though the strength of this correlation varied across different weeks. This suggests that bikeshare demand tends to increase on windy days, but the impact of wind speed on ride activity is not uniform throughout the month.

```{r visualize wind, warning=FALSE, message=FALSE}
bike.panel %>%
  group_by(interval60) %>% 
  summarize(Trip_Count = mean(Trip_Count),
            Wind = first(Wind_Speed)) %>%
  mutate(week = week(interval60)) %>%
  ggplot(aes(Wind, Trip_Count)) + 
    geom_point(size = .5, color = "black") + geom_smooth(method = "lm", se= FALSE, color="darkred") +
    facet_wrap(~week, ncol=8) + 
    labs(title="Trip Count As a Fuction of Wind by Week",
         x="Wind Speed", y="Mean Trip Count") + 
  theme(plot.subtitle = element_text(size = 9,face = "italic"),
        plot.title = element_text(size = 12, face = "bold"), 
        axis.text.x = element_blank(),
    axis.ticks.x = element_blank(), 
        axis.text.y=element_text(size=8), 
        axis.title=element_text(size=10), 
        panel.background = element_blank(),
        panel.border = element_rect(colour = "grey", fill=NA, size=0.8)) + 
    theme(plot.subtitle = element_text(size = 9,face = "italic"),
        plot.title = element_text(size = 12, face = "bold"), 
        axis.text.x = element_blank(),
    axis.ticks.x = element_blank(), 
        axis.text.y=element_text(size=8), 
        axis.title=element_text(size=10), 
        panel.background = element_blank(),
        panel.border = element_rect(colour = "grey", fill=NA, size=0.8))
```


## Serial Correlation

If the number of bike trips shows serial (temporal) correlation, incorporating time lag features into our model could significantly enhance its predictive power. The underlying concept is that bikeshare demand at any given hour is likely to be influenced by demand patterns immediately preceding or following that hour. For example, if we know there was a high number of trips during one hour, we can reasonably predict that nearby hours will also have elevated demand. To capture these temporal dependencies, we computed six different lagged variables, each representing the number of trips one, two, three, four, twelve hours earlier, and one day earlier. These lag features allow us to account for short-term fluctuations as well as daily patterns, providing a more comprehensive framework for modeling bikeshare activity. By leveraging these time-based correlations, our model can better anticipate shifts in demand, improving accuracy and reliability.


```{r calculate serial lag}

bike.panel <- 
  bike.panel %>% 
  arrange(start_station_id, interval60) %>% 
  mutate(lagHour = dplyr::lag(Trip_Count,1),
         lag2Hours = dplyr::lag(Trip_Count,2),
         lag3Hours = dplyr::lag(Trip_Count,3),
         lag4Hours = dplyr::lag(Trip_Count,4),
         lag12Hours = dplyr::lag(Trip_Count,12),
         lag1day = dplyr::lag(Trip_Count,24)) %>%
   mutate(day = yday(interval60))

```

We visualized the relationships between the number of trips and the various lagged hours to better understand these temporal correlations. The analysis revealed some interesting patterns. There is a strong, linear, and negative relationship between the number of trips and the lagged variable representing 12 hours prior. For instance, the number of rides recorded at 5 PM on a given day tends to be negatively correlated with the number of rides at 5 AM on the same day, likely reflecting contrasting peak and off-peak usage times.

Conversely, we observed a strong, positive, and linear correlation between the number of trips at a given hour and both the immediate next hour (laghour) and the same hour on the following day (lag1day). This indicates that bikeshare demand is highly predictable from one hour to the next and tends to follow consistent daily patterns. However, the strength of the correlation decreases for the lag variables representing two and three hours prior. For example, the number of rides at 4 PM is more strongly correlated with the number of rides at 5 PM than with rides at 6 or 7 PM. This gradual weakening of correlation over time highlights how bikeshare demand is most heavily influenced by recent past activity, making these time lag features valuable for modeling purposes.

```{r visualize serial correlations, message=FALSE, warning=FALSE}

as.data.frame(bike.panel) %>%
    group_by(interval60) %>% 
    summarise_at(vars(starts_with("lag"), "Trip_Count"), mean, na.rm = TRUE) %>%
    gather(Variable, Value, -interval60, -Trip_Count) %>%
  ggplot(aes(x = Value , y = Trip_Count)) +
    geom_point(size = .5, color = "black") + geom_smooth(method = "lm", se= FALSE, color="darkred") +
  facet_wrap(~Variable, ncol = 6)+
  labs(x = "Variable", y = "Correlation", title = "Correlation Between Serial Lag Trips and Trip Count") +
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) + 
  theme(plot.subtitle = element_text(size = 9,face = "italic"),
        plot.title = element_text(size = 12, face = "bold"), 
        axis.text.x = element_blank(),
    axis.ticks.x = element_blank(), 
        axis.text.y=element_text(size=8), 
        axis.title=element_text(size=10), 
        panel.background = element_blank(),
        panel.border = element_rect(colour = "grey", fill=NA, size=0.8))

```

## Spatial Correlation

We previously established that spatial autocorrelation is present in our data, indicating that the number of trips at one station during a specific time is related to the number of trips at nearby stations. In other words, knowing the demand at one station can help us estimate demand at neighboring stations. To leverage this spatial relationship, we created a new dataframe called lag_result.

We then implemented a nested loop to systematically calculate spatial lags for each hour in August 2023. For each hour, we identified the five nearest neighboring stations for every station and computed the average number of trips across these neighbors. This provided a measure of spatial influence, capturing how demand at one station is affected by activity at surrounding stations. Finally, we integrated these spatial lag values back into our panel dataset, enhancing our model's ability to predict bikeshare demand by incorporating both spatial and temporal dependencies.




```{r calculate spatial lag, message=FALSE, warning=FALSE}
library(sf)
library(spdep)
library(dplyr)
library(parallel)

# Step 1: Filter data once and convert to sf object
bike.panel_sf <- bike.panel %>%
  st_as_sf(coords = c("start_lng", "start_lat"), crs = 4326)

# Step 2: Define a function to calculate spatial lag for a given day and hour
calculate_spatial_lag <- function(d, h, bike_data) {
  # Load required libraries within the function
  library(dplyr)
  library(sf)
  library(spdep)
  library(lubridate)
  
  # Filter data for the specified day and hour
  spacelag <- bike_data %>%
    mutate(hour = hour(interval60), day = yday(interval60)) %>%
    filter(day == d & hour == h)
  
  if (nrow(spacelag) == 0) return(NULL)  # Skip if no data
  
  # Compute spatial neighbors and weights
  coords <- st_coordinates(spacelag)
  neighborList <- knn2nb(knearneigh(coords, k = 5))
  spatialWeights <- nb2listw(neighborList, style = "W")
  
  # Calculate spatial lag
  spacelag <- spacelag %>%
    mutate(lagTrip = lag.listw(spatialWeights, Trip_Count))
  
  return(spacelag)
}

# Step 3: Use parallel processing to speed up the loop
cl <- makeCluster(detectCores() - 1)  # Create a cluster using available cores
clusterExport(cl, c("bike.panel_sf", "calculate_spatial_lag"))  # Export data and function
clusterEvalQ(cl, {  # Load required libraries in each worker
  library(dplyr)
  library(sf)
  library(spdep)
  library(lubridate)
})

# Run the loop in parallel
lag_result_list <- parLapply(cl, 213:243, function(d) {
  lapply(0:23, function(h) calculate_spatial_lag(d, h, bike.panel_sf))
})

stopCluster(cl)  # Stop the cluster

# Combine the results into a single data frame
lag_result <- do.call(rbind, unlist(lag_result_list, recursive = FALSE)) %>%
  mutate(start_lng = unlist(map(geometry, 1)),
         start_lat = unlist(map(geometry, 2))) %>%
  st_drop_geometry()

# Final result
bike.panel <- lag_result

  
```

The figure below illustrates a consistent, strong, and positive linear relationship between the number of trips at a station and the number of trips at nearby stations throughout all weeks in August. This confirms the presence of significant spatial dependence, suggesting that stations in close proximity tend to experience similar levels of demand. Given this finding, we will incorporate this spatial lag variable into our regression analysis to improve the model's predictive accuracy by accounting for the influence of nearby station activity.



```{r visualize spatial correlation, message=FALSE, warning=FALSE}

bike.panel %>%
  group_by(interval60) %>% 
  summarize(Trip_Count = mean(Trip_Count),
            Lag_Count = mean(lagTrip)) %>%
  mutate(week = week(interval60)) %>%
  ggplot(aes(Lag_Count, Trip_Count)) + 
    geom_point(size = 1, color = "#6E0E0A") + geom_smooth(method = "lm", se= FALSE, color="#124E78") +
    facet_wrap(~week, ncol=8) + 
    labs(title="Correlation Between Spatial Lag Trips and Trip Count",
         x="Lag Trip Count", y="Mean Trip Count") + 
  theme(plot.subtitle = element_text(size = 9,face = "italic"),
        plot.title = element_text(size = 12, face = "bold"), 
        axis.text.x = element_blank(),
    axis.ticks.x = element_blank(), 
        axis.text.y=element_text(size=8), 
        axis.title=element_text(size=10), 
        panel.background = element_blank(),
        panel.border = element_rect(colour = "grey", fill=NA, size=0.8))

```


# OLS Regression

We divided our five weeks of data into a training set consisting of three weeks and a testing set comprising the remaining two weeks. Using the training data, we estimated four distinct linear regression models, each incorporating different fixed effects:

- reg1 captures only temporal factors, including hour-of-day fixed effects, day of the week, wind speed, and temperature.
- reg2 emphasizes spatial components, using station-specific effects and weather conditions.
- reg3 combines both spatial and temporal elements to account for interactions between space and time.
- reg4 extends the third model by adding both time and spatial lag features, capturing temporal and spatial dependencies to enhance predictive accuracy.




```{r fitting regression}

bike.Train <- filter(bike.panel, week <= 33)
bike.Test <- filter(bike.panel, week > 33)

# Time
reg1 <- 
  lm(Trip_Count ~  hour(interval60) + dotw + Temperature + Wind_Speed, data=bike.Train)

# Space
reg2 <- 
  lm(Trip_Count ~  start_station_name + Temperature + Wind_Speed,  data=bike.Train)

# Time and Space
reg3 <- 
  lm(Trip_Count ~  start_station_name + hour(interval60) + dotw + Temperature + Wind_Speed, 
     data=bike.Train)

# Time and Space and TimeLag and SpaceLag
reg4 <- 
  lm(Trip_Count ~  start_station_name +  hour(interval60) + dotw + Temperature + Wind_Speed +
                   lagHour + lag2Hours + lag12Hours + lag1day + lagTrip, 
     data=bike.Train)
```

## Making Predictions

We utilized the purrr package to efficiently iterate over a set of nested data frames, allowing us to compare different model specifications. To prepare for testing, we organized our data by week, nesting it into separate tibbles. We then created a simple prediction function that takes a tibble (dat) and a regression model (fit) as inputs and generates predictions, returning them as pred. This approach streamlines the process of applying our models across multiple subsets of data, making it easy to evaluate and compare their performance.

```{r nest data and predict function, warning=FALSE}

bike.Test.weekNest <- 
  bike.Test %>%
  nest(-week) 

model_pred <- function(dat, fit){
   pred <- predict(fit, newdata = dat)}

```


The nested data format enables us to iterate through each model for every week and compute summary statistics efficiently. In the code block that follows, we generate week_predictions for each week within the testing set. Using the map function from purrr, we apply the model_pred function to each nested tibble. For instance, in the first line of the mutate operation, a new column called Time_FE is created, which contains predictions from reg1, the model that includes time fixed effects. These predictions are produced by mapping model_pred to each data subset, with fit set to the reg1 model. This method allows us to easily apply multiple models and gather prediction results for comparison.

```{r making predictions, warning=FALSE}

week_predictions <- 
  bike.Test.weekNest %>% 
    mutate(ATime_FE = map(.x = data, fit = reg1, .f = model_pred),
           BSpace_FE = map(.x = data, fit = reg2, .f = model_pred),
           CTime_Space_FE = map(.x = data, fit = reg3, .f = model_pred),
           DTime_Space_FE_Lags = map(.x = data, fit = reg4, .f = model_pred)) %>% 
    gather(Regression, Prediction, -data, -week) %>%
    mutate(Observed = map(data, pull, Trip_Count),
           Absolute_Error = map2(Observed, Prediction,  ~ abs(.x - .y)),
           MAE = map_dbl(Absolute_Error, mean, na.rm = TRUE),
           sd_AE = map_dbl(Absolute_Error, sd, na.rm = TRUE))

```

## Error Analysis

When comparing the mean absolute errors (MAE) across our four models, we observe that Model 1, which only incorporates time effects, has the highest error rate. Models 2 and 3, which account for spatial effects and the combination of space and time, show improved accuracy. However, the greatest enhancement in predictive performance comes from Model 4, which includes both space and time lag features, significantly reducing the error. To ensure the reliability of Model 4, we also performed a variance inflation factor (VIF) analysis to confirm that multicollinearity was not an issue among the predictors.

```{r plot errors}

week_predictions %>%
  dplyr::select(week, Regression, MAE) %>%
  gather(Variable, MAE, -Regression, -week) %>%
  ggplot(aes(week, MAE)) + 
    geom_bar(aes(fill = Regression), position = "dodge", stat="identity") +
    scale_fill_manual(values = palette4) +
    labs(title = "Mean Absolute Errors by Model Specification and Week") +
  theme(plot.subtitle = element_text(size = 9,face = "italic"),
        plot.title = element_text(size = 12, face = "bold"), 
        axis.text.x = element_blank(),
    axis.ticks.x = element_blank(), 
        axis.text.y=element_text(size=8), 
        axis.title=element_text(size=10), 
        panel.background = element_blank(),
        panel.border = element_rect(colour = "grey", fill=NA, size=0.8))

```

The figure below provides a visual comparison of the accuracy of each model by plotting the predicted versus observed number of trips. Among the four models, Model 4 clearly demonstrates superior predictive performance, closely aligning with the actual bikeshare demand. This highlights the effectiveness of incorporating both spatial and temporal lag features, making Model 4 the most robust and reliable in forecasting trip counts.




```{r comparing predicted and observed, fig.height=8, fig.width=11, message=FALSE, warning=FALSE, cache=TRUE}

week_predictions %>% 
    mutate(interval60 = map(data, pull, interval60),
           start_station_id = map(data, pull, start_station_id)) %>%
    dplyr::select(interval60, start_station_id, Observed, Prediction, Regression) %>%
    unnest() %>%
    gather(Variable, Value, -Regression, -interval60, -start_station_id) %>%
    group_by(Regression, Variable, interval60) %>%
    summarize(Value = sum(Value)) %>%
    ggplot(aes(interval60, Value, colour=Variable)) + 
      geom_line(size = 1.1) + 
      scale_color_manual(values=palette2)+
      facet_wrap(~Regression, ncol=1) +
      labs(title = "Predicted/Observed Bike Share Time Series", subtitle = "Jersey City: a test set of 2 weeks",  x = "Hour", y= "Station Trips") + 
   theme(plot.subtitle = element_text(size = 9,face = "italic"),
        plot.title = element_text(size = 12, face = "bold"), 
        axis.text.x = element_blank(),
    axis.ticks.x = element_blank(), 
        axis.text.y=element_text(size=8), 
        axis.title=element_text(size=10), 
        panel.background = element_blank(),
        panel.border = element_rect(colour = "grey", fill=NA, size=0.8))

```

To evaluate whether our predictions generalize well across different locations and times, we mapped the mean absolute error (MAE) of Model 4 across various stations. The results show that the model struggles to accurately predict the number of rides at stations located along the Hudson River, which tend to experience high bikeshare demand on both weekdays and weekends. However, for stations with lower ridership throughout August, the model performs significantly better, demonstrating reasonable predictive accuracy in those areas.





```{r error by station, warning=FALSE, message=FALSE, cache=TRUE}

week_predictions %>% 
    mutate(interval60 = map(data, pull, interval60),
           start_station_id = map(data, pull, start_station_id), 
           start_latitude = map(data, pull, start_lat), 
           start_longitude = map(data, pull, start_lng)) %>%
    select(interval60, start_station_id, start_longitude, start_latitude, Observed, Prediction, Regression) %>%
    unnest() %>%
  filter(Regression == "DTime_Space_FE_Lags") %>%
  group_by(start_station_id, start_longitude, start_latitude) %>%
  summarize(MAE = mean(abs(Observed-Prediction), na.rm = TRUE))%>%
ggplot(.)+
  geom_sf(data = jcTracts,  fill = "white")+
  geom_point(aes(x = start_longitude, y = start_latitude, color = MAE), 
             fill = "transparent")+
  scale_color_continuous(low = "#124E78", high = "#6E0E0A", name= "MAE")+
 ylim(min(bike_census$start_lat), max(bike_census$start_lat))+
  xlim(min(bike_census$start_lng), max(bike_census$start_lng)) +
  labs(title="Mean Abs Error, Test Set, Model 4") + 
  theme(axis.text.x=element_blank(),
        axis.text.y=element_blank(),
        axis.ticks =element_blank(),
        axis.title.x = element_blank(),
        axis.title.y = element_blank(),
        plot.subtitle = element_text(size = 9,face = "italic"),
        plot.title = element_text(size = 12, face = "bold"),
        panel.background = element_blank(),
        panel.border = element_rect(colour = "grey", fill=NA, linewidth=0.8)
        )

```

When examining Model 4's performance across both space and time, the figure demonstrates that the model maintains consistent predictions throughout different periods. Stations identified as having high demand on weekends are similarly predicted to have high demand during weekdays. However, the errors persist over time, indicating that stations with historically high demand also exhibit greater variability in ride counts, making them more challenging to predict with our current set of features. In fact, our model only explains 50.12% of the variability, which is a limitation. The concern is that if these larger errors remain unaddressed for high-demand stations, we risk failing to adequately meet bikeshare users' needs at these crucial locations.





```{r comparing errors across space and time, warning=FALSE, message=FALSE, cache=TRUE}

week_predictions %>% 
    mutate(interval60 = map(data, pull, interval60),
           start_station_id = map(data, pull, start_station_id), 
           start_latitude = map(data, pull, start_lat), 
           start_longitude = map(data, pull, start_lng),
           dotw = map(data, pull, dotw) ) %>%
    select(interval60, start_station_id, start_longitude, 
           start_latitude, Observed, Prediction, Regression,
           dotw) %>%
    unnest() %>%
  filter(Regression == "DTime_Space_FE_Lags")%>%
  mutate(weekend = ifelse(dotw %in% c("Sun", "Sat"), "Weekend", "Weekday"),
         time_of_day = case_when(hour(interval60) < 7 | hour(interval60) > 18 ~ "Overnight",
                                 hour(interval60) >= 7 & hour(interval60) < 10 ~ "AM Rush",
                                 hour(interval60) >= 10 & hour(interval60) < 15 ~ "Mid-Day",
                                 hour(interval60) >= 15 & hour(interval60) <= 18 ~ "PM Rush")) %>%
  group_by(start_station_id, weekend, time_of_day, start_longitude, start_latitude) %>%
  summarize(MAE = mean(abs(Observed-Prediction), na.rm = TRUE))%>%
  ggplot(.)+
  geom_sf(data = jcTracts, color = "grey", fill = "transparent")+
  geom_point(aes(x = start_longitude, y = start_latitude, color = MAE), 
             fill = "transparent", size = 0.5)+
  scale_color_continuous(low = "#124E78", high = "#6E0E0A", name= "MAE")+
 ylim(min(bike_census$start_lat), max(bike_census$start_lat))+
  xlim(min(bike_census$start_lng), max(bike_census$start_lng)) +
  facet_grid(weekend~time_of_day)+
  labs(title="Mean Absolute Errors by Time, Test Set, Model 4") + 
  theme(axis.text.x=element_blank(),
        axis.text.y=element_blank(),
        axis.ticks =element_blank(),
        axis.title.x = element_blank(),
        axis.title.y = element_blank(),
        plot.subtitle = element_text(size = 9,face = "italic"),
        plot.title = element_text(size = 12, face = "bold"),
        panel.background = element_blank(),
        panel.border = element_rect(colour = "grey", fill=NA, linewidth=0.8)
        )
```


## Cross-Validation

Lastly, we conducted 100 cross-validation tests using Model 4 on data spanning all five weeks. The mean absolute error (MAE) was 0.44, indicating that while this model is our most effective solution so far, there is still a notable level of prediction error. To improve the model further, it would be valuable to cross-validate against socio-economic indicators. This approach could reveal whether bike demand at stations in specific neighborhoods is being disproportionately under- or over-predicted. However, our current model does not incorporate socioeconomic variables as predictors, and the dynamics in Jersey City—such as the continued development of new stations—may also impact demand patterns.

```{r cross validation}

fitControl <- trainControl(method = "cv", number = 100)

reg.cv <- train(Trip_Count ~ start_station_name +  hour(interval60) + dotw + Temperature + Wind_Speed + lagHour + lag2Hours + lag12Hours + lag1day + lagTrip, data=bike.panel, method = "lm", trControl = fitControl, na.action = na.pass)

reg.cv$resample %>% 
  summarise(MAE = mean(reg.cv$resample[,3]),
            sd(reg.cv$resample[,3])
) %>%
  kbl(col.name=c('Mean Absolute Error','Standard Deviation of MAE')) %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed"))
```


# Conclusion

In summary, this report highlights the urgent need for efficient bikeshare management strategies in Jersey City and presents a machine learning approach to address this challenge. Our analysis demonstrates that bikeshare demand is influenced by a combination of spatial, temporal, and weather-related factors. Specifically, the model shows that forecasting demand is more accurate when considering the number of trips in the preceding and following hours, the same hour on the previous day, and nearby station activity. We also discovered that demand peaks during windy and hot days in August, with the highest usage occurring during weekday PM rush hours and, to a lesser extent, overnight. Weekends see a significant spike in demand during the PM rush as well. Notably, Citi Bike is heavily used for commuting to New York City, making stations along the Hudson River, especially near ferry terminals, the most active.

We further explored specific high-demand stations in detail. Grove St PATH station recorded an impressive 1,117 rides during the weekday PM rush and 801 rides overnight in August, reflecting its strategic location in downtown Jersey City. Hoboken Terminal - River St & Hudson Pl had 864 rides during weekday PM rush hours, likely due to its proximity to major transit connections. South Waterfront Walkway - Sinatra Dr & 1 St station saw 319 rides overnight on a weekend and 731 during the weekday PM rush, benefiting from its location near the NJ Transit Terminal and the popular Pier A Park. These findings suggest that stations experience higher demand because they serve key areas such as employment hubs, transit nodes for commuting, and recreational spaces along the waterfront. To address this, local authorities should consider rebalancing bike supplies more strategically at these critical locations.

Although our model successfully captures over half of the demand variations, it still requires further refinement to increase its predictive accuracy. Without these improvements, the model risks failing to meet rider needs and complicating bike redistribution efforts. Nevertheless, the straightforward and open-source nature of our approach makes it a valuable tool for bikeshare operators looking to improve service quality and optimize bike availability. As urban transportation continues to evolve, the use of predictive models like ours will be vital for creating sustainable, well-balanced mobility systems.
