The rapid growth of urban populations and increased awareness about environmental sustainability have contributed to the popularity of bike sharing programs. These programs have become a common mode of transport in many cities, providing an eco-friendly, healthy, and convenient way to travel. Analyzing bike sharing data can provide insights into usage patterns, user behavior, and the overall performance of the program. In this study, we analyze bike sharing data from 2022 to explore trip durations, user types, daily trips over time, and the most popular routes. We also train a time series forecasting model to predict the number of bike rides on a given day.
I used Citibike Data collected from 15 CSV files in this study, each corresponding to a month in 2022 and 2023. The data includes information on bike trip start and end times, start and end station IDs, user types (member or casual), and latitude and longitude coordinates of the start and end stations. I preprocessed the data by combining the CSV files into a single DataFrame, parsing the date and time columns, and calculating additional features such as trip duration, day of the week, hour of the day, and whether the trip occurred on a weekend.
I utilized several data visualization techniques, such as histograms, bar plots, and line plots, to explore the bike sharing. Additionally, I plotted the daily number of trips over time and generated maps of the most popular routes using the Folium library. To predict the number of rides on a given day, I employed the Prophet forecasting model from Facebook. This model is designed for time series forecasting with multiple seasonalities and can account for holidays and special events. I preprocessed the data for prediction by aggregating the number of rides per day, and then I trained the Prophet model with daily, weekly, and yearly seasonality components. I also performed cross-validation and calculated performance metrics to evaluate the model. Finally, I generated a forecast for the next year and visualized the results.
The Bike Trips Over Time chart also highlights the increased popularity of bike usage during warmer months. We can also see sharp declines when the weather gets colder.
Some notable drops on holidays include:
Valentines Day: Can't be sweaty before your date.
Mother's Day: Can't be sweaty before seeing your mom.
Labor Day: Can't be sweaty before your BBQ.
Christmas: Can't be sweaty before opening presents.
People prefer to ride during the week, unless it's a Monday. The most popular day being Wednesday and the least popular being Sunday. Riders see Sunday as a day of rest and also hate Mondays.
After 10 A.M. there is a steady increase until 5 P.M. which is the most popular time to ride followed by 6 P.M. Interestly, 8 A.M is a standout time and among one of the most popular times, this is like due to people who enjoy riding to work or exercising in the morning. Very few people 12 A.M. and however is riding must've had quite the night.
The majority of bicycle journeys are relatively brief, as shown in the bike trip duration histogram, with most trips lasting between 1 and 15 minutes. Seldmon do people ride longer, let alone an hour.
Citibike charges $4.49 for a single ride unlock with 30 minutes, however, most riders aren't even close to that. The membership plan is the $205/year or $17.08/month which allows for unlimited 45 minute rides. If you plam on riding for more than 90 minutes a month, you're better off getting a membership.
Casual rides try to get the most of their $4.49 by riding almost twice as long as those with memberships, or Member's take advantage of their unlimited 45 minute rides.
Memebers rode over a whopping 24 million times, with casual rides only riding 6 million times. Member's are clearly taking advantage of their unlimited 45 minute rides.
Fun Fact: The average cost of a single ride is $4.49, so if all 30 million rides were single rides, Citibike would have made $134,700,000 in 2022.
With a total of over 30 million rides in 2022, Citibike is clearly a popular mode of transportation in NYC.
Where you start is where you end.
Despite the data accounting for non-rides and reracks, the most popular routes are from the same start and end location. This is likely due to people who rent a bike, ride it, and then return it to the same location.
Fun Fact: The most popular route is from 11th Ave & W 41st St to 11th Ave & W 41st St, which is a 0.0 mile ride.
Where you start is where you end, unless you don't.
Riders who dont end up at the same location tend to ride for 3 aventues or 9 city blocks.
Fun Fact: The longest route excluding same start-end location is from Chambers St & Greenway to 10th Ave & W 14 St, which is right along the Hudson River!
Still a lot of same start-end locations, but we can see more routes that are not the same start-end location.
One of the most interesting routes is the one straight through central park. Going all the way from start to end is a 2.5 mile ride, which is a great way to see the park.
Time series analysis was chosen because it is well-suited for handling data with temporal dependencies, seasonality, and forecasting requirements. It can provide valuable insights into the factors affecting bike ride demand and inform decision-making for bike-sharing companies. The Prophet model was chosen for ciiti bike ride forecasting because it can effectively handle seasonality, holidays, special events, and non-linear trends while being easy to implement and scalable. These characteristics make it a good choice for forecasting the number of bike rides per day, given the seasonal patterns and the influence of holidays and special events on the data.
As the horizon increases, the RMSE and MAE values tend to increase, which is expected because predicting further into the future is generally more challenging. The coverage remains at 0.9 for the first three horizons and drops to 0.8 for the last two horizons.
The coverage values are relatively high, at 0.9 for horizons 1 to 3 days and 0.8 for horizons 4 to 5 days. This means that the model's predicted confidence intervals capture the actual values a significant proportion of the time, which is a positive sign.
Forecasting a year into the future is a challenging task, and the model's performance is not as good as it is for the shorter horizons. The model's performance could be improved by including more data, such as weather data, and by tuning the model's hyperparameters.
However, the model's performance is good enough to provide useful insights into the factors affecting bike ride demand and to inform decision-making for bike-sharing companies.
We can see that the model is able to capture the seasonality and the general trend of the data. The model's predictions are close to the actual values, and the confidence intervals capture the actual values a significant proportion of the time.
The model predicts an increase of bike usage over the next year.
Horizon | MSE | RMSE | MAE | MAPE | MDAPE | SMAPE | Coverage |
---|---|---|---|---|---|---|---|
0 1 days | 3.987044e+08 | 19967.582579 | 15556.554539 | 0.467577 | 0.151317 | 0.411260 | 0.9 |
1 2 days | 3.370658e+08 | 18359.352086 | 14571.856889 | 0.231151 | 0.214784 | 0.265903 | 0.9 |
2 3 days | 2.996783e+08 | 17311.218383 | 13771.221513 | 0.192709 | 0.140566 | 0.192407 | 0.9 |
3 4 days | 4.665819e+08 | 21600.507711 | 15539.015992 | 0.187608 | 0.129567 | 0.228013 | 0.8 |
4 5 days | 6.328041e+08 | 25155.597256 | 19667.254203 | 0.242874 | 0.197263 | 0.301846 | 0.8 |