Intro
In our last episode, we looked at Seoul Bike Stations across the city of Seoul.
In this post, we will explore the trips taken in April of 2021.
On top of the GPS/location feature from station dataset, the trips dataset has a time feature (time of rent, time of return) which poses exciting questions like
How does trips differ by counties, hours of the day, or days of the week?
Do counties or stations have distinct characteristics with which we can cluster them into a few groups?
Data
EDA
1. Which Day of Week and Hour had the most frequent trips?
Let's start with the simplest one.
We will look at trips taken by Day of Week, Hour, County, respectively.
1.1. # of Trips by Day of Week
There were 27.1% more trips during weekends than weekdays.
During weekdays, people rode more often on Mondays and Tuesdays.
During weekends, people came out to ride bikes more on Saturday than Sunday.
1.2. # of Trips by Hour
We can presume that the trip pattern might look different between weekdays and weekends. Let's compare them and see if the difference exists.
On Weekdays
On Weekends
On Weekdays, relatively more trips are taken at 8 am and 6 pm. This is due to commuting time. Yes, people work 9 to 6 in Korea (one more hour than 9-5 in the US).
On Weekends, more trips are taken as it gets late in the afternoon and hits the peak around 6 pm. This may vary depending on the season and the sunset time.
Otherwise, except for commuting time, Rent > Return in the afternoon and Rent < Return in the evening.
1.3. Does Hourly Trip Pattern Look Different by County?
Rent
Return
All counties have very similar hourly trip patterns.
The dark colors in commute time stick out which makes me wonder:
In which county is the # of Rent > # of Return?
In which county is the # of Rent < # of Return?
1.4. Rent to Return Ratio by County during commute time
Let's extract the trips taken during commuting time (morning commute time 8am & evening commute time 18pm). In addition, let's look at the Rent to Return Ratio.
In the morning commute time, Seongbuk had 26% more Rents than Returns.
In the evening commute time, Geumcheon had 10% more Returns than Rents.
In the evening commute time, Seongbuk had 15% more Returns than Rents.
However, the two plots above have a different order of y-axis, so it's quiet difficult to compare the difference in morning and evening commute time.
Let's visualize just the Rents this time because Returns = Trips - Rents.
The morning commute bars gradually decrease, while the evening commute bars gradually increase.
In other words, counties with a higher Rent rate in the morning have a higher Return rate in the evening (The correlation is 0.8).
We can imagine that people ride bikes to commute from home -> work -> home on weekdays.
Well, this may sound pretty obvious, but we can verify our groundless inference with this visualization now!
2. Which County has a High Trip Distance and Trip Hour?
We can assume that a county with a high average trip distance and trip hour has more riders who travel a further distances or ride longer.
Let's first look at the trip distance.
2.1. # of Trips by Distance
Let's draw the distribution in a boxplot and a distplot.
The distribution is skewed due to outliers. Let's remove the outliers for the sake of better analysis.
The distribution is still skewed a little bit, but the median is 1798 meters.
Let's dig a little deeper by looking at distributions by county.
The distributions are skewed in every county.
Counties with a longer distance tend to have a higher median as well as more outliers.
2.2. # of Trips by Hour
Mode: 5 Minutes Median: 16 Minutes Truncated Mean(5~95%): 23 Minutes
Distributions for distance and hour look very similar.
Trip Hour is very similar to Trip distance.
It's interesting that some counties (Yongsan, Gangnam) have more outliers than others (Gangseo).
Putting both distance and trip hour into maps:
Yongsan has the highest distance and trip hours.
Distance and Trip Hour have a very strong correlation (0.95). This is not a surprise since a rider has to take a longer trip to go further.
3. Which County Has a High Outflow/Inflow of riders?
Outflow/Inflow represents the trips from County A to County B. We can get the riders traveling across counties this way.
Since the scale of the number of trips differs by counties, we will look at a relative ratio between counties instead of absolute value.
Outflow Ratio:
Return at a different county after Renting at County A / Total number of Rents at County A
Inflow Ratio:
Return at County A after Renting at a different county / Total number of Returns at County A
In other words, we want to know how much % of Rents(Returns) in County A are Returned(Rented) in another county.
3.1. Mapping Outflow/Inflow Ratio of Counties
We can first check which counties have a high outflow ratio or inflow ratio.
Jung (located in the center) has the highest outflow and inflow ratio. Counties in the center of Seoul tend to have both high outflow and inflow ratios.
Outflow ratio and inflow ratio have similar distributions. In other words, a county with a high outflow ratio has a high inflow ratio (correlation = 0.97).
3.2. Adding Hour to Outflow/Inflow Ratio of Counties
Let's throw time into the analysis and dig in for more insights
The outflow/inflow ratio has slightly different patterns! This finding triggers a question:
Are there counties that have similar outflow/inflow ratio depending on time?
Since we have to look at the outflow/inflow ratio at the same time period, let's take "Measure = Inflow Ratio - Outflow Ratio".
If Measure > 0, more riders are "coming to" the county in that time period.
If Measure < 0, more riders are "going out" of the county in that time period.
Let's cluster counties with similar patterns in Measure by time period.
Do you see the 3 clusters!?
I've named the clusters as A, B, C.
Now, let's plot the Measure by time.
Red = Outflow > Inflow
Green = Outflow < Inflow
The clusters make sense!
Cluster A represents counties with high inflow during morning commute time and high outflow during evening commute time. These counties tend to have relatively more companies than other counties.
Cluster B is completely opposite of Cluster A. These counties are well known for being residential areas.
Cluster C does not belong to either Cluster A or B. Patterns are not as clear as other clusters.
Some abnormal patterns:
Jung has significantly high outflow ratio from 2 - 4 am
Gangbuk has very high inflow ratio at 3 am
One thing that crosses my mind is that these anomalies may be due to distribution of bikes to stations in other counties by trucks.
3.3. How do Seoul-ers in Each County Commute?
Let's visualize it in a heatmap.
Counties with a high outflow ratio during the morning commute time have a high inflow ratio in the evening commute time.
Summary
Instead of summarizing everything, these are the counties with distinct characteristics:
Counties with High Frequency of Trips
Gangnam, Gangseo
County with High Distance and Trip Hour
Yongsan
County with High Outflow, Inflow Ratio
Jung
County with Drastic Change in Outflow, Inflow Ratios During Commute Time
Geumcheon
In the next episode, I will build predictive models to predict the number of bikes available at each station at a specific time of the day.
File Source
Comments