top of page

[Seoul Bike Sharing #2] - Clustering Counties by Inbound-Outbound Trip Patterns

Writer's picture: Jiehwan YangJiehwan Yang

Updated: Jan 26, 2022

Intro

In our last episode, we looked at Seoul Bike Stations across the city of Seoul.


In this post, we will explore the trips taken in April of 2021.

On top of the GPS/location feature from station dataset, the trips dataset has a time feature (time of rent, time of return) which poses exciting questions like

  • How does trips differ by counties, hours of the day, or days of the week?

  • Do counties or stations have distinct characteristics with which we can cluster them into a few groups?

 
Data

 
EDA


1. Which Day of Week and Hour had the most frequent trips?


Let's start with the simplest one.

We will look at trips taken by Day of Week, Hour, County, respectively.


1.1. # of Trips by Day of Week



  • There were 27.1% more trips during weekends than weekdays.

  • During weekdays, people rode more often on Mondays and Tuesdays.

  • During weekends, people came out to ride bikes more on Saturday than Sunday.



1.2. # of Trips by Hour


We can presume that the trip pattern might look different between weekdays and weekends. Let's compare them and see if the difference exists.


On Weekdays

On Weekends


  • On Weekdays, relatively more trips are taken at 8 am and 6 pm. This is due to commuting time. Yes, people work 9 to 6 in Korea (one more hour than 9-5 in the US).

  • On Weekends, more trips are taken as it gets late in the afternoon and hits the peak around 6 pm. This may vary depending on the season and the sunset time.

  • Otherwise, except for commuting time, Rent > Return in the afternoon and Rent < Return in the evening.


1.3. Does Hourly Trip Pattern Look Different by County?


Rent

Return

All counties have very similar hourly trip patterns.

The dark colors in commute time stick out which makes me wonder:

  • In which county is the # of Rent > # of Return?

  • In which county is the # of Rent < # of Return?


1.4. Rent to Return Ratio by County during commute time


Let's extract the trips taken during commuting time (morning commute time 8am & evening commute time 18pm). In addition, let's look at the Rent to Return Ratio.


  • In the morning commute time, Seongbuk had 26% more Rents than Returns.

  • In the evening commute time, Geumcheon had 10% more Returns than Rents.

  • In the evening commute time, Seongbuk had 15% more Returns than Rents.

However, the two plots above have a different order of y-axis, so it's quiet difficult to compare the difference in morning and evening commute time.

Let's visualize just the Rents this time because Returns = Trips - Rents.


The morning commute bars gradually decrease, while the evening commute bars gradually increase.

In other words, counties with a higher Rent rate in the morning have a higher Return rate in the evening (The correlation is 0.8).


We can imagine that people ride bikes to commute from home -> work -> home on weekdays.


Well, this may sound pretty obvious, but we can verify our groundless inference with this visualization now!

 

2. Which County has a High Trip Distance and Trip Hour?


We can assume that a county with a high average trip distance and trip hour has more riders who travel a further distances or ride longer.


Let's first look at the trip distance.



2.1. # of Trips by Distance


Let's draw the distribution in a boxplot and a distplot.

The distribution is skewed due to outliers. Let's remove the outliers for the sake of better analysis.


The distribution is still skewed a little bit, but the median is 1798 meters.

Let's dig a little deeper by looking at distributions by county.


  • The distributions are skewed in every county.

  • Counties with a longer distance tend to have a higher median as well as more outliers.


2.2. # of Trips by Hour

Distribution of Trips by Hour (All Trips except Outliers)

Mode: 5 Minutes Median: 16 Minutes Truncated Mean(5~95%): 23 Minutes


Distributions for distance and hour look very similar.


Trip Hour is very similar to Trip distance.

It's interesting that some counties (Yongsan, Gangnam) have more outliers than others (Gangseo).


Putting both distance and trip hour into maps:

  • Yongsan has the highest distance and trip hours.

  • Distance and Trip Hour have a very strong correlation (0.95). This is not a surprise since a rider has to take a longer trip to go further.

 

3. Which County Has a High Outflow/Inflow of riders?


Outflow/Inflow represents the trips from County A to County B. We can get the riders traveling across counties this way.


Since the scale of the number of trips differs by counties, we will look at a relative ratio between counties instead of absolute value.

  • Outflow Ratio:

    • Return at a different county after Renting at County A / Total number of Rents at County A

  • Inflow Ratio:

    • Return at County A after Renting at a different county / Total number of Returns at County A

In other words, we want to know how much % of Rents(Returns) in County A are Returned(Rented) in another county.


3.1. Mapping Outflow/Inflow Ratio of Counties


We can first check which counties have a high outflow ratio or inflow ratio.

  • Jung (located in the center) has the highest outflow and inflow ratio. Counties in the center of Seoul tend to have both high outflow and inflow ratios.

  • Outflow ratio and inflow ratio have similar distributions. In other words, a county with a high outflow ratio has a high inflow ratio (correlation = 0.97).


3.2. Adding Hour to Outflow/Inflow Ratio of Counties


Let's throw time into the analysis and dig in for more insights

The outflow/inflow ratio has slightly different patterns! This finding triggers a question:

  • Are there counties that have similar outflow/inflow ratio depending on time?

Since we have to look at the outflow/inflow ratio at the same time period, let's take "Measure = Inflow Ratio - Outflow Ratio".

If Measure > 0, more riders are "coming to" the county in that time period.

If Measure < 0, more riders are "going out" of the county in that time period.


Let's cluster counties with similar patterns in Measure by time period.



Do you see the 3 clusters!?

I've named the clusters as A, B, C.


Now, let's plot the Measure by time.

  • Red = Outflow > Inflow

  • Green = Outflow < Inflow


The clusters make sense!


Cluster A represents counties with high inflow during morning commute time and high outflow during evening commute time. These counties tend to have relatively more companies than other counties.

Cluster B is completely opposite of Cluster A. These counties are well known for being residential areas.

Cluster C does not belong to either Cluster A or B. Patterns are not as clear as other clusters.


Some abnormal patterns:

  • Jung has significantly high outflow ratio from 2 - 4 am

  • Gangbuk has very high inflow ratio at 3 am

One thing that crosses my mind is that these anomalies may be due to distribution of bikes to stations in other counties by trucks.



3.3. How do Seoul-ers in Each County Commute?


Let's visualize it in a heatmap.

  • Counties with a high outflow ratio during the morning commute time have a high inflow ratio in the evening commute time.

 
Summary

Instead of summarizing everything, these are the counties with distinct characteristics:


Counties with High Frequency of Trips

  • Gangnam, Gangseo

County with High Distance and Trip Hour

  • Yongsan

County with High Outflow, Inflow Ratio

  • Jung

County with Drastic Change in Outflow, Inflow Ratios During Commute Time

  • Geumcheon


In the next episode, I will build predictive models to predict the number of bikes available at each station at a specific time of the day.

 
File Source
  • Jupyter Notebook can be found here

  • Data can be found here

Comments


bottom of page