At the time of this writing, Citibike has serviced over sixty million bike rides in New York City. Data recorded on each of these rides totals over 12GB and continues to grow monthly.(https://s3.amazonaws.com/tripdata/index.html).
Other analysis of this data set can be seen at these links:
I'll be analyzing this large dataset with the goal of characterizing each bike station and the relationships between them. For the most part, I'll use a small subset consisting of the data from August 2018, as this data is from the summer and is representative of currently operational stations (which may close during the winter).
import pandas as pd
import numpy as np
import plotly
import plotly.plotly as py
plotly.tools.set_credentials_file ( username = "srveale", api_key = "GajxMbxLyoAc3R0I6K6e")
import plotly.graph_objs as go
from plotly import tools
import matplotlib.pyplot as plt
import networkx as nx
from networkx.algorithms import community
import re
import sklearn as sk
import sklearn.cluster as skc
import scipy.spatial.distance as spd
from sklearn import preprocessing
tripdata = pd.read_csv('201808-citibike-tripdata.csv')
tripdata.head()
tripdata.describe()
## Cleaning / Clear suspicious data
print("Total number of trips before cleaning", len(tripdata))
tripdata.columns = ["tripduration","starttime","stoptime",
"start_station_id","start_station_name",
"start_station_latitude","start_station_longitude",
"end_station_id","end_station_name",
"end_station_latitude","end_station_longitude",
"bikeid","usertype","birth_year","gender"]
tripdata = tripdata[tripdata['tripduration'] > 60]
tripdata = tripdata[tripdata['tripduration'] < 60*60*2]
tripdata['birth_year'] = tripdata['birth_year'].dropna()
tripdata['birth_year'] = tripdata['birth_year'].convert_objects(convert_numeric = True)
tripdata = tripdata[tripdata['birth_year'] > 1939]
print("After removing short trips, long trips, and octegenarians", len(tripdata))
stationCounts = tripdata.groupby('start_station_name', as_index=False).count()
to_remove = stationCounts[stationCounts['tripduration'] < 150]
# tripdata[tripdata['start_station_name'].isin(to_remove['start_station_name'])
for remove_station in to_remove['start_station_name']:
tripdata = tripdata[tripdata['start_station_name'] != remove_station]
# tripdata = tripdata[tripdata[tripdata['start_station_name'].isin(to_remove['start_station_name'])]]
print("After removing low trip counts values", len(tripdata))
# tripdata = tripdata.dropna()
print("After removing null values", len(tripdata))
I removed trips that were less than 60 seconds, as these are likely to be false starts, or riders who are making sure the bike is secure after their actual ride ends. Also, rides over two hours are suspect, as it seems unlikely that riders would spend the extra $4.00 per minute, rather than renting a bike from another service for less money.
I removed riders who said they were over 80 years old. Not to say that seniors don't ride bikes, but I find it more likely that these users of the service are lying about their age.
Then, I took out all stations that had a total monthly trip count less than 150. Many of these appeared to be testing stations, where Citibike did not have them in full production mode.
Lastly, I removed from the dataframe items with null values. I didn't find any, but this was more for the integrity of the notebook.
All the cleaning brought the trip count down from 1977177 to 1967317, a loss of about 0.5%
stationMeans = tripdata.groupby('start_station_name', as_index=False).mean()
endStationMeans = tripdata.groupby('end_station_name', as_index=False).mean()
There are several perspectives from which we can distinguish one station from another. Here I'll visualize the average trip duration, average age, and the gender proportion for station, using the station where the trip started, to try and get an idea of the spread of the values across the different stations.
durationByStation = tripdata.groupby('start_station_name', as_index=False).mean()
site_lat = durationByStation['start_station_latitude']
site_lon = durationByStation['start_station_longitude']
locations_name = durationByStation['start_station_name']
data = [
go.Scattermapbox(
lat=site_lat,
lon=site_lon,
mode='markers',
marker=dict(
size=10,
color=durationByStation['tripduration'] / 60,
opacity=0.7,
showscale=True,
),
text=locations_name,
hoverinfo='text'
)]
layout = go.Layout(
title='Average trip duration by start station (minutes)',
hovermode='closest',
showlegend=False,
mapbox=dict(
bearing=0,
center=dict(
lat=40.742327,
lon=-73.973378
),
pitch=0,
zoom=10,
style='light'
),
)
fig = dict(data=data, layout=layout)
py.iplot(fig, filename='Average trip duration by start station')
trace0 = go.Box(
y=durationByStation['tripduration'] / 60
)
data = [trace0]
layout = go.Layout(
title = "Average trip duration by trip starting station",
yaxis=dict(
title='Trip duration (min)',
zeroline=False
),
)
fig = go.Figure(data=data, layout=layout)
py.iplot(fig)
We can notice a few stand-out locations where trips generally take a very long time. For example, trips starting on the island with Yankee Ferry Terminal are long, which isn't a surprise. Additionally, it looks like the south-east corner of Central Park has some long trip times as well, perhaps as tourists start here and circumnavigate the park.
The center of Manhattan has some very low average trip times, probably since many riders are going only a few blocks within the same region.
ageByStation = tripdata.groupby('start_station_name', as_index=False).mean()
ageByStation['age'] = 2019 - ageByStation['birth_year']
site_lat = ageByStation['start_station_latitude']
site_lon = ageByStation['start_station_longitude']
locations_name = ageByStation['start_station_name']
data = [
go.Scattermapbox(
lat=site_lat,
lon=site_lon,
mode='markers',
marker=dict(
size=10,
color=ageByStation['age'],
opacity=0.7,
showscale=True,
),
text=locations_name,
hoverinfo='text'
)]
layout = go.Layout(
title='Average age by start station',
hovermode='closest',
showlegend=False,
mapbox=dict(
bearing=0,
center=dict(
lat=40.742327,
lon=-73.973378
),
pitch=0,
zoom=10,
style='light'
),
)
fig = dict(data=data, layout=layout)
py.iplot(fig, filename='Average age duration by start station')
trace0 = go.Box(
y=ageByStation['age']
)
data = [trace0]
layout = go.Layout(
title = "Average rider age by trip starting station",
yaxis=dict(
title='Age (Years)',
zeroline=False
),
)
fig = go.Figure(data=data, layout=layout)
py.iplot(fig)
men = tripdata[tripdata.gender == 1].groupby('start_station_name', as_index = False).count()
women = tripdata[tripdata.gender == 2].groupby('start_station_name', as_index = False).count()
total = men['gender'] + women['gender']
menProportion = men['gender'] / total
site_lat = stationMeans['start_station_latitude']
site_lon = stationMeans['start_station_longitude']
locations_name = stationMeans['start_station_name']
data = [
go.Scattermapbox(
lat=site_lat[:765],
lon=site_lon[:765],
mode='markers',
marker=dict(
size=10,
color=menProportion[:765],
opacity=0.7,
showscale=True,
),
text=locations_name,
hoverinfo='text'
)]
layout = go.Layout(
title='Proportion of men by start station',
hovermode='closest',
showlegend=False,
mapbox=dict(
bearing=0,
center=dict(
lat=40.742327,
lon=-73.973378
),
pitch=0,
zoom=10,
style='light'
),
)
fig = dict(data=data, layout=layout)
py.iplot(fig, filename='Proportion of men by start station')
trace0 = go.Box(
y=menProportion[menProportion.isna() == False]
)
data = [trace0]
layout = go.Layout(
title = "Proportion of male riders by trip start station",
yaxis=dict(
title='Proportion of men out of total',
zeroline=False
)
)
fig = go.Figure(data=data, layout=layout)
py.iplot(fig)
Wow, look at all the stations with over 80% men! What could cause this? Here's something I found when looking up Railway Ave & Kay Ave:
https://nypost.com/2014/03/28/citi-bike-racks-sit-unused-in-brooklyn-boondocks/
Seems like some stations are in areas where women feel uncomfortable undocking and using a bike. This is definitely something Citibike needs to factor into their plans.
Another good way to see the different behaviour at each station is to see how many bike trips originate from each station as a typical day progresses. The first graph I'll show is the average trip count of all bike trips made between 8:00am and 9:00am, averaged over the days of the month of August 2018.
## Get the number of minutes since the previous midnight
## time is a string e.g. 2013-08-01 00:00:01
def minutesSinceMidnight(time):
try:
hrminsec = re.findall(r"..:..:..", time)[0]
except:
hrminsec = '99:99:99'
return int(hrminsec[0:2])*60 + int(hrminsec[3:5]) + np.floor(int(hrminsec[6:8])/60)
tripdata['sinceMidnight'] = tripdata['starttime'].apply(minutesSinceMidnight)
minute_counts = tripdata.groupby(['start_station_name', 'sinceMidnight'], as_index=False).count()
## Prep for the trip count plot
stationMeans = tripdata.groupby('start_station_name', as_index=False).mean()
cmin = -1
cmax = 300
opacity = 0.7
increment = 8
interval1 = 60*increment
interval2 = interval1+60
incrementCounts = minute_counts[minute_counts['sinceMidnight'] > interval1]
incrementCounts = incrementCounts[incrementCounts['sinceMidnight'] < interval2]
incrementCountsSummed = incrementCounts.groupby('start_station_name', as_index=False).sum()
site_lat = stationMeans['start_station_latitude']
site_lon = stationMeans['start_station_longitude']
locations_name = stationMeans['start_station_name']
data = [
go.Scattermapbox(
lat=site_lat,
lon=site_lon,
mode='markers',
marker=dict(
size=10,
color=incrementCountsSummed['tripduration'],
opacity=opacity,
cmin=cmin,
cmax=cmax,
showscale=True,
),
text=locations_name,
hoverinfo='text'
)]
layout = go.Layout(
title='Average trip count, 0{}:00 to 0{}:00'.format(increment, increment+1),
hovermode='closest',
showlegend=False,
mapbox=dict(
bearing=0,
center=dict(
lat=40.742327,
lon=-73.973378
),
pitch=0,
zoom=10.5,
style='light'
),
)
fig = dict(data=data, layout=layout)
# py.write_image(fig, 'fig1.png')
py.iplot(fig, filename='Average trip duration by start station0')
I animated the average hourly trip counts over the hours of a day. If the embedding doesn't work, the link is here
%%HTML
<video width="320" height="240" controls>
<source src="https://s3.amazonaws.com/personal-sv/Citi+Bike+Average+Hourly+Trip+Count.mov" type="video/mp4">
</video>
Here I'll attemp to treat Citibike stations as nodes in a network. The goal will be to see what are the most popular trips. Of course, almost every station will have at least one trip to every other station, so I'll have to filter out the edges between nodes with low trip counts - I chose 175 as my threshold.
connections = tripdata.groupby(['start_station_name', 'end_station_name'], as_index=False).count()
connections = connections[connections['start_station_name'] != connections['end_station_name']]
small_edge_param = 175
nxGraph = nx.Graph()
for row in connections.iterrows():
meansStart = stationMeans[stationMeans['start_station_name'] == row[1][0]]
meansEnd = endStationMeans[endStationMeans['end_station_name'] == row[1][1]]
startLat = meansStart['start_station_latitude'].values[0]
startLong = meansStart['start_station_longitude'].values[0]
endLat = meansEnd['end_station_latitude'].values[0]
endLong = meansEnd['end_station_longitude'].values[0]
nxGraph.add_node(row[1][0], position=(startLong, startLat))
nxGraph.add_node(row[1][1], position=(endLong, endLat))
nxGraph.add_edge(row[1][0],row[1][1],tripcount=row[1][2], pos=((startLong, startLat),(endLong, endLat)))
connectionsMean = tripdata.groupby(['start_station_name'], as_index=False).mean()
positions = {}
unavailableNodes = []
allNodes = nxGraph.nodes
for node in allNodes:
try:
lat = connectionsMean[connectionsMean['start_station_name'] == node]['start_station_latitude'].values[0]
long = connectionsMean[connectionsMean['start_station_name'] == node]['start_station_longitude'].values[0]
except:
unavailableNodes.append(node)
positions[node] = (long, lat) ## This isn't used for the plotly graph, just the nx draw
for unavailableNode in unavailableNodes:
nxGraph.remove_node(unavailableNode)
smallEdges = []
for edge in nxGraph.edges.data():
if edge[2]['tripcount'] < small_edge_param:
smallEdges.append(edge)
for edge in smallEdges:
nxGraph.remove_edge(*edge[:2])
nx.draw(nxGraph, pos=positions, node_size=5)
That doesn't look very good, let's see if we can make it better with Plotly.
edge_trace = go.Scatter(
x=[],
y=[],
line=dict(width=1.5,color='#888'),
hoverinfo='none',
mode='lines')
for edge in nxGraph.edges(data=True):
startLat = edge[2]['pos'][0][0]
startLong = edge[2]['pos'][0][1]
endLat = edge[2]['pos'][1][0]
endLong = edge[2]['pos'][1][1]
edge_trace['x'] += tuple([startLat, endLat, None])
edge_trace['y'] += tuple([startLong, endLong, None])
node_trace = go.Scatter(
x=[],
y=[],
text=[],
mode='markers',
hoverinfo='text',
marker=dict(
showscale=False,
colorbar=dict(
thickness=15,
title='Node Connections',
xanchor='left',
titleside='right'
),
line=dict(width=2)))
for node in nxGraph.nodes():
x, y = nxGraph.node[node]['position']
node_trace['x'] += tuple([x])
node_trace['y'] += tuple([y])
## TODO: Change # adjacancies to total trip count, resize markers based on trip count
def getnodesize(lenAdjacencies):
return max(10, lenAdjacencies*2 - 15)
for node, adjacencies in enumerate(nxGraph.adjacency()):
node_info = '# of connections: '+str(len(adjacencies[1]))
node_trace['text']+=tuple([node_info])
fig = go.Figure(data=[edge_trace, node_trace],
layout=go.Layout(
title='<br>Connections between stations with high trip counts.</br>',
titlefont=dict(size=16),
showlegend=False,
hovermode='closest',
margin=dict(b=20,l=5,r=5,t=40),
# annotations=[ dict(
# text="Most popular trip starts here",
# arrowsize=1,
# showarrow=True,
# xref="paper", yref="paper",
# x=0.58, y=0.86 ) ],
xaxis=dict(showgrid=False, zeroline=False, showticklabels=False),
yaxis=dict(showgrid=False, zeroline=False, showticklabels=False)))
py.iplot(fig, filename='station_network_graph')
This graph shows the connections between all the stations that had at least 175 trips from the starting station to the end station, effectively showing the most popular trips. We can see that there are many trips between the stations on the circumference of Central Park, and many originating from the station at 48th St and 5th Ave.
edgeTripCounts = [(edge[0], edge[1], edge[2]['tripcount']) for edge in nxGraph.edges.data()]
currMax = 0
for item in edgeTripCounts:
if item[2] > currMax:
currMax = item[2]
maxItem = item
print(maxItem)
The most popular trip is made from 5th Ave & 88th St, to Central Park S & 6th Ave.
I also tried to see if I could put the stations into meaningful groups using K-Means clustering. The rule of thumb for how many clusters to try gives sqrt(676) ~ 18, where 676 is the number of stations. That seems too high for my intuition telling me that we won't be able to interpret that many clusters. Using the Elbow method to determine the number of clusters doesn't really show a distinct elbow, but we'll try 6 as that seems like a reasonable number of different kinds of bike station.
For input into the clustering algorithm, I used (normalized) values of average age, average trip duration, latitude, longitude, gender proportion, and average minute of the day.
clusteringDF = pd.concat([stationMeans['tripduration'],
stationMeans['start_station_latitude'],
stationMeans['start_station_longitude'],
2019 - stationMeans['birth_year'],
stationMeans['gender'],
stationMeans['sinceMidnight']],
axis=1,
keys=['tripduration',
'start_station_latitude',
'start_station_longitude',
'age', 'gender', 'sinceMidnight'])
x = clusteringDF.values #returns a numpy array
min_max_scaler = preprocessing.MinMaxScaler()
x_scaled = min_max_scaler.fit_transform(x)
clusteringDF = pd.DataFrame(x_scaled)
clusteringDF.columns = ['tripduration',
'start_station_latitude',
'start_station_longitude',
'age', 'gender', 'sinceMidnight']
X = clusteringDF.iloc[:,:2] # Using the the dataset of Section 1
K = range(1,19) # Apply kmeans 1 to 10
kmeans_models = [skc.KMeans(k).fit(X) for k in K]
centroids = [m.cluster_centers_ for m in kmeans_models]
D_k = [spd.cdist(X, cent, 'euclidean') for cent in centroids]
cIdx = [np.argmin(D,axis=1) for D in D_k]
dist = [np.min(D,axis=1) for D in D_k]
avgWithinSS = [sum(d)/X.shape[0] for d in dist]
# plot elbow curve
plt.plot(K, avgWithinSS, 'b*-')
plt.xlabel('Number of clusters');
plt.ylabel('Average within-cluster sum of squares');
plt.title('Elbow for K-Means clustering (No obvious result)');
k = 5
kmeans_model = skc.KMeans(n_clusters=k)
kmeans_model.fit(clusteringDF)
centroids = kmeans_model.cluster_centers_
site_lat = stationMeans['start_station_latitude']
site_lon = stationMeans['start_station_longitude']
locations_name = stationMeans['start_station_name']
colour_map = {
0: "#F4D06F",
1: "#FF8811",
2: "#9DD9D2",
3: "#2F274A",
4: "#392F5A",
5: "#2F2E2C"
}
colours = [colour_map[group] for group in kmeans_model.labels_]
data = [
go.Scattermapbox(
lat=site_lat,
lon=site_lon,
mode='markers',
marker=dict(
size=10,
color=colours,
opacity=0.9,
),
text=locations_name,
hoverinfo='text'
)]
layout = go.Layout(
title='Station groups based on K-means clustering (k = {})'.format(k),
hovermode='closest',
showlegend=False,
mapbox=dict(
bearing=0,
center=dict(
lat=40.742327,
lon=-73.973378
),
pitch=0,
zoom=10.5,
style='light'
),
)
fig = dict(data=data, layout=layout)
# py.write_image(fig, 'fig1.png')
py.iplot(fig, filename='Station groups based on K-means clustering (k = {})'.format(k))
These seem like reasonable clusters based on previous findings about the variables in the station characteristics section. There does seem to be a highly geographical component to how these clusters are put together, so let's remove latitude and longitude so we can see the groups based on ridership details and demographics alone.
clusteringDF = pd.concat([stationMeans['tripduration'],
2019 - stationMeans['birth_year'],
stationMeans['gender'],
stationMeans['sinceMidnight']],
axis=1,
keys=['tripduration',
'age', 'gender', 'sinceMidnight'])
x = clusteringDF.values #returns a numpy array
min_max_scaler = preprocessing.MinMaxScaler()
x_scaled = min_max_scaler.fit_transform(x)
clusteringDF = pd.DataFrame(x_scaled)
clusteringDF.columns = ['tripduration', 'age', 'gender', 'sinceMidnight']
kmeans_model = skc.KMeans(n_clusters=k)
kmeans_model.fit(clusteringDF)
centroids = kmeans_model.cluster_centers_
site_lat = stationMeans['start_station_latitude']
site_lon = stationMeans['start_station_longitude']
locations_name = stationMeans['start_station_name']
colour_map = {
0: "#F4D06F",
1: "#FF8811",
2: "#9DD9D2",
3: "#2F274A",
4: "#392F5A",
5: "#2F2E2C"
}
colours = [colour_map[group] for group in kmeans_model.labels_]
data = [
go.Scattermapbox(
lat=site_lat,
lon=site_lon,
mode='markers',
marker=dict(
size=10,
color=colours,
opacity=0.9,
),
text=locations_name,
hoverinfo='text'
)]
layout = go.Layout(
title='Station groups based on K-means clustering (k = {}, no geography input)'.format(k),
hovermode='closest',
showlegend=False,
mapbox=dict(
bearing=0,
center=dict(
lat=40.742327,
lon=-73.973378
),
pitch=0,
zoom=10.5,
style='light'
),
)
fig = dict(data=data, layout=layout)
# py.write_image(fig, 'fig1.png')
py.iplot(fig, filename='Station groups based on K-means clustering (k = {}, no geography)'.format(k))
These groupings aren't so neat and tidy, but they tell you a lot more about the station characterics than just where they are located. It looks like there is a "tourist" group along the coast and around Central Park, along the north coast, and on Governer's Island. Another looks like commuters between Brooklyn and Manhattan (although we can't be sure about the assumption that there are many trips within-cluster).
When I first loaded up this data, I was surprised at how many more stations there were in 2018 than in 2013. I wanted to see how the number and location of stations evolved over time.
%%HTML
<video width="320" height="240" controls>
<source src="https://s3.amazonaws.com/personal-sv/Citi+Bike+Monthly+Trip+Count.mov" type="video/mp4">
</video>
Note - this will be updated to indicate that these are not in fact averages but absolute counts
We can see the expansion over the years, and also the "breathing" as ridership dwindles over winter and balloons in summer. It also looks like there's a lot more room to grow! Although I would guess that there are diminishing returns with stations that are very far from Manhattan. As commute distance gets farther, riders will be looking for lighter, faster, more comfortable bikes which are inevitably more expensive and thus more attractive to thieves.
Based on the analysis from the perspective of individual station characteristics, grouped station characteristics, and the relationships among stations and over time, I found that there are many good ways to distinguish one station from another. Is it used by daily commuters or leisurely tourists? Is it a very popular station or a bywater on the outskirts? These kinds of questions could enable Citibike or other bikeshare companies to help plan further expansions or better manage their current fleet.
from IPython.display import HTML
HTML('''<script>
code_show=true;
function code_toggle() {
if (code_show){
$('div.input').hide();
} else {
$('div.input').show();
}
code_show = !code_show
}
$( document ).ready(code_toggle);
</script>
The raw code for this IPython notebook is by default hidden for easier reading.
To toggle on/off the raw code, click <a href="javascript:code_toggle()">here</a>.''')