Citibike Station Analysis

Scott Veale

DATA 606

At the time of this writing, Citibike has serviced over sixty million bike rides in New York City. Data recorded on each of these rides totals over 12GB and continues to grow monthly.(https://s3.amazonaws.com/tripdata/index.html).

Other analysis of this data set can be seen at these links:

  1. http://toddwschneider.com/posts/a-tale-of-twenty-two-million-citi-bikes-analyzing-the-nyc-bike-share-system/
  2. https://towardsdatascience.com/citi-bike-2017-analysis-efd298e6c22c
  3. https://bridgingbigdata.github.io/assets/attachments/CITI%20Bike%20Data%20Analysis_SunnyBike.pptx

I'll be analyzing this large dataset with the goal of characterizing each bike station and the relationships between them. For the most part, I'll use a small subset consisting of the data from August 2018, as this data is from the summer and is representative of currently operational stations (which may close during the winter).

In [80]:
import pandas as pd
import numpy as np
import  plotly
import  plotly.plotly  as  py
plotly.tools.set_credentials_file ( username = "srveale",  api_key = "GajxMbxLyoAc3R0I6K6e")
import  plotly.graph_objs  as  go
from  plotly  import  tools
import matplotlib.pyplot as plt
import networkx as nx
from networkx.algorithms import community
import re
import sklearn as sk
import sklearn.cluster as skc
import scipy.spatial.distance as spd
from sklearn import preprocessing
In [113]:
tripdata = pd.read_csv('201808-citibike-tripdata.csv')
In [114]:
tripdata.head()
Out[114]:
tripduration starttime stoptime start station id start station name start station latitude start station longitude end station id end station name end station latitude end station longitude bikeid usertype birth year gender
0 681 2018-08-01 00:00:07.3210 2018-08-01 00:11:28.9920 3162.0 W 78 St & Broadway 40.783400 -73.980931 3383.0 Cathedral Pkwy & Broadway 40.804213 -73.966991 27770 Subscriber 1986 1
1 625 2018-08-01 00:00:19.7480 2018-08-01 00:10:45.0290 3260.0 Mercer St & Bleecker St 40.727064 -73.996621 2012.0 E 27 St & 1 Ave 40.739445 -73.976806 25938 Subscriber 1969 1
2 1319 2018-08-01 00:00:21.1750 2018-08-01 00:22:20.6370 403.0 E 2 St & 2 Ave 40.725029 -73.990697 285.0 Broadway & E 14 St 40.734546 -73.990741 28679 Subscriber 1970 1
3 220 2018-08-01 00:00:26.4700 2018-08-01 00:04:06.8190 3637.0 Fulton St & Waverly Ave 40.683239 -73.965996 399.0 Lafayette Ave & St James Pl 40.688515 -73.964763 28075 Subscriber 1982 1
4 398 2018-08-01 00:00:30.2910 2018-08-01 00:07:09.2810 3662.0 31 Ave & Steinway St 40.761294 -73.916917 3517.0 31 St & Hoyt Ave N 40.771153 -73.917007 25002 Subscriber 1987 1
In [115]:
tripdata.describe()
Out[115]:
tripduration start station id start station latitude start station longitude end station id end station latitude end station longitude bikeid birth year gender
count 1.977177e+06 1.975789e+06 1.977177e+06 1.977177e+06 1.975789e+06 1.977177e+06 1.977177e+06 1.977177e+06 1.977177e+06 1.977177e+06
mean 1.012001e+03 1.627431e+03 4.073740e+01 -7.398241e+01 1.622045e+03 4.073704e+01 -7.398255e+01 2.619152e+04 1.979605e+03 1.136611e+00
std 1.131265e+04 1.449333e+03 3.095742e-02 1.975853e-02 1.449197e+03 3.081692e-02 1.986146e-02 6.234873e+03 1.170760e+01 5.707010e-01
min 6.100000e+01 7.200000e+01 4.064654e+01 -7.402535e+01 7.200000e+01 4.064654e+01 -7.406378e+01 1.452900e+04 1.885000e+03 0.000000e+00
25% 3.740000e+02 3.820000e+02 4.071745e+01 -7.399596e+01 3.820000e+02 4.071740e+01 -7.399601e+01 1.995500e+04 1.969000e+03 1.000000e+00
50% 6.400000e+02 5.140000e+02 4.073782e+01 -7.398565e+01 5.130000e+02 4.073726e+01 -7.398602e+01 2.774800e+04 1.982000e+03 1.000000e+00
75% 1.129000e+03 3.258000e+03 4.075898e+01 -7.397152e+01 3.258000e+03 4.075763e+01 -7.397188e+01 3.150900e+04 1.989000e+03 1.000000e+00
max 3.095079e+06 3.705000e+03 4.086900e+01 -7.387800e+01 3.705000e+03 4.086900e+01 -7.387800e+01 3.483900e+04 2.002000e+03 2.000000e+00

0. Cleaning

In [117]:
## Cleaning / Clear suspicious data
print("Total number of trips before cleaning", len(tripdata))
tripdata.columns = ["tripduration","starttime","stoptime",
                    "start_station_id","start_station_name",
                    "start_station_latitude","start_station_longitude",
                    "end_station_id","end_station_name",
                    "end_station_latitude","end_station_longitude",
                    "bikeid","usertype","birth_year","gender"]
tripdata = tripdata[tripdata['tripduration'] > 60]
tripdata = tripdata[tripdata['tripduration'] < 60*60*2]

tripdata['birth_year'] = tripdata['birth_year'].dropna()
tripdata['birth_year'] = tripdata['birth_year'].convert_objects(convert_numeric = True)
tripdata = tripdata[tripdata['birth_year'] > 1939]

print("After removing short trips, long trips, and octegenarians", len(tripdata))

stationCounts = tripdata.groupby('start_station_name', as_index=False).count()

to_remove = stationCounts[stationCounts['tripduration'] < 150]
# tripdata[tripdata['start_station_name'].isin(to_remove['start_station_name'])
for remove_station in to_remove['start_station_name']:
    tripdata = tripdata[tripdata['start_station_name'] != remove_station]
         
# tripdata = tripdata[tripdata[tripdata['start_station_name'].isin(to_remove['start_station_name'])]]
print("After removing low trip counts values", len(tripdata))

# tripdata = tripdata.dropna()
print("After removing null values", len(tripdata))
Total number of trips before cleaning 1968229
/root/environments/my_env/lib/python3.6/site-packages/ipykernel_launcher.py:13: FutureWarning:

convert_objects is deprecated.  To re-infer data dtypes for object columns, use Series.infer_objects()
For all other conversions use the data-type specific converters pd.to_datetime, pd.to_timedelta and pd.to_numeric.

After removing short trips, long trips, and octegenarians 1968229
After removing low trip counts values 1967317
After removing null values 1967317

I removed trips that were less than 60 seconds, as these are likely to be false starts, or riders who are making sure the bike is secure after their actual ride ends. Also, rides over two hours are suspect, as it seems unlikely that riders would spend the extra $4.00 per minute, rather than renting a bike from another service for less money.

I removed riders who said they were over 80 years old. Not to say that seniors don't ride bikes, but I find it more likely that these users of the service are lying about their age.

Then, I took out all stations that had a total monthly trip count less than 150. Many of these appeared to be testing stations, where Citibike did not have them in full production mode.

Lastly, I removed from the dataframe items with null values. I didn't find any, but this was more for the integrity of the notebook.

All the cleaning brought the trip count down from 1977177 to 1967317, a loss of about 0.5%

In [118]:
stationMeans = tripdata.groupby('start_station_name', as_index=False).mean()
endStationMeans = tripdata.groupby('end_station_name', as_index=False).mean()

1. Individual Station Characteristics

There are several perspectives from which we can distinguish one station from another. Here I'll visualize the average trip duration, average age, and the gender proportion for station, using the station where the trip started, to try and get an idea of the spread of the values across the different stations.

In [119]:
durationByStation = tripdata.groupby('start_station_name', as_index=False).mean()
site_lat = durationByStation['start_station_latitude']
site_lon = durationByStation['start_station_longitude']
locations_name = durationByStation['start_station_name']

data = [
    go.Scattermapbox(
        lat=site_lat,
        lon=site_lon,
        mode='markers',
        marker=dict(
            size=10,
            color=durationByStation['tripduration'] / 60,
            opacity=0.7,
            showscale=True,
        ),
        text=locations_name,
        hoverinfo='text'
    )]
        
layout = go.Layout(
    title='Average trip duration by start station (minutes)',
    hovermode='closest',
    showlegend=False,
    mapbox=dict(
        bearing=0,
        center=dict(
            lat=40.742327,
            lon=-73.973378
        ),
        pitch=0,
        zoom=10,
        style='light'
    ),
)

fig = dict(data=data, layout=layout)
py.iplot(fig, filename='Average trip duration by start station')
Out[119]:
In [120]:
trace0 = go.Box(
    y=durationByStation['tripduration'] / 60
)
data = [trace0]

layout = go.Layout(
    title = "Average trip duration by trip starting station",
    yaxis=dict(
        title='Trip duration (min)',
        zeroline=False
    ),
)

fig = go.Figure(data=data, layout=layout)

py.iplot(fig)
Out[120]:

We can notice a few stand-out locations where trips generally take a very long time. For example, trips starting on the island with Yankee Ferry Terminal are long, which isn't a surprise. Additionally, it looks like the south-east corner of Central Park has some long trip times as well, perhaps as tourists start here and circumnavigate the park.

The center of Manhattan has some very low average trip times, probably since many riders are going only a few blocks within the same region.

In [121]:
ageByStation = tripdata.groupby('start_station_name', as_index=False).mean()

ageByStation['age'] = 2019 - ageByStation['birth_year']

site_lat = ageByStation['start_station_latitude']
site_lon = ageByStation['start_station_longitude']
locations_name = ageByStation['start_station_name']


data = [
    go.Scattermapbox(
        lat=site_lat,
        lon=site_lon,
        mode='markers',
        marker=dict(
            size=10,
            color=ageByStation['age'],
            opacity=0.7,
            showscale=True,
        ),
        text=locations_name,
        hoverinfo='text'
    )]
        
layout = go.Layout(
    title='Average age by start station',
    hovermode='closest',
    showlegend=False,
    mapbox=dict(
        bearing=0,
        center=dict(
            lat=40.742327,
            lon=-73.973378
        ),
        pitch=0,
        zoom=10,
        style='light'
    ),
)

fig = dict(data=data, layout=layout)
py.iplot(fig, filename='Average age duration by start station')
Out[121]:
In [122]:
trace0 = go.Box(
    y=ageByStation['age']
)
data = [trace0]
layout = go.Layout(
    title = "Average rider age by trip starting station",
    yaxis=dict(
        title='Age (Years)',
        zeroline=False
    ),
)
fig = go.Figure(data=data, layout=layout)
py.iplot(fig)
Out[122]:
In [123]:
men = tripdata[tripdata.gender == 1].groupby('start_station_name', as_index = False).count()
women = tripdata[tripdata.gender == 2].groupby('start_station_name', as_index = False).count()

total = men['gender'] + women['gender']
menProportion = men['gender'] / total

site_lat = stationMeans['start_station_latitude']
site_lon = stationMeans['start_station_longitude']
locations_name = stationMeans['start_station_name']


data = [
    go.Scattermapbox(
        lat=site_lat[:765],
        lon=site_lon[:765],
        mode='markers',
        marker=dict(
            size=10,
            color=menProportion[:765],
            opacity=0.7,
            showscale=True,
        ),
        text=locations_name,
        hoverinfo='text'
    )]
        
layout = go.Layout(
    title='Proportion of men by start station',
    hovermode='closest',
    showlegend=False,
    mapbox=dict(
        bearing=0,
        center=dict(
            lat=40.742327,
            lon=-73.973378
        ),
        pitch=0,
        zoom=10,
        style='light'
    ),
)

fig = dict(data=data, layout=layout)
py.iplot(fig, filename='Proportion of men by start station')
Out[123]:
In [124]:
trace0 = go.Box(
    y=menProportion[menProportion.isna() == False]
)
data = [trace0]
layout = go.Layout(
    title = "Proportion of male riders by trip start station",
    yaxis=dict(
        title='Proportion of men out of total',
        zeroline=False
    )
)
fig = go.Figure(data=data, layout=layout)
py.iplot(fig)
Out[124]:

Wow, look at all the stations with over 80% men! What could cause this? Here's something I found when looking up Railway Ave & Kay Ave:

image.png

https://nypost.com/2014/03/28/citi-bike-racks-sit-unused-in-brooklyn-boondocks/

Seems like some stations are in areas where women feel uncomfortable undocking and using a bike. This is definitely something Citibike needs to factor into their plans.

2. Tripcounts over the course of a day

Another good way to see the different behaviour at each station is to see how many bike trips originate from each station as a typical day progresses. The first graph I'll show is the average trip count of all bike trips made between 8:00am and 9:00am, averaged over the days of the month of August 2018.

In [125]:
## Get the number of minutes since the previous midnight
## time is a string e.g. 2013-08-01 00:00:01
def minutesSinceMidnight(time):
    try:
        hrminsec = re.findall(r"..:..:..", time)[0]
    except:
        hrminsec = '99:99:99'
    return int(hrminsec[0:2])*60 + int(hrminsec[3:5]) + np.floor(int(hrminsec[6:8])/60)
tripdata['sinceMidnight'] = tripdata['starttime'].apply(minutesSinceMidnight)
In [126]:
minute_counts = tripdata.groupby(['start_station_name', 'sinceMidnight'], as_index=False).count()
In [127]:
## Prep for the trip count plot
stationMeans = tripdata.groupby('start_station_name', as_index=False).mean()
cmin = -1
cmax = 300
opacity = 0.7
In [128]:
increment = 8
interval1 = 60*increment
interval2 = interval1+60
incrementCounts = minute_counts[minute_counts['sinceMidnight'] > interval1]
incrementCounts = incrementCounts[incrementCounts['sinceMidnight'] < interval2]

incrementCountsSummed = incrementCounts.groupby('start_station_name', as_index=False).sum()

site_lat = stationMeans['start_station_latitude']
site_lon = stationMeans['start_station_longitude']
locations_name = stationMeans['start_station_name']

data = [
    go.Scattermapbox(
        lat=site_lat,
        lon=site_lon,
        mode='markers',
        marker=dict(
            size=10,
            color=incrementCountsSummed['tripduration'],
            opacity=opacity,
            cmin=cmin,
            cmax=cmax,
            showscale=True,
        ),
        text=locations_name,
        hoverinfo='text'
    )]
        
layout = go.Layout(
    title='Average trip count, 0{}:00 to 0{}:00'.format(increment, increment+1),
    hovermode='closest',
    showlegend=False,
    mapbox=dict(
        bearing=0,
        center=dict(
            lat=40.742327,
            lon=-73.973378
        ),
        pitch=0,
        zoom=10.5,
        style='light'
    ),
)

fig = dict(data=data, layout=layout)
# py.write_image(fig, 'fig1.png')
py.iplot(fig, filename='Average trip duration by start station0')
Out[128]:

I animated the average hourly trip counts over the hours of a day. If the embedding doesn't work, the link is here

In [43]:
%%HTML
<video width="320" height="240" controls>
  <source src="https://s3.amazonaws.com/personal-sv/Citi+Bike+Average+Hourly+Trip+Count.mov" type="video/mp4">
</video>

3. Network analysis

Here I'll attemp to treat Citibike stations as nodes in a network. The goal will be to see what are the most popular trips. Of course, almost every station will have at least one trip to every other station, so I'll have to filter out the edges between nodes with low trip counts - I chose 175 as my threshold.

In [44]:
connections = tripdata.groupby(['start_station_name', 'end_station_name'], as_index=False).count()
connections = connections[connections['start_station_name'] != connections['end_station_name']]
In [45]:
small_edge_param = 175

nxGraph = nx.Graph()
for row in connections.iterrows():
    meansStart = stationMeans[stationMeans['start_station_name'] == row[1][0]]
    meansEnd = endStationMeans[endStationMeans['end_station_name'] == row[1][1]]
    startLat = meansStart['start_station_latitude'].values[0]
    startLong = meansStart['start_station_longitude'].values[0]
    endLat = meansEnd['end_station_latitude'].values[0]
    endLong = meansEnd['end_station_longitude'].values[0]
    nxGraph.add_node(row[1][0], position=(startLong, startLat)) 
    nxGraph.add_node(row[1][1], position=(endLong, endLat))
    nxGraph.add_edge(row[1][0],row[1][1],tripcount=row[1][2], pos=((startLong, startLat),(endLong, endLat)))

connectionsMean = tripdata.groupby(['start_station_name'], as_index=False).mean()
positions = {}
unavailableNodes = []
allNodes = nxGraph.nodes
for node in allNodes:
    try: 
        lat = connectionsMean[connectionsMean['start_station_name'] == node]['start_station_latitude'].values[0]
        long = connectionsMean[connectionsMean['start_station_name'] == node]['start_station_longitude'].values[0]
    except:
        unavailableNodes.append(node)
    positions[node] = (long, lat) ## This isn't used for the plotly graph, just the nx draw
for unavailableNode in unavailableNodes:
    nxGraph.remove_node(unavailableNode)
    
smallEdges = []
for edge in nxGraph.edges.data():
    if edge[2]['tripcount'] < small_edge_param:
        smallEdges.append(edge)
for edge in smallEdges:
    nxGraph.remove_edge(*edge[:2])
    
nx.draw(nxGraph, pos=positions, node_size=5)

That doesn't look very good, let's see if we can make it better with Plotly.

In [138]:
edge_trace = go.Scatter(
    x=[],
    y=[],
    line=dict(width=1.5,color='#888'),
    hoverinfo='none',
    mode='lines')

for edge in nxGraph.edges(data=True):
    startLat = edge[2]['pos'][0][0]
    startLong = edge[2]['pos'][0][1]
    endLat = edge[2]['pos'][1][0]
    endLong = edge[2]['pos'][1][1]
    edge_trace['x'] += tuple([startLat, endLat, None])
    edge_trace['y'] += tuple([startLong, endLong, None])
In [139]:
node_trace = go.Scatter(
    x=[],
    y=[],
    text=[],
    mode='markers',
    hoverinfo='text',
    marker=dict(
        showscale=False,
        colorbar=dict(
            thickness=15,
            title='Node Connections',
            xanchor='left',
            titleside='right'
        ),
        line=dict(width=2)))
In [140]:
for node in nxGraph.nodes():
    x, y = nxGraph.node[node]['position']
    node_trace['x'] += tuple([x])
    node_trace['y'] += tuple([y])
In [141]:
## TODO: Change # adjacancies to total trip count, resize markers based on trip count

def getnodesize(lenAdjacencies):
    return max(10, lenAdjacencies*2 - 15)

for node, adjacencies in enumerate(nxGraph.adjacency()):
    node_info = '# of connections: '+str(len(adjacencies[1]))
    node_trace['text']+=tuple([node_info])    
In [142]:
fig = go.Figure(data=[edge_trace, node_trace],
             layout=go.Layout(
                title='<br>Connections between stations with high trip counts.</br>',
                titlefont=dict(size=16),
                showlegend=False,
                hovermode='closest',
                margin=dict(b=20,l=5,r=5,t=40),
#                 annotations=[ dict(
#                     text="Most popular trip starts here",
#                     arrowsize=1,
#                     showarrow=True,
#                     xref="paper", yref="paper",
#                     x=0.58, y=0.86 ) ],
                xaxis=dict(showgrid=False, zeroline=False, showticklabels=False),
                yaxis=dict(showgrid=False, zeroline=False, showticklabels=False)))

py.iplot(fig, filename='station_network_graph')
Out[142]:

This graph shows the connections between all the stations that had at least 175 trips from the starting station to the end station, effectively showing the most popular trips. We can see that there are many trips between the stations on the circumference of Central Park, and many originating from the station at 48th St and 5th Ave.

In [52]:
edgeTripCounts = [(edge[0], edge[1], edge[2]['tripcount']) for edge in nxGraph.edges.data()]
currMax = 0
for item in edgeTripCounts:
    if item[2] > currMax:
        currMax = item[2]
        maxItem = item
        
print(maxItem)
('5 Ave & E 88 St', 'Central Park S & 6 Ave', 788)

The most popular trip is made from 5th Ave & 88th St, to Central Park S & 6th Ave.

4. Clustering

I also tried to see if I could put the stations into meaningful groups using K-Means clustering. The rule of thumb for how many clusters to try gives sqrt(676) ~ 18, where 676 is the number of stations. That seems too high for my intuition telling me that we won't be able to interpret that many clusters. Using the Elbow method to determine the number of clusters doesn't really show a distinct elbow, but we'll try 6 as that seems like a reasonable number of different kinds of bike station.

For input into the clustering algorithm, I used (normalized) values of average age, average trip duration, latitude, longitude, gender proportion, and average minute of the day.

In [53]:
clusteringDF = pd.concat([stationMeans['tripduration'],
                            stationMeans['start_station_latitude'],
                            stationMeans['start_station_longitude'],
                            2019 - stationMeans['birth_year'],
                            stationMeans['gender'],
                            stationMeans['sinceMidnight']], 
                         axis=1, 
                         keys=['tripduration', 
                               'start_station_latitude', 
                               'start_station_longitude', 
                               'age', 'gender', 'sinceMidnight'])
x = clusteringDF.values #returns a numpy array
min_max_scaler = preprocessing.MinMaxScaler()
x_scaled = min_max_scaler.fit_transform(x)
clusteringDF = pd.DataFrame(x_scaled)
clusteringDF.columns = ['tripduration', 
                               'start_station_latitude', 
                               'start_station_longitude', 
                               'age', 'gender', 'sinceMidnight']
In [ ]:
 
In [54]:
X = clusteringDF.iloc[:,:2]  # Using the the dataset of Section 1
K = range(1,19)  # Apply kmeans 1 to 10
kmeans_models = [skc.KMeans(k).fit(X) for k in K]
centroids = [m.cluster_centers_ for m in kmeans_models]
D_k = [spd.cdist(X, cent, 'euclidean') for cent in centroids]
cIdx = [np.argmin(D,axis=1) for D in D_k]
dist = [np.min(D,axis=1) for D in D_k]
avgWithinSS = [sum(d)/X.shape[0] for d in dist]

# plot elbow curve
plt.plot(K, avgWithinSS, 'b*-')
plt.xlabel('Number of clusters');
plt.ylabel('Average within-cluster sum of squares');
plt.title('Elbow for K-Means clustering (No obvious result)');
In [55]:
k = 5
kmeans_model = skc.KMeans(n_clusters=k)
kmeans_model.fit(clusteringDF)
centroids = kmeans_model.cluster_centers_
In [56]:
site_lat = stationMeans['start_station_latitude']
site_lon = stationMeans['start_station_longitude']
locations_name = stationMeans['start_station_name']

colour_map = {
    0: "#F4D06F",
    1: "#FF8811",
    2: "#9DD9D2",
    3: "#2F274A",
    4: "#392F5A",
    5: "#2F2E2C"
}
colours = [colour_map[group] for group in kmeans_model.labels_]

data = [
    go.Scattermapbox(
        lat=site_lat,
        lon=site_lon,
        mode='markers',
        marker=dict(
            size=10,
            color=colours,
            opacity=0.9,
        ),
        text=locations_name,
        hoverinfo='text'
    )]
        
layout = go.Layout(
    title='Station groups based on K-means clustering (k = {})'.format(k),
    hovermode='closest',
    showlegend=False,
    mapbox=dict(
        bearing=0,
        center=dict(
            lat=40.742327,
            lon=-73.973378
        ),
        pitch=0,
        zoom=10.5,
        style='light'
    ),
)

fig = dict(data=data, layout=layout)
# py.write_image(fig, 'fig1.png')
py.iplot(fig, filename='Station groups based on K-means clustering (k = {})'.format(k))
/root/environments/my_env/lib/python3.6/site-packages/IPython/core/display.py:689: UserWarning:

Consider using IPython.display.IFrame instead

Out[56]:

These seem like reasonable clusters based on previous findings about the variables in the station characteristics section. There does seem to be a highly geographical component to how these clusters are put together, so let's remove latitude and longitude so we can see the groups based on ridership details and demographics alone.

In [57]:
clusteringDF = pd.concat([stationMeans['tripduration'],
                            2019 - stationMeans['birth_year'],
                            stationMeans['gender'],
                            stationMeans['sinceMidnight']], 
                         axis=1, 
                         keys=['tripduration', 
                               'age', 'gender', 'sinceMidnight'])
x = clusteringDF.values #returns a numpy array
min_max_scaler = preprocessing.MinMaxScaler()
x_scaled = min_max_scaler.fit_transform(x)
clusteringDF = pd.DataFrame(x_scaled)
clusteringDF.columns = ['tripduration', 'age', 'gender', 'sinceMidnight']
In [58]:
kmeans_model = skc.KMeans(n_clusters=k)
kmeans_model.fit(clusteringDF)
centroids = kmeans_model.cluster_centers_
In [59]:
site_lat = stationMeans['start_station_latitude']
site_lon = stationMeans['start_station_longitude']
locations_name = stationMeans['start_station_name']

colour_map = {
    0: "#F4D06F",
    1: "#FF8811",
    2: "#9DD9D2",
    3: "#2F274A",
    4: "#392F5A",
    5: "#2F2E2C"
}
colours = [colour_map[group] for group in kmeans_model.labels_]

data = [
    go.Scattermapbox(
        lat=site_lat,
        lon=site_lon,
        mode='markers',
        marker=dict(
            size=10,
            color=colours,
            opacity=0.9,
        ),
        text=locations_name,
        hoverinfo='text'
    )]
        
layout = go.Layout(
    title='Station groups based on K-means clustering (k = {}, no geography input)'.format(k),
    hovermode='closest',
    showlegend=False,
    mapbox=dict(
        bearing=0,
        center=dict(
            lat=40.742327,
            lon=-73.973378
        ),
        pitch=0,
        zoom=10.5,
        style='light'
    ),
)

fig = dict(data=data, layout=layout)
# py.write_image(fig, 'fig1.png')
py.iplot(fig, filename='Station groups based on K-means clustering (k = {}, no geography)'.format(k))
/root/environments/my_env/lib/python3.6/site-packages/IPython/core/display.py:689: UserWarning:

Consider using IPython.display.IFrame instead

Out[59]:

These groupings aren't so neat and tidy, but they tell you a lot more about the station characterics than just where they are located. It looks like there is a "tourist" group along the coast and around Central Park, along the north coast, and on Governer's Island. Another looks like commuters between Brooklyn and Manhattan (although we can't be sure about the assumption that there are many trips within-cluster).

5. Historical Station Count

When I first loaded up this data, I was surprised at how many more stations there were in 2018 than in 2013. I wanted to see how the number and location of stations evolved over time.

In [60]:
%%HTML
<video width="320" height="240" controls>
  <source src="https://s3.amazonaws.com/personal-sv/Citi+Bike+Monthly+Trip+Count.mov" type="video/mp4">
</video>

Note - this will be updated to indicate that these are not in fact averages but absolute counts

We can see the expansion over the years, and also the "breathing" as ridership dwindles over winter and balloons in summer. It also looks like there's a lot more room to grow! Although I would guess that there are diminishing returns with stations that are very far from Manhattan. As commute distance gets farther, riders will be looking for lighter, faster, more comfortable bikes which are inevitably more expensive and thus more attractive to thieves.

6. Conclusion

Based on the analysis from the perspective of individual station characteristics, grouped station characteristics, and the relationships among stations and over time, I found that there are many good ways to distinguish one station from another. Is it used by daily commuters or leisurely tourists? Is it a very popular station or a bywater on the outskirts? These kinds of questions could enable Citibike or other bikeshare companies to help plan further expansions or better manage their current fleet.

Tools Used

  • To deal with large amounts of data, I provisioned a Digital Ocean droplet which I was able to resize at will in terms of RAM and computing power. For the network analysis I boosted the RAM to 192 GB, and the vCPUs to 32.
  • I installed Jupyter on this droplet and worked via port-forwarding the notebook server
  • Interactive plotting with Plotly
  • Pandas, Numpy, Matplotlib
  • Networkx for network analysis
  • Ski-kit Learn for clustering
In [61]:
from IPython.display import HTML

HTML('''<script>
code_show=true; 
function code_toggle() {
 if (code_show){
 $('div.input').hide();
 } else {
 $('div.input').show();
 }
 code_show = !code_show
} 
$( document ).ready(code_toggle);
</script>
The raw code for this IPython notebook is by default hidden for easier reading.
To toggle on/off the raw code, click <a href="javascript:code_toggle()">here</a>.''')
Out[61]:
The raw code for this IPython notebook is by default hidden for easier reading. To toggle on/off the raw code, click here.
In [ ]: