Candidate: Shichao (Richard) Ji
jshichao@vt.edu

Summary

data used in the project, source

For this project, I have written a Python library mainly for data processing & pipeline

  • Most of its functions & methods can apply to other general data analysis projects & problems

package code is here (my github)

For Model building: Build a predictive model for tip as a percentage of the total fare.

  • I have trained a 5-hold stacking H2O model (using 120 CPU cores) in Uconn's HPC
  • The model achieved a RMSE of 0.2 (tip percent, scale of 0-100) on training data and 0.34 on validation data (5-fold average)
  • I have tested on Oct 2015 data, model achieved 0.30 RMSE, which means the target variable tip percent is in 0-100 scale (0% to 100%), the error rate is about 0.30% (0.3 percent).

Phase 1

Data Exploration

Initial plotting analysis

Plot a histogram of the trip distance ("Trip Distance").

further plotting and grouping analysis

Identifying trips that originate or terminate at one of the NYC area airports. Can you provide a count of how many transactions fit this criteria, the average fair, and any other interesting characteristics of these trips.


Open phase 2 reports in new notebooks

Phase 2

Model Building

Build a predictive model for tip as a percentage of the total fare

summary learned


In [1]:
import pandas as pd
from time import time
from __future__ import division, print_function
%matplotlib inline
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
import math
import gc

import warnings
warnings.filterwarnings("ignore") 

In [2]:
%load_ext autoreload
%autoreload 2

Question 1

In [3]:
# # donwload the data
# !wget https://s3.amazonaws.com/nyc-tlc/trip+data/green_tripdata_2015-09.csv
  • load the first 100 records and have a look of the data, find out if we need to parse features in the data loading phase
In [4]:
%%time
df = pd.read_csv('green_tripdata_2015-09.csv', parse_dates=['lpep_pickup_datetime','Lpep_dropoff_datetime'],\
                 chunksize=10000).next()
print (df.shape)
(10000, 21)
CPU times: user 40 ms, sys: 8 ms, total: 48 ms
Wall time: 46.4 ms
In [5]:
pd.options.display.max_columns=100
In [6]:
df.head(3)
Out[6]:
VendorID lpep_pickup_datetime Lpep_dropoff_datetime Store_and_fwd_flag RateCodeID Pickup_longitude Pickup_latitude Dropoff_longitude Dropoff_latitude Passenger_count Trip_distance Fare_amount Extra MTA_tax Tip_amount Tolls_amount Ehail_fee improvement_surcharge Total_amount Payment_type Trip_type
0 2 2015-09-01 00:02:34 2015-09-01 00:02:38 N 5 -73.979485 40.684956 -73.979431 40.685020 1 0.00 7.8 0.0 0.0 1.95 0.0 NaN 0.0 9.75 1 2
1 2 2015-09-01 00:04:20 2015-09-01 00:04:24 N 5 -74.010796 40.912216 -74.010780 40.912212 1 0.00 45.0 0.0 0.0 0.00 0.0 NaN 0.0 45.00 1 2
2 2 2015-09-01 00:01:50 2015-09-01 00:04:24 N 1 -73.921410 40.766708 -73.914413 40.764687 1 0.59 4.0 0.5 0.5 0.50 0.0 NaN 0.3 5.80 1 1
In [7]:
df.head(1).transpose()
Out[7]:
0
VendorID 2
lpep_pickup_datetime 2015-09-01 00:02:34
Lpep_dropoff_datetime 2015-09-01 00:02:38
Store_and_fwd_flag N
RateCodeID 5
Pickup_longitude -73.9795
Pickup_latitude 40.685
Dropoff_longitude -73.9794
Dropoff_latitude 40.685
Passenger_count 1
Trip_distance 0
Fare_amount 7.8
Extra 0
MTA_tax 0
Tip_amount 1.95
Tolls_amount 0
Ehail_fee NaN
improvement_surcharge 0
Total_amount 9.75
Payment_type 1
Trip_type 2
  • load the whole data with dates parsed in the loading phase
In [8]:
%%time
df = pd.read_csv('green_tripdata_2015-09.csv', parse_dates=['lpep_pickup_datetime','Lpep_dropoff_datetime'])
print (df.shape)
(1494926, 21)
CPU times: user 6.01 s, sys: 340 ms, total: 6.35 s
Wall time: 6.37 s

In [9]:
print ('Data has {} rows and {} columns'.format(df.shape[0], df.shape[1]))
Data has 1494926 rows and 21 columns
In [10]:
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1494926 entries, 0 to 1494925
Data columns (total 21 columns):
VendorID                 1494926 non-null int64
lpep_pickup_datetime     1494926 non-null datetime64[ns]
Lpep_dropoff_datetime    1494926 non-null datetime64[ns]
Store_and_fwd_flag       1494926 non-null object
RateCodeID               1494926 non-null int64
Pickup_longitude         1494926 non-null float64
Pickup_latitude          1494926 non-null float64
Dropoff_longitude        1494926 non-null float64
Dropoff_latitude         1494926 non-null float64
Passenger_count          1494926 non-null int64
Trip_distance            1494926 non-null float64
Fare_amount              1494926 non-null float64
Extra                    1494926 non-null float64
MTA_tax                  1494926 non-null float64
Tip_amount               1494926 non-null float64
Tolls_amount             1494926 non-null float64
Ehail_fee                0 non-null float64
improvement_surcharge    1494926 non-null float64
Total_amount             1494926 non-null float64
Payment_type             1494926 non-null int64
Trip_type                1494922 non-null float64
dtypes: datetime64[ns](2), float64(14), int64(4), object(1)
memory usage: 239.5+ MB

Question 2

  • Plot a histogram of the trip distance ("Trip Distance").
  • Report any structure you find and any hypotheses you have about that structure.
  • data exploration, plot a boxplot, we can see the trip distance is highly Right-skewed, we may need to remove outliers to get a better understanding of the data
In [11]:
plt.figure(figsize=[20,2])
sns.boxplot(df['Trip_distance'])
Out[11]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f4365e89790>
  • ### import helper plot Class from the project code
In [12]:
from codes import PlotHist

the helper Class

  • can remove right outliers with specified standard deviation
  • calculate specified percentile
  • Plot histogram of Trip distance with & without outliers (10 std) and with & without log scale of y (count)
  • check the code by executing the cell below
In [13]:
plot_instance = PlotHist(df, 'Trip_distance', upper_std=10)
plot_instance.plot()
290 of 1494926 outliers detected 99.98% remain of feature Trip_distance at 10 std
In [14]:
# 10 std upper limit
plot_instance.upper_limit
Out[14]:
33.734348944732176
In [15]:
plot_instance.update_upper_limit(3)
plot_instance.upper_limit
Out[15]:
12.198003279202943
In [16]:
plot_instance.get_percentile([0.5, 0.90, 0.95, 0.99, 0.999])
Out[16]:
0.500     1.98
0.900     6.55
0.950     8.80
0.990    14.77
0.999    23.87
Name: Trip_distance, dtype: float64

as we can see from above graphs, percentiles and below CDF distribution, 99% of the data is smaller than 14.77 miles. distribution of Trip distance is highly Right-skewed

  • hypothesis:
    • e.g. green taxi trips are mostly within 15 miles
In [17]:
plt.figure(figsize=[14,6])
plot_instance.data.hist( normed = True, bins = 100, cumulative = True, range=(0, 15), alpha=.7)
Out[17]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f4352519790>

In [18]:
from codes import TimeAggDis

TimeAggDis

  • #### plot helper class that plot/group a numeric column by a data column with chosen aggregation
In [19]:
TimeAggDis??
In [20]:
agg = TimeAggDis(df=df, date_col='lpep_pickup_datetime', num_col='Trip_distance', groupby='hour')

mean and median trip distance (miles) by hour of the day

In [21]:
agg.getReport()
Out[21]:
Mean_Trip_distance Median_Trip_distance
hour
0 3.115276 2.20
1 3.017347 2.12
2 3.046176 2.14
3 3.212945 2.20
4 3.526555 2.36
5 4.133474 2.90
6 4.055149 2.84
7 3.284394 2.17
8 3.048450 1.98
9 2.999105 1.96
10 2.944482 1.92
11 2.912015 1.88
12 2.903065 1.89
13 2.878294 1.84
14 2.864304 1.83
15 2.857040 1.81
16 2.779852 1.80
17 2.679114 1.78
18 2.653222 1.80
19 2.715597 1.85
20 2.777052 1.90
21 2.999189 2.03
22 3.185394 2.20
23 3.191538 2.22
In [22]:
agg.plotReport(plotkind='bar', figsize=[20,7])
In [23]:
agg.plotReport(plotkind='line', figsize=[20,7])


Q3.2

google search "NYC major airports"

https://www.google.com/search?q=nyc+major+airports

In [24]:
# latitude and longtitude of three airports getting from google (see above)
JFK = [-73.7851, 40.6463]
LGA = [-73.8685, 40.7720]
EWR = [-74.1815, 40.6895]
In [25]:
from codes import plot_coordinates

plot helper function plot_coordinates

  • #### simply scatter plot, black background
  • two color represents pickup and dropoff location dot
  • 3 major airports are marked circle in the plot
  • bottom-right corner, green squre shows the radius I used to category airport trips
In [26]:
# view code
plot_coordinates??
In [27]:
plot_coordinates(df, limit=True, xy_eq_scale=True, frame=True,
                 xlimit=[-74.20,-73.76], ylimit=[40.56,40.90], 
                 figsize=[20,16])
Out[27]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f435218d350>