Learn practical skills, build real-world projects, and advance your career


New York City Taxi Trip Prediction

The goal of this project is to predict the duration of taxi rides in NYC. The NYC taxi trip duration dataset is from Kaggle competition. The dataset is based on the 2016 NYC Yellow Cab trip record data made available in Big Query on Google Cloud Platform. The data was originally published by the NYC Taxi and Limousine Commission (TLC).

The Dataset contains three files:

  • train.csv - the training set (contains 1458644 trip records)
  • test.csv - the testing set (contains 625134 trip records)
  • sample_submission.csv - a sample submission file in the correct format

Data fields:

  • id - a unique identifier for each trip
  • vendor_id - a code indicating the provider associated with the trip record
  • pickup_datetime - date and time when the meter was engaged
  • dropoff_datetime - date and time when the meter was disengaged
  • passenger_count - the number of passengers in the vehicle (driver entered value)
  • pickup_longitude - the longitude where the meter was engaged
  • pickup_latitude - the latitude where the meter was engaged
  • dropoff_longitude - the longitude where the meter was disengaged
  • dropoff_latitude - the latitude where the meter was disengaged
  • store_and_fwd_flag - This flag indicates whether the trip record was held in vehicle memory before sending to the vendor because the vehicle did not have a connection to the server - Y=store and forward; N=not a store and forward trip
  • trip_duration - duration of the trip in seconds

Here is an outline of the steps we'll follow:

  • Download the dataset from Kaggle
  • Data preparation and cleaning
  • Feature Engineering
  • Exploratory Data Analysis
  • Prepare data for machine learning
  • Predicting the trip duration using machine learning
  • Conclusion and future work

Packages and Libraries

In this section, we will install and import the necessary libraries we will use in the project.

# install and import the libraries
!pip install pandas-profiling numpy matplotlib seaborn plotly --quiet
!pip install opendatasets scikit-learn --quiet
!pip install geopandas --quiet

import opendatasets as od
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
import numpy as np
import matplotlib
import jovian
import folium
import os
import calendar
import opendatasets
import plotly.express as px
%matplotlib inline

pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', 150)
matplotlib.rcParams['font.size'] = 14
matplotlib.rcParams['figure.figsize'] = (10, 6)
matplotlib.rcParams['figure.facecolor'] = '#00000000'
|████████████████████████████████| 1.0 MB 15.5 MB/s |████████████████████████████████| 6.3 MB 43.1 MB/s |████████████████████████████████| 16.7 MB 51.2 MB/s

Dataset Download

Here we are going to download Kaggle dataset using opendatasets library. opendatasets is a Python library for downloading datasets from online sources like Kaggle and Google Drive using a simple Python command.