Ml Nyc Taxi Trip Duration - Notebook by Darshan Desai (darshandesai)

Learn practical skills, build real-world projects, and advance your career

Updated a year ago

alt

New York City Taxi Trip Prediction

The goal of this project is to predict the duration of taxi rides in NYC. The NYC taxi trip duration dataset is from Kaggle competition. The dataset is based on the 2016 NYC Yellow Cab trip record data made available in Big Query on Google Cloud Platform. The data was originally published by the NYC Taxi and Limousine Commission (TLC).

The Dataset contains three files:

train.csv - the training set (contains 1458644 trip records)
test.csv - the testing set (contains 625134 trip records)
sample_submission.csv - a sample submission file in the correct format

Data fields:

id - a unique identifier for each trip
vendor_id - a code indicating the provider associated with the trip record
pickup_datetime - date and time when the meter was engaged
dropoff_datetime - date and time when the meter was disengaged
passenger_count - the number of passengers in the vehicle (driver entered value)
pickup_longitude - the longitude where the meter was engaged
pickup_latitude - the latitude where the meter was engaged
dropoff_longitude - the longitude where the meter was disengaged
dropoff_latitude - the latitude where the meter was disengaged
store_and_fwd_flag - This flag indicates whether the trip record was held in vehicle memory before sending to the vendor because the vehicle did not have a connection to the server - Y=store and forward; N=not a store and forward trip
trip_duration - duration of the trip in seconds

Here is an outline of the steps we'll follow:

Download the dataset from Kaggle
Data preparation and cleaning
Feature Engineering
Exploratory Data Analysis
Prepare data for machine learning
Predicting the trip duration using machine learning
Conclusion and future work

Packages and Libraries

In this section, we will install and import the necessary libraries we will use in the project.

# install and import the libraries
!pip install pandas-profiling numpy matplotlib seaborn plotly --quiet
!pip install opendatasets scikit-learn --quiet
!pip install geopandas --quiet

import opendatasets as od
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
import numpy as np
import matplotlib
import jovian
import folium
import os
import calendar
import opendatasets
import plotly.express as px
%matplotlib inline

pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', 150)
sns.set_style('darkgrid')
matplotlib.rcParams['font.size'] = 14
matplotlib.rcParams['figure.figsize'] = (10, 6)
matplotlib.rcParams['figure.facecolor'] = '#00000000'

     |████████████████████████████████| 1.0 MB 15.5 MB/s 
     |████████████████████████████████| 6.3 MB 43.1 MB/s 
     |████████████████████████████████| 16.7 MB 51.2 MB/s

Dataset Download

Here we are going to download Kaggle dataset using opendatasets library. opendatasets is a Python library for downloading datasets from online sources like Kaggle and Google Drive using a simple Python command.