Ml Nyc Taxi Trip Duration
New York City Taxi Trip Prediction
The goal of this project is to predict the duration of taxi rides in NYC. The NYC taxi trip duration dataset is from Kaggle competition. The dataset is based on the 2016 NYC Yellow Cab trip record data made available in Big Query on Google Cloud Platform. The data was originally published by the NYC Taxi and Limousine Commission (TLC).
The Dataset contains three files:
- train.csv - the training set (contains 1458644 trip records)
- test.csv - the testing set (contains 625134 trip records)
- sample_submission.csv - a sample submission file in the correct format
- id - a unique identifier for each trip
- vendor_id - a code indicating the provider associated with the trip record
- pickup_datetime - date and time when the meter was engaged
- dropoff_datetime - date and time when the meter was disengaged
- passenger_count - the number of passengers in the vehicle (driver entered value)
- pickup_longitude - the longitude where the meter was engaged
- pickup_latitude - the latitude where the meter was engaged
- dropoff_longitude - the longitude where the meter was disengaged
- dropoff_latitude - the latitude where the meter was disengaged
- store_and_fwd_flag - This flag indicates whether the trip record was held in vehicle memory before sending to the vendor because the vehicle did not have a connection to the server - Y=store and forward; N=not a store and forward trip
- trip_duration - duration of the trip in seconds
Here is an outline of the steps we'll follow:
- Download the dataset from Kaggle
- Data preparation and cleaning
- Feature Engineering
- Exploratory Data Analysis
- Prepare data for machine learning
- Predicting the trip duration using machine learning
- Conclusion and future work
Packages and Libraries
In this section, we will install and import the necessary libraries we will use in the project.
# install and import the libraries !pip install pandas-profiling numpy matplotlib seaborn plotly --quiet !pip install opendatasets scikit-learn --quiet !pip install geopandas --quiet import opendatasets as od import matplotlib.pyplot as plt import seaborn as sns import pandas as pd import numpy as np import matplotlib import jovian import folium import os import calendar import opendatasets import plotly.express as px %matplotlib inline pd.set_option('display.max_columns', None) pd.set_option('display.max_rows', 150) sns.set_style('darkgrid') matplotlib.rcParams['font.size'] = 14 matplotlib.rcParams['figure.figsize'] = (10, 6) matplotlib.rcParams['figure.facecolor'] = '#00000000'
|████████████████████████████████| 1.0 MB 15.5 MB/s |████████████████████████████████| 6.3 MB 43.1 MB/s |████████████████████████████████| 16.7 MB 51.2 MB/s
Here we are going to download Kaggle dataset using opendatasets library. opendatasets is a Python library for downloading datasets from online sources like Kaggle and Google Drive using a simple Python command.