Predict the duration of taxi rides in NYC using machine learning. Download the NYC taxi trip dataset, clean and prepare the data, perform feature engineering, and explore the data. The project uses libraries like pandas, numpy, and scikit-learn.
The goal of this project is to predict the duration of taxi rides in NYC. The NYC taxi trip duration dataset is from Kaggle competition. The dataset is based on the 2016 NYC Yellow Cab trip record data made available in Big Query on Google Cloud Platform. The data was originally published by the NYC Taxi and Limousine Commission (TLC).
The Dataset contains three files:
Data fields:
Here is an outline of the steps we'll follow:
In this section, we will install and import the necessary libraries we will use in the project.
# install and import the libraries
!pip install pandas-profiling numpy matplotlib seaborn plotly --quiet
!pip install opendatasets scikit-learn --quiet
!pip install geopandas --quiet
import opendatasets as od
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
import numpy as np
import matplotlib
import jovian
import folium
import os
import calendar
import opendatasets
import plotly.express as px
%matplotlib inline
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', 150)
sns.set_style('darkgrid')
matplotlib.rcParams['font.size'] = 14
matplotlib.rcParams['figure.figsize'] = (10, 6)
matplotlib.rcParams['figure.facecolor'] = '#00000000'
|████████████████████████████████| 1.0 MB 15.5 MB/s
|████████████████████████████████| 6.3 MB 43.1 MB/s
|████████████████████████████████| 16.7 MB 51.2 MB/s
Here we are going to download Kaggle dataset using opendatasets library. opendatasets is a Python library for downloading datasets from online sources like Kaggle and Google Drive using a simple Python command.