New York City Taxi Fare Prediction

alt

The goal of this project is to predict with reasonable accuracy, the taxi fares of new york city based on the dataset provided by this kaggle competition: https://www.kaggle.com/competitions/new-york-city-taxi-fare-prediction/overview

The dataset consists of following csv files:

  • train.csv: Input features and target fare_amount values for the training set (about 55M rows)
  • test.csv: Input features for the test set (about 10K rows). Your goal is to predict fare_amount for each row.
  • sample_submission.csv: a sample submission file in the correct format (columns key and fare_amount). This file 'predicts' fare_amount to be $11.35 for all rows, which is the mean fare_amount from the training set.

Following are the features of the training and test dataset:

  • pickup_datetime - timestamp value indicating when the taxi ride started.
  • pickup_longitude - float for longitude coordinate of where the taxi ride started.
  • pickup_latitude - float for latitude coordinate of where the taxi ride started.
  • dropoff_longitude - float for longitude coordinate of where the taxi ride ended.
  • dropoff_latitude - float for latitude coordinate of where the taxi ride ended.
  • passenger_count - integer indicating the number of passengers in the taxi ride.

The target column fare_amount is a float dollar amount of the cost of the taxi ride. This value is only in the training set; this is what we are predicting in the test set

Import Libraries

Lets start by importing all the necessary python libraries and packages required for this project:

import pandas as pd
import numpy as np
import matplotlib
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
from sklearn.cluster import KMeans
from sklearn.metrics import mean_squared_error
from sklearn.preprocessing import MinMaxScaler

%matplotlib inline