Learn practical skills, build real-world projects, and advance your career

Project: predicting the surgepricing using xgboost,randomforest and logistic regression.

The following topics are covered in this tutorial:

  • Downloading a real-world dataset from a Kaggle competition
  • Performing feature engineering and prepare the dataset for training
  • Train, evaluate and interpret a decision tree
  • Train, evaluate and interpret a random forest
  • Tune hyperparameters to improve the model
  • Training and interpreting a gradient boosting model using XGBoost
  • Configuring the gradient boosting model and tuning hyperparamters (gradient boosting is powerful technique its probable one of the powerful classical machine learning algorithms so for any problem related to tabular data we may find it effective to use gradient boosting)

Let's begin by installing the required libraries

!pip install numpy pandas matplotlib seaborn --quiet
!pip install jovian opendatasets xgboost graphviz lightgbm scikit-learn xgboost lightgbm --upgrade --quiet
|████████████████████████████████| 166.7 MB 18 kB/s |████████████████████████████████| 2.0 MB 64.4 MB/s |████████████████████████████████| 22.3 MB 1.6 MB/s

We'll learn gradient boosting,random forest,Decision tree,and logistic regression by applying it to a real-world dataset from keggle.


Data provided by an Indian cab aggregator service Sigma Cabs. Their customers can download their app on smartphones and book a cab from any where in the cities they operate in. They, in turn search for cabs from various service providers and provide the best option to their client across available options. They have been in operation for little less than a year now. During this period, they have captured surgepricingtype from the service providers.

Problem Statement

The main objective is to build a predictive model, which could help them in predicting the surgepricingtype pro-actively. This would in turn help them in matching the right cabs with the right customers quickly and efficiently.

View and download the data here: https://www.kaggle.com/arashnic/taxi-pricing-with-mobility-analytics


Trip_ID: ID for TRIP

Trip_Distance: The distance for the trip requested by the customer

TypeofCab: Category of the cab requested by the customer

CustomerSinceMonths: Customer using cab services since n months; 0 month means current month

LifeStyleIndex: Proprietary index created by Sigma Cabs showing lifestyle of the customer based on their behaviour

ConfidenceLifeStyle_Index: Category showing confidence on the index mentioned above

Destination_Type: Sigma Cabs divides any destination in one of the 14 categories.

Customer_Rating: Average of life time ratings of the customer till date

CancellationLast1Month: Number of trips cancelled by the customer in last 1 month

Var1, Var2 and Var3: Continuous variables masked by the company. Can be used for modelling purposes

Gender: Gender of the customer

SurgePricingType: Target (can be of 3 types)