Ml Project 3
Is Your Vehicle a Lemon?
Predict if car purchased at an Auction is a good or bad buy using ML
One of the biggest challenges of an auto dealership purchasing an used car at an auto auction is the risk of that the vehicle might have serious issues that prevent it from being sold to customers. The auto community calls these unfortunate purchases "kicks".
Kicked cars often result when there are tampered odometers, mechanical issues the dealer is not able to address, issues with getting the vehicle title from the seller, or some other unforeseen problem. Kick cars can be very costly to dealers after transportation cost, throw-away repair work, and market losses in reselling the vehicle.
Modelers who can figure out which cars have a higher risk of being kick can provide real value to dealerships trying to provide the best inventory selection possible to their customers.
The challenge of this competition is to predict if the car purchased at the Auction is a Kick (bad buy) or not.
In this project, we will take a look at real world data for 150,000 borrowers, and use machine learning techniques to build models which could be deployed in a bank to help the bank and a potential borrower make the best financial decision possible.
All the variables in the data set are defined as follows:
train.csv - Training data.
- RefID: Unique (sequential) number assigned to vehicles
- IsBadBuy: Identifies if the kicked vehicle was an avoidable purchase
- PurchDate: The Date the vehicle was Purchased at Auction
- Auction: Auction provider at which the vehicle was purchased
- VehYear: The manufacturer's year of the vehicle
- VehicleAge: The Years elapsed since the manufacturer's year
- Make: Vehicle Manufacturer
- Model: Vehicle Model
- Trim: Vehicle Trim Level
- SubModel: Vehicle Submodel
- Color: Vehicle Color
- Transmission: Vehicles transmission type (Automatic, Manual)
- WheelTypeID: The type id of the vehicle wheel
- WheelType: The vehicle wheel type description (Alloy, Covers)
- VehOdo: The vehicles odometer reading
- Nationality: The Manufacturer's country
- Size: The size category of the vehicle (Compact, SUV, etc.)
- TopThreeAmericanName: Identifies if the manufacturer is one of the top three American manufacturers
- MMRAcquisitionAuctionAveragePrice: Acquisition price for this vehicle in average condition at time of purchase
- MMRAcquisitionAuctionCleanPrice: Acquisition price for this vehicle in the above Average condition at time of purchase
- MMRAcquisitionRetailAveragePrice: Acquisition price for this vehicle in the retail market in average condition at time of purchase
- MMRAcquisitonRetailCleanPrice: Acquisition price for this vehicle in the retail market in above average condition at time of purchase
- MMRCurrentAuctionAveragePrice: Acquisition price for this vehicle in average condition as of current day
- MMRCurrentAuctionCleanPrice: Acquisition price for this vehicle in the above condition as of current day
- MMRCurrentRetailAveragePrice: Acquisition price for this vehicle in the retail market in average condition as of current day
- MMRCurrentRetailCleanPrice: Acquisition price for this vehicle in the retail market in above average condition as of current day
- PRIMEUNIT: Identifies if the vehicle would have a higher demand than a standard purchase
- AcquisitionType: Identifies how the vehicle was aquired (Auction buy, trade in, etc)
- AUCGUART: The level guarantee provided by auction for the vehicle (Green light - Guaranteed/arbitratable, Yellow Light - caution/issue, red light - sold as is)
- KickDate: Date the vehicle was kicked back to the auction
- BYRNO: Unique number assigned to the buyer that purchased the vehicle
- VNZIP: Zipcode where the car was purchased
- VNST: State where the the car was purchased
- VehBCost: Acquisition cost paid for the vehicle at time of purchase
- IsOnlineSale: Identifies if the vehicle was originally purchased online
- WarrantyCost: Warranty price (term=36month and millage=36K)
test.csv - Test data. Same schema as the train data, minus
The data contains missing values.
The dependent variable (IsBadBuy) is binary (C2).
There are 32 Independent variables (C3-C34).
The data set is split to 60% training and 40% testing.
The classification model is evaulated for Accuracy Score.
Here are the steps we'll cover in this project:
- Install and import libraries
- Download dataset from Kaggle using opendatasets
- Exploratory analysis and visualization
- Training, Validation and Test Sets
- Identify Input and Target columns
- Find numerical and categorical columns
- Impute missing numerical data
- Scaling numeric features
- Encoding Categorical Data
- Saving Processed Data to Disk
- Train a Logistic Regression Model
- Make predictions and evaluate the model
- Train a Random Forest model
- Tune hyperparameters and re-train new models
- Evaluate the models