Course Project - Machine Learning with Python: Zero to GBMs

Introduction

For this course project, I am to apply the machine learning skills covered in the course by training a Machine Learning model om a real-world dataset. These are the steps required to complete the project:

Pick a large real-world dataset from Kaggle (see the "Recommended Datasets" section below) and download it using opendatasets. Your training set should contain at least 50,000 rows and 5 columns of data.

Read the dataset description, understand the problem statement and describe the modeling objective clearly. You can also browse through existing notebooks created by others for inspiration.

Perform exploratory data analysis, gather insights about the data, perform feature engineering, create a training-validation split, and prepare the data for modeling.

Train & evaluate different machine learning models, tune hyperparameters and reduce overfitting to improve the model.

Report the final performance of your best model(s), show sample predictions, and save model weights. Summarize your work, share links to references, and suggest ideas for future work.

Publish your Jupyter notebook to Jovian, make a submission below and share your project with the community

— https://jovian.ai/learn/machine-learning-with-python-zero-to-gbms/assignment/course-project-real-world-machine-learning-model

For step one, as I was browsing Kaggle for datasets, I noticed that Kaggle is running a monthly playground series competition geared towards newcomers to Machine Learning called the the Tabular Playground Series. Since a new contest was about to start, I decided to use the August competition as my dataset.

Dataset Description

The dataset description from the competition is:

The dataset is used for this competition is synthetic, but based on a real dataset and generated using a CTGAN. The original dataset deals with calculating the loss associated with a loan defaults. Although the features are anonymized, they have properties relating to real-world features.

This dataset, even though it is synthesized from real data and the categories are anonymized, is tasking us to predict the amount of money a lender might lose if a borrower defaults on a loan.

We are to use the data to predict a target loss based on the feature columns in the dataset. The evaluation criteria is the Root Mean Squared Error (RMSE) of our predictions on the test data.

From the description, we know that this is a regression problem.