Course Project Machine Learning Local
Instacart Market Basket Analysis
This 2 week project was done as part of the course Machine Learning with Python: Zero to GBMs, a 6 week course lectured by Aakash N. S., and hosted on Jovian.ai.
The Zero to GBMs course focuses on supervised machine learning, decision trees, and gradient boosting using Python and its ecosystem of ML libraries: scikit-learn, XGBoost, and LightGBM.
For this course project I have selected the Instacart Market Basket Analysis, a Kaggle.com competition, to perform data cleaning & feature engineering as well as training, comparing & tuning machine learning models to create a recommendation system to predict future behavior of users based on the provided data.
The dataset was provided by Instacart and its data was downloaded from Kaggle's data bank.
1. Business Problem
In this competition, Instacart challenged the Kaggle community to use the provided anonymized data on customer orders over time to predict which previously purchased products will be in a user's next order.
Recommendation systems are now frequently used by online businesses. They use these systems to improve shopping potential and increase user interaction, allowing them to maximize their return on investment (ROI) based on the information gathered from customers' purchases and preferences. The knowledge-based Recommender System used by Instacart is an essential tool for those purposes by individually presenting more relevant products to each user.
1.1 Dataset Content
The dataset contains about 3.4 million grocery orders from 200 thousand Instacart users. The data is distributed among 6 csv files and a submission sample:
Each order is evaluated as prior, train, or test. Prior orders are the past behavior of the users while train and test orders represent the future behavior that we need to predict. Each user is divided between the train and test groups, and all of them have information on the different products purchased in all of their prior orders.
For the train category, the reordered products are already described, and we will use those to train a classification model to predict the products of the orders from the test category that are going to be reordered.
1.2. Solution Strategy
My strategy to solve this challenge was:
Step 01. Obtain and unpack the data from the source.
Step 02. Data Description and Filtering.
Step 03. Exploratory Data Analysis.
Step 04. Feature Engeneering.
Step 05. Data Preparation.
Step 06. Feature Selection.
Step 07. Machine Learning Modelling: XGBoost & LightGBM.
Step 08. Parameter Tunning.
Step 09. Evaluate Models Performance.
Importing packages and downloading data
!pip install numpy pandas matplotlib seaborn --quiet !pip install jovian opendatasets xgboost graphviz lightgbm scikit-learn xgboost lightgbm tabulate --upgrade --quiet
# Jovian commit essentials import jovian jovian.set_project('course-project-machine-learning-local') # Data analysis import pandas as pd pd.options.mode.chained_assignment = None import numpy as np # ML models import xgboost as xgb from xgboost import plot_tree from xgboost import plot_importance import lightgbm as lgb # Models performance evaluation from sklearn.metrics import f1_score from sklearn.model_selection import train_test_split from sklearn import metrics # File saving import joblib # Plotting import matplotlib.pyplot as plt from matplotlib.pylab import rcParams %matplotlib inline import seaborn as sns color = sns.color_palette() # Miscellaneous import warnings from tabulate import tabulate import gc