Instacart Market Basket Analysis

alt

Anderson Alves

August, 2021

This 2 week project was done as part of the course Machine Learning with Python: Zero to GBMs, a 6 week course lectured by Aakash N. S., and hosted on Jovian.ai.

The Zero to GBMs course focuses on supervised machine learning, decision trees, and gradient boosting using Python and its ecosystem of ML libraries: scikit-learn, XGBoost, and LightGBM.

For this course project I have selected the Instacart Market Basket Analysis, a Kaggle.com competition, to perform data cleaning & feature engineering as well as training, comparing & tuning machine learning models to create a recommendation system to predict future behavior of users based on the provided data.

Acknowledgements

The dataset was provided by Instacart and its data was downloaded from Kaggle's data bank.

1. Business Problem

This dataset comes from a competition hosted by Instacart on Kaggle. Instacart is a company that operates a grocery delivery and pick-up service. It allows customers to order groceries from multiple retailers and the shopping is done by a personal shopper. The service has a recommendation feature that suggests to the users some items that they may buy again when making a new order.

In this competition, Instacart challenged the Kaggle community to use the provided anonymized data on customer orders over time to predict which previously purchased products will be in a user's next order.

Recommendation systems are now frequently used by online businesses. They use these systems to improve shopping potential and increase user interaction, allowing them to maximize their return on investment (ROI) based on the information gathered from customers' purchases and preferences. The knowledge-based Recommender System used by Instacart is an essential tool for those purposes by individually presenting more relevant products to each user.

1.1 Dataset Content

The dataset contains about 3.4 million grocery orders from 200 thousand Instacart users. The data is distributed among 6 csv files and a submission sample:

aisles
departments
orders
order_products__prior
order_products__train
products
sample_submission

Each order is evaluated as prior, train, or test. Prior orders are the past behavior of the users while train and test orders represent the future behavior that we need to predict. Each user is divided between the train and test groups, and all of them have information on the different products purchased in all of their prior orders.

For the train category, the reordered products are already described, and we will use those to train a classification model to predict the products of the orders from the test category that are going to be reordered.

1.2. Solution Strategy

My strategy to solve this challenge was:

Step 01. Obtain and unpack the data from the source.

Step 02. Data Description and Filtering.

Step 03. Exploratory Data Analysis.

Step 04. Feature Engeneering.

Step 05. Data Preparation.

Step 06. Feature Selection.

Step 07. Machine Learning Modelling: XGBoost & LightGBM.

Step 08. Parameter Tunning.

Step 09. Evaluate Models Performance.

Importing packages and downloading data

!pip install numpy pandas matplotlib seaborn --quiet
!pip install jovian opendatasets xgboost graphviz lightgbm scikit-learn xgboost lightgbm tabulate --upgrade --quiet

# Jovian commit essentials
import jovian
jovian.set_project('course-project-machine-learning-local')

# Data analysis 
import pandas as pd
pd.options.mode.chained_assignment = None 
import numpy as np

# ML models
import xgboost as xgb
from xgboost import plot_tree
from xgboost import plot_importance
import lightgbm as lgb

# Models performance evaluation
from sklearn.metrics import f1_score
from sklearn.model_selection import train_test_split
from sklearn import metrics

# File saving
import joblib

# Plotting
import matplotlib.pyplot as plt
from matplotlib.pylab import rcParams
%matplotlib inline
import seaborn as sns
color = sns.color_palette()

# Miscellaneous
import warnings
from tabulate import tabulate
import gc