Exploratory data analysis of Shrinkme

Customer behaviour data of a multi-category e-commerce store


" The Matrix is everywhere. It is all around us. Even now in this very room. You can see it when you look out of your window, or when you turn on your television. You can feel it when you go to work...." - Morpheus, The Matrix 1999

Fast forward to 2021, we are deeply enmeshed in the virtual world for social lives, shopping online, working or learning remotely.

  • It is critical for businesses of any type and size to have an online presence. Retail e-commerce sales which refers to sales from B2C worldwide was 4.28 trillion US dollars in 2020. Over 2 billion people purchased goods or services online during the same year.ref
  • Analysing customer behaviour helps businesses to increase customer value and therefore spend by content personalisation and optimisation.
  • Based on this analysis, a profile of the customer is created, they are grouped into segments by their digital footprint , interests and behaviour.

Exploratory analysis and predictive analysis techniques help businesses to structure and recalibrate their pricing, marketing, inventory stragies and more in real-time.

B2C e-commerce businesses generate voluminous amounts of data. In this notebook we will perform the exploratory data analysis for a multicategory store using Python, Pandas, Matplotlib and Seaborn.

  • Identify a dataset for analysis
  • Explore techniques to work with large data sets
  • Pre-process or prepare the data for analysis
  • Transform the data into various summary views

Let us call this anonymised datset Shrinkme. We will use our framework to ask and answer interesting questions, create visualisations for the following areas of interest:

  • Pattern of visits by day , day of week or hour of day
  • View and purchase patterns for products and categories
  • Popular brands and their contribution to monthly sales
  • Segment customers using RFM

CAUTION: I recommed opening and running this notebook on Google Colab for two reasons

  1. Large dataset: 5GB, binder runs into issues
  2. Ease of navigation with Colab table of contents: This is a long notebook. I tried to pack in a lot of information as I faced several challenges as a newbie to e-commerce datasets and python EDA on large datasets.

About the data

In this analysis we will use the "eCommerce behavior data from multi category store" data by Michael Kechinov on Kaggle from the REES46 Marketing Platform.

The Kaggle dataset has two .csv files

  • 2019-Nov.csv 8.4 GB and 67.5 million records
  • 2019-Oct.csv 5.3GB and 42.4 million records

We will build the framework with and perform the analysis on the 2019-Oct.csv file.

Each .csv file has the following structure of 9 columns

  1. event_time: Time when the event occured (in UTC) in the format "2019-10-01 00:00:04+00:00"
  2. event_type: A single event either "view", "cart" or "purchase"
  3. product_id: A unique code for each product
  4. category_id: A unique code for each product category
  5. category_code: A '.' separated text with names of categories and subcategories "electronics.smartphone"
  6. brand: Brand name of the product "apple"
  7. price: Price of the product in decimals "1081.98"
  8. user_id: A permanent unique ID per shopper "554748717"
  9. user_session: A temporary ID per session "9333dfbd-b87a-4708-9857-6336556b0fcc"

How to read the data

At time event_time and during session user_session, user user_id purchased (event_type is "purchase") a product product_id of brand brand in category category_code of price price.

Assumptions and notes for this analysis :

  • Each unique session is a visit
  • There are no remove_from_cart events in this dataset
  • A session may have just one purchase event and no related view or cart events.
  • A session can have multiple purchase events
  • category_code is usually present for meaningful categories and skipped for the rest.
  • Price is assumed to be in US Dollars

Happy Coding!!

Use the "Run" button to execute the code.

Install packages and import libraries

!pip install numpy pandas==1.1.5 plotly==4.7.1 opendatasets pyarrow squarify pygal --upgrade --quiet
# Import python data analysis libraries
import pandas as pd
import numpy as np

# Import library to download data from Kaggle
import opendatasets as od

# Import visualisation libraries
import matplotlib
from matplotlib import pyplot as plt
import seaborn as sns
import plotly.express as px
from plotly import graph_objects as go
from wordcloud import WordCloud
import pygal
from pygal.style import Style
import squarify

Let us set a base style for all our visualisations in this notebook. These customisations from Mounir create excellent clarity images to use in presentations. You can override them and customise individual plots as required.

# Set plot parameters for the notebook
%matplotlib inline
matplotlib.rcParams['font.size'] = 18
matplotlib.rcParams['figure.figsize'] = (18, 10)
matplotlib.rcParams['figure.facecolor'] = '#00000000'
import warnings

Setup your google drive to save intermediate results.

# Import Drive API and authenticate.
from google.colab import drive

# Mount my Google Drive to the Colab VM.
Drive already mounted at /gdrive; to attempt to forcibly remount, call drive.mount("/gdrive", force_remount=True).
# Execute this to save new versions of the notebook

Working with large datasets

Data download

Let us download the data from Kaggle using the opendatasets library. You will need your Kaggle access token handy.

#Kaggle url for the dataset
data_url = 'https://www.kaggle.com/mkechinov/ecommerce-behavior-data-from-multi-category-store'
Please provide your Kaggle credentials to download this dataset. Learn more: http://bit.ly/kaggle-creds Your Kaggle username: sillyfalcon Your Kaggle Key: ··········
0%| | 5.00M/4.29G [00:00<02:05, 36.6MB/s]
Downloading ecommerce-behavior-data-from-multi-category-store.zip to ./ecommerce-behavior-data-from-multi-category-store
100%|██████████| 4.29G/4.29G [00:58<00:00, 79.4MB/s]

Let us check if the data has been downloaded.

!ls -lh ecommerce-behavior-data-from-multi-category-store
total 14G -rw-r--r-- 1 root root 8.4G Jun 12 04:17 2019-Nov.csv -rw-r--r-- 1 root root 5.3G Jun 12 04:19 2019-Oct.csv

The data has been downloaded and unzipped to the folder ``./ecommerce-behavior-data-from-multi-category-store` Let us now check the size of the folder. There are two files in this folder.

  1. 2019-Nov.csv 8.4 GB
  2. 2019-Oct.csv 5.3GB

We will select 2019-Oct.csv for our analysis

!wc -l ecommerce-behavior-data-from-multi-category-store/2019-Oct.csv
42448765 ecommerce-behavior-data-from-multi-category-store/2019-Oct.csv

There are 42.44 million records in 2019-Oct csv

oct_2019_csv = './ecommerce-behavior-data-from-multi-category-store/2019-Oct.csv'
#load the file using Pandas. 
#oct_2019_df = pd.read_csv(oct_2019_csv)
CPU times: user 3 µs, sys: 1 µs, total: 4 µs Wall time: 10 µs

Phenomenal!! That's 5.3 GB data with 42 million records loaded into a Pandas dataframe in under 2 mins!! With parse_dates the time taken to read the csv into the datafram is 10min 43s.


There are 42.4 million rows and 9 columns in the dataset. Using Pandas this dataset is now 2.8 GB from 5.3GB as a .csv

Work with a sample - a fraction of the dataset

As this massive dataset let's use 1% of the data to build our EDA framework. This is very important and often overlooked. Working on a smaller sample saves significant time while experimenting with code.

Once we have the full framework ready, we can run the analysis on the complete dataset.

#create a sample of 1% from the dataset
#oct_2019_sample_df = oct_2019_df.sample(frac=0.01)
CPU times: user 4 µs, sys: 0 ns, total: 4 µs Wall time: 9.06 µs

Our sample dataset now contains 424k rows and the same 9 columns. It is 32.4 MB in size.

Save intermediate results

While working with large datasets, you may runtinto runtime issues. Therefore it will be helpful to save the intermediate results to your google drive or a local folder to pick up and continue from the point.

Let us save this sample dataset into my google drive. We will use the data in this file to build our framework.

We will continue to save short snapshots of data offline as we move along the notebook. You can save the data in various formats including binary formats such as .feather

# Write the DataFrame to CSV file.
#with open('/gdrive/My Drive/eda-ecomm-project/sample-oct-2019.csv', 'w') as f:
#    oct_2019_df_sample.to_csv(f)
Explore faster loading and lesser memory

We will use some techniques to load data to Pandas faster and use less memory.

  • drop columns: select a subset of columns relevant for analysis
  • identify categorical columns: change the dtype tp category
  • parse_dates: change columns with date\time to type DateTime
  • set DateTime column as the index
  • use smaller dtypes: we don't see any need as of now
#use only a subset of the columns
selected_columns = ['event_time', 'event_type', 'category_code', 'brand', 'price', 'user_id', 'user_session']
change_dtype = {'event_type':'category'} 

Note: Remove comment in the cell below to run the analysis on the appropriate dataset.

#Sample data
#sample_csv_url = '/gdrive/My Drive/eda-ecomm-project/sample-oct-2019.csv'

#Full data
sample_csv_url = oct_2019_csv

Load the data to a Pandas dataframe.

# load data with read_csv() choosing selected columns, dtypes and parse_dates

ecomm_df = (pd.read_csv(sample_csv_url, 
                       index_col= 'event_time',      #set index
                       usecols= selected_columns,    #use a subset of columns
                       dtype= change_dtype,          #change dtype of selected columns
                       parse_dates= ['event_time'])
                       ) #change dtype tp DateTime
CPU times: user 10min 37s, sys: 18.8 s, total: 10min 56s Wall time: 11min 35s
<class 'pandas.core.frame.DataFrame'> DatetimeIndex: 42448764 entries, 2019-10-01 00:00:00+00:00 to 2019-10-31 23:59:59+00:00 Data columns (total 6 columns): # Column Dtype --- ------ ----- 0 event_type category 1 category_code object 2 brand object 3 price float64 4 user_id int64 5 user_session object dtypes: category(1), float64(1), int64(1), object(3) memory usage: 1.9+ GB

Our dataframe is now close to 40% smaller from 32.4MB earlier with the transformations.

#Save work
[jovian] Detected Colab notebook... [jovian] Please enter your API key ( from https://jovian.ai/ ): API KEY: ·········· [jovian] Uploading colab notebook to Jovian... [jovian] Capturing environment.. [jovian] Committed successfully! https://jovian.ai/anushree-k/zerotoanalyst-ecomm-eda

Data pre-processing

Now that we have loaded the data into a Pandas dataframe, let us process the data for the following

  • drop duplicates
  • replace missing values
  • check for outliers

You could analyse the data more to further clean up the data.

Drop duplicates

Print number of unique values for each columns before checking for duplicates. These are fairly time consuming when we run them on the complete data. They are commented out unless required.

#save duplicates for analysis
duplicates_df = ecomm_df[ecomm_df.duplicated()]
#group duplicates by user_id and user_session for further analysis
grp_df = (duplicates_df.groupby( [ "user_id"])[[ "user_id", "user_session"]].size()
                      .sort_values(by= 'group_count',ascending= False))
#Look at the highest number of duplicates.

Sometimes depending on your data and index, pd.duplicated() may recognise unique values as duplicates. It helps to check whether this is happening here. Filtering for duplicates when the dataframe had **event_time as a 'DateTimeand set as the index** returned 14,008,702 rows as duplicates. Withoutevent_time` as index, 30,314 rows were duplicates.

#Check if the data is really duplicated

The timestamp is unique if we consider seconds, but these duplicates need to be dropped to avoid overcounting for number of purchases, price etc. So let us go ahead and drop them.

#drop duplicate rows
ecomm_df.drop_duplicates(keep=False, inplace= True)
#check for missing data
missing_data_pct = ecomm_df.isna().sum().sort_values(ascending=False)/ecomm_df.shape[0]
category_code    34.897250
brand            16.037858
user_session      0.000010
user_id           0.000000
price             0.000000
event_type        0.000000
dtype: float64
#check if we still have any duplicates
Check for outliers
pd.options.display.float_format = "{:.2f}".format

Price is the only column with possible outliers. The max, min and std are seem reasonable.

Find and replace missing values

For category code and brand we take our cue from the data description on kaggle that categories are present for meaningful categories. Replace the missing categories and brans with nocategory and nobrand.

#replace missing data with nocategory
ecomm_df['category_code'].fillna('nocategory', inplace=True)
ecomm_df['brand'].fillna('nobrand', inplace=True)
event_type             3
category_code        127
brand               3415
price              62830
user_id          2798172
user_session     7963579
dtype: int64

We have a surprisingly clean dataset except for the huge number of duplicate rows that we had to drop.

#Save work
[jovian] Detected Colab notebook... [jovian] Uploading colab notebook to Jovian... [jovian] Capturing environment.. [jovian] Committed successfully! https://jovian.ai/anushree-k/zerotoanalyst-ecomm-eda

Exploratory data analysis

We are now ready to dive into exploratory data analysis.

Framework design

In our introduction, we identified some key questions around which we will frame our analysis.

Before we begin to dive deep into programming with Pandas, I highly recommend creating a draft framework. A key lesson learnt and reinforced at Jovian is to begin with the end goal in mind. When you transform data using Pandas with groupby, merge etc if you are you are not very proficient with Pandas, things can get quickly hairy. Early identification of stages of data flow, high level data summaries and columns greatly speeds up the programming to handle large datasets.

  • Raw Data

  • Transformed Data

  • Insights

Here is an illustration:

Pro tip from Aakash N S: Use googlesheets to visualise the intermediate and end results. Here is a sample for the daily_summary_table

This section for summary views is adapted from EDA - Kaggle code by Adil Emre Bilgic

Daily summary tables

Summarise data by day of the month with number of shoppers and number of visits

#Create visits_df for visits with user_id and user_session columns
visits_df = ecomm_df[['user_id','user_session']]
#Create daily_visits_df with number of unique user_ids and user_session values for each day
daily_visits_df = (visits_df.groupby(by= visits_df.index.floor('D')) # Retain date row but floor index to day
                            .agg(number_of_visits= ('user_session',lambda x: x.nunique()), 
                                 number_of_shoppers= ('user_id',lambda x: x.nunique()))
#filter to select only purchase events
is_purchase = ecomm_df['event_type'] == 'purchase'
#Create sales_df
sales_df = ecomm_df[is_purchase][['user_id','event_type','price']]
#Create daily_sales_df using groupby
daily_sales_df = (sales_df.groupby(by= sales_df.index.floor('D')) # Retain date row but floor index to day
                         .agg(number_of_purchases = ('event_type', lambda x: x.count()), 
                              total_sales = ('price', lambda x: x.sum()) ))
#Merge daily_visits_df and daily_sales_df
daily_summary_df = pd.merge(daily_visits_df,daily_sales_df,on='event_time')
##Save as feather
#daily_summary_df.reset_index(inplace = True)
#daily_summary_df.set_index('event_time', inplace=True )
# Write the DataFrame to csv file.
with open('/gdrive/My Drive/eda-ecomm-project/daily_summary_oct.csv', 'w') as f:
Q1. Summary statistics for the month
pd.options.display.float_format = "{:.2f}".format
is_view = ecomm_df['event_type'] == 'view'
number_of_views = ecomm_df[is_view]['user_session'].nunique()
# Set styling for the section
#sns.set_context("poster", font_scale=2) 
Q2. Daily trend of visits, are there any patterns outliers - jumps or dips?
data = pd.Series(daily_summary_df['number_of_visits'].values, index=daily_summary_df.index.day)

matplotlib.rcParams['figure.figsize'] = (15, 5)
clrs = ['sandybrown' if (x < max(data.values)) else 'saddlebrown' for x in data.values ] #CSS colors
fig  = sns.barplot(x = data.index, y = data.values, palette= clrs);
fig.set(xlabel = 'Days', ylabel = 'Number of views', Title='Daily visits during the month');
fig.set_xticklabels([int(x) for x in data.index])
#fig.set(ylim=(0, 18000))
Notebook Image
#day with highest visits
number_of_visits       292196.00
number_of_shoppers     207485.00
number_of_purchases     21771.00
total_sales           6378182.94
Name: 2019-10-20 00:00:00+00:00, dtype: float64

Sample: Most number of views were on 13th October 2019. This is surprising for a Tuesday. A likely explanation could be that there was a mega sale event organised by the store. For example, Amazon had the Great Indian festival on 13 October 2019.

Complete data:

  • Shrinkme witnessed steady shopper visits in Oct’19
  • 257, 263 visits daily on average
  • Most sales were on 20th October which was a Sunday.
Q3. On which day(s) of the week do customers visit more?

fig = sns.boxplot(x=daily_summary_df.index.dayofweek,
            palette= 'YlOrBr', #Color Brewer color schemes
fig.set_xticklabels(labels= ['Monday','Tuesday','Wednesday','Thursday','Friday','Saturday','Sunday'])
fig.set(xlabel = '', ylabel = 'Number of views', Title='Visits by day of the week')
Notebook Image

Sample: Weekends see more number of visits than on the weekdays. The high outliers on Tuesdays and Wednesdays indicate sale days or product offer which could have led to the spike in number of visits.

Complete data:

  • Fridays have the highest number of visits each week
  • 3, October, Thursday had the least number of visits. Either there is missing data or the store had technical issues which led to downtime.
min_visits_day = daily_summary_df[daily_summary_df.number_of_visits == daily_summary_df.number_of_visits.min()]
Q4. Number of shoppers who visit the store daily on an average
fig = sns.violinplot(x=daily_visits_df.index.dayofweek,
            palette= 'YlOrBr', #Color Brewer color schemes
fig.set_xticklabels(labels= ['Monday','Tuesday','Wednesday','Thursday','Friday','Saturday','Sunday'])
fig.set(xlabel = '', ylabel = 'Number of shoppers', Title='Shoppers by day of the week')
Notebook Image

Sample: We see the same pattern as visits here. Weekends see more shoppers than on the weekdays. So let's switch to a violin plot to understand the distribution better.

Shoppers visit the store more often on Fridays than other days of the week as the values are closely clustered around the median.

Complete data

  • Shrinkme shoppers prefer Fridays the most and Mondays the least.
  • Thursdays have the least number of shoppers. This is likely due to the single drop that we saw earlier
Q5. Daily sales and cumulative sales during the momth
data = pd.DataFrame(daily_summary_df['total_sales'])
data['cumulative_sales'] = data['total_sales'].cumsum()
data.index = data.index.day
from matplotlib.ticker import FuncFormatter
def millions(x, pos):
    'The two args are the value and tick position'
    return '{:1.1f}M'.format(x*1e-6)

formatter = FuncFormatter(millions)
ax1 = sns.set_style(style='ticks', rc=None )

fig, ax1 = plt.subplots(figsize=(15,5))
sns.barplot(data = data, x=data.index, y=data.total_sales,color='sandybrown', alpha=0.5, ax=ax1);
ax1.set(xlabel="",ylabel='Daily sales ($)')

ax2 = ax1.twinx()
#resolve twin axis offset issue
data = data.set_index(data.index-1)
sns.lineplot(x= data.index, y = data.cumulative_sales, marker='s', markersize=12, color= 'saddlebrown', sort = False, ax=ax2)
ax2.set(xlabel="",ylabel='Cumulative sales ($)')
#ax2.set_ylim([0, 3000000])
ax2.set(Title='Daily sales and cumulative sales for the month')
sns.despine(right= False, offset= 5, trim= True);
Notebook Image
print("Average daily sales during the month was ", round(daily_summary_df.total_sales.mean(),2))
print("Total sales during the month was ", round(daily_summary_df.total_sales.sum(),2))
Average daily sales during the month was 6355005.92 Total sales during the month was 197005183.64
#day with highest sales
number_of_visits       285193.00
number_of_shoppers     207152.00
number_of_purchases     27304.00
total_sales           8313237.23
Name: 2019-10-16 00:00:00+00:00, dtype: float64
number_of_visits       206070.00
number_of_shoppers     151588.00
number_of_purchases     16744.00
total_sales           5362638.66
Name: 2019-10-03 00:00:00+00:00, dtype: float64

Sample: The store generated \$77 thousand sales on an average daily . Total sales for the month was \$ 2.4 million.

Complete data:

  • The store generated \$6.4 million sales on an average daily.
  • Total sales for the month was \$ 197 million.
  • Shrinkme sold the most on 16 Oct’19 and the least on 3 Oct’19
Q6. Number of products sold during the month
fig = sns.boxplot(x=daily_summary_df.index.dayofweek,
            palette= 'YlOrBr', #Color Brewer color schemes
fig.set_xticklabels(labels= ['Monday','Tuesday','Wednesday','Thursday','Friday','Saturday','Sunday'])
fig.set(xlabel = '', ylabel = 'Number of products', Title='Option 2: Products sold by day of the week');
Notebook Image

sample: The store sold 241 items per day on average. The sale was clearly successful. But the day doesn't match as the store sold close to 100 items more than average on a Wednesday. The timezone information is missing in this dataset. So this could be be a spillover from the Tuesday sale.

complete data: The store sold 16,533 products on an average per day.

Q7. Do shoppers prefer a particular window during hour of the day?
#Floor datetime index to the nearest hour and aggregate
hourly_visits_df = (visits_df.groupby(by= visits_df.index.floor('H'))
                            .agg(number_of_visits= ('user_session',lambda x: x.nunique()), 
                                 number_of_shoppers= ('user_id',lambda x: x.nunique()))
#Create a summary for analysis by hour of the weekday
visits_by_hour = (hourly_visits_df.groupby(by= [hourly_visits_df.index.dayofweek, hourly_visits_df.index.hour])
                               .agg(number_of_visits= ('number_of_visits', lambda x: round(x.mean(),0)))
                               .rename_axis(index=['week_day', 'hourofday']) #Rename multi-index
                               .reset_index()) #Reset index level
#Pivot for heatmap
visits_by_hour = visits_by_hour.pivot(index = 'week_day', columns= 'hourofday', values = 'number_of_visits')
weekday_labels = ['Monday','Tuesday','Wednesday','Thursday','Friday','Saturday','Sunday']
hour_labels = ['12:00 AM','01:00 AM', '02:00 AM', '03:00 AM', '04:00 AM', '05:00 AM', 
                '06:00 AM', '07:00 AM', '08:00 AM', '09:00 AM', '10:00 AM', '11:00 AM',
                '12:00 PM','01:00 PM', '02:00 PM', '03:00 PM', '04:00 PM', '05:00 PM',
                '06:00 PM', '07:00 PM', '08:00 PM', '09:00 PM', '10:00 PM', '11:00 PM']
#Customise colors for a more intuitive chart
colors =  ["white", "lightgrey", "saddlebrown"]
cmap = matplotlib.colors.LinearSegmentedColormap.from_list("", colors)

fig = sns.heatmap(visits_by_hour, 
            cmap = cmap,
            annot = False, 
            robust= True,
            xticklabels= hour_labels,
            yticklabels= weekday_labels
fig.axes.collections[0].colorbar.set_label("Number of visits")
fig.set_xticklabels(fig.get_xticklabels(), rotation=90)
fig.set_title( "Shoppers by hour of the day", pad = 30);        

Notebook Image


  • Data is in UTC , but trends indicate visits from multiple time-zones
  • Most visits occurred in the window of 2:00 P.M to 4:00 P.M UTC
  • This insight can be useful to time product promotions or discount campaigns
#Save work
[jovian] Detected Colab notebook... [jovian] Uploading colab notebook to Jovian... [jovian] Capturing environment.. [jovian] Committed successfully! https://jovian.ai/anushree-k/zerotoanalyst-ecomm-eda
Product summary tables
Brand summary
brand_visits_df = (ecomm_df.groupby(by = 'brand')
                  .agg(number_of_views = ('user_session', lambda x: x.nunique()),
                       number_of_shoppers = ('user_id', lambda x: x.nunique()))
brand_sales_df = (ecomm_df[is_purchase].groupby(by= 'brand')
                                       .agg(total_sales = ('price', lambda x: x.sum()),
                                            avg_sales = ('price', lambda x: x.mean()),
                                            number_of_purchases = ('user_session',lambda x: x.nunique())))
brand_summary_df = pd.merge(brand_visits_df, brand_sales_df, on= 'brand')
brand_summary_df.reset_index(inplace= True)
# Write the DataFrame to csv file.
with open('/gdrive/My Drive/eda-ecomm-project/brand_summary_oct.csv', 'w') as f:
Q8. Most popular brands during the month
top_100_brands_by_views = brand_summary_df[~(brand_summary_df['brand'] == 'nobrand')].sort_values(by='number_of_views', ascending= False).head(100)
brand_wordcloud = top_100_brands_by_views.brand.str.cat(sep=" ")
brand_display_text= brand_wordcloud
matplotlib.rcParams['figure.figsize'] = (15, 8)
# Create the wordcloud object
wordcloud = WordCloud(width=800, height=600, margin=0, background_color='white').generate(brand_display_text)

# Display the generated image:
plt.imshow(wordcloud, interpolation='bilinear')
plt.margins(x=0, y=0)
Notebook Image
top_10_brands_by_views = brand_summary_df[~(brand_summary_df['brand'] == 'nobrand')].sort_values(by='number_of_views', ascending= False).head(10)
matplotlib.rcParams['figure.figsize'] = (15, 5)
fig  = sns.barplot(x = top_10_brands_by_views.number_of_views, y = top_10_brands_by_views.brand);
fig.set( ylabel = 'Brands',
         xlabel = 'Number of views'
fig.axes.xaxis.labelpad = 20
fig.set_title('Most popular brands during the month', pad= 30)

for p in fig.patches:
    width = p.get_width()
    plt.text(5+p.get_width(), p.get_y()+0.55*p.get_height(),
             ha='right', va='center', color= 'white')
Notebook Image


  • Samsung, Apple and Xiaomi, Huawei, Lucente, Bosch were the most popular brands viewed by shoppers
Q9. Most purchased brands during the month
top_10_brands_by_sales = brand_summary_df[~(brand_summary_df['brand'] == 'nobrand')].sort_values(by='total_sales', ascending= False).head(10)
top_10_brands_by_sales['pct_of_monthly_sales'] = top_10_brands_by_sales['total_sales']/daily_summary_df.total_sales.sum()
others_pct = 1- top_10_brands_by_sales.pct_of_monthly_sales.sum()
others_sales = brand_summary_df.total_sales.sum() - top_10_brands_by_sales.total_sales.sum()
pie_data = top_10_brands_by_sales[['brand','total_sales','pct_of_monthly_sales']]
pie_data.loc[len(pie_data.index)]= ["others", others_sales, others_pct]
pie_data.sort_values(by= 'total_sales',ascending= False, inplace= True)
pie_data.reset_index(inplace= True, drop= True)
#legend_labels = pie_data.brand +  pie_data.total_sales.apply(lambda x: ' ${:1.3f}K'.format(x*1e-6))
legend_labels = pie_data.brand +  pie_data.total_sales.apply(lambda x: ' ${:1.1f}M'.format(x*1e-6))
matplotlib.rcParams['figure.figsize'] = (15, 6)
values = pie_data.pct_of_monthly_sales*100
explode = ( 0.0, 0.0, 0.0, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1,0.1)
pie = plt.pie(values, labels= None, textprops={'fontsize': 14}, explode=explode, shadow=True, startangle=30,
autopct='%1.1f%%', pctdistance = 1.3, labeldistance = 1.3)
plt.legend(pie[0], legend_labels, fontsize='16', loc="lower corner",frameon=False, bbox_to_anchor = (1.4,1.025))
plt.subplots_adjust(left=0.1, bottom=0.1, right=0.75)
plt.title('Contribution of brands to monthly sales', pad= 30);
Notebook Image


  • Apple was the highest contributor(47.7%) to total sales in Oct’19 with $93.9 million.
  • Samsung ( 20.1%) followed with $39.5 million of monthly sales
  • Brands not in the top 10 contributed 18.7% with $36.9 million of monthly sales
Category summary
category_visits_df = (ecomm_df.groupby(by= 'category_code')
                              .agg(number_of_shoppers = ('user_id', lambda x: x.nunique()),
                                    number_of_visits = ('user_session', lambda x: x.nunique()))
category_sales_df = (ecomm_df[is_purchase].groupby(by= 'category_code')
                                       .agg(total_sales = ('price', lambda x: x.sum()),
                                            avg_sales = ('price', lambda x: x.mean()),
                                            number_of_purchases = ('user_session',lambda x: x.nunique())))
category_summary_df = pd.merge(category_visits_df,category_sales_df, on= 'category_code' )
category_summary_df.reset_index(inplace= True)
#Split categories
category_summary_df[['main_category', 'sub_category1','sub_category2','sub_category3']] = category_summary_df['category_code'].str.split(".",expand=True)
#category_summary_df.drop(columns= 'category_code', inplace=True)
#category_summary_df.set_index('main_category', inplace= True)
# Write the DataFrame to csv file.
with open('/gdrive/My Drive/eda-ecomm-project/category_summary_oct.csv', 'w') as f:
main_category_summary = (category_summary_df.groupby(by = ['main_category','sub_category1'])
                                          .agg(number_of_visits = ('number_of_visits', lambda x: x.sum()),
                                               total_sales = ('total_sales', lambda x: x.sum()),
                                              num_items_sold = ('number_of_purchases', lambda x: x.sum()))
                                              .sort_values(by= 'number_of_visits', ascending= False))
Q10. Most viewed categories during the month
main_category_summary.reset_index(inplace= True)
views = (pd.pivot_table(main_category_summary, 
                       index= 'main_category', 
                       columns= 'sub_category1', 
                       values= 'number_of_visits', 
                      .sort_values('All', ascending=True)        # sort by row sum
                      .drop('All', axis=1)                        # drop column `All`
                      .sort_values('All', ascending=True, axis=1) # sort by column sum
                      .drop('All'))                               # drop row `All`

Solution to sort values in a pivot table can be found here

matplotlib.rcParams['figure.figsize'] = (15, 5)
fig = views.plot(kind="barh", stacked=True)
fig.set(xlabel = 'number of views', 
        ylabel = 'main categories', 
        Title='Views by product category and subcategories'
Notebook Image


  • Electronics was the most viewed category followed by appliances
Q11. Top 10 categories and sub categories by sales
cat_sales = main_category_summary[['main_category','sub_category1','total_sales']]
cat_sales= cat_sales.head(10).sort_values(by= 'total_sales', ascending= False)
cat_sales['bubble_labels'] = cat_sales['sub_category1'].astype(str) + " " + cat_sales.total_sales.apply(lambda x: '${:1.1f}M'.format(x*1e-6))
matplotlib.rcParams['figure.figsize'] = (15, 5)
cat_size = cat_sales.total_sales*100000/cat_sales.total_sales.sum()
minsize = min(cat_size)*0.8
maxsize = max(cat_size)*0.8

fig = sns.scatterplot(x = 'main_category', # Horizontal axis
                    y = 'sub_category1', # Vertical axis
                    hue= cat_size, 
                    size= cat_size, 
                    sizes=(minsize, maxsize), #Size of bubble based on data size
                    palette= 'bright',
                    data= cat_sales, # Data source
                    legend= False)  
fig.set(xlabel="")  # remove the axis label
fig.axes.get_yaxis().set_visible(False) #turn off y-axis
#fig.axes.set_title('Top 10 categories by sales', pad=30)

#Annotate bubbles with labels
for line in range(0,cat_sales.shape[0]):
     fig.text(cat_sales.main_category[line], cat_sales.sub_category1[line], 
     cat_sales.bubble_labels[line], horizontalalignment='left', 
     size='small', color='black', weight='normal')

Notebook Image


  • Smartphones generated the most sales of $132.8 million, among sub categories
  • Kitchen appliances are the second most sold items by total number of sales with $9.3 million in sales
Q12. Conversion by category

\[conversion\ rate = \frac{number\ of\ purchases}{number\ of\ views}\]

Note: A purchase by the customer is a sale for store.

#Calculate conversion rates
main_category_summary = main_category_summary.groupby('main_category', as_index=False).sum().eval('conv_rate = num_items_sold / number_of_visits')
data = (pd.Series(main_category_summary['conv_rate'].values, 
                .sort_values(axis= 0, ascending= False))
matplotlib.rcParams['figure.figsize'] = (15, 5)
fig  = sns.barplot(x = data.index, y = data.values, palette= 'bright');
fig.set(xlabel = 'Categories', ylabel = 'Converstion rate(%)', Title='Conversion rate by category');
fig.set_xticklabels(fig.get_xticklabels(), rotation=90)
fig.axes.xaxis.labelpad = 30

#add bar labels
for p in fig.patches:
    percentage = '{:.1f}%'.format(100 * p.get_height())
    x = p.get_x() + p.get_width() / 2 - 0.05
    y = p.get_y() + p.get_height()
    fig.axes.annotate(percentage, (x, y), size = 18,
fig.set(ylim=(0, 0.10))
Notebook Image


  • Electronics had the highest conversion rate of 9.1% followed by medicine at 7.7%
  • Apparel had the lowest rate 1.6% among the top 10 categories by conversion rates

Q.Sales by big ticket items

Where is Apple's contribution coming from? Did Apple sell more units or have a high price or was it a combination of both?

For future analysis

#Save work
[jovian] Detected Colab notebook... [jovian] Uploading colab notebook to Jovian... [jovian] Capturing environment.. [jovian] Committed successfully! https://jovian.ai/anushree-k/zerotoanalyst-ecomm-eda
Customer summary tables
grp = ecomm_df.groupby(['user_id', 'event_type'])['user_session'].nunique().reset_index(name='count')

#Create summary of customer events during the month at a customer level
customer_visits_df = pd.DataFrame(grp.pivot('user_id', 'event_type', 'count'))
customer_visits_df.fillna(0, inplace= True)
#Create summary of customer spend during the month
customer_spend_df = (sales_df.groupby(by= 'user_id')
                         .agg( total_spend = ('price', lambda x: x.sum()),
                               avg_spend = ('price', lambda x: x.mean()))
#Convert categorical index to column
customer_visits_df.columns = customer_visits_df.columns.tolist()
customer_visits_df.reset_index(inplace= True)