Clickstream Sales Analysis - Online shopping

Focusing on Sales by Month, Clothing Type, and Display Page

Dataset - ML datasets from UCI (Univ of CA at Irvine)

Dataset for online store sales for pregnant woman clothing is gatehered from with selected filters on clustering and regression. Details of the dataset:

Citation Request: If you use this dataset, please cite: Łapczyński M., Białowąs S. (2013) Discovering Patterns of Users' Behaviour in an E-shop - Comparison of Consumer Buying Behaviours in Poland and Other European Countries, “Studia Ekonomiczne”, nr 151, “La société de l'information : perspective européenne et globale : les usages et les risques d'Internet pour les citoyens et les consommateurs”, p. 144-153

Data Set Information: The dataset contains information on clickstream from online store offering clothing for pregnant women. Data are from five months of 2008 and include, among others, product category, location of the photo on the page, country of origin of the IP address and product price in US dollars.

Attribute Information: The dataset contains 14 variables described in a separate file (See 'Data set description')

Let's Explore

Get to know Metadata

Exploring possibilities

Data Cleanup

Plots to uncover the plots in data


Further Review


Get to know Metadata

Data description “e-shop clothing 2008”


  1. YEAR (2008)

  2. MONTH -> from April (4) to August (8)

  3. DAY -> day number of the month

  4. ORDER -> sequence of clicks during one session

  5. COUNTRY -> variable indicating the country of origin of the IP address with the following categories: 1-Australia 2-Austria 3-Belgium 4-British Virgin Islands 5-Cayman Islands 6-Christmas Island 7-Croatia 8-Cyprus 9-Czech Republic 10-Denmark 11-Estonia 12-unidentified 13-Faroe Islands 14-Finland 15-France 16-Germany 17-Greece 18-Hungary 19-Iceland 20-India 21-Ireland 22-Italy 23-Latvia 24-Lithuania 25-Luxembourg 26-Mexico 27-Netherlands 28-Norway 29-Poland 30-Portugal 31-Romania 32-Russia 33-San Marino 34-Slovakia 35-Slovenia 36-Spain 37-Sweden 38-Switzerland 39-Ukraine 40-United Arab Emirates 41-United Kingdom 42-USA 43-biz (.biz) 44-com (.com) 45-int (.int) 46-net (.net) 47-org (*.org)

  6. SESSION ID -> variable indicating session id (short record)

  7. PAGE 1 (MAIN CATEGORY) -> concerns the main product category: 1-trousers 2-skirts 3-blouses 4-sale

  8. PAGE 2 (CLOTHING MODEL) -> contains information about the code for each product (217 products)

  9. COLOUR -> colour of product 1-beige 2-black 3-blue 4-brown 5-burgundy 6-gray 7-green 8-navy blue 9-of many colors 10-olive 11-pink 12-red 13-violet 14-white

  10. LOCATION -> photo location on the page, the screen has been divided into six parts: 1-top left 2-top in the middle 3-top right 4-bottom left 5-bottom in the middle 6-bottom right

  11. MODEL PHOTOGRAPHY -> variable with two categories: 1-en face 2-profile

  12. PRICE -> price in US dollars

  13. PRICE 2 -> variable informing whether the price of a particular product is higher than the average price for the entire product category 1-yes 2-no

  14. PAGE -> page number within the e-store website (from 1 to 5)

Exploring Possibilities

Now that we know the metadata for the dataset, lets explore some of quesions:

  1. What type of clothing is sold?
  2. Is there data from all the countries?
  3. Is the data complete for the period provided?
  4. Is there a peak for sales from the given months?
  5. what kind of clothing is purchased more?
  6. Is there a effect on sales for high priced and low prices items?
  7. Is there any effect of pricing on which page the product is shown?
  8. what are the sales by page?
# get the data loaded into the notebook
import pandas as pd
import numpy as np
import os
filename = 'e-shop clothing 2008.csv'
df = pd.read_csv(filename, delimiter = ';')
# Load all the viz libraries that we might need and set some default parameters for backgorund
import seaborn as sns
import matplotlib
import matplotlib.pyplot as plt
%matplotlib inline

matplotlib.rcParams['font.size'] = 14
matplotlib.rcParams['figure.figsize'] = (9, 5)
matplotlib.rcParams['figure.facecolor'] = '#00000000'
lets look at the some basic stats of the dataset
(165474, 14)
details about the data

Looking at the initial stats there are about 165k rows with 14 features in the data.
As one can see that column names are all over the place (small and some extra lengthy names).
Given the metadata lets clean up and fix the names orderly

df1 = df.rename(columns={colname:colname.capitalize() for colname in df.axes[1]})

Data Clean-Up

df1 = df1.rename(columns={'Page 1 (main category)':'Category','Page 2 (clothing model)':'Model', 
                        'Location':'Pic_Location', 'Price 2':'Price_GT_Avg', 'Model photography':'Face_Profile'})

Change some of the numerical fields to thier real values based on metadata, for now lets leave the country field as is

array([4, 5, 6, 7, 8], dtype=int64)
df1.Month = df1.Month.replace({4: 'April', 5: 'May', 6: 'June', 7:'July', 8: 'August'})
array(['April', 'May', 'June', 'July', 'August'], dtype=object)

1. What type of clothing is sold?

array(['Trousers', 'Skirts', 'Blouses', 'Sale'], dtype=object)
df1.Category=df1.Category.replace({1: 'Trousers', 2: 'Skirts', 3: 'Blouses', 4: 'Sale'})
array(['Trousers', 'Skirts', 'Blouses', 'Sale'], dtype=object)

Going by the metadata provided there are 3 types of clothing and ofcourse a sale category for all

df1.Colour = df1.Colour.replace({1:'Beige', 2:'Black', 3:'Blue', 4:'Brown', 5:'Burgundy',
                                 6:'Gray', 7:'Green', 8:'Navy Blue', 9:'Mix', 10:'Olive', 
                                 11:'Pink', 12:'Red', 13:'Violet', 14:'White'})
#1-beige 2-black 3-blue 4-brown 5-burgundy 6-gray 7-green 
#8-navy blue 9-of many colors 10-olive 11-pink 12-red 13-violet 14-white
array(['Beige', 'Olive', 'Gray', 'Brown', 'Burgundy', 'Mix', 'Red',
       'Pink', 'Black', 'Blue', 'Violet', 'White', 'Green', 'Navy Blue'],
df1.Price_GT_Avg = df1.Price_GT_Avg.replace({1:'Yes', 2:'No'})
array(['No', 'Yes'], dtype=object)

2. Is there data from all the countries?


looks like there is data from all over the world - there are atleast 47 countries shopping data

From the given Metadata for the country there are values for portal sites (.biz, .org, .com.. etc) Lets take out those values from the population there are bout 4000 of those from 165K which wont be an issue for exploration.

False    161523
True       3951
Name: Country, dtype: int64
df1 = df1[~df1['Country'].isin([43,44,45,46,47])]
(161523, 14)

Plots to uncover plots in Data

Now that the dataset is in a good state with all data labels defined in the data, lets look at some of the columns that we will use for EDA and take out the rest of the columns from the dataset

Index(['Year', 'Month', 'Day', 'Order', 'Country', 'Session id', 'Category',
       'Model', 'Colour', 'Pic_Location', 'Face_Profile', 'Price',
       'Price_GT_Avg', 'Page'],
eda_df = df1[['Year', 'Month', 'Day', 'Country', 'Category', 'Colour', 'Price', 'Price_GT_Avg', 'Page']]
(161523, 9)
Year            0
Month           0
Day             0
Country         0
Category        0
Colour          0
Price           0
Price_GT_Avg    0
Page            0
dtype: int64

3. Is the data complete for the period provided?
Exploring the totals by Monthly sales
Its observed that the data for Aug is not complete - only partial data is available.
There are more sales in April (pre-summer) than June (post Summer) in most of the Eurpean countries.

Month_Sales = eda_df.Month.value_counts()
fig, ax = plt.subplots(figsize = [12,6])
sns.barplot(x=Month_Sales.index, y=Month_Sales);
ax.set_title('Number of Sales per Month', fontsize = 18)
ax.set_ylabel('Number of Sales',fontsize = 12)
ax.set_xlabel('Month',fontsize = 12)
Notebook Image

Looks like the data is not complete for month of Aug. There seems to be more than normal volume for April

Lets see how the distributions are for each Category

#Split the data into category datasets
T_df = eda_df[eda_df.Category == 'Trousers']
S_df = eda_df[eda_df.Category == 'Skirts']
B_df = eda_df[eda_df.Category == 'Blouses']
plt.hist(T_df.Day, alpha=0.4, bins=np.arange(1,31,1));
plt.title('Distribution of Sales in June')
Notebook Image
#stacked hist plot for Categories by Price and Sales.
plt.title('Distribution of Category Sales for each Day')

plt.hist([T_df.Day, S_df.Day, B_df.Day], 

plt.legend(['Trousers', 'Skirts', 'Blouses']);
Notebook Image

Lets do some analysis on the sales by day of the month

4. Is there a peak for sales from the given months?

### Dataframes for each month
aug = eda_df.loc[eda_df['Month'] == 'August']
jun = eda_df.loc[eda_df['Month'] == 'June']
jul = eda_df.loc[eda_df['Month'] == 'July']
may = eda_df.loc[eda_df['Month'] == 'May']
apr = eda_df.loc[eda_df['Month'] == 'April']

fig, axs = plt.subplots(nrows=3, ncols = 2, figsize=[14,18])

axs[0,0].bar(apr.Day.value_counts().keys(), apr.Day.value_counts().values, color='blue')
axs[0,0].set_title('Sales per Day in April', fontsize=18)

axs[0,1].bar(may.Day.value_counts().keys(), may.Day.value_counts().values, color='green')
axs[0,1].set_title('Sales per Day in May', fontsize=18)

axs[1,0].bar(jun.Day.value_counts().keys(), jun.Day.value_counts().values, color='brown')
axs[1,0].set_title('Sales per Day in June', fontsize=18)

axs[1,1].bar(jul.Day.value_counts().keys(), jul.Day.value_counts().values, color='gray')
axs[1,1].set_title('Sales per Day in July', fontsize=18)

axs[2,0].bar(aug.Day.value_counts().keys(), aug.Day.value_counts().values, color='teal')
axs[2,0].set_title('Sales per Day in Aug', fontsize=18)

axs[2,1].bar(eda_df.Day.value_counts().keys(), eda_df.Day.value_counts().values, color='lightblue')
axs[2,1].set_title('Sales per Day for all Months', fontsize=18)
Notebook Image

Interesting Patterns of Data:
Couple of outliers in the beginning of April - probably because of Easter Holiday delays of finalization of sale, maybe? As Easter was on Mar 23 of 2008. \

Another interesting pattern noted is that the sales are usually up on Thu/Fri/Sat and lowest on Mon-Wed. For example 2008 Apr 1 was a Tue and so one can see the pattern of sales before Sun of each week.\

Finally the outliers of Apr (maybe overflown from Mar) seems to have added to the data in Sales per Day for all Months.

Variation of Dress pricing over Months:
Lets look to see if the price of month is cheaper in summer vacation month or pricier

sns.lineplot(data = eda_df, x='Month', y='Price', hue='Category', 
             style = 'Category',markers=True, dashes=False).set_title('Monthly Price of Dress Category', fontsize = 18)
Notebook Image

Observation: Not only the dress category but also the sale price of the clothing seems to be reducing from Aril to June and increating through August.
Is it possible that there is more inclined to purchase clothes for summer season?

Monthly % of Total Pie of Sales:
a pie plot to show % of sales by Month

pie_rev = eda_df.groupby('Month')['Price'].sum()
#using matplotlib
pie, ax = plt.subplots(figsize=[10,7])
labels = pie_rev.keys()
plt.pie(x=pie_rev, autopct="%.1f%%", explode=[0.02]*5, labels=labels, textprops={'fontsize': 13}, pctdistance=0.75)
plt.title("Overall Total Revenue by Month", fontsize=16);
Notebook Image

Looks like Sales peaked beginning of Apr Thru June and slight tick up in JUl. Lets see how it looks when we take out the Aug which is partial

Monthly % of Total Pie of Sales without August (incomplete data)
a pie plot to show % of sales by Month without August

pie_rev_no_aug =  eda_df[eda_df['Month']!='August'].groupby('Month')['Price'].sum()
#using matplotlib
pie, ax = plt.subplots(figsize=[10,7])
labels = pie_rev_no_aug.keys()
plt.pie(x=pie_rev_no_aug, autopct="%.1f%%", explode=[0.02]*4, labels=labels, 
        textprops={'fontsize': 13}, pctdistance=0.75)
plt.title("Overall Total Revenue by Month without August", fontsize=16);
Notebook Image

Looks like Sales peaked in month of Apr (people getting ready for spring/summer holidays) and remained consistent through rest of the months

5. what kind of clothing is purchased more?

6. Is there a effect on sales for high priced and low prices items?

pie_categ= eda_df.groupby(['Category','Price_GT_Avg'])['Price'].sum().unstack('Price_GT_Avg')[8,6])
plt.title('Type of Dress Sold by Over/Under Avg Price of Dress', fontsize=14)
Notebook Image

Looks like other than Trousers, all the other dress style are more often sold over the average price of the dress*
Probably because Trousers arent a preferable dress during pregnancy.

Lets see some significance of price and page

7. Is there any effect of pricing on which page the product is shown?

8. what are the sales by page?

page_price = eda_df[['Price', 'Page']]

Looks as though there is a reverse correlation between price and page, i.e., price is probably less pricier as the page numbers increaed. Conversely, there may not be as many items in page 5 as in page 1.

page_price_avg = page_price.groupby('Page').mean()
page_price_tot = page_price.groupby('Page').sum()
page_price_count = page_price.Page.value_counts()
page_price_eda = page_price_avg
page_price_eda['Totals_page'] = page_price_tot
page_price_eda['Counts_page'] = page_price_count
page_price_eda = page_price_eda.rename(columns={'Price':'Average'})
page_price_eda, page_price_eda.index
(        Average  Totals_page  Counts_page
 1     44.731479      4062334        90816
 2     45.831134      1844199        40239
 3     36.868446       698952        18958
 4     40.099301       350107         8731
 5     41.475351       115260         2779,
 Int64Index([1, 2, 3, 4, 5], dtype='int64', name='Page'))
fig, axs = plt.subplots(nrows=3, ncols = 1, figsize=[14, 10])

axs[0].bar(page_price_eda.index, page_price_eda.Average)
axs[0].set_title('Average Price per Item on Page', fontsize=12)

axs[1].bar(page_price_eda.index, page_price_eda.Totals_page)
axs[1].set_title('Total Dollars Sold per Page', fontsize=12)
axs[1].ticklabel_format(useOffset=False, style='plain')

axs[2].bar(page_price_eda.index, page_price_eda.Counts_page)
axs[2].set_title('Number of Sales per Page', fontsize=12)
axs[2].set_ylabel('Articles Sold')

for ax in axs:
    ax.spines[['right', 'top']].set_visible(False)
Notebook Image

Page 1 and 2 seems to have higher avg than other pages and Page 3 & 4 has lower and 3 being the lowest. Page 5 seems to have middle of the pack average price. pages 5 is the med

Maybe Page 3 has more sale items?


Sales peak in April, and decline slightly in following months; the data for August is incomplete.\

Overall, Sales in each month are seeming to peak begfore the weekend - likely from Thu to Sat and lowest on Sun and Mon.

Skirts have the highest average price of all goods, Trousers are least preferable. Sales items bring in the least money, but generate more sales than either blouses or skirts. \

Pages one and two have the highest priced items, Page 3 & 4 las least average and page 5 is in the middle pf the pack.

Further Review:

  1. Page wise analysis of sales and which category has more sales?
  2. Any type of correlations to sale price, page, price, category and adding items sold for greater than Avg price?




Madhav Muni6 months ago
Sign In