Focusing on Sales by Month, Clothing Type, and Display Page
Dataset for online store sales for pregnant woman clothing is gatehered from https://archive-beta.ics.uci.edu/ml/datasets with selected filters on clustering and regression. Details of the dataset: https://archive.ics.uci.edu/ml/datasets/clickstream+data+for+online+shopping
Citation Request: If you use this dataset, please cite: ÅapczyÅ„ski M., BiaÅ‚owÄ…s S. (2013) Discovering Patterns of Users' Behaviour in an E-shop - Comparison of Consumer Buying Behaviours in Poland and Other European Countries, “Studia Ekonomiczneâ€, nr 151, “La société de l'information : perspective européenne et globale : les usages et les risques d'Internet pour les citoyens et les consommateursâ€, p. 144-153
Data Set Information: The dataset contains information on clickstream from online store offering clothing for pregnant women. Data are from five months of 2008 and include, among others, product category, location of the photo on the page, country of origin of the IP address and product price in US dollars.
Attribute Information: The dataset contains 14 variables described in a separate file (See 'Data set description')
Data description “e-shop clothing 2008”
Variables:
YEAR (2008)
MONTH -> from April (4) to August (8)
DAY -> day number of the month
ORDER -> sequence of clicks during one session
COUNTRY -> variable indicating the country of origin of the IP address with the following categories: 1-Australia 2-Austria 3-Belgium 4-British Virgin Islands 5-Cayman Islands 6-Christmas Island 7-Croatia 8-Cyprus 9-Czech Republic 10-Denmark 11-Estonia 12-unidentified 13-Faroe Islands 14-Finland 15-France 16-Germany 17-Greece 18-Hungary 19-Iceland 20-India 21-Ireland 22-Italy 23-Latvia 24-Lithuania 25-Luxembourg 26-Mexico 27-Netherlands 28-Norway 29-Poland 30-Portugal 31-Romania 32-Russia 33-San Marino 34-Slovakia 35-Slovenia 36-Spain 37-Sweden 38-Switzerland 39-Ukraine 40-United Arab Emirates 41-United Kingdom 42-USA 43-biz (.biz) 44-com (.com) 45-int (.int) 46-net (.net) 47-org (*.org)
SESSION ID -> variable indicating session id (short record)
PAGE 1 (MAIN CATEGORY) -> concerns the main product category: 1-trousers 2-skirts 3-blouses 4-sale
PAGE 2 (CLOTHING MODEL) -> contains information about the code for each product (217 products)
COLOUR -> colour of product 1-beige 2-black 3-blue 4-brown 5-burgundy 6-gray 7-green 8-navy blue 9-of many colors 10-olive 11-pink 12-red 13-violet 14-white
LOCATION -> photo location on the page, the screen has been divided into six parts: 1-top left 2-top in the middle 3-top right 4-bottom left 5-bottom in the middle 6-bottom right
MODEL PHOTOGRAPHY -> variable with two categories: 1-en face 2-profile
PRICE -> price in US dollars
PRICE 2 -> variable informing whether the price of a particular product is higher than the average price for the entire product category 1-yes 2-no
PAGE -> page number within the e-store website (from 1 to 5)
Now that we know the metadata for the dataset, lets explore some of quesions:
# get the data loaded into the notebook
import pandas as pd
import numpy as np
import os
os.chdir(r'C:\Users\madha\Jovian\ZeroToDataAnalyst\zda_projects')
filename = 'e-shop clothing 2008.csv'
df = pd.read_csv(filename, delimiter = ';')
df.head()
# Load all the viz libraries that we might need and set some default parameters for backgorund
import seaborn as sns
import matplotlib
import matplotlib.pyplot as plt
%matplotlib inline
sns.set_style('darkgrid')
matplotlib.rcParams['font.size'] = 14
matplotlib.rcParams['figure.figsize'] = (9, 5)
matplotlib.rcParams['figure.facecolor'] = '#00000000'
df.shape
(165474, 14)
Looking at the initial stats there are about 165k rows with 14 features in the data.
As one can see that column names are all over the place (small and some extra lengthy names).
Given the metadata lets clean up and fix the names orderly
#df.columns.str.capitalize()
df1 = df.rename(columns={colname:colname.capitalize() for colname in df.axes[1]})
df1.head(3)
df1 = df1.rename(columns={'Page 1 (main category)':'Category','Page 2 (clothing model)':'Model',
'Location':'Pic_Location', 'Price 2':'Price_GT_Avg', 'Model photography':'Face_Profile'})
df1.head(3)
Change some of the numerical fields to thier real values based on metadata, for now lets leave the country field as is
df1['Month'].unique()
array([4, 5, 6, 7, 8], dtype=int64)
df1.Month = df1.Month.replace({4: 'April', 5: 'May', 6: 'June', 7:'July', 8: 'August'})
df1['Month'].unique()
array(['April', 'May', 'June', 'July', 'August'], dtype=object)
1. What type of clothing is sold?
df1.Category.unique()
array(['Trousers', 'Skirts', 'Blouses', 'Sale'], dtype=object)
df1.Category=df1.Category.replace({1: 'Trousers', 2: 'Skirts', 3: 'Blouses', 4: 'Sale'})
df1.Category.unique()
array(['Trousers', 'Skirts', 'Blouses', 'Sale'], dtype=object)
Going by the metadata provided there are 3 types of clothing and ofcourse a sale category for all
df1.Colour = df1.Colour.replace({1:'Beige', 2:'Black', 3:'Blue', 4:'Brown', 5:'Burgundy',
6:'Gray', 7:'Green', 8:'Navy Blue', 9:'Mix', 10:'Olive',
11:'Pink', 12:'Red', 13:'Violet', 14:'White'})
df1.Colour.unique()
#1-beige 2-black 3-blue 4-brown 5-burgundy 6-gray 7-green
#8-navy blue 9-of many colors 10-olive 11-pink 12-red 13-violet 14-white
array(['Beige', 'Olive', 'Gray', 'Brown', 'Burgundy', 'Mix', 'Red',
'Pink', 'Black', 'Blue', 'Violet', 'White', 'Green', 'Navy Blue'],
dtype=object)
df1.Price_GT_Avg = df1.Price_GT_Avg.replace({1:'Yes', 2:'No'})
df1.Price_GT_Avg.unique()
array(['No', 'Yes'], dtype=object)
2. Is there data from all the countries?
df1['Country'].value_counts().count()
47
looks like there is data from all over the world - there are atleast 47 countries shopping data
From the given Metadata for the country there are values for portal sites (.biz, .org, .com.. etc) Lets take out those values from the population there are bout 4000 of those from 165K which wont be an issue for exploration.
df1['Country'].isin([43,44,45,46,47]).value_counts()
False 161523
True 3951
Name: Country, dtype: int64
df1 = df1[~df1['Country'].isin([43,44,45,46,47])]
df1.shape
(161523, 14)
Now that the dataset is in a good state with all data labels defined in the data, lets look at some of the columns that we will use for EDA and take out the rest of the columns from the dataset
df1.columns
Index(['Year', 'Month', 'Day', 'Order', 'Country', 'Session id', 'Category',
'Model', 'Colour', 'Pic_Location', 'Face_Profile', 'Price',
'Price_GT_Avg', 'Page'],
dtype='object')
eda_df = df1[['Year', 'Month', 'Day', 'Country', 'Category', 'Colour', 'Price', 'Price_GT_Avg', 'Page']]
eda_df.shape
(161523, 9)
eda_df.isnull().sum()
Year 0
Month 0
Day 0
Country 0
Category 0
Colour 0
Price 0
Price_GT_Avg 0
Page 0
dtype: int64
eda_df.head()
3. Is the data complete for the period provided?
Exploring the totals by Monthly sales
Its observed that the data for Aug is not complete - only partial data is available.
There are more sales in April (pre-summer) than June (post Summer) in most of the Eurpean countries.
Month_Sales = eda_df.Month.value_counts()
fig, ax = plt.subplots(figsize = [12,6])
sns.barplot(x=Month_Sales.index, y=Month_Sales);
ax.set_title('Number of Sales per Month', fontsize = 18)
ax.set_ylabel('Number of Sales',fontsize = 12)
ax.set_xlabel('Month',fontsize = 12)
plt.show()
Looks like the data is not complete for month of Aug. There seems to be more than normal volume for April
Lets see how the distributions are for each Category
#Split the data into category datasets
T_df = eda_df[eda_df.Category == 'Trousers']
S_df = eda_df[eda_df.Category == 'Skirts']
B_df = eda_df[eda_df.Category == 'Blouses']
plt.hist(T_df.Day, alpha=0.4, bins=np.arange(1,31,1));
plt.title('Distribution of Sales in June')
plt.show()
#stacked hist plot for Categories by Price and Sales.
plt.title('Distribution of Category Sales for each Day')
plt.hist([T_df.Day, S_df.Day, B_df.Day],
bins=np.arange(1,31,1),
stacked=True);
plt.legend(['Trousers', 'Skirts', 'Blouses']);
Lets do some analysis on the sales by day of the month
4. Is there a peak for sales from the given months?
### Dataframes for each month
aug = eda_df.loc[eda_df['Month'] == 'August']
jun = eda_df.loc[eda_df['Month'] == 'June']
jul = eda_df.loc[eda_df['Month'] == 'July']
may = eda_df.loc[eda_df['Month'] == 'May']
apr = eda_df.loc[eda_df['Month'] == 'April']
fig, axs = plt.subplots(nrows=3, ncols = 2, figsize=[14,18])
axs[0,0].bar(apr.Day.value_counts().keys(), apr.Day.value_counts().values, color='blue')
axs[0,0].set_title('Sales per Day in April', fontsize=18)
axs[0,1].bar(may.Day.value_counts().keys(), may.Day.value_counts().values, color='green')
axs[0,1].set_title('Sales per Day in May', fontsize=18)
axs[1,0].bar(jun.Day.value_counts().keys(), jun.Day.value_counts().values, color='brown')
axs[1,0].set_title('Sales per Day in June', fontsize=18)
axs[1,1].bar(jul.Day.value_counts().keys(), jul.Day.value_counts().values, color='gray')
axs[1,1].set_title('Sales per Day in July', fontsize=18)
axs[2,0].bar(aug.Day.value_counts().keys(), aug.Day.value_counts().values, color='teal')
axs[2,0].set_title('Sales per Day in Aug', fontsize=18)
axs[2,1].bar(eda_df.Day.value_counts().keys(), eda_df.Day.value_counts().values, color='lightblue')
axs[2,1].set_title('Sales per Day for all Months', fontsize=18)
plt.show()
Interesting Patterns of Data:
Couple of outliers in the beginning of April - probably because of Easter Holiday delays of finalization of sale, maybe? As Easter was on Mar 23 of 2008. \
Another interesting pattern noted is that the sales are usually up on Thu/Fri/Sat and lowest on Mon-Wed. For example 2008 Apr 1 was a Tue and so one can see the pattern of sales before Sun of each week.\
Finally the outliers of Apr (maybe overflown from Mar) seems to have added to the data in Sales per Day for all Months.
Variation of Dress pricing over Months:
Lets look to see if the price of month is cheaper in summer vacation month or pricier
sns.lineplot(data = eda_df, x='Month', y='Price', hue='Category',
style = 'Category',markers=True, dashes=False).set_title('Monthly Price of Dress Category', fontsize = 18)
plt.show()
Observation:
Not only the dress category but also the sale price of the clothing seems to be reducing from Aril to June and increating through August.
Is it possible that there is more inclined to purchase clothes for summer season?
Monthly % of Total Pie of Sales:
a pie plot to show % of sales by Month
pie_rev = eda_df.groupby('Month')['Price'].sum()
#using matplotlib
pie, ax = plt.subplots(figsize=[10,7])
labels = pie_rev.keys()
plt.pie(x=pie_rev, autopct="%.1f%%", explode=[0.02]*5, labels=labels, textprops={'fontsize': 13}, pctdistance=0.75)
plt.title("Overall Total Revenue by Month", fontsize=16);
Looks like Sales peaked beginning of Apr Thru June and slight tick up in JUl. Lets see how it looks when we take out the Aug which is partial
Monthly % of Total Pie of Sales without August (incomplete data)
a pie plot to show % of sales by Month without August
pie_rev_no_aug = eda_df[eda_df['Month']!='August'].groupby('Month')['Price'].sum()
#using matplotlib
pie, ax = plt.subplots(figsize=[10,7])
labels = pie_rev_no_aug.keys()
plt.pie(x=pie_rev_no_aug, autopct="%.1f%%", explode=[0.02]*4, labels=labels,
textprops={'fontsize': 13}, pctdistance=0.75)
plt.title("Overall Total Revenue by Month without August", fontsize=16);
Looks like Sales peaked in month of Apr (people getting ready for spring/summer holidays) and remained consistent through rest of the months
5. what kind of clothing is purchased more?
6. Is there a effect on sales for high priced and low prices items?
pie_categ= eda_df.groupby(['Category','Price_GT_Avg'])['Price'].sum().unstack('Price_GT_Avg')
pie_categ.plot.bar(figsize=[8,6])
plt.title('Type of Dress Sold by Over/Under Avg Price of Dress', fontsize=14)
plt.xlabel('Category',fontsize=12)
plt.ylabel('Sales',fontsize=12)
plt.show()
Looks like other than Trousers, all the other dress style are more often sold over the average price of the dress*
Probably because Trousers arent a preferable dress during pregnancy.
7. Is there any effect of pricing on which page the product is shown?
8. what are the sales by page?
page_price = eda_df[['Price', 'Page']]
page_price.corr()
Looks as though there is a reverse correlation between price and page, i.e., price is probably less pricier as the page numbers increaed. Conversely, there may not be as many items in page 5 as in page 1.
page_price_avg = page_price.groupby('Page').mean()
page_price_tot = page_price.groupby('Page').sum()
page_price_count = page_price.Page.value_counts()
page_price_eda = page_price_avg
page_price_eda['Totals_page'] = page_price_tot
page_price_eda['Counts_page'] = page_price_count
page_price_eda = page_price_eda.rename(columns={'Price':'Average'})
page_price_eda, page_price_eda.index
( Average Totals_page Counts_page
Page
1 44.731479 4062334 90816
2 45.831134 1844199 40239
3 36.868446 698952 18958
4 40.099301 350107 8731
5 41.475351 115260 2779,
Int64Index([1, 2, 3, 4, 5], dtype='int64', name='Page'))
fig, axs = plt.subplots(nrows=3, ncols = 1, figsize=[14, 10])
axs[0].bar(page_price_eda.index, page_price_eda.Average)
axs[0].set_title('Average Price per Item on Page', fontsize=12)
axs[0].set_ylabel('USD')
axs[1].bar(page_price_eda.index, page_price_eda.Totals_page)
axs[1].set_title('Total Dollars Sold per Page', fontsize=12)
axs[1].set_ylabel('USD')
axs[1].ticklabel_format(useOffset=False, style='plain')
axs[2].bar(page_price_eda.index, page_price_eda.Counts_page)
axs[2].set_title('Number of Sales per Page', fontsize=12)
axs[2].set_ylabel('Articles Sold')
for ax in axs:
ax.yaxis.grid(False)
ax.spines[['right', 'top']].set_visible(False)
plt.show()
Observations
Page 1 and 2 seems to have higher avg than other pages and Page 3 & 4 has lower and 3 being the lowest. Page 5 seems to have middle of the pack average price.
pages 5 is the med
Maybe Page 3 has more sale items?
Sales peak in April, and decline slightly in following months; the data for August is incomplete.\
Overall, Sales in each month are seeming to peak begfore the weekend - likely from Thu to Sat and lowest on Sun and Mon.
Skirts have the highest average price of all goods, Trousers are least preferable. Sales items bring in the least money, but generate more sales than either blouses or skirts. \
Pages one and two have the highest priced items, Page 3 & 4 las least average and page 5 is in the middle pf the pack.
jovian.submit(assignment="zerotodatascience-project2")