Exercise 3 - Multiple Linear Regression

From the previous exercise, we know that customers are happier with chocolate bars that are large and have high amounts of cocoa. Customers may feel differently when they have to pay for these bars though.

In this exercise, we will try to find the chocolate bar that best suits customers, taking into account the cocoa content, size, and price.

Step 1

Firstly, lets have a look at our data.

The data is from survey of how happy customers were with chocolate bars they purchased.

Replace `<printDataHere>` with `print(dataset.head())` below, and run the code.

# This sets up the graphing configuration
import warnings
warnings.filterwarnings("ignore")
import matplotlib.pyplot as graph
%matplotlib inline
graph.rcParams['figure.figsize'] = (15,5)
graph.rcParams["font.family"] = 'DejaVu Sans'
graph.rcParams["font.size"] = '12'
import pandas as pd
import statsmodels.formula.api as smf

# Imports our new data set!
dataset = pd.read_csv('Data/chocolate data multiple linear regression.txt', index_col=False, sep="\t",header=0)
 
 
print(dataset.head())

   weight  cocoa_percent   cost  customer_happiness
0     247           0.11   0.25                  29
1     192           0.82  10.44                  29
2     106           0.01   0.00                   6
3      78           0.04   0.01                   4
4     213           0.39   2.56                  30

Step 2

Previously we found that customers like a high percentage of cocoa and heavier bars of chocolate. Large bars of chocolate cost more money, though, which might make customers less inclined to purchase them.

Let's perform a simple linear regression to see the relationship between customer happiness and chocolate bar weight when the cost of the chocolate was taken into consideration for the survey.

In the cell below find the text `<addFeatureHere>` and replace it with `weight` and run the code.

###
# REPLACE <addFeatureHere> BELOW WITH weight
###
formula = 'customer_happiness ~ weight'
###

# This performs linear regression
lm = smf.ols(formula = formula, data = dataset).fit()

weight = formula.split(" ")[-1]

# Get the data for the x parameter (the feature)
x = dataset['weight']

# This makes and shows a graph
intercept = lm.params[0]
slope = lm.params[1]
line = slope * x + intercept
graph.plot(x, line, '-', c = 'red')
graph.scatter(x, dataset.customer_happiness)
graph.ylabel('Customer Happiness')
graph.xlabel('weight')
graph.show()

Customer happiness still increases with larger bars of chocolate. However, many data points (blue) are a long way from our trendline (red). This means that this line doesn't describe the data very well. It is likely that there are other features of the chocolate that are influencing customer happiness.

Repeat the above exercise, looking at cocoa_percent in place of weight and run the code again. You should see a similar trend.

Exercise 3 - Multiple Linear Regression

Step 1

Replace <printDataHere> with print(dataset.head()) below, and run the code.

Step 2

In the cell below find the text <addFeatureHere> and replace it with weight and run the code.

Replace `<printDataHere>` with `print(dataset.head())` below, and run the code.

In the cell below find the text `<addFeatureHere>` and replace it with `weight` and run the code.