Clustering

When a data set doesn’t have labels we can use unsupervised learning to find some kind of structure in the data - allowing us to discover patterns or groupings.

Cluster analysis is a method of finding groupings, known as clusters, in datasets. As the data sets are unlabelled, cluster analysis tries to group similar examples using the examples features.

K-means clustering lives true to its name - it separates examples into k number of clusters (so if k is 5, it will divide the examples into 5 clusters) and it partitions the examples by the average (mean) of the clusters.

Step 1

In this exercise we will look at using k-means clustering to categorise a few different datasets.

Let's start by first creating three clusters.

Run the code below to set up the graphing features.

# This sets up the graphs
import warnings
warnings.filterwarnings("ignore")
import matplotlib.pyplot as graph
%matplotlib inline
graph.rcParams['figure.figsize'] = (15,5)
graph.rcParams["font.family"] = 'DejaVu Sans'
graph.rcParams["font.size"] = '12'
graph.rcParams['image.cmap'] = 'rainbow'

In the cell below replace:

1. `<addClusterData>` with `cluster_data`

2. `<addOutput>` with `output`

and then run the code.

# Let's make some data!
import numpy as np
from sklearn import datasets

###
# REPLACE <addClusterData> WITH cluster_data AND <addOutput> WITH output
###
cluster_data, output = datasets.make_classification(n_samples = 500, n_features = 2, n_informative = 2, n_redundant = 0, n_repeated = 0,
                                                    n_classes = 3, n_clusters_per_class = 1, class_sep = 1.25, random_state = 6)
###

# Let's visualise it
graph.scatter(cluster_data.T[0], cluster_data.T[1])
graph.show()