Introduction to Machine Learning

This lab introduces some basic concepts of machine learning with Python. In this lab you will use the K-Nearest Neighbor (KNN) algorithm to classify the species of iris flowers, given measurements of flower characteristics. By completing this lab you will have an overview of an end-to-end machine learning modeling process.

By the completion of this lab, you will:

Follow and understand a complete end-to-end machine learning process including data exploration, data preparation, modeling, and model evaluation.
Develop a basic understanding of the principles of machine learning and associated terminology.
Understand the basic process for evaluating machine learning models.

Overview of KNN classification

Before discussing a specific algorithm, it helps to know a bit of machine learning terminology. In supervised machine learning a set of cases are used to train, test and evaluate the model. Each case is comprised of the values of one or more features and a label value. The features are variables used by the model to *predict the value of the label. Minimizing the errors between the true value of the label and the prediction supervises the training of this model. Once the model is trained and tested, it can be evaluated based on the accuracy in predicting the label of a new set of cases.

In this lab you will use randomly selected cases to first train and then evaluate a k-nearest-neighbor (KNN) machine learning model. The goal is to predict the type or class of the label, which makes the machine learning model a classification model.

The k-nearest-neighbor algorithm is conceptually simple. In fact, there is no formal training step. Given a known set of cases, a new case is classified by majority vote of the K (where $k = 1, 2, 3$ , etc.) points nearest to the values of the new case; that is, the nearest neighbors of the new case.

The schematic figure below illustrates the basic concepts of a KNN classifier. In this case there are two features, the values of one shown on the horizontal axis and the values of the other shown on the vertical axis. The cases are shown on the diagram as one of two classes, red triangles and blue circles. To summarize, each case has a value for the two features, and a class. The goal of the KNN algorithm is to classify cases with unknown labels.

Continuing with the example, on the left side of the diagram the $K = 1$ case is illustrated. The nearest neighbor is a red triangle. Therefore, this KNN algorithm will classify the unknown case, '?', as a red triangle. On the right side of the diagram, the $K = 3$ case is illustrated. There are three near neighbors within the circle. The majority of nearest neighbors for $K = 3$ are the blue circles, so the algorithm classifies the unknown case, '?', as a blue circle. Notice that class predicted for the unknown case changes as K changes. This behavior is inherent in the KNN method.

alt

**KNN for k = 1 and k = 3**

There are some additional considerations in creating a robust KNN algorithm. These will be addressed later in this course.

Examine the data set

In this lab you will work with the Iris data set. This data set is famous in the history of statistics. The first publication using these data in statistics by the pioneering statistician Ronald A Fisher was in 1936. Fisher proposed an algorithm to classify the species of iris flowers from physical measurements of their characteristics. The data set has been used as a teaching example ever since.

Now, you will load and examine these data which are in the statsmodels.api package. Execute the code in the cell below and examine the first few rows of the data frame.

import pandas as pd
#from statsmodels.api import datasets
from sklearn import datasets ## Get dataset from sklearn

## Import the dataset from sklearn.datasets
iris = datasets.load_iris()

## Create a data frame from the dictionary
species = [iris.target_names[x] for x in iris.target]
iris = pd.DataFrame(iris['data'], columns = ['Sepal_Length', 'Sepal_Width', 'Petal_Length', 'Petal_Width'])
iris['Species'] = species

There are four features, containing the dimensions of parts of the iris flower structures. The label column is the Species of the flower. The goal is to create and test a KNN algorithm to correctly classify the species.

Next, you will execute the code in the cell below to show the data types of each column.