Learn practical skills, build real-world projects, and advance your career

Data Preparation for Machine Learning

Data preparation is a vital step in the machine learning pipeline. Just as visualization is necessary to understand the relationships in data, proper preparation or data munging is required to ensure machine learning models work optimally.

The process of data preparation is highly interactive and iterative. A typical process includes at least the following steps:

  1. Visualization of the dataset to understand the relationships and identify possible problems with the data.
  2. Data cleaning and transformation to address the problems identified. It many cases, step 1 is then repeated to verify that the cleaning and transformation had the desired effect.
  3. Construction and evaluation of a machine learning models. Visualization of the results will often lead to understanding of further data preparation that is required; going back to step 1.

In this lab you will learn the following:

  • Recode character strings to eliminate characters that will not be processed correctly.
  • Find and treat missing values.
  • Set correct data type of each column.
  • Transform categorical features to create categories with more cases and coding likely to be useful in predicting the label.
  • Apply transformations to numeric features and the label to improve the distribution properties.
  • Locate and treat duplicate cases.

An example

As a first example you will prepare the automotive dataset. Careful preparation of this dataset, or any dataset, is required before atempting to train any machine learning model. This dataset has a number of problems which must be addressed. Further, some feature engineering will be applied.

Load the dataset

As a first step you must load the dataset.

Execute the code in the cell below to load the packages required to run this notebook.

import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
import numpy.random as nr
import math

%matplotlib inline

Execute the code in the cell below to load the dataset and print the first few rows of the data frame.

auto_prices = pd.read_csv('Automobile price data _Raw_.csv')
auto_prices.head(20)