Risk Models Using Tree-based Models

Welcome to the second assignment of Course 2!

Outline

1. Import Packages
2. Load the Dataset
3. Explore the Dataset
4. Dealing with Missing Data
- Exercise 1
5. Decision Trees
- Exercise 2
6. Random Forests
- Exercise 3
7. Imputation
8. Error Analysis
- Exercise 4
9. Imputation Approaches
- Exercise 5
- Exercise 6
10. Comparison
11. Explanations: SHAP

In this assignment, you'll gain experience with tree based models by predicting the 10-year risk of death of individuals from the NHANES I epidemiology dataset (for a detailed description of this dataset you can check the CDC Website). This is a challenging task and a great test bed for the machine learning methods we learned this week.

As you go through the assignment, you'll learn about:

Dealing with Missing Data
- Complete Case Analysis.
- Imputation
Decision Trees
- Evaluation.
- Regularization.
Random Forests
- Hyperparameter Tuning.

1. Import Packages

We'll first import all the common packages that we need for this assignment.

shap is a library that explains predictions made by machine learning models.
sklearn is one of the most popular machine learning libraries.
itertools allows us to conveniently manipulate iterable objects such as lists.
pydotplus is used together with IPython.display.Image to visualize graph structures such as decision trees.
numpy is a fundamental package for scientific computing in Python.
pandas is what we'll use to manipulate our data.
seaborn is a plotting library which has some convenient functions for visualizing missing data.
matplotlib is a plotting library.

import shap
import sklearn
import itertools
import pydotplus
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

from IPython.display import Image 

from sklearn.tree import export_graphviz
from sklearn.externals.six import StringIO
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer, SimpleImputer

# We'll also import some helper functions that will be useful later on.
from util import load_data, cindex