Decision Trees and Random Forests

alt

The following topics are covered in this tutorial:

Downloading a real-world dataset
Preparing a dataset for training
Training and interpreting decision trees
Training and interpreting random forests
Overfitting & hyperparameter tuning
Making predictions on single inputs

How to run the code

This tutorial is an executable Jupyter notebook hosted on Jovian. You can run this tutorial and experiment with the code examples in a couple of ways: using free online resources (recommended) or on your computer.

Option 1: Running using free online resources (1-click, recommended)

The easiest way to start executing the code is to click the Run button at the top of this page and select Run on Colab. You will be prompted to connect your Google Drive account so that this notebook can be placed into your drive for execution.

Option 2: Running on your computer locally

To run the code on your computer locally, you'll need to set up Python, download the notebook and install the required libraries. We recommend using the Conda distribution of Python. Click the Run button at the top of this page, select the Run Locally option, and follow the instructions.

Problem Statement

This tutorial takes a practical and coding-focused approach. We'll learn how to use decision trees and random forests to solve a real-world problem from Kaggle:

QUESTION: The Rain in Australia dataset contains about 10 years of daily weather observations from numerous Australian weather stations. Here's a small sample from the dataset:

As a data scientist at the Bureau of Meteorology, you are tasked with creating a fully-automated system that can use today's weather data for a given location to predict whether it will rain at the location tomorrow.

Let's install and import some required libraries before we begin.