Learn practical skills, build real-world projects, and advance your career

Logistic Regression with Scikit Learn - Machine Learning with Python

This tutorial is a part of Zero to Data Science Bootcamp by Jovian and Machine Learning with Python: Zero to GBMs


The following topics are covered in this tutorial:

  • Downloading a real-world dataset from Kaggle
  • Exploratory data analysis and visualization
  • Splitting a dataset into training, validation & test sets
  • Filling/imputing missing values in numeric columns
  • Scaling numeric features to a (0,1)(0,1) range
  • Encoding categorical columns as one-hot vectors
  • Training a logistic regression model using Scikit-learn
  • Evaluating a model using a validation set and test set
  • Saving a model to disk and loading it back

Problem Statement

This tutorial takes a practical and coding-focused approach. We'll learn how to apply logistic regression to a real-world dataset from Kaggle:

QUESTION: The Rain in Australia dataset contains about 10 years of daily weather observations from numerous Australian weather stations. Here's a small sample from the dataset:


As a data scientist at the Bureau of Meteorology, you are tasked with creating a fully-automated system that can use today's weather data for a given location to predict whether it will rain at the location tomorrow.


Linear Regression vs. Logistic Regression

In the previous tutorial, we attempted to predict a person's annual medical charges using linear regression. In this tutorial, we'll use logistic regression, which is better suited for classification problems like predicting whether it will rain tomorrow. Identifying whether a given problem is a classfication or regression problem is an important first step in machine learning.

Classification Problems

Problems where each input must be assigned a discrete category (also called label or class) are known as classification problems.

Here are some examples of classification problems:

  • Rainfall prediction: Predicting whether it will rain tomorrow using today's weather data (classes are "Will Rain" and "Will Not Rain")
  • Breast cancer detection: Predicting whether a tumor is "benign" (noncancerous) or "malignant" (cancerous) using information like its radius, texture etc.
  • Loan Repayment Prediction - Predicting whether applicants will repay a home loan based on factors like age, income, loan amount, no. of children etc.
  • Handwritten Digit Recognition - Identifying which digit from 0 to 9 a picture of handwritten text represents.

Can you think of some more classification problems?

EXERCISE: Replicate the steps followed in this tutorial with each of the above datasets.

Classification problems can be binary (yes/no) or multiclass (picking one of many classes).