2 years ago

# Logistic Regression with Scikit Learn - Machine Learning with Python

This tutorial is a part of Zero to Data Science Bootcamp by Jovian and Machine Learning with Python: Zero to GBMs

The following topics are covered in this tutorial:

• Exploratory data analysis and visualization
• Splitting a dataset into training, validation & test sets
• Filling/imputing missing values in numeric columns
• Scaling numeric features to a $(0,1)$ range
• Encoding categorical columns as one-hot vectors
• Training a logistic regression model using Scikit-learn
• Evaluating a model using a validation set and test set

### Problem Statement

This tutorial takes a practical and coding-focused approach. We'll learn how to apply logistic regression to a real-world dataset from Kaggle:

QUESTION: The Rain in Australia dataset contains about 10 years of daily weather observations from numerous Australian weather stations. Here's a small sample from the dataset:

As a data scientist at the Bureau of Meteorology, you are tasked with creating a fully-automated system that can use today's weather data for a given location to predict whether it will rain at the location tomorrow.

### Linear Regression vs. Logistic Regression

In the previous tutorial, we attempted to predict a person's annual medical charges using linear regression. In this tutorial, we'll use logistic regression, which is better suited for classification problems like predicting whether it will rain tomorrow. Identifying whether a given problem is a classfication or regression problem is an important first step in machine learning.

#### Classification Problems

Problems where each input must be assigned a discrete category (also called label or class) are known as classification problems.

Here are some examples of classification problems:

• Rainfall prediction: Predicting whether it will rain tomorrow using today's weather data (classes are "Will Rain" and "Will Not Rain")
• Breast cancer detection: Predicting whether a tumor is "benign" (noncancerous) or "malignant" (cancerous) using information like its radius, texture etc.
• Loan Repayment Prediction - Predicting whether applicants will repay a home loan based on factors like age, income, loan amount, no. of children etc.
• Handwritten Digit Recognition - Identifying which digit from 0 to 9 a picture of handwritten text represents.

Can you think of some more classification problems?

EXERCISE: Replicate the steps followed in this tutorial with each of the above datasets.

Classification problems can be binary (yes/no) or multiclass (picking one of many classes).