Data Preparation for Imbalanced Data: Credit Card Fraud Detection

TL;DR This notebook evaluates the performance of a simple Logistic Regression on the imbalanced Credit Card Fraud data from kaggle. The focus lies on the data preparation: Autoencoder, Oversampling (SMOTE), Random Under-sampling and a Combined Sampling method are used to transform the data before applying the model. Because it is a common mistake to apply data transformations in a way that leads to data leakage, this topic is also covered. In general, classification on sampled datasets achieves higher accuracy in classifying fraudulent transactions while it gets more normal transactions wrong, whereas the Logistic Regression on the latent data from the Autoencoder seems to be a more balanced approach, achieving the highest average precision.

Introduction

Context: In order to enhance security, credit card companies would like to detect fraudulent transactions. Machine Learning can help with this.
Limitations: It is relatively simple to increase the number of correctly classified fraudulent transactions at the cost of misclassifying normal transactions. Weighing the economic costs of misclassifying normal and fraudulent transactions is a business decision and therefore beyond the scope of this analysis.
The Dataset: The dataset contains transactions made by credit cards in September 2013 by European cardholders. This dataset presents transactions that occurred in two days, where we have 492 frauds out of 284,807 transactions. The dataset is highly unbalanced, the positive class (frauds) account for 0.172% of all transactions.

Approach and Evaluation Metric

A Logistic Regression will be performed after applying different techniques for dealing with imbalanced data to the data set. A couple of options have been considered for dealing with the imbalance problem and the following five are implemented now:

Logistic Regression: this will show the performance on the imbalanced data and serves as a baseline for further comparison.
Autoencoder: used to extract a latent representation of the training data. Compression of the data might improve the inherent information.
Over-sample the data: over-sampling will yield balanced classes but with a lot of data that makes training computationally expensive.
Under-sample the data: under-sampling will yield balanced classes but with the cost of loss of information.
Combined Sampling: use SMOTE and Tomek links to combine both over- and under-sampling techniques.

As proposed by the provider of this dataset, all models will be evaluated using the Area Under the Precision-Recall Curve (AUPRC) score, because confusion matrix accuracy is not meaningful for unbalanced classification. Sklearn provides this with score in the average_precision_score function, also accessible via 'average_precision'.

Data Preparation for Imbalanced Data: Credit Card Fraud Detection

Introduction

Approach and Evaluation Metric

Prerequisites

Import Packages and Set Global Variables