Ml Project Financial Distress Prediction
Predicting Financial Distress with Python and Scikit-Learn
Banks make thousands of decisions yearly about who will get financing and on what terms. They typically use credit scoring algorithms for making these decisions, which take a number of factors and attempt to predict whether someone will experience financial distress in the near future.
In this project, we will take a look at real world data for 150,000 borrowers, and use machine learning techniques to build models which could be deployed in a bank to help the bank and a potential borrower make the best financial decision possible.
Logistic regression is a commonly used technique for solving binary classification problems. A classification problem is one which is typically answered by yes or no, or by one of many choices. Our data also represents a classification problem, even though we will be calculating the probability that someone will experience financial distress.
Decision Trees and Random Forests
Another machine learning algorithm is the Decision Tree. These can be used for both classification and regression problems, and many trees are often used in conjunction as part of a Random Forest - one of the most powerful machine learning algorithms currently available. Essentially, a random forest is like asking a complex question to thousands of random people, then aggregating their answers. This is called wisdom of the crowd - often these aggregated answers are more accurate than an expert's answer. With a random forest, we train many decision tree models and average them, giving a more accurate result than a single tree.
Here are the steps we'll cover in this project:
- Install and import the needed libraries
- Download real-world data from Kaggle using opendatasets
- Exploratory analysis and visualization
- Split the training dataset into training and validation sets
- Fill/impute missing values in numeric columns
- Scale numeric features to a range of 0-1
- Train a logistic regression model using Scikit-Learn
- Train a random forest using Scikit-Learn
- Evaluate the models using the split validation and test sets
- Tune hyperparameters and re-train new models using the entire training dataset
- Save the model parameters for deployment or future tweaks
Install and Import Libraries
!pip install numpy pandas matplotlib plotly seaborn --quiet