Authored by: Philippe Bouaziz
Class Imbalance is a common problem in data classification. This dataset has two highly imbalanced classes get hired or not. Our goal is to predict the minority class ‘getting hired’.
In the following subsections, I describe a short dataset exploration with the most common pre-processing steps (dealing with missing data, categorical encoding, normalization and scaling, correlation matrix) follow by features and model selections with hyperparameters tunning, model evaluation for two scenarios (with or without synthetic data augmentation).
The codes files are in Python 3 and can be run from Kaggle, Collab, or any other notebooks directly from these links:
Part I: Descriptive Statistics
Part II: Exploratory data analysis (EDA) with features selections and correlation matrix
Part III: Model Pipeline
Part IV: model selections with hyperparameters tunning and model evaluation
without synthetic data augmentation: https://jovian.ai/yeonathan/quant-rf-best-model-without-smote
with synthetic data augmentation: https://jovian.ai/yeonathan/quant-rf-smote-best-model
Exploratory data analysis (EDA)
First, let’s get started familiarizing with our dataset:
There is one binary label ‘embauche’ (get hired in French) in the training data which makes the problem a binary classification problem. The datasets have 11 features with 2 categorical binary features (sexe, disponibilite), 3 categorical features ('cheveux', 'diplome', 'specialite') and 5 numerical features (salary, note, exp, AgeRange, date) and one binary target variable (‘embauche’) with 20 k individual data points. After data splitting, the training dataset will contain 11 features and 14 k individual data points, and the test dataset will contain 11 features and 6 k individual data points. The dataset is very unbalanced. Most of the data belongs to class-0 (88.54%) whereas class-1 has just an 11.46% hiring rate.
Dealing with missing values
The dataset contains x missing data which need to be pre-process using a missing data replacement method here we use data replacement by the mode but many other methods can be used, for instance, direct impairment of missing values in our dataset x/ total data points = 999/20000 = 0.05 (5%).
For more information, please refer to my article on toward-data-science: https://towardsdatascience.com/handling-missing-values-the-exclusive-pythonic-guide-9aa883835655
Dealing with categorical features
Many scenarios are dealing with categorical data, the classical method consists of detecting cells with categorical values, count their numbers, and understanding their types (binary, multi-values, …). In this dataset, two methods seem most appropriate one-hot encoding and label encoders.
One Hot Encoding is a process known as encoding categorical variables into dummy variables. This data processing method converts binary categorical columns (yes/ no, male/female,…) into a 0/1 binary vector in which 0 indicates the absence of the row belonging to that category. This method can be tricky if used for non-binary multidimensional variables that will result in adding non-useful columns. For instance, if we were to have a column representing x colors one hot encoding would result in x additional columns (colors_green, colors_blue,…).
Label encoding involves converting each value in a column to a number. Consider 4 hair types we obtain one column with 4 values 0,1,2,3 hair types.
For more information on categorical encoding please refer to my article on toward-data-science: https://towardsdatascience.com/5-categorical-encoding-tricks-you-need-to-know-today-as-a-data-scientist-73cf75595298
Normalization and scaling
The observation of the distribution of color of hair, sexe, disponibility features are already normalized but the observations of age, experience, salary features need proper scaling, in our case a min-max scaling seems the most appropriate.
The age distribution with hiring rate seems the most complicated feature which needs to be defined as a range and then label encodes as a categorical feature as shown in the age range normalization code snippet.
Dealing with date-time
The time dimension does not impact the hiring rate for this reason I delete this feature from the dataset.
Features selection is a second natural step after exploratory data analysis in most data science projects. This process consists of choosing the right features for obtaining the best predictions. Easy to use features selection methods generally include SelectFromModel, Feature ranking with recursive feature elimination, filter-based univariate selection, features importance, Voting Selector. In this project, we will use the feature importance method to analyze our dataset with ‘embauche’ as the target feature. This method score features using a Tree-Based Classifiers. The higher the score more the feature is important to predict our target feature. In our case, the features are 'note', 'salary', 'exp', 'specialite', 'AgeRange'. Moreover, the analysis of the correlation matrix validates this observation with a positive correlation to the target feature of the features: 'note', 'salary', 'exp'.
For more details, on other feature selection methods please refer to my article on toward-data-science: https://towardsdatascience.com/best-bulletproof-python-feature-selection-methods-every-data-scientist-should-know-7c1027a833c6
After preliminary observation of our pycaret model pipeline (refer to Pycaret Machine learning pipeline).
I decided to use the Random forest (RF) algorithm since it outperforms the other algorithms such as support Xgboost, Logistic regression, and vector machine. RF is almost ten times faster than CatBoost Classifier. RF is a bagging type of ensemble classifier which overcome the use of lightgbm for 3 reasons:
- Robust to overfitting.
- Easy Parameterization.
- Often used for unbalanced datasets
To handle the data imbalance issue, I have used the following 3 strategies:
Use Ensemble Cross-Validation (CV): used of cross-validation to reflect the model's robustness. The entire datasets were divided into five subsets. In each CV, 4 subsets are used for training, and one to validate the model. In each CV, the model also predicts probabilities on the test data. At the end of the cross-validation, the model was evaluated using the F1-score and the Recall since the accuracy of our model doesn’t reflect the minority class of our target variable ‘getting hired’.
Use Synthetic Minority Oversampling (SMOTE): duplicate examples from the minority class, without adding new information to the model.
Use Set Class Weight/Importance: this method imposes a cost penalty (class weights) on the minority class thus, avoiding RF tends to be biased toward the majority class. I determine a class weight from the ratio between the number of the dataset in class-0 and the number of the dataset in class-1 is approximately 1/9.
Finally, to find the best parameters, I performed a grid search over specified parameter values using scikit-sklearn implemented GridSearchCV. More details can be found on Github.
The following results show how the above three techniques helped to improve the model performance. The training performance of the model was steady and has an almost constant recall and f1 score on each CV with or without SMOTE. SMOTE-model performs better than non-SMOTE-model from 61% to 93% for hiring rate prediction recall. The use of Class Weight/Importance balances the class improving the non-SMOTE-model performance on error basis from 23% to 61% recall. Using the smote strategy, Class Weight/Importance balance doesn’t improve the recall of the model.
There is still scope for improvement and further work. For example, deleting unnecessary features in our case using the 5 most important features reduce our model performance from 93% to 91 % we might try other features selections methods. We can also detect and eliminate outliers or create new features to improve the model. Finally, thank you very much for reading.
If you have any questions, feel free to ask. You can reach out to me: