How To Increase Recall When Given Imbalanced Dataset For Machine Learning Model?
SMOTE-Tomek Links is often suited for synthethic data generation when given imbalanced dataset.
Tomek Links is an under-sampling technique that was developed in 1976 by Tomek. It is one of a modification from Condensed Nearest Neighbors (CNN).
Talking about Condensed Nearest Neighbor Rule, examples are selected randomly, especially initially, which results in retention of unnecessary samples. It is based on the nearest-neighbor (NN) rule.
While Tomek Links requires both samples to be each other's nearest neighbors. In simpler words, Tomek Links uses a more restrictive condition resulting in less samples being removed.
The problem that arises with imbalanced datasets is that they are predominately composed of majority class (normal examples) with only a small percentage of minority class (abnormal or interesting examples). This makes the model unable to learn from minority class well. Sometimes, this minority class can contain vital information like disease detection dataset, churn dataset, and fraud detection dataset.
There are two approaches to encounter this problem.
1. Under-sampling of majority class.
It is conducted by removing some random examples from the majority class, at cost of some information in the original data are removed as well.
2. Over-sampling of minority class.
The idea is to duplicate some random examples from the minority class thus this technique does not add any new information from the data.
SMOTE
Synthetic Minority Oversampling Technique (SMOTE) is one of the most popular oversampling techniques that is developed by Chawla et al. (2002). Unlike random oversampling that only duplicates some random examples from the minority class, SMOTE generates examples based on the distance of each data (usually using Euclidean distance) and the minority class nearest neighbors, so the generated examples are different from the original minority class. This method is effective because the synthetic data that are generated are relatively close with the feature space on the minority class, thus adding new “information” on the data, unlike the original oversampling method.
Tomek Links
As discussed earlier, Tomek Links is a method of under-sampling. It can be used to find desired samples of data from the majority class that is having the lowest Euclidean distance with the minority class data (i.e., the data from the majority class that is closest with the minority class data, thus make it ambiguous to distinct), and then remove it.
SMOTE-Tomek Links
A combination of over-sampling the minority (abnormal) class and under-sampling the majority (normal) class can achieve better classifier performance than only under-sampling the majority class. This method was first introduced by Batista et al. (2003).
This method combines the SMOTE ability to generate synthetic data for minority class and Tomek Links ability to remove the data that are identified as Tomek links from the majority class (that is, samples of data from the majority class that is closest with the minority class data).
We can say that over-sampling is done using SMOTE while cleaning is done using Tomek links.
To understand this method better, let’s take a look at the code.
Code
Import libraries
import pandas as pd
import numpy as np
from imblearn.pipeline
import Pipeline
import matplotlib.pyplot as plt
from sklearn.datasets import make_classification
from sklearn.model_selection import cross_validate
from sklearn.model_selection importRepeatedStratifiedKFold
from sklearn.ensemble import RandomForestClassifier
from imblearn.combine import SMOTETomek
from imblearn.under_sampling import TomekLinks
Now, generate synthetic data
#Dummy dataset study case
X, Y = make_classification(n_samples=10000, n_features=4, n_redundant=0, n_clusters_per_class=1, weights=[0.99], flip_y=0, random_state=1)
Normally, a model will fail to learn the minority class but by using the SMOTE-Tomek Links method, we can improve the model’s performance to handle imbalanced data.
## With SMOTE-Tomek Links method
# Define model
model=RandomForestClassifier(criterion='entropy')
# Define SMOTE-Tomek Links resample=SMOTETomek(tomek=TomekLinks(sampling_strategy='majority'))
# Define pipeline
pipeline=Pipeline(steps=[('r', resample), ('m', model)])
# Define evaluation procedure (here we use Repeated Stratified K-Fold CV)
cv=RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)
# Evaluate model
scoring=['accuracy','precision_macro','recall_macro'] scores = cross_validate(pipeline, X, Y, scoring=scoring, cv=cv, n_jobs=-1)
# summarize performance
print('Mean Accuracy: %.4f' % np.mean(scores['test_accuracy'])) print('Mean Precision: %.4f' % np.mean(scores['test_precision_macro'])) print('Mean Recall: %.4f' % np.mean(scores['test_recall_macro']))
The result is as follows:
Mean Accuracy: 0.9805
Mean Precision: 0.6499
Mean Recall: 0.8433
The accuracy and precision metrics might decrease, but we can see that the recall metric is higher, it means that the model performs better to correctly predict the minority class label by using SMOTE-Tomek Links to handle the imbalanced data.
Subscribe to receive a copy of our newsletter directly delivered to your inbox.
The above article is sponsored by Vevesta.
Vevesta: Your Machine Learning Team’s Feature and Technique Dictionary - Accelerate your Machine learning project by using features, techniques and projects used by your peers
100 early birds who login into Vevesta will get free subscription for lifetime
References
This article was originally published at https://www.vevesta.com/blog/30-SMOTE-Tomek-Links