Geometric SMOTE for Imbalanced Datasets with Nominal and Continuous Features

Published in UNDER SUBMISSION, 2023

Recommended citation: Fonseca, J., & Bacao, F. (2023). Geometric SMOTE for Imbalanced Datasets with Nominal and Continuous Features. Under Submission.

There are different approaches to address imbalanced learning. Artificial data generation, however, is a more general approach when opposed to algorithmic modifications or cost-sensitive solutions. Since the proposal of the Synthetic Minority Oversampling TEchnique (SMOTE), various SMOTE variants and neural network-based oversampling methods were developed. However, the options to oversample datasets with nominal and continuous features are limited. In this paper, we propose Geometric SMOTE for Nominal and Continuous features (G-SMOTENC), based on a combination of G-SMOTE and SMOTENC. Our method uses SMOTENC’s encoding and generation mechanism for nominal features while using G-SMOTE’s data selection mechanism to determine the center observation and k-nearest neighbors and generation mechanism for continuous features. G-SMOTENC’s performance is compared against SMOTENC’s along with 2 other baseline methods, a State-of-the-art oversampling method and no oversampling. The experiment was performed over 20 datasets with varying imbalance ratios, number of metric and non-metric features and target classes. We found a significant improvement in the quality of the generated data when using G-SMOTENC as the oversampling method. An open-source implementation of G-SMOTENC is made available in the Python programming language.