Often, the modeled phenomenon is rare, e.g., on average, 0.8% of all payments are considered as fraudulent in France. In this case, it is said the problem has a strong class imbalance.
Hence, when evaluating how the model generalizes with the hold- out or cross validation method, some partitions of the training and test set will have different proportions (%) of data points per class.
To reduce the influence of differences due to the sampling and the class imbalance, various approaches exist. Below, we mention three approaches: stratified sampling, under sampling, and over sampling.
|Stratified sampling||One approach is to partition the holdout and cross-validation so each class is represented with stable proportions in each training/test-set split. This is called stratified sampling. The benefit of this approach is that the variation due to the differing class proportion in each partition is minimized.|
|Under sampling||Another approach is to reduce the total size of the learning problem by undersampling the most prevalent class. However, in that case the class priors, i.e., the probability of each class, are changed. Therefore, to have predicted values on the same scale as that of the analyzed population, it is necessary to adjust the final estimates of the model trained on the subpopulation.|
|Over sampling||Finally, the counterpart of undersampling is called oversampling. In that case, data points are selected that can potentially be picked more than one time; these sampling methods are called with replacement. One instance is bootstrapping. Its main use is to estimate the sampling distribution of almost any (performance) statistics, which is necessary to estimate the confidence interval.|