Hanjo
14 November
Having to deal with real life…
When modeling discrete classes, the relative frequencies of the classes can have a significant impact on the effectiveness of the model.
Imbalance can be present in any data set or application, and hence, the practitioner should be aware of the implications of modeling this type of data and possible remedies to counter
The accuracy paradox for predictive analytics states that predictive models with a given level of accuracy may have greater predictive power than models with higher accuracy.
It may be better to avoid the accuracy metric in favor of other metrics such as Sensitivity and Specificity or Kappa.
Here are a few practical settings where class imbalance often occurs:
I use a dataset obtained from: http://sci2s.ugr.es/keel/imbalanced.php#sub60
openxlsx::readWorkbook("Data/vino.xlsx") %>% str
'data.frame': 691 obs. of 12 variables:
$ FixedAcidity : num 7.4 7.8 7.8 7.4 7.4 7.9 7.5 6.7 7.5 5.6 ...
$ VolatileAcidity : num 0.7 0.88 0.76 0.7 0.66 0.6 0.5 0.58 0.5 0.615 ...
$ CitricAcid : num 0 0 0.04 0 0 0.06 0.36 0.08 0.36 0 ...
$ ResidualSugar : num 1.9 2.6 2.3 1.9 1.8 1.6 6.1 1.8 6.1 1.6 ...
$ Chlorides : num 0.076 0.098 0.092 0.076 0.075 0.069 0.071 0.097 0.071 0.089 ...
$ FreeSulfurDioxide : num 11 25 15 11 13 15 17 15 17 16 ...
$ TotalSulfurDioxide: num 34 67 54 34 40 59 102 65 102 59 ...
$ Density : num 0.998 0.997 0.997 0.998 0.998 ...
$ PH : num 3.51 3.2 3.26 3.51 3.51 3.3 3.35 3.28 3.35 3.58 ...
$ Sulphates : num 0.56 0.68 0.65 0.56 0.56 0.46 0.8 0.54 0.8 0.52 ...
$ Alcohol : num 9.4 9.8 9.8 9.4 9.4 9.4 10.5 9.2 10.5 9.9 ...
$ Class : chr "negative" "negative" "negative" "negative" ...
What is the severity of the problem?
vino %>%
count(Class) %>%
mutate(perc = n/sum(n))
# A tibble: 2 x 3
Class n perc
<chr> <int> <dbl>
1 negative 681 0.98552822
2 positive 10 0.01447178
I use a dataset obtained from: http://sci2s.ugr.es/keel/imbalanced.php#sub60
openxlsx::readWorkbook("Data/vino.xlsx") %>% str
'data.frame': 691 obs. of 12 variables:
$ FixedAcidity : num 7.4 7.8 7.8 7.4 7.4 7.9 7.5 6.7 7.5 5.6 ...
$ VolatileAcidity : num 0.7 0.88 0.76 0.7 0.66 0.6 0.5 0.58 0.5 0.615 ...
$ CitricAcid : num 0 0 0.04 0 0 0.06 0.36 0.08 0.36 0 ...
$ ResidualSugar : num 1.9 2.6 2.3 1.9 1.8 1.6 6.1 1.8 6.1 1.6 ...
$ Chlorides : num 0.076 0.098 0.092 0.076 0.075 0.069 0.071 0.097 0.071 0.089 ...
$ FreeSulfurDioxide : num 11 25 15 11 13 15 17 15 17 16 ...
$ TotalSulfurDioxide: num 34 67 54 34 40 59 102 65 102 59 ...
$ Density : num 0.998 0.997 0.997 0.998 0.998 ...
$ PH : num 3.51 3.2 3.26 3.51 3.51 3.3 3.35 3.28 3.35 3.58 ...
$ Sulphates : num 0.56 0.68 0.65 0.56 0.56 0.46 0.8 0.54 0.8 0.52 ...
$ Alcohol : num 9.4 9.8 9.8 9.4 9.4 9.4 10.5 9.2 10.5 9.9 ...
$ Class : chr "negative" "negative" "negative" "negative" ...
What is the severity of the problem?
vino %>%
count(Class) %>%
mutate(perc = n/sum(n))
# A tibble: 2 x 3
Class n perc
<chr> <int> <dbl>
1 negative 681 0.98552822
2 positive 10 0.01447178