In a class of its own: dealing with class imbalances using caret package

Hanjo
14 November

Knowing what class imbalances are

Having to deal with real life…

When modeling discrete classes, the relative frequencies of the classes can have a significant impact on the effectiveness of the model.

Imbalance can be present in any data set or application, and hence, the practitioner should be aware of the implications of modeling this type of data and possible remedies to counter

plot of chunk unnamed-chunk-2

Beware the accuracy paradox!

The accuracy paradox for predictive analytics states that predictive models with a given level of accuracy may have greater predictive power than models with higher accuracy.

It may be better to avoid the accuracy metric in favor of other metrics such as Sensitivity and Specificity or Kappa.

plot of chunk unnamed-chunk-3

Examples of where class imbalance is prevalent

Here are a few practical settings where class imbalance often occurs:

  • Online advertising: The click through rate is the number of times an ad was clicked on divided by the total number of impressions and tends to be very low (2.4%)
  • Medical research: Analysis on benign vs malignant samples. Important to focus on Specificity (true positive rate)
  • Insurance claims: Fraud detection within a large claim dataset

Combating class imbalance

  • Can you collect more data?
  • Try changing your performance metric - classify performance based on specificity or kappa
  • Incorporate different algorithms
  • Penalized models can be used in applying a different cost-function

Combating class imbalance

  • Can you collect more data?
  • Try Changing Your Performance Metric - classify performance based on specificity or kappa
  • Incorporate different algorithms
  • Penalized models can be used in applying a different cost-function
  • Use resampling to adjust your dataset
  • Generate synthetic samples

Getting our hands dirty

I use a dataset obtained from: http://sci2s.ugr.es/keel/imbalanced.php#sub60

openxlsx::readWorkbook("Data/vino.xlsx") %>% str
'data.frame':   691 obs. of  12 variables:
 $ FixedAcidity      : num  7.4 7.8 7.8 7.4 7.4 7.9 7.5 6.7 7.5 5.6 ...
 $ VolatileAcidity   : num  0.7 0.88 0.76 0.7 0.66 0.6 0.5 0.58 0.5 0.615 ...
 $ CitricAcid        : num  0 0 0.04 0 0 0.06 0.36 0.08 0.36 0 ...
 $ ResidualSugar     : num  1.9 2.6 2.3 1.9 1.8 1.6 6.1 1.8 6.1 1.6 ...
 $ Chlorides         : num  0.076 0.098 0.092 0.076 0.075 0.069 0.071 0.097 0.071 0.089 ...
 $ FreeSulfurDioxide : num  11 25 15 11 13 15 17 15 17 16 ...
 $ TotalSulfurDioxide: num  34 67 54 34 40 59 102 65 102 59 ...
 $ Density           : num  0.998 0.997 0.997 0.998 0.998 ...
 $ PH                : num  3.51 3.2 3.26 3.51 3.51 3.3 3.35 3.28 3.35 3.58 ...
 $ Sulphates         : num  0.56 0.68 0.65 0.56 0.56 0.46 0.8 0.54 0.8 0.52 ...
 $ Alcohol           : num  9.4 9.8 9.8 9.4 9.4 9.4 10.5 9.2 10.5 9.9 ...
 $ Class             : chr  "negative" "negative" "negative" "negative" ...

What is the severity of the problem?

vino %>% 
  count(Class) %>% 
  mutate(perc = n/sum(n))
# A tibble: 2 x 3
     Class     n       perc
     <chr> <int>      <dbl>
1 negative   681 0.98552822
2 positive    10 0.01447178

Getting our hands dirty

I use a dataset obtained from: http://sci2s.ugr.es/keel/imbalanced.php#sub60

openxlsx::readWorkbook("Data/vino.xlsx") %>% str
'data.frame':   691 obs. of  12 variables:
 $ FixedAcidity      : num  7.4 7.8 7.8 7.4 7.4 7.9 7.5 6.7 7.5 5.6 ...
 $ VolatileAcidity   : num  0.7 0.88 0.76 0.7 0.66 0.6 0.5 0.58 0.5 0.615 ...
 $ CitricAcid        : num  0 0 0.04 0 0 0.06 0.36 0.08 0.36 0 ...
 $ ResidualSugar     : num  1.9 2.6 2.3 1.9 1.8 1.6 6.1 1.8 6.1 1.6 ...
 $ Chlorides         : num  0.076 0.098 0.092 0.076 0.075 0.069 0.071 0.097 0.071 0.089 ...
 $ FreeSulfurDioxide : num  11 25 15 11 13 15 17 15 17 16 ...
 $ TotalSulfurDioxide: num  34 67 54 34 40 59 102 65 102 59 ...
 $ Density           : num  0.998 0.997 0.997 0.998 0.998 ...
 $ PH                : num  3.51 3.2 3.26 3.51 3.51 3.3 3.35 3.28 3.35 3.58 ...
 $ Sulphates         : num  0.56 0.68 0.65 0.56 0.56 0.46 0.8 0.54 0.8 0.52 ...
 $ Alcohol           : num  9.4 9.8 9.8 9.4 9.4 9.4 10.5 9.2 10.5 9.9 ...
 $ Class             : chr  "negative" "negative" "negative" "negative" ...

What is the severity of the problem?

vino %>% 
  count(Class) %>% 
  mutate(perc = n/sum(n))
# A tibble: 2 x 3
     Class     n       perc
     <chr> <int>      <dbl>
1 negative   681 0.98552822
2 positive    10 0.01447178