In a class of its own: dealing with class imbalances using caret package

Hanjo
14 November

Knowing what class imbalances are

Having to deal with real life…

When modeling discrete classes, the relative frequencies of the classes can have a significant impact on the effectiveness of the model.

Imbalance can be present in any data set or application, and hence, the practitioner should be aware of the implications of modeling this type of data and possible remedies to counter

plot of chunk unnamed-chunk-2

Beware the accuracy paradox!

The accuracy paradox for predictive analytics states that predictive models with a given level of accuracy may have greater predictive power than models with higher accuracy.

It may be better to avoid the accuracy metric in favor of other metrics such as Sensitivity and Specificity or Kappa.

plot of chunk unnamed-chunk-3

Examples of where class imbalance is prevalent

Here are a few practical settings where class imbalance often occurs:

Online advertising: The click through rate is the number of times an ad was clicked on divided by the total number of impressions and tends to be very low (2.4%)
Medical research: Analysis on benign vs malignant samples. Important to focus on Specificity (true positive rate)
Insurance claims: Fraud detection within a large claim dataset

Combating class imbalance

Can you collect more data?
Try changing your performance metric - classify performance based on specificity or kappa
Incorporate different algorithms
Penalized models can be used in applying a different cost-function

Combating class imbalance

Can you collect more data?
Try Changing Your Performance Metric - classify performance based on specificity or kappa
Incorporate different algorithms
Penalized models can be used in applying a different cost-function

Use resampling to adjust your dataset
Generate synthetic samples

Getting our hands dirty

I use a dataset obtained from: http://sci2s.ugr.es/keel/imbalanced.php#sub60

openxlsx::readWorkbook("Data/vino.xlsx") %>% str

'data.frame':   691 obs. of  12 variables:
 $ FixedAcidity      : num  7.4 7.8 7.8 7.4 7.4 7.9 7.5 6.7 7.5 5.6 ...
 $ VolatileAcidity   : num  0.7 0.88 0.76 0.7 0.66 0.6 0.5 0.58 0.5 0.615 ...
 $ CitricAcid        : num  0 0 0.04 0 0 0.06 0.36 0.08 0.36 0 ...
 $ ResidualSugar     : num  1.9 2.6 2.3 1.9 1.8 1.6 6.1 1.8 6.1 1.6 ...
 $ Chlorides         : num  0.076 0.098 0.092 0.076 0.075 0.069 0.071 0.097 0.071 0.089 ...
 $ FreeSulfurDioxide : num  11 25 15 11 13 15 17 15 17 16 ...
 $ TotalSulfurDioxide: num  34 67 54 34 40 59 102 65 102 59 ...
 $ Density           : num  0.998 0.997 0.997 0.998 0.998 ...
 $ PH                : num  3.51 3.2 3.26 3.51 3.51 3.3 3.35 3.28 3.35 3.58 ...
 $ Sulphates         : num  0.56 0.68 0.65 0.56 0.56 0.46 0.8 0.54 0.8 0.52 ...
 $ Alcohol           : num  9.4 9.8 9.8 9.4 9.4 9.4 10.5 9.2 10.5 9.9 ...
 $ Class             : chr  "negative" "negative" "negative" "negative" ...

What is the severity of the problem?

vino %>% 
  count(Class) %>% 
  mutate(perc = n/sum(n))

# A tibble: 2 x 3
     Class     n       perc
     <chr> <int>      <dbl>
1 negative   681 0.98552822
2 positive    10 0.01447178

Getting our hands dirty

I use a dataset obtained from: http://sci2s.ugr.es/keel/imbalanced.php#sub60

openxlsx::readWorkbook("Data/vino.xlsx") %>% str

'data.frame':   691 obs. of  12 variables:
 $ FixedAcidity      : num  7.4 7.8 7.8 7.4 7.4 7.9 7.5 6.7 7.5 5.6 ...
 $ VolatileAcidity   : num  0.7 0.88 0.76 0.7 0.66 0.6 0.5 0.58 0.5 0.615 ...
 $ CitricAcid        : num  0 0 0.04 0 0 0.06 0.36 0.08 0.36 0 ...
 $ ResidualSugar     : num  1.9 2.6 2.3 1.9 1.8 1.6 6.1 1.8 6.1 1.6 ...
 $ Chlorides         : num  0.076 0.098 0.092 0.076 0.075 0.069 0.071 0.097 0.071 0.089 ...
 $ FreeSulfurDioxide : num  11 25 15 11 13 15 17 15 17 16 ...
 $ TotalSulfurDioxide: num  34 67 54 34 40 59 102 65 102 59 ...
 $ Density           : num  0.998 0.997 0.997 0.998 0.998 ...
 $ PH                : num  3.51 3.2 3.26 3.51 3.51 3.3 3.35 3.28 3.35 3.58 ...
 $ Sulphates         : num  0.56 0.68 0.65 0.56 0.56 0.46 0.8 0.54 0.8 0.52 ...
 $ Alcohol           : num  9.4 9.8 9.8 9.4 9.4 9.4 10.5 9.2 10.5 9.9 ...
 $ Class             : chr  "negative" "negative" "negative" "negative" ...

What is the severity of the problem?

vino %>% 
  count(Class) %>% 
  mutate(perc = n/sum(n))

# A tibble: 2 x 3
     Class     n       perc
     <chr> <int>      <dbl>
1 negative   681 0.98552822
2 positive    10 0.01447178

Benchmark model

I pull out all the stops in my model:

Repeated CV
Scale and center variables

set.seed(42)
index <- createDataPartition(vino$Class, p = 0.7, list = FALSE)
train_data <- vino[index, ]
test_data  <- vino[-index, ]

ctrl <- trainControl(method = "repeatedcv", 
                     number = 5, 
                     repeats = 10, 
                     verboseIter = FALSE, 
                     summaryFunction = twoClassSummary, 
                     classProbs = TRUE)

model_rf <- train(Class ~ .,
                         data = train_data,
                         method = "rf",
                         preProcess = c("scale", "center"), 
                         metric = "Spec",
                         trControl = ctrl)

pred_rf <- predict(model_rf, test_data)

caret::confusionMatrix(pred_rf, test_data$Class)

Confusion Matrix and Statistics

          Reference
Prediction negative positive
  negative      204        3
  positive        0        0

               Accuracy : 0.9855         
                 95% CI : (0.9582, 0.997)
    No Information Rate : 0.9855         
    P-Value [Acc > NIR] : 0.6472         

                  Kappa : 0              
 Mcnemar's Test P-Value : 0.2482         

            Sensitivity : 1.0000         
            Specificity : 0.0000         
         Pos Pred Value : 0.9855         
         Neg Pred Value :    NaN         
             Prevalence : 0.9855         
         Detection Rate : 0.9855         
   Detection Prevalence : 1.0000         
      Balanced Accuracy : 0.5000         

       'Positive' Class : negative

Under-sampling model

We are very lucky that in R, we have a well engineered package such as caret.

By simply adding a sampling parameter to the trainControl, it handles the sample for you in the background

ctrl <- trainControl(method = "repeatedcv", 
                     number = 5, 
                     repeats = 10, 
                     sampling = "down",
                     verboseIter = FALSE, 
                     savePredictions = T, 
                     summaryFunction = twoClassSummary, 
                     classProbs = TRUE)

model_rf_under <- train(Class ~ .,
                         data = train_data,
                         method = "rf",
                         preProcess = c("scale", "center"),
                         metric = "Spec",
                         trControl = ctrl)

pred_rf <- predict(model_rf_under, test_data)

caret::confusionMatrix(pred_rf, test_data$Class)

Confusion Matrix and Statistics

          Reference
Prediction negative positive
  negative      136        3
  positive       68        0

               Accuracy : 0.657          
                 95% CI : (0.588, 0.7214)
    No Information Rate : 0.9855         
    P-Value [Acc > NIR] : 1              

                  Kappa : -0.0286        
 Mcnemar's Test P-Value : 3.068e-14      

            Sensitivity : 0.6667         
            Specificity : 0.0000         
         Pos Pred Value : 0.9784         
         Neg Pred Value : 0.0000         
             Prevalence : 0.9855         
         Detection Rate : 0.6570         
   Detection Prevalence : 0.6715         
      Balanced Accuracy : 0.3333         

       'Positive' Class : negative

Over-sampling model

ctrl <- trainControl(method = "repeatedcv", 
                     number = 5, 
                     repeats = 10, 
                     sampling = "up",
                     verboseIter = FALSE,
                     savePredictions = T, 
                     summaryFunction = twoClassSummary, 
                     classProbs = TRUE)

model_rf_up <- train(Class ~ .,
                         data = train_data,
                         method = "rf",
                         preProcess = c("scale", "center"),
                         metric = "Spec",
                         trControl = ctrl)

pred_rf <- predict(model_rf_up, test_data)

caret::confusionMatrix(pred_rf, test_data$Class)

Confusion Matrix and Statistics

          Reference
Prediction negative positive
  negative      204        3
  positive        0        0

               Accuracy : 0.9855         
                 95% CI : (0.9582, 0.997)
    No Information Rate : 0.9855         
    P-Value [Acc > NIR] : 0.6472         

                  Kappa : 0              
 Mcnemar's Test P-Value : 0.2482         

            Sensitivity : 1.0000         
            Specificity : 0.0000         
         Pos Pred Value : 0.9855         
         Neg Pred Value :    NaN         
             Prevalence : 0.9855         
         Detection Rate : 0.9855         
   Detection Prevalence : 1.0000         
      Balanced Accuracy : 0.5000         

       'Positive' Class : negative

Hybrid-sampling model (ROSE)

ctrl <- trainControl(method = "repeatedcv", 
                     number = 5, 
                     repeats = 10, 
                     sampling = "rose",
                     verboseIter = FALSE,
                     savePredictions = T, 
                     summaryFunction = twoClassSummary, 
                     classProbs = TRUE)

model_rf_rose <- train(Class ~ .,
                         data = train_data,
                         method = "rf",
                         preProcess = c("scale", "center"),
                         metric = "Spec",
                         trControl = ctrl)

pred_rf <- predict(model_rf_rose, test_data)

caret::confusionMatrix(pred_rf, test_data$Class)

Confusion Matrix and Statistics

          Reference
Prediction negative positive
  negative      176        1
  positive       28        2

               Accuracy : 0.8599         
                 95% CI : (0.805, 0.9041)
    No Information Rate : 0.9855         
    P-Value [Acc > NIR] : 1              

                  Kappa : 0.0974         
 Mcnemar's Test P-Value : 1.379e-06      

            Sensitivity : 0.86275        
            Specificity : 0.66667        
         Pos Pred Value : 0.99435        
         Neg Pred Value : 0.06667        
             Prevalence : 0.98551        
         Detection Rate : 0.85024        
   Detection Prevalence : 0.85507        
      Balanced Accuracy : 0.76471        

       'Positive' Class : negative

Hybrid-sampling model (SMOTE)

ctrl <- trainControl(method = "repeatedcv", 
                     number = 5, 
                     repeats = 10, 
                     sampling = "smote",
                     verboseIter = FALSE,
                     savePredictions = T, 
                     summaryFunction = twoClassSummary, 
                     classProbs = TRUE)

model_rf_smote <- train(Class ~ .,
                         data = train_data,
                         method = "rf",
                         preProcess = c("scale", "center"),
                         metric = "Spec",
                         trControl = ctrl)

pred_rf <- predict(model_rf_smote, test_data)

caret::confusionMatrix(pred_rf, test_data$Class)

Confusion Matrix and Statistics

          Reference
Prediction negative positive
  negative      181        1
  positive       23        2

               Accuracy : 0.8841          
                 95% CI : (0.8324, 0.9243)
    No Information Rate : 0.9855          
    P-Value [Acc > NIR] : 1               

                  Kappa : 0.1201          
 Mcnemar's Test P-Value : 1.814e-05       

            Sensitivity : 0.8873          
            Specificity : 0.6667          
         Pos Pred Value : 0.9945          
         Neg Pred Value : 0.0800          
             Prevalence : 0.9855          
         Detection Rate : 0.8744          
   Detection Prevalence : 0.8792          
      Balanced Accuracy : 0.7770          

       'Positive' Class : negative

Sampling model (samples)

So what happenend in the background?

** Up/Down-sampling**

vino_down <- downSample(train_data, factor(train_data$Class))

table(vino_down$Class)


negative positive 
       7        7

vino_up <- upSample(train_data, factor(train_data$Class))

table(vino_up$Class)


negative positive 
     477      477

** Hybrid **

train_data$Class <- as.factor(train_data$Class)
vino_rose <- ROSE(Class ~., train_data)$data
table(vino_rose$Class)


negative positive 
     239      245

train_data$Class <- as.factor(train_data$Class)
vino_smote <- SMOTE(Class ~., train_data, perc.over = 200, perc.under=100)
table(vino_smote$Class)


negative positive 
      14       21

Checking result

models <- list(original = model_rf,
                       under = model_rf_under,
                       over = model_rf_up,
                       smote = model_rf_rose,
                       rose = model_rf_smote)

resampling <- resamples(models)
bw_plot <- bwplot(resampling)

bw_plot

plot of chunk unnamed-chunk-23

Conclusion

Some of you might be wondering by now:

Conclusion

and the answer is no… I have just spent a lot of time reading the caret vignette.

I have seen REAL improvement in some of the algorithms we run by just being aware of this problem
The implementation is not difficult in R (and I guess the same in python)
Easy way to improve the outcome of your models, without too much of a hassle

Things to explore

I have only touched on different methods of sampling and one should obviously explore Pros and Cons.

One thing I am quite interested in, is parameter optimisation simultaneously for the ML algo and sampling technique
- Smote has 2 parameters you can adjust
- So this should be easy to do in a heatmap kind of way