Hanjo
14 November
Having to deal with real life…
When modeling discrete classes, the relative frequencies of the classes can have a significant impact on the effectiveness of the model.
Imbalance can be present in any data set or application, and hence, the practitioner should be aware of the implications of modeling this type of data and possible remedies to counter
The accuracy paradox for predictive analytics states that predictive models with a given level of accuracy may have greater predictive power than models with higher accuracy.
It may be better to avoid the accuracy metric in favor of other metrics such as Sensitivity and Specificity or Kappa.
Here are a few practical settings where class imbalance often occurs:
I use a dataset obtained from: http://sci2s.ugr.es/keel/imbalanced.php#sub60
openxlsx::readWorkbook("Data/vino.xlsx") %>% str
'data.frame': 691 obs. of 12 variables:
$ FixedAcidity : num 7.4 7.8 7.8 7.4 7.4 7.9 7.5 6.7 7.5 5.6 ...
$ VolatileAcidity : num 0.7 0.88 0.76 0.7 0.66 0.6 0.5 0.58 0.5 0.615 ...
$ CitricAcid : num 0 0 0.04 0 0 0.06 0.36 0.08 0.36 0 ...
$ ResidualSugar : num 1.9 2.6 2.3 1.9 1.8 1.6 6.1 1.8 6.1 1.6 ...
$ Chlorides : num 0.076 0.098 0.092 0.076 0.075 0.069 0.071 0.097 0.071 0.089 ...
$ FreeSulfurDioxide : num 11 25 15 11 13 15 17 15 17 16 ...
$ TotalSulfurDioxide: num 34 67 54 34 40 59 102 65 102 59 ...
$ Density : num 0.998 0.997 0.997 0.998 0.998 ...
$ PH : num 3.51 3.2 3.26 3.51 3.51 3.3 3.35 3.28 3.35 3.58 ...
$ Sulphates : num 0.56 0.68 0.65 0.56 0.56 0.46 0.8 0.54 0.8 0.52 ...
$ Alcohol : num 9.4 9.8 9.8 9.4 9.4 9.4 10.5 9.2 10.5 9.9 ...
$ Class : chr "negative" "negative" "negative" "negative" ...
What is the severity of the problem?
vino %>%
count(Class) %>%
mutate(perc = n/sum(n))
# A tibble: 2 x 3
Class n perc
<chr> <int> <dbl>
1 negative 681 0.98552822
2 positive 10 0.01447178
I use a dataset obtained from: http://sci2s.ugr.es/keel/imbalanced.php#sub60
openxlsx::readWorkbook("Data/vino.xlsx") %>% str
'data.frame': 691 obs. of 12 variables:
$ FixedAcidity : num 7.4 7.8 7.8 7.4 7.4 7.9 7.5 6.7 7.5 5.6 ...
$ VolatileAcidity : num 0.7 0.88 0.76 0.7 0.66 0.6 0.5 0.58 0.5 0.615 ...
$ CitricAcid : num 0 0 0.04 0 0 0.06 0.36 0.08 0.36 0 ...
$ ResidualSugar : num 1.9 2.6 2.3 1.9 1.8 1.6 6.1 1.8 6.1 1.6 ...
$ Chlorides : num 0.076 0.098 0.092 0.076 0.075 0.069 0.071 0.097 0.071 0.089 ...
$ FreeSulfurDioxide : num 11 25 15 11 13 15 17 15 17 16 ...
$ TotalSulfurDioxide: num 34 67 54 34 40 59 102 65 102 59 ...
$ Density : num 0.998 0.997 0.997 0.998 0.998 ...
$ PH : num 3.51 3.2 3.26 3.51 3.51 3.3 3.35 3.28 3.35 3.58 ...
$ Sulphates : num 0.56 0.68 0.65 0.56 0.56 0.46 0.8 0.54 0.8 0.52 ...
$ Alcohol : num 9.4 9.8 9.8 9.4 9.4 9.4 10.5 9.2 10.5 9.9 ...
$ Class : chr "negative" "negative" "negative" "negative" ...
What is the severity of the problem?
vino %>%
count(Class) %>%
mutate(perc = n/sum(n))
# A tibble: 2 x 3
Class n perc
<chr> <int> <dbl>
1 negative 681 0.98552822
2 positive 10 0.01447178
I pull out all the stops in my model:
set.seed(42)
index <- createDataPartition(vino$Class, p = 0.7, list = FALSE)
train_data <- vino[index, ]
test_data <- vino[-index, ]
ctrl <- trainControl(method = "repeatedcv",
number = 5,
repeats = 10,
verboseIter = FALSE,
summaryFunction = twoClassSummary,
classProbs = TRUE)
model_rf <- train(Class ~ .,
data = train_data,
method = "rf",
preProcess = c("scale", "center"),
metric = "Spec",
trControl = ctrl)
pred_rf <- predict(model_rf, test_data)
caret::confusionMatrix(pred_rf, test_data$Class)
Confusion Matrix and Statistics
Reference
Prediction negative positive
negative 204 3
positive 0 0
Accuracy : 0.9855
95% CI : (0.9582, 0.997)
No Information Rate : 0.9855
P-Value [Acc > NIR] : 0.6472
Kappa : 0
Mcnemar's Test P-Value : 0.2482
Sensitivity : 1.0000
Specificity : 0.0000
Pos Pred Value : 0.9855
Neg Pred Value : NaN
Prevalence : 0.9855
Detection Rate : 0.9855
Detection Prevalence : 1.0000
Balanced Accuracy : 0.5000
'Positive' Class : negative
We are very lucky that in R
, we have a well engineered package such as caret
.
By simply adding a sampling parameter to the trainControl
, it handles the sample for you in the background
ctrl <- trainControl(method = "repeatedcv",
number = 5,
repeats = 10,
sampling = "down",
verboseIter = FALSE,
savePredictions = T,
summaryFunction = twoClassSummary,
classProbs = TRUE)
model_rf_under <- train(Class ~ .,
data = train_data,
method = "rf",
preProcess = c("scale", "center"),
metric = "Spec",
trControl = ctrl)
pred_rf <- predict(model_rf_under, test_data)
caret::confusionMatrix(pred_rf, test_data$Class)
Confusion Matrix and Statistics
Reference
Prediction negative positive
negative 136 3
positive 68 0
Accuracy : 0.657
95% CI : (0.588, 0.7214)
No Information Rate : 0.9855
P-Value [Acc > NIR] : 1
Kappa : -0.0286
Mcnemar's Test P-Value : 3.068e-14
Sensitivity : 0.6667
Specificity : 0.0000
Pos Pred Value : 0.9784
Neg Pred Value : 0.0000
Prevalence : 0.9855
Detection Rate : 0.6570
Detection Prevalence : 0.6715
Balanced Accuracy : 0.3333
'Positive' Class : negative
ctrl <- trainControl(method = "repeatedcv",
number = 5,
repeats = 10,
sampling = "up",
verboseIter = FALSE,
savePredictions = T,
summaryFunction = twoClassSummary,
classProbs = TRUE)
model_rf_up <- train(Class ~ .,
data = train_data,
method = "rf",
preProcess = c("scale", "center"),
metric = "Spec",
trControl = ctrl)
pred_rf <- predict(model_rf_up, test_data)
caret::confusionMatrix(pred_rf, test_data$Class)
Confusion Matrix and Statistics
Reference
Prediction negative positive
negative 204 3
positive 0 0
Accuracy : 0.9855
95% CI : (0.9582, 0.997)
No Information Rate : 0.9855
P-Value [Acc > NIR] : 0.6472
Kappa : 0
Mcnemar's Test P-Value : 0.2482
Sensitivity : 1.0000
Specificity : 0.0000
Pos Pred Value : 0.9855
Neg Pred Value : NaN
Prevalence : 0.9855
Detection Rate : 0.9855
Detection Prevalence : 1.0000
Balanced Accuracy : 0.5000
'Positive' Class : negative
ctrl <- trainControl(method = "repeatedcv",
number = 5,
repeats = 10,
sampling = "rose",
verboseIter = FALSE,
savePredictions = T,
summaryFunction = twoClassSummary,
classProbs = TRUE)
model_rf_rose <- train(Class ~ .,
data = train_data,
method = "rf",
preProcess = c("scale", "center"),
metric = "Spec",
trControl = ctrl)
pred_rf <- predict(model_rf_rose, test_data)
caret::confusionMatrix(pred_rf, test_data$Class)
Confusion Matrix and Statistics
Reference
Prediction negative positive
negative 176 1
positive 28 2
Accuracy : 0.8599
95% CI : (0.805, 0.9041)
No Information Rate : 0.9855
P-Value [Acc > NIR] : 1
Kappa : 0.0974
Mcnemar's Test P-Value : 1.379e-06
Sensitivity : 0.86275
Specificity : 0.66667
Pos Pred Value : 0.99435
Neg Pred Value : 0.06667
Prevalence : 0.98551
Detection Rate : 0.85024
Detection Prevalence : 0.85507
Balanced Accuracy : 0.76471
'Positive' Class : negative
ctrl <- trainControl(method = "repeatedcv",
number = 5,
repeats = 10,
sampling = "smote",
verboseIter = FALSE,
savePredictions = T,
summaryFunction = twoClassSummary,
classProbs = TRUE)
model_rf_smote <- train(Class ~ .,
data = train_data,
method = "rf",
preProcess = c("scale", "center"),
metric = "Spec",
trControl = ctrl)
pred_rf <- predict(model_rf_smote, test_data)
caret::confusionMatrix(pred_rf, test_data$Class)
Confusion Matrix and Statistics
Reference
Prediction negative positive
negative 181 1
positive 23 2
Accuracy : 0.8841
95% CI : (0.8324, 0.9243)
No Information Rate : 0.9855
P-Value [Acc > NIR] : 1
Kappa : 0.1201
Mcnemar's Test P-Value : 1.814e-05
Sensitivity : 0.8873
Specificity : 0.6667
Pos Pred Value : 0.9945
Neg Pred Value : 0.0800
Prevalence : 0.9855
Detection Rate : 0.8744
Detection Prevalence : 0.8792
Balanced Accuracy : 0.7770
'Positive' Class : negative
So what happenend in the background?
** Up/Down-sampling**
vino_down <- downSample(train_data, factor(train_data$Class))
table(vino_down$Class)
negative positive
7 7
vino_up <- upSample(train_data, factor(train_data$Class))
table(vino_up$Class)
negative positive
477 477
** Hybrid **
train_data$Class <- as.factor(train_data$Class)
vino_rose <- ROSE(Class ~., train_data)$data
table(vino_rose$Class)
negative positive
239 245
train_data$Class <- as.factor(train_data$Class)
vino_smote <- SMOTE(Class ~., train_data, perc.over = 200, perc.under=100)
table(vino_smote$Class)
negative positive
14 21
models <- list(original = model_rf,
under = model_rf_under,
over = model_rf_up,
smote = model_rf_rose,
rose = model_rf_smote)
resampling <- resamples(models)
bw_plot <- bwplot(resampling)
bw_plot
Some of you might be wondering by now:
and the answer is no… I have just spent a lot of time reading the caret
vignette.
R
(and I guess the same in python
)I have only touched on different methods of sampling and one should obviously explore Pros and Cons.