`vignettes/multilabel_classification.Rmd`

`multilabel_classification.Rmd`

In the following vignette we will march through multilabel classification with `DALEX`

. Purpose of of this examples is that for some of `DALEX`

functionalities binary cliassification is default one, and therefore we need to put some self-made code to work here. All of examples will be performed with `HR`

dataset that is available with `DALEX`

, it’s target column is `status`

with three level factor. For all cases our model will be `ranger`

.

```
#> gender age hours evaluation salary status
#> 1 male 32.58267 41.88626 3 1 fired
#> 2 female 41.21104 36.34339 2 5 fired
#> 3 male 37.70516 36.81718 3 0 fired
#> 4 female 30.06051 38.96032 3 2 fired
#> 5 male 21.10283 62.15464 5 3 promoted
#> 6 male 40.11812 69.53973 2 0 fired
```

Ok, now it is time to create a model.

library("ranger") model_HR_ranger <- ranger(status~., data = HR, probability = TRUE, num.trees = 50) model_HR_ranger

```
#> Ranger result
#>
#> Call:
#> ranger(status ~ ., data = HR, probability = TRUE, num.trees = 50)
#>
#> Type: Probability estimation
#> Number of trees: 50
#> Sample size: 7847
#> Number of independent variables: 5
#> Mtry: 2
#> Target node size: 10
#> Variable importance mode: none
#> Splitrule: gini
#> OOB prediction error (Brier s.): 0.2181152
```

library("DALEX") explain_HR_ranger <- explain(model_HR_ranger, data = HR[,-6], y = HR$status, label = "Ranger Multilabel Classification", colorize = FALSE)

```
#> Preparation of a new explainer is initiated
#> -> model label : Ranger Multilabel Classification
#> -> data : 7847 rows 5 cols
#> -> target variable : 7847 values
#> -> target variable : Please note that 'y' is a factor. ( WARNING )
#> -> target variable : Consider changing the 'y' to a logical or numerical vector.
#> -> target variable : Otherwise I will not be able to calculate residuals or loss function.
#> -> predict function : yhat.ranger will be used ( default )
#> -> predicted values : predict function returns multiple columns: 3 ( WARNING ) some of functionalities may not work
#> -> model_info : package ranger , ver. 0.12.1 , task multiclass ( default )
#> -> residual function : difference between 1 and probability of true class ( default )
#> -> residuals : numerical, min = 0 , mean = 0.2788 , max = 0.8829384
#> A new explainer has been created!
```

Ofcourse sixth column, that we have omitted during creation of explainer, stands for target column (`status`

) and it is good practice not to put it in `data`

. Keep in mind that default `yhat`

function for `ranger`

and for any other package that is supported by `DALEX`

, enforces probability output. Therfore residuals cannot be standard \(y - \hat{y}\). Since `DALEX 1.2.2`

in case of multiclass classification one minus probability of the TRUE class is standard residual function.

In order to use `model_parts()`

(former `variable_importance()`

) function it is necessary to switch default `loss_function`

argument to one that handle multiple classes. `DALEX`

has one function like that implemented and it is called `loss_cross_entropy()`

. To use it, `y`

parameter passed to `explain`

function should have exactly the same format as the target vector used for the training process (ie. same number of levels and names of those levels).

Also we need probability outputs so there is no need to change deafult `predict_function`

parameter.

library("DALEX") explain_HR_ranger_new_y <- explain(model_HR_ranger, data = HR[,-6], y = HR$status, label = "Ranger Multilabel Classification", colorize = FALSE)

```
#> Preparation of a new explainer is initiated
#> -> model label : Ranger Multilabel Classification
#> -> data : 7847 rows 5 cols
#> -> target variable : 7847 values
#> -> target variable : Please note that 'y' is a factor. ( WARNING )
#> -> target variable : Consider changing the 'y' to a logical or numerical vector.
#> -> target variable : Otherwise I will not be able to calculate residuals or loss function.
#> -> predict function : yhat.ranger will be used ( default )
#> -> predicted values : predict function returns multiple columns: 3 ( WARNING ) some of functionalities may not work
#> -> model_info : package ranger , ver. 0.12.1 , task multiclass ( default )
#> -> residual function : difference between 1 and probability of true class ( default )
#> -> residuals : numerical, min = 0 , mean = 0.2788 , max = 0.8829384
#> A new explainer has been created!
```

And now we can use `model_parts()`

mp <- model_parts(explain_HR_ranger_new_y, loss_function = loss_cross_entropy) plot(mp)

As we see above, we can enjoy perfectly fine variable importance plot.

There is no need for tricks in order to use `model_profile()`

(former `variable_effect()`

). Our target will be one-hot-encoded, and all of explantions will be performed for each of class separately.

mp_p <- model_profile(explain_HR_ranger, variables = "salary", type = "partial") mp_p$color <- "_label_" plot(mp_p)

mp_a <- model_profile(explain_HR_ranger, variables = "salary", type = "accumulated") mp_a$color = "_label_" plot(mp_a)

As above, `predict_parts()`

(former `variable_attribution()`

) works perfectly fine with multilabel classification and default explainer. Just like before, our target will be splitted into variables standing for each factor level and computations will be performed then.

bd <- predict_parts(explain_HR_ranger, HR[1,], type = "break_down") plot(bd)

shap <- predict_parts(explain_HR_ranger, HR[1,], type = "shap") plot(shap)

Those two function are merged into one paragraph becasue they require same action in order to get them work with multilabel classification. The most important thing here is to realise that both function are based on residuals. Since `DALEX 1.2.2`

, explain function recognize if model is a multiclass classification task and uses dedicated residal function as default.

(mp <- model_performance(explain_HR_ranger))

```
#> Measures for: multiclass
#> micro_F1 : 0.8682299
#> macro_F1 : 0.8661725
#> w_macro_F1 : 0.8670093
#> accuracy : 0.8682299
#> w_macro_auc: 0.9770129
#>
#> Residuals:
#> 0% 10% 20% 30% 40% 50% 60%
#> 0.00000000 0.02876519 0.06736897 0.11661823 0.17875934 0.24240620 0.31529124
#> 70% 80% 90% 100%
#> 0.39451101 0.48422885 0.59305883 0.88293838
```

plot(mp)

pd_all <- predict_diagnostics(explain_HR_ranger, HR[1,]) plot(pd_all)

pd_salary <- predict_diagnostics(explain_HR_ranger, HR[1,], variables = "salary") plot(pd_salary)

```
#> R version 4.0.0 (2020-04-24)
#> Platform: x86_64-apple-darwin17.0 (64-bit)
#> Running under: macOS Catalina 10.15.4
#>
#> Matrix products: default
#> BLAS: /Library/Frameworks/R.framework/Versions/4.0/Resources/lib/libRblas.dylib
#> LAPACK: /Library/Frameworks/R.framework/Versions/4.0/Resources/lib/libRlapack.dylib
#>
#> locale:
#> [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
#>
#> attached base packages:
#> [1] stats graphics grDevices utils datasets methods base
#>
#> other attached packages:
#> [1] ranger_0.12.1 DALEX_1.3.0
#>
#> loaded via a namespace (and not attached):
#> [1] Rcpp_1.0.4.6 gower_0.2.1 compiler_4.0.0 pillar_1.4.4
#> [5] ingredients_1.2.0 tools_4.0.0 digest_0.6.25 lattice_0.20-41
#> [9] evaluate_0.14 memoise_1.1.0 lifecycle_0.2.0 tibble_3.0.1
#> [13] gtable_0.3.0 pkgconfig_2.0.3 rlang_0.4.6 Matrix_1.2-18
#> [17] yaml_2.2.1 pkgdown_1.5.1.9000 xfun_0.14 stringr_1.4.0
#> [21] dplyr_0.8.5 knitr_1.28 desc_1.2.0 fs_1.4.1
#> [25] vctrs_0.3.0 rprojroot_1.3-2 grid_4.0.0 tidyselect_1.1.0
#> [29] glue_1.4.1 R6_2.4.1 iBreakDown_1.2.0 rmarkdown_2.1
#> [33] farver_2.0.3 ggplot2_3.3.0 purrr_0.3.4 magrittr_1.5
#> [37] backports_1.1.7 scales_1.1.1 htmltools_0.4.0 ellipsis_0.3.1
#> [41] MASS_7.3-51.5 assertthat_0.2.1 colorspace_1.4-1 labeling_0.3
#> [45] stringi_1.4.6 munsell_0.5.0 crayon_1.3.4
```