`vignettes/vignette_describe.Rmd`

`vignette_describe.Rmd`

We adress the problem of insuficient interpretability of explanations for domain experts. We solve this issue by introducing `describe()`

function, which automaticly generates natural language descriptions of explanations generated with `ingredients`

package.

The `ingredients`

package allows for generating prediction validation and predition perturbation explanations. They allow for both global and local model explanation.

Generic function `decribe()`

generates a natural language description for explanations generated with `feature_importance()`

, `ceteris_paribus()`

functions.

To show generating automatic descriptions we first load the data set and build a random forest model classifying, which of the passangers survived sinking of the titanic. Then, using `DALEX`

package, we generate an explainer of the model. Lastly we select a random passanger, which prediction’s should be explained.

```
library("DALEX")
library("ingredients")
library("randomForest")
titanic <- na.omit(titanic)
model_titanic_rf <- randomForest(survived == "yes" ~ .,
data = titanic)
explain_titanic_rf <- explain(model_titanic_rf,
data = titanic[,-9],
y = titanic$survived == "yes",
label = "Random Forest")
```

```
#> Preparation of a new explainer is initiated
#> -> model label : Random Forest
#> -> data : 2099 rows 8 cols
#> -> target variable : 2099 values
#> -> predict function : yhat.randomForest will be used ( [33m default [39m )
#> -> predicted values : numerical, min = 0.008528888 , mean = 0.3232699 , max = 0.9931723
#> -> residual function : difference between y and yhat ( [33m default [39m )
#> -> residuals : numerical, min = -0.8275507 , mean = 0.001170303 , max = 0.8897291
#> -> model_info : package randomForest , ver. 4.6.14 , task regression ( [33m default [39m )
#> [32m A new explainer has been created! [39m
```

```
passanger <- titanic[sample(nrow(titanic), 1) ,-9]
passanger
```

```
#> gender age class embarked country fare sibsp parch
#> 105 male 1 2nd Southampton India 39 2 1
```

Now we are ready for generating various explantions and then describing it with `describe()`

function.

Feature importance explanation shows the importance of all the model’s variables. As it is a global explanation technique, no passanger need to be specified.

```
importance_rf <- feature_importance(explain_titanic_rf)
plot(importance_rf)
```

Function `describe()`

easily describes which variables are the most important. Argument `nonsignificance_treshold`

as always sets the level above which variables become significant. For higher treshold, less variables will be described as significant.

`describe(importance_rf)`

```
#> The number of important variables for Random Forest's prediction is 5 out of 8.
#> Variables gender, class, age have the highest importantance.
```

Ceteris Paribus profiles shows how the model’s input changes with the change of a specified variable.

```
perturbed_variable <- "class"
cp_rf <- ceteris_paribus(explain_titanic_rf,
passanger,
variables = perturbed_variable)
plot(cp_rf, variable_type = "categorical")
```

For a user with no experience, interpreting the above plot may be not straightforward. Thus we generate a natural language description in order to make it easier.

`describe(cp_rf)`

```
#> For the selected instance, prediction estimated by Random Forest is equal to 0.844.
#>
#> Model's prediction would decrease substantially if the value of class variable would change to "restaurant staff", "3rd", "engineering crew", "victualling crew", "deck crew", "1st".
#> The largest change would be marked if class variable would change to "restaurant staff".
#>
#> All the variables were displayed.
```

Natural lannguage descriptions should be flexible in order to provide the desired level of complexity and specificity. Thus various parameters can modify the description being generated.

```
#> Random Forest predicts that for the selected instance, the probability that the passanger will survive is equal to 0.844
#>
#> The most important change in Random Forest's prediction would occur for class = "restaurant staff". It decreases the prediction by 0.316.
#> The second most important change in the prediction would occur for class = "3rd". It decreases the prediction by 0.315.
#> The third most important change in the prediction would occur for class = "engineering crew". It decreases the prediction by 0.311.
#>
#> Other variable values are with less importance. They do not change the the probability that the passanger will survive by more than 0.3.
```

Please note, that `describe()`

can handle only one variable at a time, so it is recommended to specify, which variables should be described.

```
describe(cp_rf,
display_numbers = TRUE,
label = "the probability that the passanger will survive",
variables = perturbed_variable)
```

```
#> Random Forest predicts that for the selected instance, the probability that the passanger will survive is equal to 0.844
#>
#> The most important change in Random Forest's prediction would occur for class = "restaurant staff". It decreases the prediction by 0.316.
#> The second most important change in the prediction would occur for class = "3rd". It decreases the prediction by 0.315.
#> The third most important change in the prediction would occur for class = "engineering crew". It decreases the prediction by 0.311.
#>
#> Other variable values are with less importance. They do not change the the probability that the passanger will survive by more than 0.3.
```

Continuous variables are described as well.

```
perturbed_variable_continuous <- "age"
cp_rf <- ceteris_paribus(explain_titanic_rf,
passanger)
plot(cp_rf, variables = perturbed_variable_continuous)
```

`describe(cp_rf, variables = perturbed_variable_continuous)`

```
#> Random Forest predicts that for the selected instance, prediction is equal to 0.844
#>
#> The highest prediction occurs for (age = 2), while the lowest for (age = 74).
#> Breakpoint is identified at (age = 9).
#>
#> Average model responses are *lower* for variable values *higher* than breakpoint (= 9).
```

Ceteris Paribus profiles are described only for a single observation. If we want to access the influence of more than one observation, we need to describe dependency profiles.

```
pdp <- aggregate_profiles(cp_rf, type = "partial")
plot(pdp, variables = "fare")
```

`describe(pdp, variables = "fare")`

```
#> Random Forest's mean prediction is equal to 0.844.
#>
#> The highest prediction occurs for (fare = 35.1), while the lowest for (fare = 0).
#> Breakpoint is identified at (fare = 30.1).
#>
#> Average model responses are *higher* for variable values *higher* than breakpoint (= 30.1).
```

```
pdp <- aggregate_profiles(cp_rf, type = "partial", variable_type = "categorical")
plot(pdp, variables = perturbed_variable)
```

`describe(pdp, variables = perturbed_variable)`

```
#> Random Forest's mean prediction is equal to 0.844.
#>
#> Model's prediction would increase substantially if the value of class variable would change to "restaurant staff".
#> The largest change would be marked if class variable would change to "2nd".
#>
#> Other variables are with less importance and they do not change prediction by more than 0.05%.
```