Data and logistic regression model for Titanic survival

Vignette presents the aspect_importance() function on the datasets: titanic_imputed (available in the DALEX package) and BostonHousing2 (from mlbench package).
At the beginning, we download titanic_imputed dataset and build logistic regression model.

#>   gender age class    embarked  fare sibsp parch survived
#> 1   male  42   3rd Southampton  7.11     0     0       no
#> 2   male  13   3rd Southampton 20.05     0     2       no
#> 3   male  16   3rd Southampton 20.05     1     1       no
#> 4 female  39   3rd Southampton 20.05     1     1      yes
#> 5 female  16   3rd Southampton  7.13     0     0      yes
#> 6   male  25   3rd Southampton  7.13     0     0      yes
model_titanic_glm <-
  glm(survived == "yes" ~ class + gender + age + sibsp + parch + fare + embarked,
      titanic,
      family = "binomial")

Preparing additional parameters

Before using aspect_importance() we need to:

  • group features of the dataset into aspects,
  • define the size of the sample that will be used to calculate aspect importance,
  • choose observation for which we want to explain aspects’ importance.
aspects <-
  list(
    wealth = c("class", "fare"),
    family = c("sibsp", "parch"),
    personal = c("age", "gender"),
    embarked = "embarked"
  )
N <- 1000
passenger <- data.frame(
  class = factor(
    "3rd",
    levels = c("1st", "2nd", "3rd", "deck crew", "engineering crew", "restaurant staff", "victualling crew")),
  gender = factor("male", levels = c("female", "male")),
  age = 8,
  sibsp = 0,
  parch = 0,
  fare = 18,
  embarked = factor(
    "Southampton",
    levels = c("Belfast", "Cherbourg", "Queenstown", "Southampton")
  )
)

passenger
#>   class gender age sibsp parch fare    embarked
#> 1   3rd   male   8     0     0   18 Southampton
predict(model_titanic_glm, passenger, type = "response")
#>         1 
#> 0.1803217

Calculating aspect importance (logistic regression)

Now we can call aspect_importance() function and see that features included in wealth (that is class and fare) have the biggest contribution on survival prediction for the passenger. That contribution is of negative type. Rest of the aspects have significally smaller influence. However, in case of family and personal, it’s a positive type of influence.

library("ggplot2")
library("ingredients")

titanic_glm_ai <- aspect_importance(model_titanic_glm, titanic, predict, passenger, aspects, N)

titanic_glm_ai
#>    aspects importance     features
#> 2   wealth   -0.81013  class, fare
#> 4 personal    0.20001  age, gender
#> 3   family    0.18500 sibsp, parch
#> 5 embarked   -0.05799     embarked
plot(titanic_glm_ai) + ggtitle("Aspect importance for the selected passenger (logistic reg.)")

Calculating aspect importance with explainer

Aspect_importance() could be also called using DALEX explainer as we show below.

#>    aspects importance     features
#> 2   wealth   -0.81013  class, fare
#> 4 personal    0.20001  age, gender
#> 3   family    0.18500 sibsp, parch
#> 5 embarked   -0.05799     embarked

Random forest model for Titanic survival

Now, we prepare random forest model for the titanic dataset.

library("randomForest")
model_titanic_rf <- randomForest(factor(survived) == "yes" ~ gender + age + class + embarked + fare + sibsp + parch,  data = titanic)
predict(model_titanic_rf, passenger)
#>         1 
#> 0.5162556

Calculating aspect importance (random forest)

After calling aspect_importance() we can see why the survival prediction for the passenger in random forest model was much higher (0.5) than in logistic regression case (0.18).

In this example personal features (age and gender) have the biggest positive influence. Aspects wealth (class, fare) and embarked have both much smaller contribution and those are negative ones. Aspect family has very small influence on the prediction.

titanic_rf_ai <- aspect_importance(model_titanic_rf, titanic, predict, passenger, aspects, N)

titanic_rf_ai
#>    aspects importance     features
#> 4 personal   0.223615  age, gender
#> 2   wealth  -0.063091  class, fare
#> 3   family   0.009410 sibsp, parch
#> 5 embarked  -0.003338     embarked
plot(titanic_rf_ai) + ggtitle("Aspect importance for the selected passenger (random for.)")

Automated grouping features into aspects

In examples described above, we had to manually group features into aspects. On BostonHousing2 dataset, we will test the function that automatically groups features (grouping is based on the features correlation). Function only works on numeric variables.

We import BostonHousing2 from mlbench package and choose columns with numeric features. Then we fit linear model to the data and choose observation to be explained. Target variable is cmedv.

library(mlbench)
data("BostonHousing2")
data <- BostonHousing2[,-c(1:5, 10)] #excluding non numeric features
head(data)
#>   cmedv    crim zn indus   nox    rm  age    dis rad tax ptratio      b
#> 1  24.0 0.00632 18  2.31 0.538 6.575 65.2 4.0900   1 296    15.3 396.90
#> 2  21.6 0.02731  0  7.07 0.469 6.421 78.9 4.9671   2 242    17.8 396.90
#> 3  34.7 0.02729  0  7.07 0.469 7.185 61.1 4.9671   2 242    17.8 392.83
#> 4  33.4 0.03237  0  2.18 0.458 6.998 45.8 6.0622   3 222    18.7 394.63
#> 5  36.2 0.06905  0  2.18 0.458 7.147 54.2 6.0622   3 222    18.7 396.90
#> 6  28.7 0.02985  0  2.18 0.458 6.430 58.7 6.0622   3 222    18.7 394.12
#>   lstat
#> 1  4.98
#> 2  9.14
#> 3  4.03
#> 4  2.94
#> 5  5.33
#> 6  5.21
x <- BostonHousing2[,-c(1:6, 10)] #excluding non numeric features and target variable
new_observation <- data[10,]
model <- lm(cmedv ~., data = data)
predict(model, new_observation)
#>       10 
#> 19.01352

We run group_variables() function with cut off level set on 0.6. As a result, we get a list of variables groups (aspects) where absolute value of features’ pairwise correlation is at least at 0.6.

Afterwards, we call aspect_importance() function with parameter show_cor = TRUE, to show how features are grouped into aspects, what is minimal value of pairwise correlation in each group and to check whether any pair of features is negatively correlated (neg) or not (pos).

#>         aspects importance                   features   min_cor sign
#> 4 aspect.group3    -4.1808                  rm, lstat 0.6408316  neg
#> 6 aspect.group5     3.9080                    ptratio        NA     
#> 2 aspect.group1    -3.2599 crim, indus, nox, age, dis 0.6794867  neg
#> 7 aspect.group6     1.5356                          b        NA     
#> 3 aspect.group2     0.8680                         zn        NA     
#> 5 aspect.group4     0.8183                   rad, tax 0.7048757  pos

Using lasso in aspect_importance() function

Function aspect_importance() can calculate coefficients (that is aspects’ importance) by using either linear regression or lasso regression. Using lasso, we can control how many nonzero coefficients (nonzero aspects importance values) are present in the final explanation.

To use aspect_importance() with lasso, we have to provide n_var parameter, which declares how many aspects importance values we would like to get in aspect_importance() results.

For this example, we use BostonHousing2 dataset again. This time we would like to group variables into aspects with a cut off level for correlation set at 0.7. And then we use aspect_importance, at the beginning without lasso.

#>         aspects importance      features
#> 7 aspect.group6     3.3096       ptratio
#> 4 aspect.group3    -3.0750 nox, age, dis
#> 5 aspect.group4    -1.7896            rm
#> 9 aspect.group8    -1.1738         lstat
#> 3 aspect.group2    -0.4833            zn
#> 8 aspect.group7     0.4075             b
#> 2 aspect.group1     0.2596   crim, indus
#> 6 aspect.group5    -0.1356      rad, tax

With the help of lasso technique, we would like to check the importance of variables’ aspects, while controlling that two of them should be equal to 0. Therefore we call aspect_importance() with n_var parameter set to 6.

BostonHousing2_ai_lasso <- aspect_importance(model,data, predict_function = predict, new_observation, aspects, N = 200, n_var = 6, show_cor = TRUE)
BostonHousing2_ai_lasso
#>         aspects importance      features   min_cor sign
#> 7 aspect.group6     3.2300       ptratio        NA     
#> 4 aspect.group3    -2.5694 nox, age, dis 0.7951529  neg
#> 5 aspect.group4    -1.2941            rm        NA     
#> 9 aspect.group8    -0.6969         lstat        NA     
#> 8 aspect.group7     0.4650             b        NA     
#> 2 aspect.group1     0.2146   crim, indus 0.7355237  pos
#> 3 aspect.group2     0.0000            zn        NA     
#> 6 aspect.group5     0.0000      rad, tax 0.7048757  pos

Session info

#> R version 3.5.0 (2018-04-23)
#> Platform: x86_64-apple-darwin15.6.0 (64-bit)
#> Running under: macOS  10.14.4
#> 
#> Matrix products: default
#> BLAS: /Library/Frameworks/R.framework/Versions/3.5/Resources/lib/libRblas.0.dylib
#> LAPACK: /Library/Frameworks/R.framework/Versions/3.5/Resources/lib/libRlapack.dylib
#> 
#> locale:
#> [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
#> 
#> attached base packages:
#> [1] stats     graphics  grDevices utils     datasets  methods   base     
#> 
#> other attached packages:
#> [1] mlbench_2.1-1       randomForest_4.6-14 ingredients_0.3.10 
#> [4] ggplot2_3.2.1       DALEX_0.4.8        
#> 
#> loaded via a namespace (and not attached):
#>  [1] Rcpp_1.0.2       compiler_3.5.0   pillar_1.4.2     iterators_1.0.12
#>  [5] tools_3.5.0      digest_0.6.21    lattice_0.20-38  evaluate_0.14   
#>  [9] memoise_1.1.0    tibble_2.1.3     gtable_0.3.0     pkgconfig_2.0.3 
#> [13] rlang_0.4.0      foreach_1.4.7    Matrix_1.2-17    rstudioapi_0.10 
#> [17] yaml_2.2.0       pkgdown_1.4.1    xfun_0.9         ggdendro_0.1-20 
#> [21] gridExtra_2.3    withr_2.1.2      stringr_1.4.0    dplyr_0.8.3     
#> [25] knitr_1.24       desc_1.2.0       fs_1.3.1         glmnet_2.0-18   
#> [29] rprojroot_1.3-2  grid_3.5.0       tidyselect_0.2.5 glue_1.3.1      
#> [33] R6_2.4.0         rmarkdown_1.15   purrr_0.3.2      magrittr_1.5    
#> [37] codetools_0.2-16 backports_1.1.4  scales_1.0.0     htmltools_0.3.6 
#> [41] MASS_7.3-51.4    assertthat_0.2.1 colorspace_1.4-1 labeling_0.3    
#> [45] stringi_1.4.3    lazyeval_0.2.2   munsell_0.5.0    crayon_1.3.4