DALEX is designed to work with various black-box models like tree ensembles, linear models, neural networks etc. Unfortunately R packages that create such models are very inconsistent. Different tools use different interfaces to train, validate and use models. One of those tools, we would like to make more accessible is H2O.

explain_h2o(
  model,
  data = NULL,
  y = NULL,
  weights = NULL,
  predict_function = NULL,
  predict_function_target_column = NULL,
  residual_function = NULL,
  ...,
  label = NULL,
  verbose = TRUE,
  precalculate = TRUE,
  colorize = !isTRUE(getOption("knitr.in.progress")),
  model_info = NULL,
  type = NULL
)

Arguments

model

object - a model to be explained

data

data.frame or matrix - data which will be used to calculate the explanations. If not provided, then it will be extracted from the model. Data should be passed without a target column (this shall be provided as the y argument). NOTE: If the target variable is present in the data, some of the functionalities may not work properly.

y

numeric vector with outputs/scores. If provided, then it shall have the same size as data

weights

numeric vector with sampling weights. By default it's NULL. If provided, then it shall have the same length as data

predict_function

function that takes two arguments: model and new data and returns a numeric vector with predictions. By default it is yhat.

predict_function_target_column

Character or numeric containing either column name or column number in the model prediction object of the class that should be considered as positive (i.e. the class that is associated with probability 1). If NULL, the second column of the output will be taken for binary classification. For a multiclass classification setting, that parameter cause switch to binary classification mode with one vs others probabilities.

residual_function

function that takes four arguments: model, data, target vector y and predict function (optionally). It should return a numeric vector with model residuals for given data. If not provided, response residuals (\(y-\hat{y}\)) are calculated. By default it is residual_function_default.

...

other parameters

label

character - the name of the model. By default it's extracted from the 'class' attribute of the model

verbose

logical. If TRUE (default) then diagnostic messages will be printed

precalculate

logical. If TRUE (default) then predicted_values and residual are calculated when explainer is created. This will happen also if verbose is TRUE. Set both verbose and precalculate to FALSE to omit calculations.

colorize

logical. If TRUE (default) then WARNINGS, ERRORS and NOTES are colorized. Will work only in the R console. Now by default it is FALSE while knitting and TRUE otherwise.

model_info

a named list (package, version, type) containing information about model. If NULL, DALEX will seek for information on it's own.

type

type of a model, either classification or regression. If not specified then type will be extracted from model_info.

Value

explainer object (explain) ready to work with DALEX

Examples

# \donttest{


# load packages and data
library(h2o)
#> 
#> ----------------------------------------------------------------------
#> 
#> Your next step is to start H2O:
#>     > h2o.init()
#> 
#> For H2O package documentation, ask for help:
#>     > ??h2o
#> 
#> After starting H2O, you can use the Web UI at http://localhost:54321
#> For more information visit https://docs.h2o.ai
#> 
#> ----------------------------------------------------------------------
#> 
#> Attaching package: ‘h2o’
#> The following objects are masked from ‘package:stats’:
#> 
#>     cor, sd, var
#> The following objects are masked from ‘package:base’:
#> 
#>     %*%, %in%, &&, apply, as.factor, as.numeric, colnames, colnames<-,
#>     ifelse, is.character, is.factor, is.numeric, log, log10, log1p,
#>     log2, round, signif, trunc, ||
library(DALEXtra)

# data <- DALEX::titanic_imputed

# init h2o
 cluster <- try(h2o::h2o.init())
#> 
#> H2O is not running yet, starting it now...
#> 
#> Note:  In case of errors look at the following log files:
#>     /var/folders/24/8k48jl6d249_n_qfxwsl6xvm0000gn/T//RtmpsbQZiq/file719d3118727e/h2o_runner_started_from_r.out
#>     /var/folders/24/8k48jl6d249_n_qfxwsl6xvm0000gn/T//RtmpsbQZiq/file719d3c04c760/h2o_runner_started_from_r.err
#> 
#> 
#> Starting H2O JVM and connecting: ...... Connection successful!
#> 
#> R is connected to the H2O cluster: 
#>     H2O cluster uptime:         2 seconds 577 milliseconds 
#>     H2O cluster timezone:       UTC 
#>     H2O data parsing timezone:  UTC 
#>     H2O cluster version:        3.36.0.4 
#>     H2O cluster version age:    23 days  
#>     H2O cluster name:           H2O_started_from_R_runner_gce895 
#>     H2O cluster total nodes:    1 
#>     H2O cluster total memory:   3.11 GB 
#>     H2O cluster total cores:    3 
#>     H2O cluster allowed cores:  3 
#>     H2O cluster healthy:        TRUE 
#>     H2O Connection ip:          localhost 
#>     H2O Connection port:        54321 
#>     H2O Connection proxy:       NA 
#>     H2O Internal Security:      FALSE 
#>     R Version:                  R version 4.2.0 (2022-04-22) 
#> 
if (!inherits(cluster, "try-error")) {
# stop h2o progress printing
 h2o.no_progress()

# split the data
# h2o_split <- h2o.splitFrame(as.h2o(data))
# train <- h2o_split[[1]]
# test <- as.data.frame(h2o_split[[2]])
# h2o automl takes target as factor
# train$survived <- as.factor(train$survived)

# fit a model
# automl <- h2o.automl(y = "survived",
#                   training_frame = train,
#                    max_runtime_secs = 30)


# create an explainer for the model
# explainer <- explain_h2o(automl,
#                        data = test,
#                         y = test$survived,
#                          label = "h2o")


titanic_test <- read.csv(system.file("extdata", "titanic_test.csv", package = "DALEXtra"))
titanic_train <- read.csv(system.file("extdata", "titanic_train.csv", package = "DALEXtra"))
titanic_h2o <- h2o::as.h2o(titanic_train)
titanic_h2o["survived"] <- h2o::as.factor(titanic_h2o["survived"])
titanic_test_h2o <- h2o::as.h2o(titanic_test)
model <- h2o::h2o.gbm(
training_frame = titanic_h2o,
y = "survived",
distribution = "bernoulli",
ntrees = 500,
max_depth = 4,
min_rows =  12,
learn_rate = 0.001
)
explain_h2o(model, titanic_test[,1:17], titanic_test[,18])

h2o.shutdown(prompt = FALSE)
 }
#> Warning: data.table cannot be used without R package bit64 version 0.9.7 or higher.  Please upgrade to take advangage of data.table speedups.
#> Warning: data.table cannot be used without R package bit64 version 0.9.7 or higher.  Please upgrade to take advangage of data.table speedups.
#> Preparation of a new explainer is initiated
#>   -> model label       :  H2OBinomialModel  (  default  )
#>   -> data              :  524  rows  17  cols 
#>   -> target variable   :  524  values 
#>   -> predict function  :  yhat.H2OBinomialModel  will be used (  default  )
#>   -> predicted values  :  No value for predict function target column. (  default  )
#> Warning: data.table cannot be used without R package bit64 version 0.9.7 or higher.  Please upgrade to take advangage of data.table speedups.
#>   -> model_info        :  package h2o , ver. 3.36.0.4 , task classification (  default  ) 
#> Warning: data.table cannot be used without R package bit64 version 0.9.7 or higher.  Please upgrade to take advangage of data.table speedups.
#>   -> predicted values  :  numerical, min =  0.1994385 , mean =  0.3299534 , max =  0.5876636  
#>   -> residual function :  difference between y and yhat (  default  )
#> Warning: data.table cannot be used without R package bit64 version 0.9.7 or higher.  Please upgrade to take advangage of data.table speedups.
#>   -> residuals         :  numerical, min =  -0.5847857 , mean =  -0.01888466 , max =  0.8005615  
#>   A new explainer has been created!  
# }