Black-box models may have very different structures. This function creates a unified representation of a model, which can be further processed by functions for explanations.
explain.default(
model,
data = NULL,
y = NULL,
predict_function = NULL,
predict_function_target_column = NULL,
residual_function = NULL,
weights = NULL,
...,
label = NULL,
verbose = TRUE,
precalculate = TRUE,
colorize = !isTRUE(getOption("knitr.in.progress")),
model_info = NULL,
type = NULL
)
explain(
model,
data = NULL,
y = NULL,
predict_function = NULL,
predict_function_target_column = NULL,
residual_function = NULL,
weights = NULL,
...,
label = NULL,
verbose = TRUE,
precalculate = TRUE,
colorize = !isTRUE(getOption("knitr.in.progress")),
model_info = NULL,
type = NULL
)
object - a model to be explained
data.frame or matrix - data which will be used to calculate the explanations. If not provided, then it will be extracted from the model. Data should be passed without a target column (this shall be provided as the y
argument). NOTE: If the target variable is present in the data
, some of the functionalities may not work properly.
numeric vector with outputs/scores. If provided, then it shall have the same size as data
function that takes two arguments: model and new data and returns a numeric vector with predictions. By default it is yhat
.
Character or numeric containing either column name or column number in the model prediction object of the class that should be considered as positive (i.e. the class that is associated with probability 1). If NULL, the second column of the output will be taken for binary classification. For a multiclass classification setting, that parameter cause switch to binary classification mode with one vs others probabilities.
function that takes four arguments: model, data, target vector y and predict function (optionally). It should return a numeric vector with model residuals for given data. If not provided, response residuals (\(y-\hat{y}\)) are calculated. By default it is residual_function_default
.
numeric vector with sampling weights. By default it's NULL
. If provided, then it shall have the same length as data
other parameters
character - the name of the model. By default it's extracted from the 'class' attribute of the model
logical. If TRUE (default) then diagnostic messages will be printed
logical. If TRUE (default) then predicted_values
and residual
are calculated when explainer is created.
This will happen also if verbose
is TRUE. Set both verbose
and precalculate
to FALSE to omit calculations.
logical. If TRUE (default) then WARNINGS
, ERRORS
and NOTES
are colorized. Will work only in the R console. Now by default it is FALSE
while knitting and TRUE
otherwise.
a named list (package
, version
, type
) containing information about model. If NULL
, DALEX
will seek for information on it's own.
type of a model, either classification
or regression
. If not specified then type
will be extracted from model_info
.
An object of the class explainer
.
It's a list with the following fields:
model
the explained model.
data
the dataset used for training.
y
response for observations from data
.
weights
sample weights for data
. NULL
if weights are not specified.
y_hat
calculated predictions.
residuals
calculated residuals.
predict_function
function that may be used for model predictions, shall return a single numerical value for each observation.
residual_function
function that returns residuals, shall return a single numerical value for each observation.
class
class/classes of a model.
label
label of explainer.
model_info
named list contating basic information about model, like package, version of package and type.
Please NOTE that the model
is the only required argument.
But some explanations may expect that other arguments will be provided too.
Explanatory Model Analysis. Explore, Explain and Examine Predictive Models. https://ema.drwhy.ai/
# simple explainer for regression problem
aps_lm_model4 <- lm(m2.price ~., data = apartments)
aps_lm_explainer4 <- explain(aps_lm_model4, data = apartments, label = "model_4v")
#> Preparation of a new explainer is initiated
#> -> model label : model_4v
#> -> data : 1000 rows 6 cols
#> -> target variable : not specified! ( WARNING )
#> -> predict function : yhat.lm will be used ( default )
#> -> predicted values : No value for predict function target column. ( default )
#> -> model_info : package stats , ver. 4.2.3 , task regression ( default )
#> -> model_info : Model info detected regression task but 'y' is a NULL . ( WARNING )
#> -> model_info : By deafult regressions tasks supports only numercical 'y' parameter.
#> -> model_info : Consider changing to numerical vector.
#> -> model_info : Otherwise I will not be able to calculate residuals or loss function.
#> -> predicted values : numerical, min = 1781.848 , mean = 3487.019 , max = 6176.032
#> -> residual function : difference between y and yhat ( default )
#> A new explainer has been created!
aps_lm_explainer4
#> Model label: model_4v
#> Model class: lm
#> Data head :
#> m2.price construction.year surface floor no.rooms district
#> 1 5897 1953 25 3 1 Srodmiescie
#> 2 1818 1992 143 9 5 Bielany
# various parameters for the explain function
# all defaults
aps_lm <- explain(aps_lm_model4)
#> Preparation of a new explainer is initiated
#> -> model label : lm ( default )
#> -> data : 1000 rows 6 cols extracted from the model
#> -> target variable : not specified! ( WARNING )
#> -> predict function : yhat.lm will be used ( default )
#> -> predicted values : No value for predict function target column. ( default )
#> -> model_info : package stats , ver. 4.2.3 , task regression ( default )
#> -> model_info : Model info detected regression task but 'y' is a NULL . ( WARNING )
#> -> model_info : By deafult regressions tasks supports only numercical 'y' parameter.
#> -> model_info : Consider changing to numerical vector.
#> -> model_info : Otherwise I will not be able to calculate residuals or loss function.
#> -> predicted values : numerical, min = 1781.848 , mean = 3487.019 , max = 6176.032
#> -> residual function : difference between y and yhat ( default )
#> A new explainer has been created!
# silent execution
aps_lm <- explain(aps_lm_model4, verbose = FALSE)
# set target variable
aps_lm <- explain(aps_lm_model4, data = apartments, label = "model_4v", y = apartments$m2.price)
#> Preparation of a new explainer is initiated
#> -> model label : model_4v
#> -> data : 1000 rows 6 cols
#> -> target variable : 1000 values
#> -> predict function : yhat.lm will be used ( default )
#> -> predicted values : No value for predict function target column. ( default )
#> -> model_info : package stats , ver. 4.2.3 , task regression ( default )
#> -> predicted values : numerical, min = 1781.848 , mean = 3487.019 , max = 6176.032
#> -> residual function : difference between y and yhat ( default )
#> -> residuals : numerical, min = -247.4728 , mean = 2.093656e-14 , max = 469.0023
#> A new explainer has been created!
aps_lm <- explain(aps_lm_model4, data = apartments, label = "model_4v", y = apartments$m2.price,
predict_function = predict)
#> Preparation of a new explainer is initiated
#> -> model label : model_4v
#> -> data : 1000 rows 6 cols
#> -> target variable : 1000 values
#> -> predict function : predict
#> -> predicted values : No value for predict function target column. ( default )
#> -> model_info : package stats , ver. 4.2.3 , task regression ( default )
#> -> predicted values : numerical, min = 1781.848 , mean = 3487.019 , max = 6176.032
#> -> residual function : difference between y and yhat ( default )
#> -> residuals : numerical, min = -247.4728 , mean = 2.093656e-14 , max = 469.0023
#> A new explainer has been created!
# \donttest{
# user provided predict_function
aps_ranger <- ranger::ranger(m2.price~., data = apartments, num.trees = 50)
custom_predict <- function(X.model, newdata) {
predict(X.model, newdata)$predictions
}
aps_ranger_exp <- explain(aps_ranger, data = apartments, y = apartments$m2.price,
predict_function = custom_predict)
#> Preparation of a new explainer is initiated
#> -> model label : ranger ( default )
#> -> data : 1000 rows 6 cols
#> -> target variable : 1000 values
#> -> predict function : custom_predict
#> -> predicted values : No value for predict function target column. ( default )
#> -> model_info : package ranger , ver. 0.14.1 , task regression ( default )
#> -> predicted values : numerical, min = 1853.746 , mean = 3491.451 , max = 6247.183
#> -> residual function : difference between y and yhat ( default )
#> -> residuals : numerical, min = -529.95 , mean = -4.432274 , max = 626.1063
#> A new explainer has been created!
# user provided residual_function
aps_ranger <- ranger::ranger(m2.price~., data = apartments, num.trees = 50)
custom_residual <- function(X.model, newdata, y, predict_function) {
abs(y - predict_function(X.model, newdata))
}
aps_ranger_exp <- explain(aps_ranger, data = apartments,
y = apartments$m2.price,
residual_function = custom_residual)
#> Preparation of a new explainer is initiated
#> -> model label : ranger ( default )
#> -> data : 1000 rows 6 cols
#> -> target variable : 1000 values
#> -> predict function : yhat.ranger will be used ( default )
#> -> predicted values : No value for predict function target column. ( default )
#> -> model_info : package ranger , ver. 0.14.1 , task regression ( default )
#> -> predicted values : numerical, min = 1902.531 , mean = 3488.193 , max = 6041.708
#> -> residual function : custom_residual
#> -> residuals : numerical, min = 0.081 , mean = 116.7558 , max = 633.3
#> A new explainer has been created!
# binary classification
titanic_ranger <- ranger::ranger(as.factor(survived)~., data = titanic_imputed, num.trees = 50,
probability = TRUE)
# keep in mind that for binary classification y parameter has to be numeric with 0 and 1 values
titanic_ranger_exp <- explain(titanic_ranger, data = titanic_imputed, y = titanic_imputed$survived)
#> Preparation of a new explainer is initiated
#> -> model label : ranger ( default )
#> -> data : 2207 rows 8 cols
#> -> target variable : 2207 values
#> -> predict function : yhat.ranger will be used ( default )
#> -> predicted values : No value for predict function target column. ( default )
#> -> model_info : package ranger , ver. 0.14.1 , task classification ( default )
#> -> predicted values : numerical, min = 0.02353463 , mean = 0.3220703 , max = 0.9937418
#> -> residual function : difference between y and yhat ( default )
#> -> residuals : numerical, min = -0.7602215 , mean = 8.647797e-05 , max = 0.8902087
#> A new explainer has been created!
# multiclass task
hr_ranger <- ranger::ranger(status~., data = HR, num.trees = 50, probability = TRUE)
# keep in mind that for multiclass y parameter has to be a factor,
# with same levels as in training data
hr_ranger_exp <- explain(hr_ranger, data = HR, y = HR$status)
#> Preparation of a new explainer is initiated
#> -> model label : ranger ( default )
#> -> data : 7847 rows 6 cols
#> -> target variable : 7847 values
#> -> predict function : yhat.ranger will be used ( default )
#> -> predicted values : No value for predict function target column. ( default )
#> -> model_info : package ranger , ver. 0.14.1 , task multiclass ( default )
#> -> predicted values : predict function returns multiple columns: 3 ( default )
#> -> residual function : difference between 1 and probability of true class ( default )
#> -> residuals : numerical, min = 0 , mean = 0.2811637 , max = 0.907689
#> A new explainer has been created!
# set model_info
model_info <- list(package = "stats", ver = "3.6.2", type = "regression")
aps_lm_model4 <- lm(m2.price ~., data = apartments)
aps_lm_explainer4 <- explain(aps_lm_model4, data = apartments, label = "model_4v",
model_info = model_info)
#> Preparation of a new explainer is initiated
#> -> model label : model_4v
#> -> data : 1000 rows 6 cols
#> -> target variable : not specified! ( WARNING )
#> -> predict function : yhat.lm will be used ( default )
#> -> predicted values : No value for predict function target column. ( default )
#> -> model_info : package stats , ver. 3.6.2 , task regression
#> -> model_info : Model info detected regression task but 'y' is a NULL . ( WARNING )
#> -> model_info : By deafult regressions tasks supports only numercical 'y' parameter.
#> -> model_info : Consider changing to numerical vector.
#> -> model_info : Otherwise I will not be able to calculate residuals or loss function.
#> -> predicted values : numerical, min = 1781.848 , mean = 3487.019 , max = 6176.032
#> -> residual function : difference between y and yhat ( default )
#> A new explainer has been created!
# simple function
aps_fun <- function(x) 58*x$surface
aps_fun_explainer <- explain(aps_fun, data = apartments, y = apartments$m2.price, label="sfun")
#> Preparation of a new explainer is initiated
#> -> model label : sfun
#> -> data : 1000 rows 6 cols
#> -> target variable : 1000 values
#> -> predict function : yhat.function will be used ( default )
#> -> predicted values : No value for predict function target column. ( default )
#> -> model_info : package Model of class: function package unrecognized , ver. Unknown , task regression ( default )
#> -> predicted values : numerical, min = 1160 , mean = 4964.22 , max = 8700
#> -> residual function : difference between y and yhat ( default )
#> -> residuals : numerical, min = -7035 , mean = -1477.201 , max = 4855
#> A new explainer has been created!
model_performance(aps_fun_explainer)
#> Measures for: regression
#> mse : 9836605
#> rmse : 3136.336
#> r2 : -10.97734
#> mad : 2336.5
#>
#> Residuals:
#> 0% 10% 20% 30% 40% 50% 60% 70% 80% 90%
#> -7035.0 -5258.6 -4232.4 -3362.9 -2474.8 -1479.5 -507.6 294.2 1283.2 2156.5
#> 100%
#> 4855.0
# set model_info
model_info <- list(package = "stats", ver = "3.6.2", type = "regression")
aps_lm_model4 <- lm(m2.price ~., data = apartments)
aps_lm_explainer4 <- explain(aps_lm_model4, data = apartments, label = "model_4v",
model_info = model_info)
#> Preparation of a new explainer is initiated
#> -> model label : model_4v
#> -> data : 1000 rows 6 cols
#> -> target variable : not specified! ( WARNING )
#> -> predict function : yhat.lm will be used ( default )
#> -> predicted values : No value for predict function target column. ( default )
#> -> model_info : package stats , ver. 3.6.2 , task regression
#> -> model_info : Model info detected regression task but 'y' is a NULL . ( WARNING )
#> -> model_info : By deafult regressions tasks supports only numercical 'y' parameter.
#> -> model_info : Consider changing to numerical vector.
#> -> model_info : Otherwise I will not be able to calculate residuals or loss function.
#> -> predicted values : numerical, min = 1781.848 , mean = 3487.019 , max = 6176.032
#> -> residual function : difference between y and yhat ( default )
#> A new explainer has been created!
aps_lm_explainer4 <- explain(aps_lm_model4, data = apartments, label = "model_4v",
weights = as.numeric(apartments$construction.year > 2000))
#> Preparation of a new explainer is initiated
#> -> model label : model_4v
#> -> data : 1000 rows 6 cols
#> -> target variable : not specified! ( WARNING )
#> -> sampling weights : 1000 values ( note that not all explanations handle weights )
#> -> predict function : yhat.lm will be used ( default )
#> -> predicted values : No value for predict function target column. ( default )
#> -> model_info : package stats , ver. 4.2.3 , task regression ( default )
#> -> model_info : Model info detected regression task but 'y' is a NULL . ( WARNING )
#> -> model_info : By deafult regressions tasks supports only numercical 'y' parameter.
#> -> model_info : Consider changing to numerical vector.
#> -> model_info : Otherwise I will not be able to calculate residuals or loss function.
#> -> predicted values : numerical, min = 1781.848 , mean = 3487.019 , max = 6176.032
#> -> residual function : difference between y and yhat ( default )
#> A new explainer has been created!
# more complex model
library("ranger")
aps_ranger_model4 <- ranger(m2.price ~., data = apartments, num.trees = 50)
aps_ranger_explainer4 <- explain(aps_ranger_model4, data = apartments, label = "model_ranger")
#> Preparation of a new explainer is initiated
#> -> model label : model_ranger
#> -> data : 1000 rows 6 cols
#> -> target variable : not specified! ( WARNING )
#> -> predict function : yhat.ranger will be used ( default )
#> -> predicted values : No value for predict function target column. ( default )
#> -> model_info : package ranger , ver. 0.14.1 , task regression ( default )
#> -> model_info : Model info detected regression task but 'y' is a NULL . ( WARNING )
#> -> model_info : By deafult regressions tasks supports only numercical 'y' parameter.
#> -> model_info : Consider changing to numerical vector.
#> -> model_info : Otherwise I will not be able to calculate residuals or loss function.
#> -> predicted values : numerical, min = 1880.288 , mean = 3487.452 , max = 6149.699
#> -> residual function : difference between y and yhat ( default )
#> A new explainer has been created!
aps_ranger_explainer4
#> Model label: model_ranger
#> Model class: ranger
#> Data head :
#> m2.price construction.year surface floor no.rooms district
#> 1 5897 1953 25 3 1 Srodmiescie
#> 2 1818 1992 143 9 5 Bielany
# }