scikit-learn models may be loaded into R environment like any other Python object. This function helps to inspect performance of Python model and compare it with other models, using R tools like DALEX. This function creates an object that is easily accessible R version of scikit-learn model exported from Python via pickle file.
explain_scikitlearn(
path,
yml = NULL,
condaenv = NULL,
env = NULL,
data = NULL,
y = NULL,
weights = NULL,
predict_function = NULL,
predict_function_target_column = NULL,
residual_function = NULL,
...,
label = NULL,
verbose = TRUE,
precalculate = TRUE,
colorize = !isTRUE(getOption("knitr.in.progress")),
model_info = NULL,
type = NULL
)
a path to the pickle file. Can be used without other arguments if you are sure that active Python version match pickle version.
a path to the yml file. Conda virtual env will be recreated from this file. If OS is Windows conda has to be added to the PATH first
If yml param is provided, a path to the main conda folder. If yml is null, a name of existing conda environment.
A path to python virtual environment.
data.frame or matrix - data which will be used to calculate the explanations. If not provided, then it will be extracted from the model. Data should be passed without a target column (this shall be provided as the y
argument). NOTE: If the target variable is present in the data
, some of the functionalities may not work properly.
numeric vector with outputs/scores. If provided, then it shall have the same size as data
numeric vector with sampling weights. By default it's NULL
. If provided, then it shall have the same length as data
function that takes two arguments: model and new data and returns a numeric vector with predictions. By default it is yhat
.
Character or numeric containing either column name or column number in the model prediction object of the class that should be considered as positive (i.e. the class that is associated with probability 1). If NULL, the second column of the output will be taken for binary classification. For a multiclass classification setting, that parameter cause switch to binary classification mode with one vs others probabilities.
function that takes four arguments: model, data, target vector y and predict function (optionally). It should return a numeric vector with model residuals for given data. If not provided, response residuals (\(y-\hat{y}\)) are calculated. By default it is residual_function_default
.
other parameters
character - the name of the model. By default it's extracted from the 'class' attribute of the model
logical. If TRUE (default) then diagnostic messages will be printed
logical. If TRUE (default) then predicted_values
and residual
are calculated when explainer is created.
This will happen also if verbose
is TRUE. Set both verbose
and precalculate
to FALSE to omit calculations.
logical. If TRUE (default) then WARNINGS
, ERRORS
and NOTES
are colorized. Will work only in the R console. Now by default it is FALSE
while knitting and TRUE
otherwise.
a named list (package
, version
, type
) containing information about model. If NULL
, DALEX
will seek for information on it's own.
type of a model, either classification
or regression
. If not specified then type
will be extracted from model_info
.
An object of the class 'explainer'. It has additional field param_set when user can check parameters of scikit-learn model
Example of Python code
from pandas import DataFrame, read_csv
import pandas as pd
import pickle
import sklearn.ensemble
model = sklearn.ensemble.GradientBoostingClassifier()
model = model.fit(titanic_train_X, titanic_train_Y)
pickle.dump(model, open("gbm.pkl", "wb"), protocol = 2)
In order to export environment into .yml, activating virtual env via activate name_of_the_env
and execution of the following shell command is necessary conda env export > environment.yml
Errors use case
Here is shortened version of solution for specific errors
There already exists environment with a name specified by given .yml file
If you provide .yml file that in its header contatins name exact to name of environment that already exists, existing will be set active without changing it.
You have two ways of solving that issue. Both connected with anaconda prompt. First is removing conda env with command: conda env remove --name myenv
And execute function once again. Second is updating env via: conda env create -f environment.yml
Conda cannot find specified packages at channels you have provided.
That error may be casued by a lot of things. One of those is that specified version is too old to be avaialble from offcial conda repo.
Edit Your .yml file and add link to proper repository at channels section.
Issue may be also connected with the platform. If model was created on the platform with different OS yo may need to remove specific version from .yml file.- numpy=1.16.4=py36h19fb1c0_0
- numpy-base=1.16.4=py36hc3f5095_0
In the example above You have to remove =py36h19fb1c0_0
and =py36hc3f5095_0
If some packages are not availbe for anaconda at all, use pip statement
If .yml file seems not to work, virtual env can be created manually using anaconda promt. conda create -n name_of_env python=3.4
conda install -n name_of_env name_of_package=0.20
if (FALSE) {
if (Sys.info()["sysname"] != "Darwin") {
# Explainer build (Keep in mind that 18th column is target)
titanic_test <- read.csv(system.file("extdata", "titanic_test.csv", package = "DALEXtra"))
# Keep in mind that when pickle is being built and loaded,
# not only Python version but libraries versions has to match aswell
explainer <- explain_scikitlearn(system.file("extdata", "scikitlearn.pkl", package = "DALEXtra"),
yml = system.file("extdata", "testing_environment.yml", package = "DALEXtra"),
data = titanic_test[,1:17], y = titanic_test$survived)
plot(model_performance(explainer))
# Predictions with newdata
predict(explainer, titanic_test[1:10,1:17])
}
}