automation.Rmd
This document presents how you can customize xspliner usage to you make it more easier and automated.
In previous sections we learned how to specify parameters for the effect and transition of single variable. Let’s name them local parameters.
A quick example you can see below:
library(xspliner)
library(randomForest)
rf_iris <- randomForest(Petal.Width ~ Sepal.Length + Petal.Length + Species, data = iris)
model_xs <- xspline(Petal.Width ~
Sepal.Length +
xs(Petal.Length, effect = list(grid.resolution = 100), transition = list(bs = "cr")) +
xf(Species, transition = list(stat = "loglikelihood", value = 4)),
model = rf_iris)
summary(model_xs)
##
## Call:
## stats::glm(formula = Petal.Width ~ Sepal.Length + xs(Petal.Length) +
## xf(Species), family = family, data = data)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -0.66853 -0.07237 -0.02553 0.10260 0.45607
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.41081 0.22655 -6.227 4.85e-09 ***
## Sepal.Length 0.03169 0.03532 0.897 0.371
## xs(Petal.Length) 1.94243 0.36640 5.301 4.20e-07 ***
## xf(Species)versicolor 0.05708 0.17316 0.330 0.742
## xf(Species)virginica 0.36911 0.22999 1.605 0.111
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for gaussian family taken to be 0.03084038)
##
## Null deviance: 86.5699 on 149 degrees of freedom
## Residual deviance: 4.4719 on 145 degrees of freedom
## AIC: -89.243
##
## Number of Fisher Scoring iterations: 2
When the black box model is based on higher amount of variables it can be problematic to specify local parameters for each predictor. Also formula becomes large and hard to read.
Its more common that we use similar configuration for each variable, as it’s simple and allows us to build and experiment faster. To do this you can specify xs_opts
and xf_opts
of xspline
function. Let’s name them global parameters.
Each of global parameters can specify effect and transition, that should be used for xs and xf transformations respectively. Then you just need to use base xs and xf symbols without parameters:
model_xs <- xspline(Petal.Width ~
Sepal.Length +
xs(Petal.Length) +
xf(Species),
model = rf_iris,
xs_opts = list(effect = list(grid.resolution = 100), transition = list(bs = "cr")),
xf_opts = list(transition = list(stat = "loglikelihood", value = 4))
)
summary(model_xs)
##
## Call:
## stats::glm(formula = Petal.Width ~ Sepal.Length + xs(Petal.Length) +
## xf(Species), family = family, data = data)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -0.66853 -0.07237 -0.02553 0.10260 0.45607
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.41081 0.22655 -6.227 4.85e-09 ***
## Sepal.Length 0.03169 0.03532 0.897 0.371
## xs(Petal.Length) 1.94243 0.36640 5.301 4.20e-07 ***
## xf(Species)versicolor 0.05708 0.17316 0.330 0.742
## xf(Species)virginica 0.36911 0.22999 1.605 0.111
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for gaussian family taken to be 0.03084038)
##
## Null deviance: 86.5699 on 149 degrees of freedom
## Residual deviance: 4.4719 on 145 degrees of freedom
## AIC: -89.243
##
## Number of Fisher Scoring iterations: 2
But still you can specify local parameters that override the global ones.
model_xs <- xspline(Petal.Width ~
xs(Sepal.Length, transition = list(k = 10)) +
xs(Petal.Length) +
xf(Species),
model = rf_iris,
xs_opts = list(effect = list(grid.resolution = 100), transition = list(bs = "cr")),
xf_opts = list(transition = list(stat = "loglikelihood", value = 4))
)
summary(model_xs)
##
## Call:
## stats::glm(formula = Petal.Width ~ xs(Sepal.Length) + xs(Petal.Length) +
## xf(Species), family = family, data = data)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -0.67242 -0.07811 -0.02950 0.09776 0.45778
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.53622 0.26701 -5.753 5.01e-08 ***
## xs(Sepal.Length) 0.29442 0.31859 0.924 0.357
## xs(Petal.Length) 1.89794 0.39297 4.830 3.44e-06 ***
## xf(Species)versicolor 0.06961 0.17863 0.390 0.697
## xf(Species)virginica 0.38893 0.23856 1.630 0.105
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for gaussian family taken to be 0.03083001)
##
## Null deviance: 86.5699 on 149 degrees of freedom
## Residual deviance: 4.4704 on 145 degrees of freedom
## AIC: -89.294
##
## Number of Fisher Scoring iterations: 2
In this case last_evaluation
variable will be transformed with thin plate regression spline (bs = "tp"
is default for mgcv::s
) with basis dimension equal to 10
. At the same time average_monthly_hours
will be transformed with cubic splines.
What if you forget specifying local and global parameters?
As you can see in project objects reference, you may find there xs_opts_edfault
and xf_opts_default
objects. These objects specify default parameters for xs
and xf
transformations.
## $effect
## $effect$type
## [1] "pdp"
##
##
## $transition
## $transition$alter
## [1] "always"
##
## $transition$monotonic
## [1] "not"
## $effect
## $effect$type
## [1] "ice"
##
##
## $transition
## $transition$alter
## [1] "always"
##
## $transition$stat
## [1] "GIC"
##
## $transition$value
## [1] 3
Default parameters are overwritten by global and local ones. So the hierarchy of parameters importance is as follows:
LOCAL > GLOBAL > DEFAULT
But having model based on the huge amount of variables still requires a lot of our work to build the final model. Especially we need to use xs
and xf
symbols for each variable. How can we make it easier?
If you want to transform each predictor of specified formula and not using xs and xf variables you can omit it using consider
parameter for xspline
function.
Possible values are "specials"
the default one and "all"
. For automatic transformation of each variable without specifying xs
and xf
symbols just set consider = "all"
and pass standard formula into xspline
function:
model_xs <- xspline(Petal.Width ~ Sepal.Length + Petal.Length + Species,
model = rf_iris,
consider = "all"
)
summary(model_xs)
##
## Call:
## stats::glm(formula = Petal.Width ~ xs(Sepal.Length) + xs(Petal.Length) +
## xf(Species), family = family, data = data)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -0.67513 -0.07984 -0.03156 0.09795 0.46538
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.53108 0.27006 -5.669 7.49e-08 ***
## xs(Sepal.Length) 0.31926 0.32245 0.990 0.3238
## xs(Petal.Length) 1.85402 0.38730 4.787 4.13e-06 ***
## xf(Species)versicolor 0.09094 0.17533 0.519 0.6048
## xf(Species)virginica 0.41733 0.23452 1.779 0.0773 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for gaussian family taken to be 0.03090657)
##
## Null deviance: 86.5699 on 149 degrees of freedom
## Residual deviance: 4.4815 on 145 degrees of freedom
## AIC: -88.922
##
## Number of Fisher Scoring iterations: 2
Then each predictor is transformed with xs and xf symbols and use of default parameters or global ones when specified.
xs
is used for integer and numeric variables - xf
for factors.
By default xspline function tries to extract the data from model (rf_model
) call and xspline’s parent.frame()
environment then uses it to determine predictor types. So to be sure that variable types are sourced correctly a good practice here is to add data parameter, storing black box’s training data.
model_xs <- xspline(Petal.Width ~ Sepal.Length + Petal.Length + Species,
model = rf_iris,
data = iris,
consider = "all"
)
summary(model_xs)
##
## Call:
## stats::glm(formula = Petal.Width ~ xs(Sepal.Length) + xs(Petal.Length) +
## xf(Species), family = family, data = data)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -0.67513 -0.07984 -0.03156 0.09795 0.46538
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.53108 0.27006 -5.669 7.49e-08 ***
## xs(Sepal.Length) 0.31926 0.32245 0.990 0.3238
## xs(Petal.Length) 1.85402 0.38730 4.787 4.13e-06 ***
## xf(Species)versicolor 0.09094 0.17533 0.519 0.6048
## xf(Species)virginica 0.41733 0.23452 1.779 0.0773 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for gaussian family taken to be 0.03090657)
##
## Null deviance: 86.5699 on 149 degrees of freedom
## Residual deviance: 4.4815 on 145 degrees of freedom
## AIC: -88.922
##
## Number of Fisher Scoring iterations: 2
In some cases you may want to transform only quantitative or qualitative predictors. Looking into default parameters xs_opts_default
and xf_opts_default
we may find alter
parameter for transition.
The parameter is used to specify if predictor for which xs
or xf
was specified needs to be transformed or used as a bare variable in formula. You can specify it in the local or global transition parameter. In this case using the global one is more reasonable.
So, in order to transform only continuous variables just set alter = "always"
(what is default) for xs_opts
and alter = "never"
for xf_opts
:
model_xs <- xspline(Petal.Width ~ Sepal.Length + Petal.Length + Species,
model = rf_iris,
data = iris,
consider = "all",
xf_opts = list(transition = list(alter = "never"))
)
summary(model_xs)
##
## Call:
## stats::glm(formula = Petal.Width ~ xs(Sepal.Length) + xs(Petal.Length) +
## Species, family = family, data = data)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -0.67513 -0.07984 -0.03156 0.09795 0.46538
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.53108 0.27006 -5.669 7.49e-08 ***
## xs(Sepal.Length) 0.31926 0.32245 0.990 0.3238
## xs(Petal.Length) 1.85402 0.38730 4.787 4.13e-06 ***
## Speciesversicolor 0.09094 0.17533 0.519 0.6048
## Speciesvirginica 0.41733 0.23452 1.779 0.0773 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for gaussian family taken to be 0.03090657)
##
## Null deviance: 86.5699 on 149 degrees of freedom
## Residual deviance: 4.4815 on 145 degrees of freedom
## AIC: -88.922
##
## Number of Fisher Scoring iterations: 2
For transformation of factors only:
model_xs <- xspline(Petal.Width ~ Sepal.Length + Petal.Length + Species,
model = rf_iris,
data = iris,
consider = "all",
xs_opts = list(transition = list(alter = "never"))
)
summary(model_xs)
##
## Call:
## stats::glm(formula = Petal.Width ~ Sepal.Length + Petal.Length +
## xf(Species), family = family, data = data)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -0.63796 -0.07848 -0.01257 0.09863 0.47749
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.084592 0.172136 -0.491 0.62387
## Sepal.Length -0.001693 0.044135 -0.038 0.96945
## Petal.Length 0.231921 0.052796 4.393 2.15e-05 ***
## xf(Species)versicolor 0.432659 0.125048 3.460 0.00071 ***
## xf(Species)virginica 0.834121 0.173212 4.816 3.65e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for gaussian family taken to be 0.03249376)
##
## Null deviance: 86.5699 on 149 degrees of freedom
## Residual deviance: 4.7116 on 145 degrees of freedom
## AIC: -81.41
##
## Number of Fisher Scoring iterations: 2
Even having learned the above listed options, we still can have problems with handling model with a huge amount of variables.
For many existing models in R we usually can specify “dot formulas”, when used predictors are sourced from provided data. xspliner can also handle the form. Let’s return here for iris random forest model.
##
## Call:
## stats::glm(formula = Petal.Width ~ xs(Sepal.Length) + xs(Petal.Length) +
## xf(Species), family = family, data = data)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -0.67513 -0.07984 -0.03156 0.09795 0.46538
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.53108 0.27006 -5.669 7.49e-08 ***
## xs(Sepal.Length) 0.31926 0.32245 0.990 0.3238
## xs(Petal.Length) 1.85402 0.38730 4.787 4.13e-06 ***
## xf(Species)versicolor 0.09094 0.17533 0.519 0.6048
## xf(Species)virginica 0.41733 0.23452 1.779 0.0773 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for gaussian family taken to be 0.03090657)
##
## Null deviance: 86.5699 on 149 degrees of freedom
## Residual deviance: 4.4815 on 145 degrees of freedom
## AIC: -88.922
##
## Number of Fisher Scoring iterations: 2
Good practice here is to provide data
parameter as well to detect predictors classes, and model type (classification or regression).
How predictors are sourced?
xspline
function tries to establish data in the following order (if any way is not possible, it tries the next one):
To assure correct predictors usage, you may also specify predictor names vector in predictors
parameter, and data (optional) to assure source of variable classes:
model_xs <- xspline(Petal.Width ~ .,
model = rf_iris,
predictors = colnames(iris)[-c(2, 4)],
data = iris)
summary(model_xs)
##
## Call:
## stats::glm(formula = Petal.Width ~ xs(Sepal.Length) + xs(Petal.Length) +
## xf(Species), family = family, data = data)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -0.67513 -0.07984 -0.03156 0.09795 0.46538
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.53108 0.27006 -5.669 7.49e-08 ***
## xs(Sepal.Length) 0.31926 0.32245 0.990 0.3238
## xs(Petal.Length) 1.85402 0.38730 4.787 4.13e-06 ***
## xf(Species)versicolor 0.09094 0.17533 0.519 0.6048
## xf(Species)virginica 0.41733 0.23452 1.779 0.0773 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for gaussian family taken to be 0.03090657)
##
## Null deviance: 86.5699 on 149 degrees of freedom
## Residual deviance: 4.4815 on 145 degrees of freedom
## AIC: -88.922
##
## Number of Fisher Scoring iterations: 2
In above examples each predictor is transformed by default. You can exclude needed, by specifying global alter = "never"
parameters, or bare
.
This way we are ready to work with many predictors. Can the approach be simpler?
As we could see in previous section, using dot formula the predictors are sourced from provided black box. Why cannot we fully extract formula from black box? We can.
Let’s use previously built rf_iris
model:
##
## Call:
## stats::glm(formula = Petal.Width ~ xs(Sepal.Length) + xs(Petal.Length) +
## xf(Species), family = family, data = data)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -0.67513 -0.07984 -0.03156 0.09795 0.46538
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.53108 0.27006 -5.669 7.49e-08 ***
## xs(Sepal.Length) 0.31926 0.32245 0.990 0.3238
## xs(Petal.Length) 1.85402 0.38730 4.787 4.13e-06 ***
## xf(Species)versicolor 0.09094 0.17533 0.519 0.6048
## xf(Species)virginica 0.41733 0.23452 1.779 0.0773 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for gaussian family taken to be 0.03090657)
##
## Null deviance: 86.5699 on 149 degrees of freedom
## Residual deviance: 4.4815 on 145 degrees of freedom
## AIC: -88.922
##
## Number of Fisher Scoring iterations: 2
Works! Can it be simpler? Actually not because of black box based transformation and theory, but we can provide some model based parameters upfront using DALEX’s explainer
object (see next section).
Unfortunately this simplicity works thanks to a few approaches that randomForest follows.
They are:
Most of R based ML models satisfy the above condition. One of exceptions is XGBoost which in actual xspliner version needs to be handled in the more complex xspline call. You can see it in use cases
For this example consider again Boston Housing Data from pdp package, and build simple svm
model for predicting chas
variable:
## Warning: package 'e1071' was built under R version 3.5.2
data(boston)
svm_boston <- svm(chas ~ cmedv + rad + lstat, data = boston, probability = TRUE)
str(boston[, "rad"])
## int [1:506] 1 2 2 3 3 3 5 5 5 5 ...
## [1] 1 2 3 5 4 8 6 7 24
As we can see rad
variable is integer and has only 9 unique values. As a result spline approximation may be misleading, and not possible to perform. We decide here to omit rad
variable when performing transformation, nevertheless remaining predictors should be transformed.
At first setup model based transformation:
## Error in smooth.construct.tp.smooth.spec(object, dk$data, dk$knots): A term has fewer unique covariate combinations than specified maximum degrees of freedom
As we can see, the error was returned due to the form of rad
variable.
How to exclude rad
from transformation? We can use xspline’s bare
parameter, responsible for specifying predictor which shouldn’t be transformed.
##
## Call:
## stats::glm(formula = chas ~ xs(cmedv) + rad + xs(lstat), family = family,
## data = data)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -0.8194 -0.3415 -0.3391 -0.3367 2.4276
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -7.047e+01 7.766e+01 -0.907 0.36417
## xs(cmedv) -6.115e+01 2.365e+01 -2.586 0.00972 **
## rad -8.446e-04 2.149e-02 -0.039 0.96864
## xs(lstat) 9.013e+00 7.390e+01 0.122 0.90294
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 254.5 on 505 degrees of freedom
## Residual deviance: 244.4 on 502 degrees of freedom
## AIC: 252.4
##
## Number of Fisher Scoring iterations: 5
As we can see in the summary, rad variable was omitted during transformation.
Note
The option is available only for xspline build on top of model or dot formula.
As mentioned in the previous section, xspliner is integrated with DALEX package. The main function from the package explain
returns useful black box data (such as bare black model or training data) that can be used by xspline function.
Just check below example
library(DALEX)
rf_boston <- randomForest(lstat ~ cmedv + crim + chas, data = boston)
explainer <- explain(rf_boston, label = "boston")
## Preparation of a new explainer is initiated
## -> model label : boston
## -> data : 506 rows 4 cols ([33mextracted from the model[39m)
## -> target variable : not specified! ( [31m WARNING [39m )
## -> predict function : yhat.randomForest will be used ( [33m default [39m )
## -> predicted values : numerical, min = 6.008104 , mean = 12.64965 , max = 24.55393
## -> residual function : difference between y and yhat ( [33m default [39m )
## [32m A new explainer has been created! [39m
##
## Call:
## stats::glm(formula = lstat ~ xs(cmedv) + xs(crim) + xf(chas),
## family = family, data = data)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -11.5431 -2.3511 -0.6675 1.7577 21.2074
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -16.2225 1.3781 -11.771 < 2e-16 ***
## xs(cmedv) 1.7004 0.0725 23.452 < 2e-16 ***
## xs(crim) 0.5662 0.1377 4.112 4.59e-05 ***
## xf(chas)1 1.4065 0.6951 2.023 0.0436 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for gaussian family taken to be 15.44275)
##
## Null deviance: 25752.4 on 505 degrees of freedom
## Residual deviance: 7752.3 on 502 degrees of freedom
## AIC: 2826.9
##
## Number of Fisher Scoring iterations: 2
You can provide your own xspline’s parameters that overwrite that sourced from explainer.