Performing Feature Selection on the Dataset with Transformed Variables

The safely_select_variables() function selects variables from dataset returned by safely_transform_data() function. For each original variable exactly one variable is chosen

either original one or transformed one. The choice is based on the AIC value for linear model (regression) or logistic regression (classification).

safely_select_variables(
  safe_extractor,
  data,
  y = NULL,
  which_y = NULL,
  class_pred = NULL,
  verbose = TRUE
)

Arguments

safe_extractor: object containing information about variables transformations created with safe_extraction() function
data: data, original dataset or the one returned by safely_transform_data() function. If data do not contain transformed variables then transformation is done inside this function using 'safe_extractor' argument. Data may contain response variable or not - if it does then 'which_y' argument must be given, otherwise 'y' argument should be provided.
y: vector of responses, must be given if data does not contain it
which_y: numeric or character (optional), must be given if data contains response values
class_pred: numeric or character, used only in multi-classification problems. If response vector has more than two levels, then 'class_pred' should indicate the class of interest which will denote failure - all other classes will stand for success.
verbose: logical, if progress bar is to be printed

Value

vector of variables names, selected based on AIC values

Examples


library(DALEX)
library(randomForest)
library(rSAFE)

data <- apartments[1:500,]
set.seed(111)
model_rf <- randomForest(m2.price ~ construction.year + surface + floor +
                           no.rooms + district, data = data)
explainer_rf <- explain(model_rf, data = data[,2:6], y = data[,1])
#> Preparation of a new explainer is initiated
#>   -> model label       :  randomForest  (  default  )
#>   -> data              :  500  rows  5  cols 
#>   -> target variable   :  500  values 
#>   -> predict function  :  yhat.randomForest  will be used (  default  )
#>   -> predicted values  :  No value for predict function target column. (  default  )
#>   -> model_info        :  package randomForest , ver. 4.7.1.1 , task regression (  default  ) 
#>   -> predicted values  :  numerical, min =  2010.939 , mean =  3502.345 , max =  5764.513  
#>   -> residual function :  difference between y and yhat (  default  )
#>   -> residuals         :  numerical, min =  -387.9388 , mean =  -0.6372461 , max =  749.0998  
#>   A new explainer has been created!  
safe_extractor <- safe_extraction(explainer_rf, verbose = FALSE)
safely_select_variables(safe_extractor, data, which_y = "m2.price", verbose = FALSE)
#> [1] "surface"               "floor"                 "no.rooms"             
#> [4] "construction.year_new" "district_new"

Performing Feature Selection on the Dataset with Transformed Variables

Arguments

Value

See also

Examples