Prepares provided dataset to be ready for the training process. It makes data suitable for training functions, splits it into train, test and validation, provides other data objects that are necessary for our training.
preprocess( data, target_name, sensitive_name, privileged, discriminated, drop_also = NULL, sample = 1, train_size = 0.7, test_size = 0.3, validation_size = 0, seed = NULL )
data | list representing whole table of data (categorical variables must be factors). |
---|---|
target_name | character, column name of the target variable. Selected column must be interpretable as categorical. |
sensitive_name | character, column name of the sensitive variable. Selected column must be interpretable as categorical. |
privileged | character meaning the name of privileged group |
discriminated | character meaning the name of discriminated group |
drop_also | character vector, column names of other columns to drop (like other sensitive variables). |
sample | double from [0,1] setting size of our sample from original data set. Default: 1 |
train_size | double from [0,1] setting size of our train. Note that train_size+test_size+validation_size=1. Default=0.7 |
test_size | double from [0,1] setting size of our test Note that train_size+test_size+validation_size=1. Default=0.3 |
validation_size | double from [0,1] setting size of our validation. Note that train_size+test_size+validation_size=1. Default=0 |
seed | sets seed for the sampling for code reproduction. Default=NULL |
list of prepared data ( train_x, - numeric scaled matrix for classifier training train_y, - numeric scaled vector for classifier training sensitive_train, - numeric scaled vector for adversaries training test_x, - numeric scaled matrix for classifier testing test_y, - numeric scaled vector for classifier testing sensitive_test, - numeric scaled vector for adversaries testing valid_x, - numeric scaled matrix for classifier validation valid_y, - numeric scaled vector for classifier validation sensitive_valid, - numeric scaled vector for adversaries validation data_scaled_test, - numeric scaled data set for testing data_scaled_valid, - numeric scaled data set for validation data_test, - whole dataset for testing, unchanged protected_test, - character vector of protected values for explainers test data_valid, - whole dataset for validation, unchanged protected_valid - character vector of protected values for explainers valid )
WARNING! So far the code in other functions is not fully prepared for validation dataset and is designed for using test as test and validation. Well understanding users however can use validation set in place of test if they are sure it makes sense there.