Preprocesses data for training — preprocess • fairpan

Prepares provided dataset to be ready for the training process. It makes data suitable for training functions, splits it into train, test and validation, provides other data objects that are necessary for our training.

preprocess(
  data,
  target_name,
  sensitive_name,
  privileged,
  discriminated,
  drop_also = NULL,
  sample = 1,
  train_size = 0.7,
  test_size = 0.3,
  validation_size = 0,
  seed = NULL
)

Arguments

data	list representing whole table of data (categorical variables must be factors).
target_name	character, column name of the target variable. Selected column must be interpretable as categorical.
sensitive_name	character, column name of the sensitive variable. Selected column must be interpretable as categorical.
privileged	character meaning the name of privileged group
discriminated	character meaning the name of discriminated group
drop_also	character vector, column names of other columns to drop (like other sensitive variables).
sample	double from [0,1] setting size of our sample from original data set. Default: 1
train_size	double from [0,1] setting size of our train. Note that train_size+test_size+validation_size=1. Default=0.7
test_size	double from [0,1] setting size of our test Note that train_size+test_size+validation_size=1. Default=0.3
validation_size	double from [0,1] setting size of our validation. Note that train_size+test_size+validation_size=1. Default=0
seed	sets seed for the sampling for code reproduction. Default=NULL

Value

list of prepared data ( train_x, - numeric scaled matrix for classifier training train_y, - numeric scaled vector for classifier training sensitive_train, - numeric scaled vector for adversaries training test_x, - numeric scaled matrix for classifier testing test_y, - numeric scaled vector for classifier testing sensitive_test, - numeric scaled vector for adversaries testing valid_x, - numeric scaled matrix for classifier validation valid_y, - numeric scaled vector for classifier validation sensitive_valid, - numeric scaled vector for adversaries validation data_scaled_test, - numeric scaled data set for testing data_scaled_valid, - numeric scaled data set for validation data_test, - whole dataset for testing, unchanged protected_test, - character vector of protected values for explainers test data_valid, - whole dataset for validation, unchanged protected_valid - character vector of protected values for explainers valid )

Details

WARNING! So far the code in other functions is not fully prepared for validation dataset and is designed for using test as test and validation. Well understanding users however can use validation set in place of test if they are sure it makes sense there.

Examples

adult <- fairmodels::adult

processed <-
  preprocess(
    adult,
    "salary",
    "sex",
    "Male",
    "Female",
    c("race"),
    sample = 0.05,
    train_size = 0.65,
    test_size = 0.35,
    validation_size = 0,
    seed = 7
  )