As mentioned in introduction, calculating comparable correlation coefficients between variables of different types is a non-trivial problem. The CorrGrapheR
package provides user with ability to experiment with different measures of correlation, using calculate_cors
function.
The num_num_f
, num_cat_f
and cat_cat_f
arguments should be functions, used to calculate correlation coefficients between different pairs variables, respectively. Each of them should take 2 arguments (vectors of equal length; for num_num_f
- both numeric
, for cat_cat_f
- both factor
, for num_cat_f
- first numeric
and second factor
) and return a single number. It should be interpretable in the following way: the bigger its absolute value, the stronger correlation between variables. If it’s negative - the variables are negatively correlated.
When supplying custom functions, a max_cor
argument is required. It should be a single number indicating strongest possible correlation (like 1 in Pearson’s cor.test
). It is used when calculating correlation between the same variables, to trim accidental results above it and to scale final results into \([-1,1]\) range.
The calculate_cors
returns a correlation matrix
, similarly to cor
.
library(corrgrapher) df <- as.data.frame(datasets::Seatbelts)[,1:5] f1 <- function(x, y) cor.test(x, y, method = 'pearson')$estimate f2 <- function(x, y) cor.test(x, y, method = 'kendall')$estimate f3 <- function(x, y) cor.test(x, y, method = 'spearman', exact = FALSE)$estimate calculate_cors(df, num_num_f = f1, max_cor = 1) #> DriversKilled drivers front rear kms #> DriversKilled 1.0000000 0.8888264 0.7067596 0.3533510 -0.3211016 #> drivers 0.8888264 1.0000000 0.8084114 0.3436685 -0.4447631 #> front 0.7067596 0.8084114 1.0000000 0.6202248 -0.3573823 #> rear 0.3533510 0.3436685 0.6202248 1.0000000 0.3330069 #> kms -0.3211016 -0.4447631 -0.3573823 0.3330069 1.0000000 calculate_cors(df, num_num_f = f2, max_cor = 1) #> DriversKilled drivers front rear kms #> DriversKilled 1.0000000 0.6785708 0.5085310 0.2602347 -0.1870987 #> drivers 0.6785708 1.0000000 0.6262218 0.2426445 -0.2895965 #> front 0.5085310 0.6262218 1.0000000 0.4518703 -0.2270979 #> rear 0.2602347 0.2426445 0.4518703 1.0000000 0.2135214 #> kms -0.1870987 -0.2895965 -0.2270979 0.2135214 1.0000000 calculate_cors(df, num_num_f = f3, max_cor = 1) #> DriversKilled drivers front rear kms #> DriversKilled 1.0000000 0.8579854 0.6936970 0.3835558 -0.2686338 #> drivers 0.8579854 1.0000000 0.8225164 0.3641043 -0.4234322 #> front 0.6936970 0.8225164 1.0000000 0.6086036 -0.3511428 #> rear 0.3835558 0.3641043 0.6086036 1.0000000 0.3140946 #> kms -0.2686338 -0.4234322 -0.3511428 0.3140946 1.0000000
As we see, user has to supply only necessary functions for given x
argument. For a data.frame
only with numeric
data only num_num_f
is required.
Since correlation measures for different data variables are not comparable in most cases, it required from the user to always supply all necessary functions. Naturally, user might supply function identical to the default:
data(dragons, package = 'DALEX') f1 <- function(x, y) -log10(cor.test(x, y, method = 'spearman', exact = FALSE)$p.value) f2 <- function(x, y) -log10(kruskal.test(x, y)$p.value) calculate_cors(dragons, num_num_f = f1, num_cat_f = f2, max_cor = 100) #> year_of_birth height weight scars #> year_of_birth 1.0000000000 0.0040428398 1.084646e-02 0.005151935 #> height 0.0040428398 1.0000000000 1.000000e+00 0.001298727 #> weight 0.0108464600 1.0000000000 1.000000e+00 0.006402918 #> scars 0.0051519347 0.0012987272 6.402918e-03 1.000000000 #> colour 0.0068233038 0.0004765201 4.282685e-05 0.003525012 #> year_of_discovery 0.0098803614 0.0030892743 2.337062e-03 0.040962737 #> number_of_lost_teeth 0.0007526137 0.0101897329 7.369186e-03 0.004277834 #> life_length 0.0104123400 0.0071877845 1.186213e-02 1.000000000 #> colour year_of_discovery number_of_lost_teeth #> year_of_birth 6.823304e-03 0.0098803614 0.0007526137 #> height 4.765201e-04 0.0030892743 0.0101897329 #> weight 4.282685e-05 0.0023370625 0.0073691861 #> scars 3.525012e-03 0.0409627374 0.0042778339 #> colour 1.000000e+00 0.0060860485 0.0081442241 #> year_of_discovery 6.086049e-03 1.0000000000 0.0001690691 #> number_of_lost_teeth 8.144224e-03 0.0001690691 1.0000000000 #> life_length 3.046990e-03 0.0184173931 1.0000000000 #> life_length #> year_of_birth 0.010412340 #> height 0.007187785 #> weight 0.011862132 #> scars 1.000000000 #> colour 0.003046990 #> year_of_discovery 0.018417393 #> number_of_lost_teeth 1.000000000 #> life_length 1.000000000
calculate_cors
is called inside corrgrapher
function. User may pass custom functions to it via cor_functions
argument. It should be a named list
with num_num_f
, num_cat_f
, cat_cat_f
and max_cor
elements.
corrgrapher(df, cor_functions = list(num_num_f = f1, num_cat_f = f2, max_cor = 100))
When corrgrapher
function is called on an explainer
object, it calculates inside importance of features (variables) using ingredients::feature_importance
function. User may (sometimes should) supply feature_importance
argument as either:
ingredients::feature_importance
on the same explainer
, oringredients::feature_importance
called inside corrgrapher
function.Remember, do not change variables
and variable_groups
arguments.
library(ranger) library(DALEX) #> Welcome to DALEX (version: 2.0). #> Find examples and detailed introduction at: https://pbiecek.github.io/ema/ #> Additional features will be available after installation of: ggpubr. #> Use 'install_dependencies()' to get all suggested dependencies data("titanic_imputed", package='DALEX') tit_model <- ranger(survived ~ ., data = titanic_imputed, num.trees = 100) tit_model_exp <- explain(tit_model, data = titanic_imputed[,-8], y = titanic_imputed[, 8], verbose = FALSE) tit_model_fi <- ingredients::feature_importance(tit_model_exp, B = 5, loss_function = loss_accuracy) tit_cgr_1 <- corrgrapher(tit_model_exp, feature_importance = tit_model_fi) tit_cgr_2 <- corrgrapher(tit_model_exp, feature_importance = list(B = 20, loss_function = loss_one_minus_auc)) tit_cgr_1
tit_cgr_2
Similarly to feature_importance
, user may (sometimes should) supply partial_dependency
argument. Do not change variable_type
, variables
or variable_splits
arguments.
tit_model_pds <- ingredients::partial_dependence(tit_model_exp, grid_points = 50, N = 100) tit_cgr_3 <- corrgrapher(tit_model_exp, feature_importance = tit_model_fi, partial_dependency = tit_model_pds) tit_cgr_4 <- corrgrapher(tit_model_exp, feature_importance = tit_model_fi, partial_dependency = list(grid_points = 101, N = 200)) tit_cgr_3
tit_cgr_4
Introduction vignette for overall overview of the package.