R/check_data.R
check_data.Rd
Run data check pipeline to seek for potential problems with the data
check_data(data, y, verbose = TRUE)
A data source, that is one of the major R formats: data.table, data.frame, matrix, and so on.
A string that indicates a target column name.
A logical value, if set to TRUE, provides all information about the process, if FALSE gives none.
A list with two vectors: lines of the report (str) and the outliers (outliers).
check_data(iris[1:100, ], 'Species')
#> -------------------- CHECK DATA REPORT --------------------
#>
#> The dataset has 100 observations and 5 columns, which names are:
#> Sepal.Length; Sepal.Width; Petal.Length; Petal.Width; Species;
#>
#> With the target value described by a column Species.
#>
#> ✔ No static columns.
#>
#> ✔ No duplicate columns.
#>
#> ✔ No target values are missing.
#>
#> ✔ No predictor values are missing.
#>
#> ✔ No issues with dimensionality.
#>
#> ✖ Strongly correlated, by Spearman rank, pairs of numerical values are:
#>
#> Sepal.Length - Petal.Length: 0.81;
#> Sepal.Length - Petal.Width: 0.79;
#> Petal.Length - Petal.Width: 0.98;
#>
#> ✔ No outliers in the dataset.
#>
#> ✔ Dataset is balanced.
#>
#> ✔ Columns names suggest that none of them are IDs.
#>
#> ✔ Columns data suggest that none of them are IDs.
#>
#> -------------------- CHECK DATA REPORT END --------------------
#>
#> $str
#> [1] " -------------------- **CHECK DATA REPORT** -------------------- "
#> [2] " "
#> [3] "**The dataset has 100 observations and 5 columns which names are: **"
#> [4] ""
#> [5] "Sepal.Length; Sepal.Width; Petal.Length; Petal.Width; Species; "
#> [6] ""
#> [7] "**With the target value described by a column:** Species."
#> [8] " "
#> [9] "**No static columns. **"
#> [10] ""
#> [11] ""
#> [12] "**No duplicate columns.**"
#> [13] ""
#> [14] "**No target values are missing. **"
#> [15] ""
#> [16] "**No predictor values are missing. **"
#> [17] ""
#> [18] "**No issues with dimensionality. **"
#> [19] ""
#> [20] "**Strongly correlated, by Spearman rank, pairs of numerical values are: **"
#> [21] ""
#> [22] " Sepal.Length - Petal.Length: 0.81;"
#> [23] " Sepal.Length - Petal.Width: 0.79;"
#> [24] " Petal.Length - Petal.Width: 0.98;"
#> [25] ""
#> [26] "**No outliers in the dataset. **"
#> [27] ""
#> [28] "**Dataset is balanced. **"
#> [29] ""
#> [30] "**Columns names suggest that none of them are IDs. **"
#> [31] ""
#> [32] "**Columns data suggest that none of them are IDs. **"
#> [33] ""
#> [34] ""
#> [35] " -------------------- **CHECK DATA REPORT END** -------------------- "
#> [36] " "
#>
#> $outliers
#> numeric(0)
#>
check_data(lisbon, 'Price')
#> -------------------- CHECK DATA REPORT --------------------
#>
#> The dataset has 246 observations and 17 columns, which names are:
#> Id; Condition; PropertyType; PropertySubType; Bedrooms; Bathrooms; AreaNet; AreaGross; Parking; Latitude; Longitude; Country; District; Municipality; Parish; Price.M2; Price;
#>
#> With the target value described by a column Price.
#>
#> ✖ Static columns are:
#> Country; District; Municipality;
#>
#> ✖ With dominating values:
#> Portugal; Lisboa; Lisboa;
#>
#> ✖ These column pairs are duplicate:
#> District - Municipality;
#>
#> ✔ No target values are missing.
#>
#> ✔ No predictor values are missing.
#>
#> ✔ No issues with dimensionality.
#>
#> ✖ Strongly correlated, by Spearman rank, pairs of numerical values are:
#>
#> Bedrooms - AreaNet: 0.77;
#> Bedrooms - AreaGross: 0.77;
#> Bathrooms - AreaNet: 0.78;
#> Bathrooms - AreaGross: 0.78;
#> AreaNet - AreaGross: 1;
#>
#> ✖ Strongly correlated, by Crammer's V rank, pairs of categorical values are:
#> PropertyType - PropertySubType: 1;
#>
#> ✖ These obserwation migth be outliers due to their numerical columns values:
#> 145 146 196 44 5 51 57 58 59 60 61 62 63 64 69 75 76 77 78 ;
#>
#> ✖ Target data is not evenly distributed with quantile bins: 0.25 0.35 0.14 0.26
#>
#> ✖ Columns names suggest that some of them are IDs, removing them can improve the model.
#> Suspicious columns are: Id .
#>
#> ✖ Columns data suggest that some of them are IDs, removing them can improve the model.
#> Suspicious columns are: Id .
#>
#> -------------------- CHECK DATA REPORT END --------------------
#>
#> $str
#> [1] " -------------------- **CHECK DATA REPORT** -------------------- "
#> [2] " "
#> [3] "**The dataset has 246 observations and 17 columns which names are: **"
#> [4] ""
#> [5] "Id; Condition; PropertyType; PropertySubType; Bedrooms; Bathrooms; AreaNet; AreaGross; Parking; Latitude; Longitude; Country; District; Municipality; Parish; Price.M2; Price; "
#> [6] ""
#> [7] "**With the target value described by a column:** Price."
#> [8] " "
#> [9] "** Static columns are: **Country; District; Municipality; "
#> [10] ""
#> [11] "**With dominating values: **Portugal; Lisboa; Lisboa; "
#> [12] ""
#> [13] ""
#> [14] "**These column pairs are duplicate: **"
#> [15] "District - Municipality; "
#> [16] ""
#> [17] ""
#> [18] "**No target values are missing. **"
#> [19] ""
#> [20] "**No predictor values are missing. **"
#> [21] ""
#> [22] "**No issues with dimensionality. **"
#> [23] ""
#> [24] "**Strongly correlated, by Spearman rank, pairs of numerical values are: **"
#> [25] ""
#> [26] " Bedrooms - AreaNet: 0.77;"
#> [27] " Bedrooms - AreaGross: 0.77;"
#> [28] " Bathrooms - AreaNet: 0.78;"
#> [29] " Bathrooms - AreaGross: 0.78;"
#> [30] " AreaNet - AreaGross: 1;"
#> [31] ""
#> [32] " ** Strongly correlated, by Crammer's V rank, pairs of categorical values are: **"
#> [33] ""
#> [34] " PropertyType - PropertySubType: 1;"
#> [35] ""
#> [36] "**These obserwation migth be outliers due to their numerical columns values: **"
#> [37] ""
#> [38] " 145 146 196 44 5 51 57 58 59 60 61 62 63 64 69 75 76 77 78 ;"
#> [39] ""
#> [40] "**Target data is not evenly distributed with quantile bins:** 0.25 0.35 0.14 0.26 "
#> [41] ""
#> [42] "**Columns names suggest that some of them are IDs, removing them can improve the model. Suspicious columns are: **"
#> [43] ""
#> [44] " Id "
#> [45] ""
#> [46] "**Columns data suggest that some of them are IDs, removing them can improve the model. Suspicious columns are: **"
#> [47] ""
#> [48] " Id "
#> [49] ""
#> [50] ""
#> [51] " -------------------- **CHECK DATA REPORT END** -------------------- "
#> [52] " "
#>
#> $outliers
#> [1] 145 146 196 44 5 51 57 58 59 60 61 62 63 64 69 75 76 77 78
#>
check_data(compas, 'Two_yr_Recidivism')
#> -------------------- CHECK DATA REPORT --------------------
#>
#> The dataset has 6172 observations and 7 columns, which names are:
#> Two_yr_Recidivism; Number_of_Priors; Age_Above_FourtyFive; Age_Below_TwentyFive; Misdemeanor; Ethnicity; Sex;
#>
#> With the target value described by a column Two_yr_Recidivism.
#>
#> ✔ No static columns.
#>
#> ✔ No duplicate columns.
#>
#> ✔ No target values are missing.
#>
#> ✔ No predictor values are missing.
#>
#> ✔ No issues with dimensionality.
#>
#> ✔ No strongly correlated, by Spearman rank, pairs of numerical values.
#>
#> ✔ No strongly correlated, by Crammer's V rank, pairs of categorical values.
#>
#> ✖ There are more than 50 possible outliers in the data set, so we are not printing them. They are returned in the output as a vector.
#>
#> ✔ Dataset is balanced.
#>
#> ✔ Columns names suggest that none of them are IDs.
#>
#> ✔ Columns data suggest that none of them are IDs.
#>
#> -------------------- CHECK DATA REPORT END --------------------
#>
#> $str
#> [1] " -------------------- **CHECK DATA REPORT** -------------------- "
#> [2] " "
#> [3] "**The dataset has 6172 observations and 7 columns which names are: **"
#> [4] ""
#> [5] "Two_yr_Recidivism; Number_of_Priors; Age_Above_FourtyFive; Age_Below_TwentyFive; Misdemeanor; Ethnicity; Sex; "
#> [6] ""
#> [7] "**With the target value described by a column:** Two_yr_Recidivism."
#> [8] " "
#> [9] "**No static columns. **"
#> [10] ""
#> [11] ""
#> [12] "**No duplicate columns.**"
#> [13] ""
#> [14] "**No target values are missing. **"
#> [15] ""
#> [16] "**No predictor values are missing. **"
#> [17] ""
#> [18] "**No issues with dimensionality. **"
#> [19] ""
#> [20] "**No strongly correlated, by Spearman rank, pairs of numerical values. **"
#> [21] ""
#> [22] "**No strongly correlated, by Crammer's V rank, pairs of categorical values. **"
#> [23] ""
#> [24] "**There are more than 50 possible outliers in the data set, so we are not printing them. They are returned in the output as a vector. **"
#> [25] ""
#> [26] "**Dataset is balanced. **"
#> [27] ""
#> [28] "**Columns names suggest that none of them are IDs. **"
#> [29] ""
#> [30] "**Columns data suggest that none of them are IDs. **"
#> [31] ""
#> [32] ""
#> [33] " -------------------- **CHECK DATA REPORT END** -------------------- "
#> [34] " "
#>
#> $outliers
#> [1] 102 108 1181 1209 1321 1401 1403 1406 1408 1417 1422 1443 1468 1526 1532
#> [16] 1561 157 1596 1630 1681 173 1814 1820 1830 1865 1920 1924 1950 2080 2099
#> [31] 210 2105 2168 2264 2301 2336 2348 2410 2417 2423 2444 2453 2503 2504 2526
#> [46] 2544 2611 2648 2680 273 2744 2792 2829 2858 2871 2872 2873 2888 2979 3043
#> [61] 3050 3104 3107 3138 3204 3207 322 3229 3250 326 3280 3314 3333 3360 3394
#> [76] 34 3534 356 3594 3620 3714 3762 3803 3830 3872 3873 3923 393 4076 4083
#> [91] 4085 4086 4091 4104 4111 4172 424 425 4274 4303 4378 4394 4484 4492 455
#> [106] 4709 4720 4740 4882 4931 4962 4973 5067 5133 5164 5193 5231 5254 5261 5283
#> [121] 5286 5332 5350 5351 539 5411 5497 5512 555 556 5705 5830 588 5934 5959
#> [136] 5979 6003 6005 6023 603 6080 6161 630 642 674 707 709 739 787 846
#> [151] 897 904 932
#>
check_data(iris, 'Species')
#> -------------------- CHECK DATA REPORT --------------------
#>
#> The dataset has 150 observations and 5 columns, which names are:
#> Sepal.Length; Sepal.Width; Petal.Length; Petal.Width; Species;
#>
#> With the target value described by a column Species.
#>
#> ✔ No static columns.
#>
#> ✔ No duplicate columns.
#>
#> ✔ No target values are missing.
#>
#> ✔ No predictor values are missing.
#>
#> ✔ No issues with dimensionality.
#>
#> ✖ Strongly correlated, by Spearman rank, pairs of numerical values are:
#>
#> Sepal.Length - Petal.Length: 0.87;
#> Sepal.Length - Petal.Width: 0.82;
#> Petal.Length - Petal.Width: 0.96;
#>
#> ✖ These obserwation migth be outliers due to their numerical columns values:
#> 16 ;
#>
#> ✖ Multilabel classification is not supported yet.
#>
#> ✔ Columns names suggest that none of them are IDs.
#>
#> ✔ Columns data suggest that none of them are IDs.
#>
#> -------------------- CHECK DATA REPORT END --------------------
#>
#> $str
#> [1] " -------------------- **CHECK DATA REPORT** -------------------- "
#> [2] " "
#> [3] "**The dataset has 150 observations and 5 columns which names are: **"
#> [4] ""
#> [5] "Sepal.Length; Sepal.Width; Petal.Length; Petal.Width; Species; "
#> [6] ""
#> [7] "**With the target value described by a column:** Species."
#> [8] " "
#> [9] "**No static columns. **"
#> [10] ""
#> [11] ""
#> [12] "**No duplicate columns.**"
#> [13] ""
#> [14] "**No target values are missing. **"
#> [15] ""
#> [16] "**No predictor values are missing. **"
#> [17] ""
#> [18] "**No issues with dimensionality. **"
#> [19] ""
#> [20] "**Strongly correlated, by Spearman rank, pairs of numerical values are: **"
#> [21] ""
#> [22] " Sepal.Length - Petal.Length: 0.87;"
#> [23] " Sepal.Length - Petal.Width: 0.82;"
#> [24] " Petal.Length - Petal.Width: 0.96;"
#> [25] ""
#> [26] "**These obserwation migth be outliers due to their numerical columns values: **"
#> [27] ""
#> [28] " 16 ;"
#> [29] ""
#> [30] "**Multilabel classification is not supported yet. **"
#> [31] ""
#> [32] "**Columns names suggest that none of them are IDs. **"
#> [33] ""
#> [34] "**Columns data suggest that none of them are IDs. **"
#> [35] ""
#> [36] ""
#> [37] " -------------------- **CHECK DATA REPORT END** -------------------- "
#> [38] " "
#>
#> $outliers
#> [1] 16
#>
check_data(lymph, 'class')
#> -------------------- CHECK DATA REPORT --------------------
#>
#> The dataset has 148 observations and 19 columns, which names are:
#> lymphatics; block_of_affere; bl_of_lymph_c; bl_of_lymph_s; by_pass; extravasates; regeneration_of; early_uptake_in; lym_nodes_dimin; lym_nodes_enlar; changes_in_lym; defect_in_node; changes_in_node; changes_in_stru; special_forms; dislocation_of; exclusion_of_no; no_of_nodes_in; class;
#>
#> With the target value described by a column class.
#>
#> ✔ No static columns.
#>
#> ✔ No duplicate columns.
#>
#> ✔ No target values are missing.
#>
#> ✔ No predictor values are missing.
#>
#> ✔ No issues with dimensionality.
#>
#> ✔ No strongly correlated, by Spearman rank, pairs of numerical values.
#>
#> ✔ No strongly correlated, by Crammer's V rank, pairs of categorical values.
#>
#> ✖ These obserwation migth be outliers due to their numerical columns values:
#> 3 54 74 ;
#>
#> ✖ Multilabel classification is not supported yet.
#>
#> ✔ Columns names suggest that none of them are IDs.
#>
#> ✔ Columns data suggest that none of them are IDs.
#>
#> -------------------- CHECK DATA REPORT END --------------------
#>
#> $str
#> [1] " -------------------- **CHECK DATA REPORT** -------------------- "
#> [2] " "
#> [3] "**The dataset has 148 observations and 19 columns which names are: **"
#> [4] ""
#> [5] "lymphatics; block_of_affere; bl_of_lymph_c; bl_of_lymph_s; by_pass; extravasates; regeneration_of; early_uptake_in; lym_nodes_dimin; lym_nodes_enlar; changes_in_lym; defect_in_node; changes_in_node; changes_in_stru; special_forms; dislocation_of; exclusion_of_no; no_of_nodes_in; class; "
#> [6] ""
#> [7] "**With the target value described by a column:** class."
#> [8] " "
#> [9] "**No static columns. **"
#> [10] ""
#> [11] ""
#> [12] "**No duplicate columns.**"
#> [13] ""
#> [14] "**No target values are missing. **"
#> [15] ""
#> [16] "**No predictor values are missing. **"
#> [17] ""
#> [18] "**No issues with dimensionality. **"
#> [19] ""
#> [20] "**No strongly correlated, by Spearman rank, pairs of numerical values. **"
#> [21] ""
#> [22] "**No strongly correlated, by Crammer's V rank, pairs of categorical values. **"
#> [23] ""
#> [24] "**These obserwation migth be outliers due to their numerical columns values: **"
#> [25] ""
#> [26] " 3 54 74 ;"
#> [27] ""
#> [28] "**Multilabel classification is not supported yet. **"
#> [29] ""
#> [30] "**Columns names suggest that none of them are IDs. **"
#> [31] ""
#> [32] "**Columns data suggest that none of them are IDs. **"
#> [33] ""
#> [34] ""
#> [35] " -------------------- **CHECK DATA REPORT END** -------------------- "
#> [36] " "
#>
#> $outliers
#> [1] 3 54 74
#>