Run data check pipeline to seek for potential problems with the data

check_data(data, y, verbose = TRUE)

Arguments

data

A data source, that is one of the major R formats: data.table, data.frame, matrix, and so on.

y

A string that indicates a target column name.

verbose

A logical value, if set to TRUE, provides all information about the process, if FALSE gives none.

Value

A list with two vectors: lines of the report (str) and the outliers (outliers).

Examples

check_data(iris[1:100, ], 'Species')
#>  -------------------- CHECK DATA REPORT -------------------- 
#>  
#> The dataset has 100 observations and 5 columns, which names are: 
#> Sepal.Length; Sepal.Width; Petal.Length; Petal.Width; Species; 
#> 
#> With the target value described by a column Species.
#> 
#>  No static columns. 
#> 
#>  No duplicate columns.
#> 
#>  No target values are missing. 
#> 
#>  No predictor values are missing. 
#> 
#>  No issues with dimensionality. 
#> 
#>  Strongly correlated, by Spearman rank, pairs of numerical values are: 
#>  
#>  Sepal.Length - Petal.Length: 0.81;
#>  Sepal.Length - Petal.Width: 0.79;
#>  Petal.Length - Petal.Width: 0.98;
#> 
#>  No outliers in the dataset. 
#> 
#>  Dataset is balanced. 
#> 
#>  Columns names suggest that none of them are IDs. 
#> 
#>  Columns data suggest that none of them are IDs. 
#> 
#>  -------------------- CHECK DATA REPORT END -------------------- 
#>  
#> $str
#>  [1] " -------------------- **CHECK DATA REPORT** -------------------- "         
#>  [2] " "                                                                         
#>  [3] "**The dataset has 100 observations and 5 columns which names are: **"      
#>  [4] ""                                                                          
#>  [5] "Sepal.Length; Sepal.Width; Petal.Length; Petal.Width; Species; "           
#>  [6] ""                                                                          
#>  [7] "**With the target value described by a column:** Species."                 
#>  [8] " "                                                                         
#>  [9] "**No static columns. **"                                                   
#> [10] ""                                                                          
#> [11] ""                                                                          
#> [12] "**No duplicate columns.**"                                                 
#> [13] ""                                                                          
#> [14] "**No target values are missing. **"                                        
#> [15] ""                                                                          
#> [16] "**No predictor values are missing. **"                                     
#> [17] ""                                                                          
#> [18] "**No issues with dimensionality. **"                                       
#> [19] ""                                                                          
#> [20] "**Strongly correlated, by Spearman rank, pairs of numerical values are: **"
#> [21] ""                                                                          
#> [22] " Sepal.Length - Petal.Length: 0.81;"                                       
#> [23] " Sepal.Length - Petal.Width: 0.79;"                                        
#> [24] " Petal.Length - Petal.Width: 0.98;"                                        
#> [25] ""                                                                          
#> [26] "**No outliers in the dataset. **"                                          
#> [27] ""                                                                          
#> [28] "**Dataset is balanced. **"                                                 
#> [29] ""                                                                          
#> [30] "**Columns names suggest that none of them are IDs. **"                     
#> [31] ""                                                                          
#> [32] "**Columns data suggest that none of them are IDs. **"                      
#> [33] ""                                                                          
#> [34] ""                                                                          
#> [35] " -------------------- **CHECK DATA REPORT END** -------------------- "     
#> [36] " "                                                                         
#> 
#> $outliers
#> numeric(0)
#> 
check_data(lisbon, 'Price')
#>  -------------------- CHECK DATA REPORT -------------------- 
#>  
#> The dataset has 246 observations and 17 columns, which names are: 
#> Id; Condition; PropertyType; PropertySubType; Bedrooms; Bathrooms; AreaNet; AreaGross; Parking; Latitude; Longitude; Country; District; Municipality; Parish; Price.M2; Price; 
#> 
#> With the target value described by a column Price.
#> 
#>  Static columns are: 
#>  Country; District; Municipality; 
#> 
#>  With dominating values: 
#>  Portugal; Lisboa; Lisboa; 
#>  
#>  These column pairs are duplicate:
#>  District - Municipality; 
#> 
#>  No target values are missing. 
#> 
#>  No predictor values are missing. 
#> 
#>  No issues with dimensionality. 
#> 
#>  Strongly correlated, by Spearman rank, pairs of numerical values are: 
#>  
#>  Bedrooms - AreaNet: 0.77;
#>  Bedrooms - AreaGross: 0.77;
#>  Bathrooms - AreaNet: 0.78;
#>  Bathrooms - AreaGross: 0.78;
#>  AreaNet - AreaGross: 1;
#> 
#>  Strongly correlated, by Crammer's V rank, pairs of categorical values are: 
#>  PropertyType - PropertySubType: 1;
#> 
#>  These obserwation migth be outliers due to their numerical columns values: 
#>  145 146 196 44 5 51 57 58 59 60 61 62 63 64 69 75 76 77 78 ;
#> 
#>  Target data is not evenly distributed with quantile bins: 0.25 0.35 0.14 0.26 
#> 
#>  Columns names suggest that some of them are IDs, removing them can improve the model.
#>  Suspicious columns are: Id .
#> 
#>  Columns data suggest that some of them are IDs, removing them can improve the model.
#>  Suspicious columns are: Id .
#> 
#>  -------------------- CHECK DATA REPORT END -------------------- 
#>  
#> $str
#>  [1] " -------------------- **CHECK DATA REPORT** -------------------- "                                                                                                              
#>  [2] " "                                                                                                                                                                              
#>  [3] "**The dataset has 246 observations and 17 columns which names are: **"                                                                                                          
#>  [4] ""                                                                                                                                                                               
#>  [5] "Id; Condition; PropertyType; PropertySubType; Bedrooms; Bathrooms; AreaNet; AreaGross; Parking; Latitude; Longitude; Country; District; Municipality; Parish; Price.M2; Price; "
#>  [6] ""                                                                                                                                                                               
#>  [7] "**With the target value described by a column:** Price."                                                                                                                        
#>  [8] " "                                                                                                                                                                              
#>  [9] "** Static columns are: **Country; District; Municipality; "                                                                                                                     
#> [10] ""                                                                                                                                                                               
#> [11] "**With dominating values: **Portugal; Lisboa; Lisboa; "                                                                                                                         
#> [12] ""                                                                                                                                                                               
#> [13] ""                                                                                                                                                                               
#> [14] "**These column pairs are duplicate: **"                                                                                                                                         
#> [15] "District - Municipality; "                                                                                                                                                      
#> [16] ""                                                                                                                                                                               
#> [17] ""                                                                                                                                                                               
#> [18] "**No target values are missing. **"                                                                                                                                             
#> [19] ""                                                                                                                                                                               
#> [20] "**No predictor values are missing. **"                                                                                                                                          
#> [21] ""                                                                                                                                                                               
#> [22] "**No issues with dimensionality. **"                                                                                                                                            
#> [23] ""                                                                                                                                                                               
#> [24] "**Strongly correlated, by Spearman rank, pairs of numerical values are: **"                                                                                                     
#> [25] ""                                                                                                                                                                               
#> [26] " Bedrooms - AreaNet: 0.77;"                                                                                                                                                     
#> [27] " Bedrooms - AreaGross: 0.77;"                                                                                                                                                   
#> [28] " Bathrooms - AreaNet: 0.78;"                                                                                                                                                    
#> [29] " Bathrooms - AreaGross: 0.78;"                                                                                                                                                  
#> [30] " AreaNet - AreaGross: 1;"                                                                                                                                                       
#> [31] ""                                                                                                                                                                               
#> [32] " ** Strongly correlated, by Crammer's V rank, pairs of categorical values are: **"                                                                                              
#> [33] ""                                                                                                                                                                               
#> [34] " PropertyType - PropertySubType: 1;"                                                                                                                                            
#> [35] ""                                                                                                                                                                               
#> [36] "**These obserwation migth be outliers due to their numerical columns values: **"                                                                                                
#> [37] ""                                                                                                                                                                               
#> [38] " 145 146 196 44 5 51 57 58 59 60 61 62 63 64 69 75 76 77 78 ;"                                                                                                                  
#> [39] ""                                                                                                                                                                               
#> [40] "**Target data is not evenly distributed with quantile bins:** 0.25 0.35 0.14 0.26 "                                                                                             
#> [41] ""                                                                                                                                                                               
#> [42] "**Columns names suggest that some of them are IDs, removing them can improve the model. Suspicious columns are: **"                                                             
#> [43] ""                                                                                                                                                                               
#> [44] " Id "                                                                                                                                                                           
#> [45] ""                                                                                                                                                                               
#> [46] "**Columns data suggest that some of them are IDs, removing them can improve the model. Suspicious columns are: **"                                                              
#> [47] ""                                                                                                                                                                               
#> [48] " Id "                                                                                                                                                                           
#> [49] ""                                                                                                                                                                               
#> [50] ""                                                                                                                                                                               
#> [51] " -------------------- **CHECK DATA REPORT END** -------------------- "                                                                                                          
#> [52] " "                                                                                                                                                                              
#> 
#> $outliers
#>  [1] 145 146 196  44   5  51  57  58  59  60  61  62  63  64  69  75  76  77  78
#> 
check_data(compas, 'Two_yr_Recidivism')
#>  -------------------- CHECK DATA REPORT -------------------- 
#>  
#> The dataset has 6172 observations and 7 columns, which names are: 
#> Two_yr_Recidivism; Number_of_Priors; Age_Above_FourtyFive; Age_Below_TwentyFive; Misdemeanor; Ethnicity; Sex; 
#> 
#> With the target value described by a column Two_yr_Recidivism.
#> 
#>  No static columns. 
#> 
#>  No duplicate columns.
#> 
#>  No target values are missing. 
#> 
#>  No predictor values are missing. 
#> 
#>  No issues with dimensionality. 
#> 
#>  No strongly correlated, by Spearman rank, pairs of numerical values. 
#> 
#>  No strongly correlated, by Crammer's V rank, pairs of categorical values. 
#> 
#>  There are more than 50 possible outliers in the data set, so we are not printing them. They are returned in the output as a vector. 
#> 
#>  Dataset is balanced. 
#> 
#>  Columns names suggest that none of them are IDs. 
#> 
#>  Columns data suggest that none of them are IDs. 
#> 
#>  -------------------- CHECK DATA REPORT END -------------------- 
#>  
#> $str
#>  [1] " -------------------- **CHECK DATA REPORT** -------------------- "                                                                       
#>  [2] " "                                                                                                                                       
#>  [3] "**The dataset has 6172 observations and 7 columns which names are: **"                                                                   
#>  [4] ""                                                                                                                                        
#>  [5] "Two_yr_Recidivism; Number_of_Priors; Age_Above_FourtyFive; Age_Below_TwentyFive; Misdemeanor; Ethnicity; Sex; "                          
#>  [6] ""                                                                                                                                        
#>  [7] "**With the target value described by a column:** Two_yr_Recidivism."                                                                     
#>  [8] " "                                                                                                                                       
#>  [9] "**No static columns. **"                                                                                                                 
#> [10] ""                                                                                                                                        
#> [11] ""                                                                                                                                        
#> [12] "**No duplicate columns.**"                                                                                                               
#> [13] ""                                                                                                                                        
#> [14] "**No target values are missing. **"                                                                                                      
#> [15] ""                                                                                                                                        
#> [16] "**No predictor values are missing. **"                                                                                                   
#> [17] ""                                                                                                                                        
#> [18] "**No issues with dimensionality. **"                                                                                                     
#> [19] ""                                                                                                                                        
#> [20] "**No strongly correlated, by Spearman rank, pairs of numerical values. **"                                                               
#> [21] ""                                                                                                                                        
#> [22] "**No strongly correlated, by Crammer's V rank, pairs of categorical values. **"                                                          
#> [23] ""                                                                                                                                        
#> [24] "**There are more than 50 possible outliers in the data set, so we are not printing them. They are returned in the output as a vector. **"
#> [25] ""                                                                                                                                        
#> [26] "**Dataset is balanced. **"                                                                                                               
#> [27] ""                                                                                                                                        
#> [28] "**Columns names suggest that none of them are IDs. **"                                                                                   
#> [29] ""                                                                                                                                        
#> [30] "**Columns data suggest that none of them are IDs. **"                                                                                    
#> [31] ""                                                                                                                                        
#> [32] ""                                                                                                                                        
#> [33] " -------------------- **CHECK DATA REPORT END** -------------------- "                                                                   
#> [34] " "                                                                                                                                       
#> 
#> $outliers
#>   [1]  102  108 1181 1209 1321 1401 1403 1406 1408 1417 1422 1443 1468 1526 1532
#>  [16] 1561  157 1596 1630 1681  173 1814 1820 1830 1865 1920 1924 1950 2080 2099
#>  [31]  210 2105 2168 2264 2301 2336 2348 2410 2417 2423 2444 2453 2503 2504 2526
#>  [46] 2544 2611 2648 2680  273 2744 2792 2829 2858 2871 2872 2873 2888 2979 3043
#>  [61] 3050 3104 3107 3138 3204 3207  322 3229 3250  326 3280 3314 3333 3360 3394
#>  [76]   34 3534  356 3594 3620 3714 3762 3803 3830 3872 3873 3923  393 4076 4083
#>  [91] 4085 4086 4091 4104 4111 4172  424  425 4274 4303 4378 4394 4484 4492  455
#> [106] 4709 4720 4740 4882 4931 4962 4973 5067 5133 5164 5193 5231 5254 5261 5283
#> [121] 5286 5332 5350 5351  539 5411 5497 5512  555  556 5705 5830  588 5934 5959
#> [136] 5979 6003 6005 6023  603 6080 6161  630  642  674  707  709  739  787  846
#> [151]  897  904  932
#> 
check_data(iris, 'Species')
#>  -------------------- CHECK DATA REPORT -------------------- 
#>  
#> The dataset has 150 observations and 5 columns, which names are: 
#> Sepal.Length; Sepal.Width; Petal.Length; Petal.Width; Species; 
#> 
#> With the target value described by a column Species.
#> 
#>  No static columns. 
#> 
#>  No duplicate columns.
#> 
#>  No target values are missing. 
#> 
#>  No predictor values are missing. 
#> 
#>  No issues with dimensionality. 
#> 
#>  Strongly correlated, by Spearman rank, pairs of numerical values are: 
#>  
#>  Sepal.Length - Petal.Length: 0.87;
#>  Sepal.Length - Petal.Width: 0.82;
#>  Petal.Length - Petal.Width: 0.96;
#> 
#>  These obserwation migth be outliers due to their numerical columns values: 
#>  16 ;
#> 
#>  Multilabel classification is not supported yet. 
#> 
#>  Columns names suggest that none of them are IDs. 
#> 
#>  Columns data suggest that none of them are IDs. 
#> 
#>  -------------------- CHECK DATA REPORT END -------------------- 
#>  
#> $str
#>  [1] " -------------------- **CHECK DATA REPORT** -------------------- "              
#>  [2] " "                                                                              
#>  [3] "**The dataset has 150 observations and 5 columns which names are: **"           
#>  [4] ""                                                                               
#>  [5] "Sepal.Length; Sepal.Width; Petal.Length; Petal.Width; Species; "                
#>  [6] ""                                                                               
#>  [7] "**With the target value described by a column:** Species."                      
#>  [8] " "                                                                              
#>  [9] "**No static columns. **"                                                        
#> [10] ""                                                                               
#> [11] ""                                                                               
#> [12] "**No duplicate columns.**"                                                      
#> [13] ""                                                                               
#> [14] "**No target values are missing. **"                                             
#> [15] ""                                                                               
#> [16] "**No predictor values are missing. **"                                          
#> [17] ""                                                                               
#> [18] "**No issues with dimensionality. **"                                            
#> [19] ""                                                                               
#> [20] "**Strongly correlated, by Spearman rank, pairs of numerical values are: **"     
#> [21] ""                                                                               
#> [22] " Sepal.Length - Petal.Length: 0.87;"                                            
#> [23] " Sepal.Length - Petal.Width: 0.82;"                                             
#> [24] " Petal.Length - Petal.Width: 0.96;"                                             
#> [25] ""                                                                               
#> [26] "**These obserwation migth be outliers due to their numerical columns values: **"
#> [27] ""                                                                               
#> [28] " 16 ;"                                                                          
#> [29] ""                                                                               
#> [30] "**Multilabel classification is not supported yet. **"                           
#> [31] ""                                                                               
#> [32] "**Columns names suggest that none of them are IDs. **"                          
#> [33] ""                                                                               
#> [34] "**Columns data suggest that none of them are IDs. **"                           
#> [35] ""                                                                               
#> [36] ""                                                                               
#> [37] " -------------------- **CHECK DATA REPORT END** -------------------- "          
#> [38] " "                                                                              
#> 
#> $outliers
#> [1] 16
#> 
check_data(lymph, 'class')
#>  -------------------- CHECK DATA REPORT -------------------- 
#>  
#> The dataset has 148 observations and 19 columns, which names are: 
#> lymphatics; block_of_affere; bl_of_lymph_c; bl_of_lymph_s; by_pass; extravasates; regeneration_of; early_uptake_in; lym_nodes_dimin; lym_nodes_enlar; changes_in_lym; defect_in_node; changes_in_node; changes_in_stru; special_forms; dislocation_of; exclusion_of_no; no_of_nodes_in; class; 
#> 
#> With the target value described by a column class.
#> 
#>  No static columns. 
#> 
#>  No duplicate columns.
#> 
#>  No target values are missing. 
#> 
#>  No predictor values are missing. 
#> 
#>  No issues with dimensionality. 
#> 
#>  No strongly correlated, by Spearman rank, pairs of numerical values. 
#> 
#>  No strongly correlated, by Crammer's V rank, pairs of categorical values. 
#> 
#>  These obserwation migth be outliers due to their numerical columns values: 
#>  3 54 74 ;
#> 
#>  Multilabel classification is not supported yet. 
#> 
#>  Columns names suggest that none of them are IDs. 
#> 
#>  Columns data suggest that none of them are IDs. 
#> 
#>  -------------------- CHECK DATA REPORT END -------------------- 
#>  
#> $str
#>  [1] " -------------------- **CHECK DATA REPORT** -------------------- "                                                                                                                                                                                                                              
#>  [2] " "                                                                                                                                                                                                                                                                                              
#>  [3] "**The dataset has 148 observations and 19 columns which names are: **"                                                                                                                                                                                                                          
#>  [4] ""                                                                                                                                                                                                                                                                                               
#>  [5] "lymphatics; block_of_affere; bl_of_lymph_c; bl_of_lymph_s; by_pass; extravasates; regeneration_of; early_uptake_in; lym_nodes_dimin; lym_nodes_enlar; changes_in_lym; defect_in_node; changes_in_node; changes_in_stru; special_forms; dislocation_of; exclusion_of_no; no_of_nodes_in; class; "
#>  [6] ""                                                                                                                                                                                                                                                                                               
#>  [7] "**With the target value described by a column:** class."                                                                                                                                                                                                                                        
#>  [8] " "                                                                                                                                                                                                                                                                                              
#>  [9] "**No static columns. **"                                                                                                                                                                                                                                                                        
#> [10] ""                                                                                                                                                                                                                                                                                               
#> [11] ""                                                                                                                                                                                                                                                                                               
#> [12] "**No duplicate columns.**"                                                                                                                                                                                                                                                                      
#> [13] ""                                                                                                                                                                                                                                                                                               
#> [14] "**No target values are missing. **"                                                                                                                                                                                                                                                             
#> [15] ""                                                                                                                                                                                                                                                                                               
#> [16] "**No predictor values are missing. **"                                                                                                                                                                                                                                                          
#> [17] ""                                                                                                                                                                                                                                                                                               
#> [18] "**No issues with dimensionality. **"                                                                                                                                                                                                                                                            
#> [19] ""                                                                                                                                                                                                                                                                                               
#> [20] "**No strongly correlated, by Spearman rank, pairs of numerical values. **"                                                                                                                                                                                                                      
#> [21] ""                                                                                                                                                                                                                                                                                               
#> [22] "**No strongly correlated, by Crammer's V rank, pairs of categorical values. **"                                                                                                                                                                                                                 
#> [23] ""                                                                                                                                                                                                                                                                                               
#> [24] "**These obserwation migth be outliers due to their numerical columns values: **"                                                                                                                                                                                                                
#> [25] ""                                                                                                                                                                                                                                                                                               
#> [26] " 3 54 74 ;"                                                                                                                                                                                                                                                                                     
#> [27] ""                                                                                                                                                                                                                                                                                               
#> [28] "**Multilabel classification is not supported yet. **"                                                                                                                                                                                                                                           
#> [29] ""                                                                                                                                                                                                                                                                                               
#> [30] "**Columns names suggest that none of them are IDs. **"                                                                                                                                                                                                                                          
#> [31] ""                                                                                                                                                                                                                                                                                               
#> [32] "**Columns data suggest that none of them are IDs. **"                                                                                                                                                                                                                                           
#> [33] ""                                                                                                                                                                                                                                                                                               
#> [34] ""                                                                                                                                                                                                                                                                                               
#> [35] " -------------------- **CHECK DATA REPORT END** -------------------- "                                                                                                                                                                                                                          
#> [36] " "                                                                                                                                                                                                                                                                                              
#> 
#> $outliers
#> [1]  3 54 74
#>