Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

C5.0 fails with commas in input variables #12

Open
CarolynOlsen opened this issue Apr 20, 2018 · 2 comments
Open

C5.0 fails with commas in input variables #12

CarolynOlsen opened this issue Apr 20, 2018 · 2 comments

Comments

@CarolynOlsen
Copy link

C5.0() now fails on factor variables that include commas, where it did not before.

I recently updated my version of C50, and tried to train a model on a data set I've trained C5.0 models on before. I now receive the error "c50 code called exit with value 1". I narrowed it down to one factor variable that had commas in the values. After removing the commas, the model trained fine. Below is a small example I created to replicate the problem.

Thank you very much!

> ## PURPOSE: Replicate an error in C5.0 model training with commas
> 
> # define 2 different data frame, one with commas
> 
> # df no commas
> v1 = c(2, 3, 5, 7, 2, 4, 5, 2) 
> v2 = c("aa", "bb", "cc", "dd", "aa", "bb", "aa", "bb") 
> v3 = factor(c(1, 0, 0, 0, 1, 0, 1, 1) )
> dfNoCommas = data.frame(v1, v2, v3)
> 
> # df with commas
> v1 = c(2, 3, 5, 7, 2, 4, 5, 2) 
> v2 = c("a,a", "b,b", "c,c", "d,d", "a,a", "b,b", "a,a", "b,b") 
> v3 = factor(c(1, 0, 0, 0, 1, 0, 1, 1) )
> dfCommas = data.frame(v1, v2, v3)
> 
> # load C5 library
> library(C50)
> 
> # train a model with the no commas df
> trainNoCommas <- C5.0(formula = v3 ~ .
+      , data = dfNoCommas[,!colnames(dfNoCommas) %in% c("v3")]
+      , trials = 1
+      , rules = TRUE
+      , control = C5.0Control()
+ )
> 
> # train a model with the commas df
> trainCommas <- C5.0(formula = v3 ~ .
+                     , data = dfCommas[,!colnames(dfCommas) %in% c("v3")]
+                     , trials = 1
+                     , rules = TRUE
+                     , control = C5.0Control()
+ )
c50 code called exit with value 1
> 
> # see package versions
> sessionInfo()
R version 3.4.3 (2017-11-30)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows >= 8 x64 (build 9200)

Matrix products: default

locale:
[1] LC_COLLATE=English_United States.1252  LC_CTYPE=English_United States.1252    LC_MONETARY=English_United States.1252 LC_NUMERIC=C                          
[5] LC_TIME=English_United States.1252    

attached base packages:
[1] grid      parallel  stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] RODBC_1.3-15      C50_0.1.1         AUC_0.3.0         adabag_4.2        pROC_1.11.0       smbinning_0.6     Formula_1.2-2     partykit_1.2-0   
 [9] rpart_4.1-11      mvtnorm_1.0-7     libcoin_1.0-1     sqldf_0.4-11      RSQLite_2.1.0     gsubfn_0.7        proto_1.0.0       stringr_1.3.0    
[17] caret_6.0-79      ggplot2_2.2.1     lattice_0.20-35   doParallel_1.0.11 iterators_1.0.9   foreach_1.4.4    

loaded via a namespace (and not attached):
[1] nlme_3.1-131        lubridate_1.7.2     bit64_0.9-7         dimRed_0.1.0        tools_3.4.3         R6_2.2.2            DBI_0.8            
 [8] lazyeval_0.2.1      colorspace_1.3-2    nnet_7.3-12         withr_2.1.1         tidyselect_0.2.4    mnormt_1.5-5        bit_1.1-12         
[15] compiler_3.4.3      chron_2.3-52        Cubist_0.2.1        scales_0.5.0        sfsmisc_1.1-2       DEoptimR_1.0-8      psych_1.7.8        
[22] robustbase_0.92-8   digest_0.6.15       foreign_0.8-69      pkgconfig_2.0.1     rlang_0.2.0         ddalpha_1.3.2       bindr_0.1          
[29] dplyr_0.7.4         ModelMetrics_1.1.0  magrittr_1.5        Matrix_1.2-12       Rcpp_0.12.15        munsell_0.4.3       abind_1.4-5        
[36] stringi_1.1.6       inum_1.0-0          MASS_7.3-47         plyr_1.8.4          recipes_0.1.2       blob_1.1.1          splines_3.4.3      
[43] pillar_1.2.1        tcltk_3.4.3         xgboost_0.6.4.1     reshape2_1.4.3      codetools_0.2-15    stats4_3.4.3        CVST_0.2-1         
[50] magic_1.5-8         glue_1.2.0          data.table_1.10.4-3 gtable_0.2.0        purrr_0.2.4         tidyr_0.8.0         kernlab_0.9-25     
[57] assertthat_0.2.0    DRR_0.0.3           gower_0.1.2         prodlim_1.6.1       broom_0.4.3         class_7.3-14        survival_2.41-3    
[64] geometry_0.3-6      timeDate_3043.102   RcppRoll_0.2.2      tibble_1.4.2        memoise_1.1.0       bindrcpp_0.2        lava_1.6.1         
[71] ipred_0.9-6     
@topepo
Copy link
Owner

topepo commented May 21, 2018

This looks like a limitation in the C5.0 C code. You can escape other characters but I've been testing a bit and it doesn't accept this inside the data values.

You might dummy up some application files to verify. If it doesn't work, I'd email RuleQuest and see if Quinlan can make a change.

@jjalcolea
Copy link

jjalcolea commented Aug 13, 2019

Same problem here: had no problem before, but after upgrading, commas in variables break the training proccess :-(
Will check escaping the commas and report back...
(EDITED):
Sorry, don't have time... I've downgraded with install_version("C50", version = "0.1.0-24", repos = "http://cran.us.r-project.org") to get the old comma-tolerant functionality...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants