Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Missing character predictors cause model build to fail #39

Open
jlries61 opened this issue Jul 31, 2021 · 0 comments
Open

Missing character predictors cause model build to fail #39

jlries61 opened this issue Jul 31, 2021 · 0 comments

Comments

@jlries61
Copy link

jlries61 commented Jul 31, 2021

Consider the following command sequence:

require("C50")
require("dplyr")

indataName <- "bankmkt_part1m10.csv"
target <- "y"
NPART <- 20
keep <- c()
exclude <-c()

#The keeplist is automatically generated
mkkeep <- function(dataset, keep, exclude) {
  varnames <- names(dataset)
  orgnames <- list()
  for (varname in varnames) orgnames[toupper(varname)] <- varname
  if (is.null(keep)) keep <- orgnames
  KEEP <- toupper(keep)
  EXCLUDE <- toupper(exclude)
  KEEP <- setdiff(KEEP, EXCLUDE)
  KEEP <- intersect(KEEP, names(orgnames))
  return(unlist(orgnames[KEEP]))
}

indata <- read.csv(indataName)
indata[,target] = factor(indata[,target])

for (part in 1:NPART) exclude <- c(exclude, paste0("SAMPLE", part))
exclude <- c(exclude, target)
keep <- mkkeep(indata, keep, exclude)
form <- as.formula(paste(target, "~", paste(keep, collapse="+")))
model <-C5.0(formula=form, data=indata, trials=1)
summary(model)

The model build fails with the following message:
c50 code called exit with value 1

summary(model) produces the following:


Call:
C5.0.formula(formula = form, data = indata, trials = 1, control
 = C5.0Control(subset = FALSE, winnow = TRUE, noGlobalPruning = FALSE))


C5.0 [Release 2.07 GPL Edition]  	Sat Jul 31 11:52:01 2021
-------------------------------

*** line 7 of `undefined.names': missing name or value before `,'

Error limit exceeded

The value of model$names is:

[1] "| Generated using R version 4.0.5 (2021-03-31)\n| on Sat Jul 31 11:59:19 2021\noutcome.\n\noutcome: 0,1.\nage: continuous.\njob: management,technician,entrepreneur,blue-collar,unknown,retired,admin.,services,,self-employed,unemployed,housemaid,student.\nmarital: ,single,married,divorced.\neducation: tertiary,secondary,unknown,primary,.\ndefault: continuous.\nbalance: continuous.\nhousing: yes,,no.\nloan: continuous.\ncontact: unknown,cellular,telephone.\nday: continuous.\nmonth: may,,jun,jul,aug,oct,nov,dec,jan,feb,mar,apr,sep.\nduration: continuous.\ncampaign: continuous.\npdays: continuous.\nprevious: continuous.\npoutcome: unknown,failure,other,success.\n"

Change the value of indataName to "bankmkt_part1.csv" (which has no missings) and the model is built normally. A zipfile containing the R script and the two datasets is attached here.
c50bug.zip

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant