Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tidy rules #16

Closed
talegari opened this issue Jul 26, 2018 · 7 comments
Closed

Tidy rules #16

talegari opened this issue Jul 26, 2018 · 7 comments

Comments

@talegari
Copy link

Hi Max,

I end up using C5 often for its speed and rules. Thanks for the package!

Although the summary function prints the rules in a handy way, it might be sometimes preferable to have them in a tidy way.

Rules displayed on calling summary function:

Rules:

Rule 1: (50, lift 2.9)
	Petal.Length <= 1.9
	->  class setosa  [0.981]

Rule 2: (48/1, lift 2.9)
	Petal.Length > 1.9
	Petal.Length <= 4.9
	Petal.Width <= 1.7
	->  class versicolor  [0.960]

Rule 3: (46/1, lift 2.9)
	Petal.Width > 1.7
	->  class virginica  [0.958]

Rule 4: (46/2, lift 2.8)
	Petal.Length > 4.9
	->  class virginica  [0.938]

Default class: setosa

Output of the tidying function:

support confidence lift LHS RHS n_conditions
50 1.0000000 2.94231 Petal.Length < 1.9 setosa 1
48 0.9791667 2.88000 Petal.Length > 1.9 & Petal.Length < 4.9000001 & Petal.Width < 1.7 versicolor 3
46 0.9782609 2.87500 Petal.Width > 1.7 virginica 1
46 0.9565217 2.81250 Petal.Length > 4.9000001 virginica 1

Note that the LHS is string parseable as a R expression. Hence, it can be simply pasted into dplyr::filter.

Here is the code to tidy the rules and the code snippet to run an example.

source("https://gist.github.com/talegari/dde1bc3aaed88533bcf7ee137296830a/raw/9bfc1fe894428b21ad2c94dd2d83b6277470dd60/tidy_rules_C5")

dplyr::glimpse(iris)
model <- C50::C5.0(Species ~ ., data = iris, rules = TRUE) # build a C5 model
summary(model)                                             # print rules

tidy_rules(model) %>% knitr::kable()

Please suggest if it might be a good idea to include this in broom package instead of here. Else, let me know if you are open for a PR.

Suggestions are welcome!

Regards,
Srikanth KS

@topepo
Copy link
Owner

topepo commented Jul 27, 2018

This looks great! I PR would be very welcome.

One thing: cubist has the same rule format (but a different model structure in the terminal nodes/rules). It would make sense to have this work for both. Can you adapt it to work with that package? I've been putting any joint infrastructure functions in Cubist (e.g. makeDataFile etc). If this were put in C50, it would make circular dependencies.

@talegari
Copy link
Author

Thanks Max.

Cubist output can be processed similarly. I will write a function to create a tidy dataframe for that and submit a PR there.

About the design:

  1. Should we have a S3 generic tidy_rules with tidy_rules.C50 and tidy_rules.Cubist methods?
  2. Or two different functions named tidy_rules going into both packages separately?

Please suggest.

@topepo
Copy link
Owner

topepo commented Jul 27, 2018

With other functions that they share, I add them to Cubist then import from from there into C50.

So try to write the function so that they share as much common code as possible then add those common functions and the tidy_rules generic to Cubist. Then C50 can import the class and have its own tidy_rules method.

@talegari
Copy link
Author

talegari commented Jul 31, 2018

@topepo Please review this draft before PR submission.

  1. Please suggest changes in column names and order if necessary
  2. I am not able to get cases where no rules are generated. If you have an example for that, I will cover that edge case.

@topepo
Copy link
Owner

topepo commented Sep 1, 2018

It looks really good.

Some recommendations/comments:

  • Have the output contain a column for committee (singular) and rule to better match the output.
  • It looks like non-standard names need to be escaped somehow. A variable named Gr Liv Area gets translated to Gr * Liv * Area (code below). Odd factor levels seem fine though.
  • Don't forget to update the NEWS file with the change.
  • Did you consider using the model object? That's built to be parsed. Here is an example using the vignette data set:
> cubist(x = train_pred[, 1:2], y = train_resp)$model %>% cat()
id="Cubist 2.07 GPL Edition 2018-09-01"
prec="1" globalmean="22.41485" extrap="1" insts="0" ceiling="95" floor="0"
att="outcome" mean="22.41" sd="9.284727" min="5" max="50"
att="crim" mean="3.789463" sd="8.553482" min="0.00906" max="88.9762"
att="zn" mean="11.38" sd="23.47519" min="0" max="100"
entries="1"
rules="2"
conds="1" cover="54" mean="12.27" loval="5" hival="27.9" esterr="3.96"
type="2" att="crim" cut="9.2322998" result=">"
coeff="13.25" att="crim" coeff="-0.11" att="zn" coeff="0.009"
conds="1" cover="350" mean="23.98" loval="8.1" hival="50" esterr="5.59"
type="2" att="crim" cut="9.2322998" result="<="
coeff="21.79" att="crim" coeff="-0.62" att="zn" coeff="0.105"
> cubist(x = train_pred[, 1:2], y = train_resp) %>% summary()

Call:
cubist.default(x = train_pred[, 1:2], y = train_resp)


Cubist [Release 2.07 GPL Edition]  Sat Sep  1 17:43:44 2018
---------------------------------

    Target attribute `outcome'

Read 404 cases (3 attributes) from undefined.data

Model:

  Rule 1: [54 cases, mean 12.27, range 5 to 27.9, est err 3.96]

    if
	crim > 9.2323
    then
	outcome = 13.25 - 0.11 crim + 0.009 zn

  Rule 2: [350 cases, mean 23.98, range 8.1 to 50, est err 5.59]

    if
	crim <= 9.2323
    then
	outcome = 21.79 - 0.62 crim + 0.105 zn


Evaluation on training data (404 cases):

    Average  |error|               5.65
    Relative |error|               0.85
    Correlation coefficient        0.50


	Attribute usage:
	  Conds  Model

	  100%   100%    crim
	         100%    zn


Time: 0.0 secs

Some testing code:

library(Cubist)
library(AmesHousing)
library(tidymodels)

ames <- make_ames()

ames2 <- 
  ames %>%
  dplyr::rename(`Gr Liv Area` = Gr_Liv_Area) %>%
  mutate(
    Overall_Qual = gsub("_", " ", as.character(Overall_Qual)),
    MS_SubClass = gsub("_", " ", as.character(MS_SubClass))
    )

cb_mod <- 
  cubist(
    x = ames2 %>% dplyr::select(-Sale_Price),
    y = log10(ames2$Sale_Price),
    committees = 3
    ) 

tr <- tidy_rules(cb_mod)

@talegari
Copy link
Author

talegari commented Sep 5, 2018

Thanks Max,

  • Changed to 'committee'.
  • Names with spaces have been handled. See the attached doc.
  • Updated the news file.
  • I had not noticed the model object. Parsing the model object or the summary output seem almost the equivalent. I would stick with the parsing summary output, unless you see a compelling reason to change.

I will submit a new PR shortly.
tidy_rules_spaces_handled.pdf

edit: PR is here

@topepo
Copy link
Owner

topepo commented May 7, 2021

I think that you solved with with tidyrules.

@topepo topepo closed this as completed May 7, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants