Tidy rules #16

talegari · 2018-07-26T08:49:21Z

Hi Max,

I end up using C5 often for its speed and rules. Thanks for the package!

Although the summary function prints the rules in a handy way, it might be sometimes preferable to have them in a tidy way.

Rules displayed on calling summary function:

Rules:

Rule 1: (50, lift 2.9)
	Petal.Length <= 1.9
	->  class setosa  [0.981]

Rule 2: (48/1, lift 2.9)
	Petal.Length > 1.9
	Petal.Length <= 4.9
	Petal.Width <= 1.7
	->  class versicolor  [0.960]

Rule 3: (46/1, lift 2.9)
	Petal.Width > 1.7
	->  class virginica  [0.958]

Rule 4: (46/2, lift 2.8)
	Petal.Length > 4.9
	->  class virginica  [0.938]

Default class: setosa

Output of the tidying function:

support	confidence	lift	LHS	RHS	n_conditions
50	1.0000000	2.94231	Petal.Length < 1.9	setosa	1
48	0.9791667	2.88000	Petal.Length > 1.9 & Petal.Length < 4.9000001 & Petal.Width < 1.7	versicolor	3
46	0.9782609	2.87500	Petal.Width > 1.7	virginica	1
46	0.9565217	2.81250	Petal.Length > 4.9000001	virginica	1

Note that the LHS is string parseable as a R expression. Hence, it can be simply pasted into dplyr::filter.

Here is the code to tidy the rules and the code snippet to run an example.

source("https://gist.github.com/talegari/dde1bc3aaed88533bcf7ee137296830a/raw/9bfc1fe894428b21ad2c94dd2d83b6277470dd60/tidy_rules_C5")

dplyr::glimpse(iris)
model <- C50::C5.0(Species ~ ., data = iris, rules = TRUE) # build a C5 model
summary(model)                                             # print rules

tidy_rules(model) %>% knitr::kable()

Please suggest if it might be a good idea to include this in broom package instead of here. Else, let me know if you are open for a PR.

Suggestions are welcome!

Regards,
Srikanth KS

The text was updated successfully, but these errors were encountered:

topepo · 2018-07-27T02:21:19Z

This looks great! I PR would be very welcome.

One thing: cubist has the same rule format (but a different model structure in the terminal nodes/rules). It would make sense to have this work for both. Can you adapt it to work with that package? I've been putting any joint infrastructure functions in Cubist (e.g. makeDataFile etc). If this were put in C50, it would make circular dependencies.

talegari · 2018-07-27T06:47:21Z

Thanks Max.

Cubist output can be processed similarly. I will write a function to create a tidy dataframe for that and submit a PR there.

About the design:

Should we have a S3 generic tidy_rules with tidy_rules.C50 and tidy_rules.Cubist methods?
Or two different functions named tidy_rules going into both packages separately?

Please suggest.

topepo · 2018-07-27T15:43:57Z

With other functions that they share, I add them to Cubist then import from from there into C50.

So try to write the function so that they share as much common code as possible then add those common functions and the tidy_rules generic to Cubist. Then C50 can import the class and have its own tidy_rules method.

talegari · 2018-07-31T07:45:16Z

@topepo Please review this draft before PR submission.

Please suggest changes in column names and order if necessary
I am not able to get cases where no rules are generated. If you have an example for that, I will cover that edge case.

topepo · 2018-09-01T21:45:55Z

It looks really good.

Some recommendations/comments:

Have the output contain a column for committee (singular) and rule to better match the output.
It looks like non-standard names need to be escaped somehow. A variable named Gr Liv Area gets translated to Gr * Liv * Area (code below). Odd factor levels seem fine though.
Don't forget to update the NEWS file with the change.
Did you consider using the model object? That's built to be parsed. Here is an example using the vignette data set:

> cubist(x = train_pred[, 1:2], y = train_resp)$model %>% cat()
id="Cubist 2.07 GPL Edition 2018-09-01"
prec="1" globalmean="22.41485" extrap="1" insts="0" ceiling="95" floor="0"
att="outcome" mean="22.41" sd="9.284727" min="5" max="50"
att="crim" mean="3.789463" sd="8.553482" min="0.00906" max="88.9762"
att="zn" mean="11.38" sd="23.47519" min="0" max="100"
entries="1"
rules="2"
conds="1" cover="54" mean="12.27" loval="5" hival="27.9" esterr="3.96"
type="2" att="crim" cut="9.2322998" result=">"
coeff="13.25" att="crim" coeff="-0.11" att="zn" coeff="0.009"
conds="1" cover="350" mean="23.98" loval="8.1" hival="50" esterr="5.59"
type="2" att="crim" cut="9.2322998" result="<="
coeff="21.79" att="crim" coeff="-0.62" att="zn" coeff="0.105"
> cubist(x = train_pred[, 1:2], y = train_resp) %>% summary()

Call:
cubist.default(x = train_pred[, 1:2], y = train_resp)


Cubist [Release 2.07 GPL Edition]  Sat Sep  1 17:43:44 2018
---------------------------------

    Target attribute `outcome'

Read 404 cases (3 attributes) from undefined.data

Model:

  Rule 1: [54 cases, mean 12.27, range 5 to 27.9, est err 3.96]

    if
	crim > 9.2323
    then
	outcome = 13.25 - 0.11 crim + 0.009 zn

  Rule 2: [350 cases, mean 23.98, range 8.1 to 50, est err 5.59]

    if
	crim <= 9.2323
    then
	outcome = 21.79 - 0.62 crim + 0.105 zn


Evaluation on training data (404 cases):

    Average  |error|               5.65
    Relative |error|               0.85
    Correlation coefficient        0.50


	Attribute usage:
	  Conds  Model

	  100%   100%    crim
	         100%    zn


Time: 0.0 secs

Some testing code:

library(Cubist)
library(AmesHousing)
library(tidymodels)

ames <- make_ames()

ames2 <- 
  ames %>%
  dplyr::rename(`Gr Liv Area` = Gr_Liv_Area) %>%
  mutate(
    Overall_Qual = gsub("_", " ", as.character(Overall_Qual)),
    MS_SubClass = gsub("_", " ", as.character(MS_SubClass))
    )

cb_mod <- 
  cubist(
    x = ames2 %>% dplyr::select(-Sale_Price),
    y = log10(ames2$Sale_Price),
    committees = 3
    ) 

tr <- tidy_rules(cb_mod)

talegari · 2018-09-05T07:20:41Z

Thanks Max,

Changed to 'committee'.
Names with spaces have been handled. See the attached doc.
Updated the news file.
I had not noticed the model object. Parsing the model object or the summary output seem almost the equivalent. I would stick with the parsing summary output, unless you see a compelling reason to change.

I will submit a new PR shortly.
tidy_rules_spaces_handled.pdf

edit: PR is here

topepo · 2021-05-07T00:19:21Z

I think that you solved with with tidyrules.

talegari mentioned this issue Aug 3, 2018

tidy_rules topepo/Cubist#21

Closed

talegari added a commit to talegari/Cubist that referenced this issue Sep 5, 2018

handled comments from max in topepo/C5.0#16 (comment)

3980c89

talegari mentioned this issue Sep 5, 2018

tidy_rules (attempt 2) topepo/Cubist#22

Closed

topepo closed this as completed May 7, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Tidy rules #16

Tidy rules #16

talegari commented Jul 26, 2018

topepo commented Jul 27, 2018

talegari commented Jul 27, 2018

topepo commented Jul 27, 2018

talegari commented Jul 31, 2018 •

edited

Loading

topepo commented Sep 1, 2018

talegari commented Sep 5, 2018 •

edited

Loading

topepo commented May 7, 2021

Tidy rules #16

Tidy rules #16

Comments

talegari commented Jul 26, 2018

topepo commented Jul 27, 2018

talegari commented Jul 27, 2018

topepo commented Jul 27, 2018

talegari commented Jul 31, 2018 • edited Loading

topepo commented Sep 1, 2018

talegari commented Sep 5, 2018 • edited Loading

topepo commented May 7, 2021

talegari commented Jul 31, 2018 •

edited

Loading

talegari commented Sep 5, 2018 •

edited

Loading