-
Work on tests
- tests for
left
andclose_end
arguments - tests for
brk_default
brk_width()
needs tests which match the guarantees in the documentation- ditto for
brk_evenly()
which now uses its own implementation to guarantee exactlyintervals
intervals - systematic tests for
brk_*
functions
- tests for
-
Implement a simple
Infinity
class that automatically casts to any other class and is always > or < than any other element? Then replace theclass_bounds()
complexity?- The problem at the moment is that
vec_cast()
is highly unreliable and you never know if a particular class will acceptInf
. - An infinity class would be fine, but how does that go into the existing
breaks
object which has its own underlying class? - Might be more reasonable just not to add
Inf
or-Inf
elements. Instead, record whether the breaks have left and right "infinity" set. Then just add numeric infinity to the breaks before you callcategorize_impl
(or the R version). In particular, e.g.integer64
doesn't likeInf
or-Inf
but it does have very large numbers inbit64::lim.integer64
which look ugly and which only exist to be lower/higher than everything else anyway...- But NB this requires a new way to create the labels, and that kinda sucks....
- The problem at the moment is that
-
On the whole, we don't want to error out if
x
is weird.x
is data. But if e.g.breaks
are weird, we can error out.- Exception:
x
is the wrong class or type.
- Exception:
-
In some cases we want to guarantee the set of breaks.
- e.g.
brk_manual()
withextend
set.
- e.g.
-
In other cases, e.g.
brk_evenly()
we don't need to make such a guarantee.
-
Is it really OK to have
left = FALSE
as the default inchop_quantiles()
,chop_evenly()
and friends?- the alternative is to do it only when
x
is non-numeric. - that makes the surprise rarer, but rare surprises can be worse... and it adds complexity since the functions have to be generic.
- another alternative:
chop
setsleft = FALSE
for non-numericx
. Probably better.
- the alternative is to do it only when
-
Do we need
drop
?- should
drop
have a default of! isTRUE(extend)
i.e. beFALSE
whenextend = TRUE
?
- should
-
Should we have a flag to return characters?
- No, we have
labels = NULL
for integer codes only though.
- No, we have
-
Should we put a
percent
argument intobrk_quantiles()
so it can store scaled endpoints as proportions rather than percentages (the current default)?- My sense is, not unless someone asks.
- Oh, someone just did ask; more generally though.
-
Should
close_end = TRUE
argument come before...
inchop_
variants?- No. We don't want people to set it by position, so distinguish it from the initial arguments.
-
What to do about
tidyr::chop()
- Current answer: fuck 'em. (NB: just kidding. I am a huge tidyverse fan.)
- We provide
kiru()
. So on the REPL, people can just usekiru()
if they load santoku first. If they load santoku second, they'll have to usetidyr::chop()
, but reading the documentation, I suspect this will be rare. - For programming, people should probably used the fully qualified name anyway.
-
When to extend?
- I think default should be "if necessary" (
extend = NULL
); should always extend to Inf, -Inf so that these break labels are not data-dependent - Tension between wanting something predictable in your new data, vs. something
readable in
tab_*
. E.g.should surely return labels a, b, c. But this means we aren't always extending.tab_size(1:9, 3, lbl_seq())
- I think default should be "if necessary" (
-
Should we allow vector
labels
to be longer than necessary?- lets people do e.g.
chop(rnorm(100), -2:2, LETTERS)
- but might hide errors
- overall I'm against
- lets people do e.g.
-
Is the label interface right? Problem exposed by
brk_mean_sd
: if we aren't sure whether data gets extended, then how do we know what the labels should be?- maybe label functions should have access to
x
? - or should they be informed if breaks got extended?
- or could the breaks object know how to extend its labels?
- current solution: labels get
extend
- I think better:
breaks
objects include suggested labels which the user can override. That way they always have the info necessary. - We could also divide labelling into two parts:
- choosing the break numbers (these may not be the actual values, e.g they could be quantiles or std errs from 0)
- formatting these numbers, and with dashes, set notation etc
- So maybe
brk_*
functions always return break numbers; then labels decide how to format them?
- maybe label functions should have access to
-
Should we automatically sort breaks, or throw an error if they're unsorted?
- or a warning?
- currently an error
-
What if
breaks = c(1, 2, 2, 2, 3)
?- throw an error
-
For some cases e.g.
brk_quantiles
,brk_width
, the data may not work well e.g. if it is all NA. What is an empty set of breaks?
hist_xxx
functions for histograms/barplots? (how to treat singletons?)grp_xxx
for group_by? Hmmm...- New label interface to replace
lbl_sequence
:lbl_style("1."), lbl_style("(i)"), lbl_style("A")
etc.? - Still wonder, could we drop
extend
which adds complexity and just haveonly()
orextend()
as new breaks functions?
- Speedup categorize by only checking left intervals, add 1 if its past each interval [NO: actually no fewer checks in the end...]
- Speedup by using pointers? hmm, magic...