Ensure returned values stay within bounds #11

christineyen · 2017-10-11T02:10:42Z

While it's understandable for t-digests to be inaccurate on small sample sizes, the values returned should still be within reason and not impossible. This change clamps returned values to the range of keys seen.

caio · 2017-10-11T07:13:03Z

Excellent, thanks a lot!

I wonder what kind of weird frowns this caused if you saw these crazy numbers in production 😅

I'll release v1.1.2 today with your patch

caio · 2017-10-11T07:44:33Z

Released @ https://github.com/caio/go-tdigest/releases/tag/v1.1.2

Thanks again!

christineyen · 2017-10-11T18:01:43Z

Thanks for the speedy merge!

In retrospect, I went with this (less accurate, because who knows how close the centroid keys are to the actual min/max) approach in order to avoid changing the serialization logic - but it feels like the more correct thing to do would be to track min/max explicitly. Thoughts? (... and for whatever reason, my machine is struggling to build the tdunning/t-digest repo.)

caio · 2017-10-11T18:57:47Z

I'm not sure I would go further than you already went with your initial patch - the problem only occurrs when we have too little data, which is not what the data structure was designed for. Besides, keeping track of min, max (and mean if one cares) is a lot simpler unless I'm missing something. So, all in all, I don't think it's worth the effort.

That said, if you wanna go that routewe'd need to decide wether the interop with (one of) the Java serialization format(s) is worth it for it would be impossible to have this min/max feature working correctly unless the Java one also knew about it (and sent it throgh the wire). I wouldn't object to breaking it as I'm not aware of anybody using this, but as you can see I keep piling disadvantages on the feature haha.

(RE building the java code: it builds fine here with mvn compile (or with IDEA's built-in maven)- I would expect the only problem to be with downloading dependencies if you're behind a proxy. Feel free to dump the error you are getting here, I'm happy to (try to) help!)

christineyen · 2017-10-11T20:31:38Z

Yup, fair. Our use of this library is actually wrapped within some logic that switches from exact quantile calculation to approximation at a certain # of samples, but there're some degenerate cases in which that logic Does The Right Thing (TM) and there still aren't enough distinct values fed into the tdigest.

Anyway. The main impetus here wrt tracking actual min/max is how far the first/last centroids' keys will shift from the input values -- but that's something we can (read: should) track in our wrapper around this data structure instead, and likely not worth breaking Java compatibility. Thanks! 👋

caio · 2017-10-11T21:11:34Z

Thanks a lot! Knowing how this is being used makes me quite happy, even more so now that I clicked around: I'm a fan (and sadly not a user) of what you all are making! 👋

christineyen · 2017-10-14T00:01:03Z

Hah! Went digging -- I'm not convinced that this solution is right, actually (bounding to the first/last centroids' means instead of actual min/maxes), and found this: tdunning/t-digest@89bb394

Sounds like maybe an argument for making this compatible with that format ;)

In the meantime - this -0.5 (https://github.com/caio/go-tdigest/blob/master/tdigest.go#L75) is likely the source of the negative quantile values. Any idea what the origin of that was? Tried tracking it down in the t-digest paper and the tdunning implementation of AvlTreeDigest and it doesn't look like it was something that was intentionally ported over... 🤔

caio · 2017-10-14T09:24:11Z

You might be on to something indeed! But I'm not sure it would be in the quantile computation - the formula comes from Algorithm 4 page 17 and looks correct to me (not trying to discard the possibility of a wrong port in there, math is not my forte- I'm already confused when reasoning starts involving curves). However this:

https://github.com/caio/go-tdigest/blob/master/tdigest.go#L67

Is supposed to treat for the case that your patch also treats for, so likely something shady is going on lol

I'll try to take a closer look at this (and the referencee implementation evolution which I admitedly didn't follow closely) coming Monday and will keep you posted!

caio · 2017-10-16T08:59:16Z

So, I got news and they aren't quite good heh.

As you can see I've opened #12 which signals a serious bug. I've also opened #13 which might uncover some more because of: tdunning/t-digest#89 (issue 78 sheds some more light I guess, but it's over my head)

It's not clear to me yet what exactly is the problem, but made me notice that Ted Dunning is working on a new version of the paper, deprecating some implementations and such (branch issue-84).

In other news, much work went into serializing tdigests, so I certainly reckons a revisit (discussions start on pull request 56 there).

caio · 2017-10-16T09:10:43Z

From this, I'm guessing that the smartest course of action would be to act on #13 and either decide whether it's worth to chase #12 or just go ahead and start somewhat larger task of using the MergingDigest variant as base of implementation instead of the AVL one.

I've only used this repo's implementation in production for building stats over prices strictly above 0 (and less that 1M samples) and didn't see problems, but that's very different from your scenario which might involve many repeated items, massive variance, etc.

Given that you likely have customers relying on these estimates here's some full disclosure to help you prioritize your next actions: whilst I'm happy to chase all this, some months ago I quit my job for feeling burned out and haven't really managed to get back on track wrt proper programming which means my progress can and likely will be a lot slower than what is acceptable (thanks btw, this interest in my code has helped with the motivation a bit :))

caio · 2017-10-16T16:47:02Z

I managed to sit down on this for a while today, progress at the avl-wip branch. The significant part is 42a5999 - no great news yet, even though it fixes the boundary problem without having to manually clamp

caio · 2017-10-24T14:55:06Z

In case you're still watching: I've been updating the status on #12 - all green now. I'll lock this PR since I think the issue is a better place to discuss the changes. 👋

Ensure returned values stay within bounds

2481326

caio merged commit a72570c into caio:master Oct 11, 2017

caio mentioned this pull request Oct 16, 2017

Awful quantile estimation on small sample size #12

Closed

Repository owner locked and limited conversation to collaborators Oct 24, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Ensure returned values stay within bounds #11

Ensure returned values stay within bounds #11

christineyen commented Oct 11, 2017

caio commented Oct 11, 2017

caio commented Oct 11, 2017

christineyen commented Oct 11, 2017

caio commented Oct 11, 2017

christineyen commented Oct 11, 2017

caio commented Oct 11, 2017

christineyen commented Oct 14, 2017

caio commented Oct 14, 2017

caio commented Oct 16, 2017

caio commented Oct 16, 2017

caio commented Oct 16, 2017

caio commented Oct 24, 2017

Ensure returned values stay within bounds #11

Ensure returned values stay within bounds #11

Conversation

christineyen commented Oct 11, 2017

caio commented Oct 11, 2017

caio commented Oct 11, 2017

christineyen commented Oct 11, 2017

caio commented Oct 11, 2017

christineyen commented Oct 11, 2017

caio commented Oct 11, 2017

christineyen commented Oct 14, 2017

caio commented Oct 14, 2017

caio commented Oct 16, 2017

caio commented Oct 16, 2017

caio commented Oct 16, 2017

caio commented Oct 24, 2017