Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ensure returned values stay within bounds #11

Merged
merged 1 commit into from
Oct 11, 2017
Merged

Conversation

christineyen
Copy link
Contributor

While it's understandable for t-digests to be inaccurate on small sample sizes, the values returned should still be within reason and not impossible. This change clamps returned values to the range of keys seen.

@caio
Copy link
Owner

caio commented Oct 11, 2017

Excellent, thanks a lot!

I wonder what kind of weird frowns this caused if you saw these crazy numbers in production 😅

I'll release v1.1.2 today with your patch

@caio caio merged commit a72570c into caio:master Oct 11, 2017
@caio
Copy link
Owner

caio commented Oct 11, 2017

Released @ https://github.com/caio/go-tdigest/releases/tag/v1.1.2

Thanks again!

@christineyen
Copy link
Contributor Author

Thanks for the speedy merge!

In retrospect, I went with this (less accurate, because who knows how close the centroid keys are to the actual min/max) approach in order to avoid changing the serialization logic - but it feels like the more correct thing to do would be to track min/max explicitly. Thoughts? (... and for whatever reason, my machine is struggling to build the tdunning/t-digest repo.)

@caio
Copy link
Owner

caio commented Oct 11, 2017

I'm not sure I would go further than you already went with your initial patch - the problem only occurrs when we have too little data, which is not what the data structure was designed for. Besides, keeping track of min, max (and mean if one cares) is a lot simpler unless I'm missing something. So, all in all, I don't think it's worth the effort.

That said, if you wanna go that routewe'd need to decide wether the interop with (one of) the Java serialization format(s) is worth it for it would be impossible to have this min/max feature working correctly unless the Java one also knew about it (and sent it throgh the wire). I wouldn't object to breaking it as I'm not aware of anybody using this, but as you can see I keep piling disadvantages on the feature haha.

(RE building the java code: it builds fine here with mvn compile (or with IDEA's built-in maven)- I would expect the only problem to be with downloading dependencies if you're behind a proxy. Feel free to dump the error you are getting here, I'm happy to (try to) help!)

@christineyen
Copy link
Contributor Author

Yup, fair. Our use of this library is actually wrapped within some logic that switches from exact quantile calculation to approximation at a certain # of samples, but there're some degenerate cases in which that logic Does The Right Thing (TM) and there still aren't enough distinct values fed into the tdigest.

Anyway. The main impetus here wrt tracking actual min/max is how far the first/last centroids' keys will shift from the input values -- but that's something we can (read: should) track in our wrapper around this data structure instead, and likely not worth breaking Java compatibility. Thanks! 👋

@caio
Copy link
Owner

caio commented Oct 11, 2017

Thanks a lot! Knowing how this is being used makes me quite happy, even more so now that I clicked around: I'm a fan (and sadly not a user) of what you all are making! 👋

@christineyen
Copy link
Contributor Author

Hah! Went digging -- I'm not convinced that this solution is right, actually (bounding to the first/last centroids' means instead of actual min/maxes), and found this: tdunning/t-digest@89bb394

Sounds like maybe an argument for making this compatible with that format ;)

In the meantime - this -0.5 (https://github.com/caio/go-tdigest/blob/master/tdigest.go#L75) is likely the source of the negative quantile values. Any idea what the origin of that was? Tried tracking it down in the t-digest paper and the tdunning implementation of AvlTreeDigest and it doesn't look like it was something that was intentionally ported over... 🤔

@caio
Copy link
Owner

caio commented Oct 14, 2017

You might be on to something indeed! But I'm not sure it would be in the quantile computation - the formula comes from Algorithm 4 page 17 and looks correct to me (not trying to discard the possibility of a wrong port in there, math is not my forte- I'm already confused when reasoning starts involving curves). However this:

https://github.com/caio/go-tdigest/blob/master/tdigest.go#L67

Is supposed to treat for the case that your patch also treats for, so likely something shady is going on lol

I'll try to take a closer look at this (and the referencee implementation evolution which I admitedly didn't follow closely) coming Monday and will keep you posted!

@caio
Copy link
Owner

caio commented Oct 16, 2017

So, I got news and they aren't quite good heh.

As you can see I've opened #12 which signals a serious bug. I've also opened #13 which might uncover some more because of: tdunning/t-digest#89 (issue 78 sheds some more light I guess, but it's over my head)

It's not clear to me yet what exactly is the problem, but made me notice that Ted Dunning is working on a new version of the paper, deprecating some implementations and such (branch issue-84).

In other news, much work went into serializing tdigests, so I certainly reckons a revisit (discussions start on pull request 56 there).

@caio
Copy link
Owner

caio commented Oct 16, 2017

From this, I'm guessing that the smartest course of action would be to act on #13 and either decide whether it's worth to chase #12 or just go ahead and start somewhat larger task of using the MergingDigest variant as base of implementation instead of the AVL one.

I've only used this repo's implementation in production for building stats over prices strictly above 0 (and less that 1M samples) and didn't see problems, but that's very different from your scenario which might involve many repeated items, massive variance, etc.

Given that you likely have customers relying on these estimates here's some full disclosure to help you prioritize your next actions: whilst I'm happy to chase all this, some months ago I quit my job for feeling burned out and haven't really managed to get back on track wrt proper programming which means my progress can and likely will be a lot slower than what is acceptable (thanks btw, this interest in my code has helped with the motivation a bit :))

@caio
Copy link
Owner

caio commented Oct 16, 2017

I managed to sit down on this for a while today, progress at the avl-wip branch. The significant part is 42a5999 - no great news yet, even though it fixes the boundary problem without having to manually clamp

@caio
Copy link
Owner

caio commented Oct 24, 2017

In case you're still watching: I've been updating the status on #12 - all green now. I'll lock this PR since I think the issue is a better place to discuss the changes. 👋

Repository owner locked and limited conversation to collaborators Oct 24, 2017
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants