-
Notifications
You must be signed in to change notification settings - Fork 25
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Ensure returned values stay within bounds #11
Conversation
Excellent, thanks a lot! I wonder what kind of weird frowns this caused if you saw these crazy numbers in production 😅 I'll release v1.1.2 today with your patch |
Released @ https://github.com/caio/go-tdigest/releases/tag/v1.1.2 Thanks again! |
Thanks for the speedy merge! In retrospect, I went with this (less accurate, because who knows how close the centroid keys are to the actual min/max) approach in order to avoid changing the serialization logic - but it feels like the more correct thing to do would be to track min/max explicitly. Thoughts? (... and for whatever reason, my machine is struggling to build the tdunning/t-digest repo.) |
I'm not sure I would go further than you already went with your initial patch - the problem only occurrs when we have too little data, which is not what the data structure was designed for. Besides, keeping track of min, max (and mean if one cares) is a lot simpler unless I'm missing something. So, all in all, I don't think it's worth the effort. That said, if you wanna go that routewe'd need to decide wether the interop with (one of) the Java serialization format(s) is worth it for it would be impossible to have this (RE building the java code: it builds fine here with |
Yup, fair. Our use of this library is actually wrapped within some logic that switches from exact quantile calculation to approximation at a certain # of samples, but there're some degenerate cases in which that logic Does The Right Thing (TM) and there still aren't enough distinct values fed into the tdigest. Anyway. The main impetus here wrt tracking actual min/max is how far the first/last centroids' keys will shift from the input values -- but that's something we can (read: should) track in our wrapper around this data structure instead, and likely not worth breaking Java compatibility. Thanks! 👋 |
Thanks a lot! Knowing how this is being used makes me quite happy, even more so now that I clicked around: I'm a fan (and sadly not a user) of what you all are making! 👋 |
Hah! Went digging -- I'm not convinced that this solution is right, actually (bounding to the first/last centroids' means instead of actual min/maxes), and found this: tdunning/t-digest@89bb394 Sounds like maybe an argument for making this compatible with that format ;) In the meantime - this |
You might be on to something indeed! But I'm not sure it would be in the quantile computation - the formula comes from https://github.com/caio/go-tdigest/blob/master/tdigest.go#L67 Is supposed to treat for the case that your patch also treats for, so likely something shady is going on lol I'll try to take a closer look at this (and the referencee implementation evolution which I admitedly didn't follow closely) coming Monday and will keep you posted! |
So, I got news and they aren't quite good heh. As you can see I've opened #12 which signals a serious bug. I've also opened #13 which might uncover some more because of: tdunning/t-digest#89 (issue 78 sheds some more light I guess, but it's over my head) It's not clear to me yet what exactly is the problem, but made me notice that Ted Dunning is working on a new version of the paper, deprecating some implementations and such (branch In other news, much work went into serializing tdigests, so I certainly reckons a revisit (discussions start on pull request 56 there). |
From this, I'm guessing that the smartest course of action would be to act on #13 and either decide whether it's worth to chase #12 or just go ahead and start somewhat larger task of using the MergingDigest variant as base of implementation instead of the AVL one. I've only used this repo's implementation in production for building stats over prices strictly above 0 (and less that 1M samples) and didn't see problems, but that's very different from your scenario which might involve many repeated items, massive variance, etc. Given that you likely have customers relying on these estimates here's some full disclosure to help you prioritize your next actions: whilst I'm happy to chase all this, some months ago I quit my job for feeling burned out and haven't really managed to get back on track wrt proper programming which means my progress can and likely will be a lot slower than what is acceptable (thanks btw, this interest in my code has helped with the motivation a bit :)) |
I managed to sit down on this for a while today, progress at the |
In case you're still watching: I've been updating the status on #12 - all green now. I'll lock this PR since I think the issue is a better place to discuss the changes. 👋 |
While it's understandable for t-digests to be inaccurate on small sample sizes, the values returned should still be within reason and not impossible. This change clamps returned values to the range of keys seen.