-
Notifications
You must be signed in to change notification settings - Fork 166
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Is the uniform histogram type wrong? #81
Comments
Oh, looks like this is how Vitter's Algorithm R works. So it is more a question of specification than one of implementation mistake. |
Thanks for checking @jlouis. @russelldb might have some thoughts on how this would effect slide_uniform. |
I need to look into it too. It seems @Vagabond made this change 2be6249#diff-b7a6cde361f08ae87401c3a98eb116c7 that changes the behaviour of the slide_uniform sample. |
Ah, it looks like @Vagabond actually fixes So is the plan to change the sample algorithm? |
The code as it works now does the right thing. A uniform histogram samples its reservoir over the entire set of inputs. So if you have a million inputs already, the chance of replacing into the sample reservoir is 1/1000. Which is what you want for the complete overview. Most of the time however, you want some kind of histogram window in which case the slide_uniform or exdec solutions are more appropriate. So it is really a questions of "what do you want". Not a question of what is correct or wrong. The reason I opened this was because I was getting unexpected data. But it turns out the data are correct and my expectations of what uniform does was wrong :) You can close this one. |
@russelldb I don't think we should change anything. But we should definitely consider explaining what the histogram types does in the README :) |
Folsom has moved, please resubmit your issue at https://github.com/folsom-project Thanks! |
Please take a look at the code at
https://github.com/boundary/folsom/blob/master/src/folsom_sample_uniform.erl#L50
which updates a sample uniformly in the histogram reservoir. The L46 clause is hit whenever we have fewer than 1028 samples and we insert a new sample in the table. Once we have 1028 samples, we look at N. Suppose N is 2056 since we have taken that many samples. We take a random value, which could be 1768 and then maybe update the reservoir. In half the cases, we won't be bumping the reservoir here, depending ont he random outcome.
I have reservoir's with N > 1000_1000_1000. They will almost never update the reservoir. Is this intended behavior of the uniform sample type? I am afraid some of the logic is wrong and we never ever replace entries in the reservoir for large N.
I could change to slide_uniform to fix this, but I want to make sure I understand how this is supposed to work.
The text was updated successfully, but these errors were encountered: