Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Is the uniform histogram type wrong? #81

Open
jlouis opened this issue Jul 7, 2014 · 7 comments
Open

Is the uniform histogram type wrong? #81

jlouis opened this issue Jul 7, 2014 · 7 comments

Comments

@jlouis
Copy link
Contributor

jlouis commented Jul 7, 2014

Please take a look at the code at

https://github.com/boundary/folsom/blob/master/src/folsom_sample_uniform.erl#L50

which updates a sample uniformly in the histogram reservoir. The L46 clause is hit whenever we have fewer than 1028 samples and we insert a new sample in the table. Once we have 1028 samples, we look at N. Suppose N is 2056 since we have taken that many samples. We take a random value, which could be 1768 and then maybe update the reservoir. In half the cases, we won't be bumping the reservoir here, depending ont he random outcome.

I have reservoir's with N > 1000_1000_1000. They will almost never update the reservoir. Is this intended behavior of the uniform sample type? I am afraid some of the logic is wrong and we never ever replace entries in the reservoir for large N.

I could change to slide_uniform to fix this, but I want to make sure I understand how this is supposed to work.

@jlouis
Copy link
Contributor Author

jlouis commented Jul 7, 2014

Oh, looks like this is how Vitter's Algorithm R works. So it is more a question of specification than one of implementation mistake.

@joewilliams
Copy link
Contributor

Thanks for checking @jlouis. @russelldb might have some thoughts on how this would effect slide_uniform.

@russelldb
Copy link
Contributor

I need to look into it too. It seems @Vagabond made this change 2be6249#diff-b7a6cde361f08ae87401c3a98eb116c7 that changes the behaviour of the slide_uniform sample.

@russelldb
Copy link
Contributor

Ah, it looks like @Vagabond actually fixes slide_uniform as it was using Size not MCnt as the upper bound for the random number generation (i.e. a fixed, rather than growing value.)

So is the plan to change the sample algorithm?

@jlouis
Copy link
Contributor Author

jlouis commented Jul 9, 2014

The code as it works now does the right thing. A uniform histogram samples its reservoir over the entire set of inputs. So if you have a million inputs already, the chance of replacing into the sample reservoir is 1/1000. Which is what you want for the complete overview.

Most of the time however, you want some kind of histogram window in which case the slide_uniform or exdec solutions are more appropriate. So it is really a questions of "what do you want". Not a question of what is correct or wrong.

The reason I opened this was because I was getting unexpected data. But it turns out the data are correct and my expectations of what uniform does was wrong :) You can close this one.

@jlouis
Copy link
Contributor Author

jlouis commented Jul 9, 2014

@russelldb I don't think we should change anything. But we should definitely consider explaining what the histogram types does in the README :)

@joewilliams
Copy link
Contributor

Folsom has moved, please resubmit your issue at https://github.com/folsom-project Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants