Improve sample population selection for deterministic sampling #699

idreeskhan · 2024-02-01T15:24:41Z

Currently BigSampler tends to undersample for a given input ratio when performing a deterministic sample

One avenue to explore is:

Once hashes are created they are normalized in a [0.0, 1.0] range by boundLong. Potentially this function should be updated or modified. One possible way is using the upper/lower bound of the input results instead, however this may be difficult to implement in practice. It could also be removed and replaced, the specifics of this implementation are lost to time and have dropped out of my memory.

Another path instead or in addition to this is:

We primarily use farmhash, which is not a cryptographic hash function. Is the output sufficiently uniform in its distribution? If not, now that additional hashes are available within BigQuery, is there another function with a more appropriate output distribution

idreeskhan added bug enhancement labels Feb 1, 2024

idreeskhan changed the title ~~Improve sample population selection for deterministic hashing~~ Improve sample population selection for deterministic sampling Feb 1, 2024

idreeskhan mentioned this issue Feb 9, 2024

Add docs for reproducing sample from BigQuery #700

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve sample population selection for deterministic sampling #699

Improve sample population selection for deterministic sampling #699

idreeskhan commented Feb 1, 2024 •

edited

Loading

Improve sample population selection for deterministic sampling #699

Improve sample population selection for deterministic sampling #699

Comments

idreeskhan commented Feb 1, 2024 • edited Loading

idreeskhan commented Feb 1, 2024 •

edited

Loading