You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Currently BigSampler tends to undersample for a given input ratio when performing a deterministic sample
One avenue to explore is:
Once hashes are created they are normalized in a [0.0, 1.0] range by boundLong. Potentially this function should be updated or modified. One possible way is using the upper/lower bound of the input results instead, however this may be difficult to implement in practice. It could also be removed and replaced, the specifics of this implementation are lost to time and have dropped out of my memory.
Another path instead or in addition to this is:
We primarily use farmhash, which is not a cryptographic hash function. Is the output sufficiently uniform in its distribution? If not, now that additional hashes are available within BigQuery, is there another function with a more appropriate output distribution
The text was updated successfully, but these errors were encountered:
idreeskhan
changed the title
Improve sample population selection for deterministic hashing
Improve sample population selection for deterministic sampling
Feb 1, 2024
Currently BigSampler tends to undersample for a given input ratio when performing a deterministic sample
One avenue to explore is:
Once hashes are created they are normalized in a [0.0, 1.0] range by boundLong. Potentially this function should be updated or modified. One possible way is using the upper/lower bound of the input results instead, however this may be difficult to implement in practice. It could also be removed and replaced, the specifics of this implementation are lost to time and have dropped out of my memory.
Another path instead or in addition to this is:
We primarily use farmhash, which is not a cryptographic hash function. Is the output sufficiently uniform in its distribution? If not, now that additional hashes are available within BigQuery, is there another function with a more appropriate output distribution
The text was updated successfully, but these errors were encountered: