-
Notifications
You must be signed in to change notification settings - Fork 903
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix Byte-Pair-Encoding usage of cuco static-map for storing merge-pairs #13807
Fix Byte-Pair-Encoding usage of cuco static-map for storing merge-pairs #13807
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The code should build properly with the proposed changes. The problem with double_hashing
is that we expect users to provide a hasher that can be constructed from an integer (or a seed
). In that way, we can minimize the chance of secondary collisions during probing but now I realize it's a too-restrictive requirement. By all means, for a single map with 50% occupancy, linear_probing
would probably outperform double_hashing
all the time.
I'm looking into other issues (e.g. (should be solved in NVIDIA/cuCollections#345) |
This PR fixes issues and adds new features requested by rapidsai/cudf#13807. It: - removes the requirement of the second hasher from double hashing must be constructible from an integer - fixes an issue in map iterator `!=` operator - overloads map iterator access operator - allows zero capacity container - adds `capacity_test` back since several corner cases need to be exercised
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
CMake changes to be reverted once rapidsai/rapids-cmake#452 is merged.
This PR fetches the latest cuco to unblock rapidsai/cudf#13807. Authors: - Yunsong Wang (https://github.com/PointKernel) - Robert Maynard (https://github.com/robertmaynard) Approvers: - Bradley Dice (https://github.com/bdice) - Nghia Truong (https://github.com/ttnghia) - Robert Maynard (https://github.com/robertmaynard) URL: #452
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great! 🔥
Heterogeneous lookup in libcudf 🔥
…fix-bpe-static-map
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is really nice. Heterogeneous lookup is cool. Can you comment on the performance impact of this change?
/merge |
…rs (rapidsai#13807) Switching to use `cuco::experimental::static_map` for storing the unique merge-pair strings that can be looked up by `string_view`. This takes advantage of a feature of the `static_map` that allows storing with one key (index to a string entry) and lookup with a different type (string). The map uses a hash on the string for storing the index but allows lookup by string since the hash of string can resolve the entry and duplicates can be resolved by comparing the string with row entries. Authors: - David Wendt (https://github.com/davidwendt) - Yunsong Wang (https://github.com/PointKernel) Approvers: - Yunsong Wang (https://github.com/PointKernel) - Bradley Dice (https://github.com/bdice) URL: rapidsai#13807
Description
Switching to use
cuco::experimental::static_map
for storing the unique merge-pair strings that can be looked up bystring_view
. This takes advantage of a feature of thestatic_map
that allows storing with one key (index to a string entry) and lookup with a different type (string). The map uses a hash on the string for storing the index but allows lookup by string since the hash of string can resolve the entry and duplicates can be resolved by comparing the string with row entries.Checklist