-
Notifications
You must be signed in to change notification settings - Fork 80
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add dataset parser for vector search #424
Add dataset parser for vector search #424
Conversation
b839733
to
4753b8f
Compare
VectorSearch uses file formats like hdf5 and bigann. We need library to parse corresponding formats. Signed-off-by: Vijayan Balasubramanian <[email protected]>
4753b8f
to
c0cbbf2
Compare
I tested this with my next change where i actually perform vector search using new param source that accepts hdf5 and u8bin dataset. |
4e18633
to
86a4617
Compare
osbenchmark/utils/dataset.py
Outdated
if end_offset > self.size(): | ||
end_offset = self.size() | ||
|
||
v = cast(np.ndarray, self.data[self.current:end_offset]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: what does v
represent? Is there a more descriptive term that you can be used instead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Update: v
represents vector
. Can we update the variable to be vector
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ack
raise ConfigurationError("Value must be a string for param {}".format(key)) | ||
|
||
|
||
def parse_int_parameter(key: str, params: dict, default: int = None) -> int: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's some redundancy in these three functions and I think we can simplify it. Assuming OSB will need to check the type before calling one of the three functions, maybe we can condense these three functions into a single generic function where default
parameter is dynamically typed and treated as Any
, such as the following:
def parse_parameter(key: str, params:dict, default=None):
valid_parameter_types = [str, int, float]
if key not in params:
if default:
return default
else:
raise ConfigurationError(
f"Value cannot be None for param {key}"
)
for parameter_type in valid_parameter_types:
if isinstance(params[key], parameter_type):
return params[key]
formatted_parameter_types = ", ".join([ parameter_type.__name__ for parameter_type in valid_parameter_types])
raise ConfigurationError(f"Value must be one of the followings for param {formatted_parameter_types}")
I've tested this against your test cases and they all pass as well. Let me know your thoughts @VijayanB ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What happens if we pass integer value to string parameter? The above looks complicated for simple function. I don't see any advantage. However, if you have strong opinion, i can refactor.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What happens if we pass integer value to string parameter?
Are you referring to passing an integer value to the string parameter key
? The code above is based off of the three functions (parse_str_parameter()
, parse_float_parameter()
, and parse_int_parameter()
) and how they were used in the unittests in parse_test.py
. In parse_test.py
, it doesn't seem like there's a scenario where an integer value is passed to key
since only values such as "int-value"
, "string-value"
, "default-value"
, "float-value"
are provided. Based off these unittests, I was under the assumption that OSB must first identify the parameter's data-type before calling any of these functions. Let me know if I'm misunderstanding this.
That being said, I don't have a strong preference and if it's more simple to use these three functions, we can keep them as is 👍🏻
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Left some comments
86a4617
to
46d2297
Compare
46d2297
to
f7ae374
Compare
Added Hdf5, Bigann dataset parser. Added test cases for dataset and parser. Signed-off-by: Vijayan Balasubramanian <[email protected]>
f7ae374
to
5113c39
Compare
@@ -110,3 +110,17 @@ class WorkloadConfigError(BenchmarkError): | |||
|
|||
class NotFound(BenchmarkError): | |||
pass | |||
|
|||
|
|||
class InvalidExtensionException(BenchmarkError): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for doing this!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
Description
Vector Search Workloads usually uses dataset of type hdf5 and bigann. Hence, added support
to parse, partition those file formats. This will be later used by VectorSearch Param source in next pr.
Issues Resolved
part of #103
Testing
[Describe how this change was tested]
By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.