Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add dataset parser for vector search #424

Merged
merged 2 commits into from
Dec 21, 2023

Conversation

VijayanB
Copy link
Member

Description

Vector Search Workloads usually uses dataset of type hdf5 and bigann. Hence, added support
to parse, partition those file formats. This will be later used by VectorSearch Param source in next pr.

Issues Resolved

part of #103

Testing

  • New functionality includes testing

[Describe how this change was tested]

make test
tests/utils/parse_test.py::ParseParamsFunctionalTests::test_parse_float_parameter_default PASSED [ 55%]
tests/utils/parse_test.py::ParseParamsFunctionalTests::test_parse_float_parameter_from_params PASSED [ 55%]
tests/utils/parse_test.py::ParseParamsFunctionalTests::test_parse_int_parameter_default PASSED [ 56%]
tests/utils/parse_test.py::ParseParamsFunctionalTests::test_parse_int_parameter_from_params PASSED [ 56%]
tests/utils/parse_test.py::ParseParamsFunctionalTests::test_parse_string_parameter_default PASSED [ 56%]
tests/utils/parse_test.py::ParseParamsFunctionalTests::test_parse_string_parameter_from_params PASSED [ 56%]

tests/utils/dataset_test.py::DataSetTestCase::testBigANNAsAcceptableDataSetFormatWithFloatExtension PASSED [ 47%]
tests/utils/dataset_test.py::DataSetTestCase::testHDF5AsAcceptableDataSetFormat PASSED [ 47%]
tests/utils/dataset_test.py::DataSetTestCase::testUnSupportedDataSetFormat PASSED [ 47%]

================= 1200 passed, 5 skipped, 3 warnings in 15.22s =================

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

@VijayanB VijayanB force-pushed the add-dataset-parser branch 3 times, most recently from b839733 to 4753b8f Compare December 18, 2023 22:35
@VijayanB VijayanB changed the title Add dataset parser Add dataset parser for vector search Dec 18, 2023
VectorSearch uses file formats like hdf5 and bigann.
We need library to parse corresponding formats.

Signed-off-by: Vijayan Balasubramanian <[email protected]>
@VijayanB
Copy link
Member Author

I tested this with my next change where i actually perform vector search using new param source that accepts hdf5 and u8bin dataset.

osbenchmark/utils/dataset.py Outdated Show resolved Hide resolved
osbenchmark/utils/dataset.py Outdated Show resolved Hide resolved
osbenchmark/utils/dataset.py Show resolved Hide resolved
@VijayanB VijayanB force-pushed the add-dataset-parser branch 2 times, most recently from 4e18633 to 86a4617 Compare December 19, 2023 04:33
@VijayanB VijayanB requested a review from navneet1v December 19, 2023 04:34
@VijayanB VijayanB mentioned this pull request Dec 20, 2023
1 task
if end_offset > self.size():
end_offset = self.size()

v = cast(np.ndarray, self.data[self.current:end_offset])
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: what does v represent? Is there a more descriptive term that you can be used instead?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Update: v represents vector. Can we update the variable to be vector?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ack

raise ConfigurationError("Value must be a string for param {}".format(key))


def parse_int_parameter(key: str, params: dict, default: int = None) -> int:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's some redundancy in these three functions and I think we can simplify it. Assuming OSB will need to check the type before calling one of the three functions, maybe we can condense these three functions into a single generic function where default parameter is dynamically typed and treated as Any, such as the following:

def parse_parameter(key: str, params:dict, default=None):
    valid_parameter_types = [str, int, float]

    if key not in params:
        if default:
            return default
        else:
            raise ConfigurationError(
                f"Value cannot be None for param {key}"
            )

    for parameter_type in valid_parameter_types:
        if isinstance(params[key], parameter_type):
            return params[key]

    formatted_parameter_types = ", ".join([ parameter_type.__name__ for parameter_type in valid_parameter_types])

    raise ConfigurationError(f"Value must be one of the followings for param {formatted_parameter_types}")

I've tested this against your test cases and they all pass as well. Let me know your thoughts @VijayanB ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happens if we pass integer value to string parameter? The above looks complicated for simple function. I don't see any advantage. However, if you have strong opinion, i can refactor.

Copy link
Collaborator

@IanHoang IanHoang Dec 21, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happens if we pass integer value to string parameter?

Are you referring to passing an integer value to the string parameter key? The code above is based off of the three functions (parse_str_parameter(), parse_float_parameter(), and parse_int_parameter()) and how they were used in the unittests in parse_test.py. In parse_test.py, it doesn't seem like there's a scenario where an integer value is passed to key since only values such as "int-value", "string-value", "default-value", "float-value" are provided. Based off these unittests, I was under the assumption that OSB must first identify the parameter's data-type before calling any of these functions. Let me know if I'm misunderstanding this.

That being said, I don't have a strong preference and if it's more simple to use these three functions, we can keep them as is 👍🏻

Copy link
Collaborator

@IanHoang IanHoang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left some comments

Added Hdf5, Bigann dataset parser.
Added test cases for dataset and parser.

Signed-off-by: Vijayan Balasubramanian <[email protected]>
@@ -110,3 +110,17 @@ class WorkloadConfigError(BenchmarkError):

class NotFound(BenchmarkError):
pass


class InvalidExtensionException(BenchmarkError):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for doing this!

Copy link
Collaborator

@IanHoang IanHoang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@IanHoang IanHoang merged commit a9f7c60 into opensearch-project:main Dec 21, 2023
8 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants