Add dataset parser for vector search #424

VijayanB · 2023-12-18T21:54:02Z

Description

Vector Search Workloads usually uses dataset of type hdf5 and bigann. Hence, added support
to parse, partition those file formats. This will be later used by VectorSearch Param source in next pr.

Issues Resolved

part of #103

Testing

New functionality includes testing

[Describe how this change was tested]

make test
tests/utils/parse_test.py::ParseParamsFunctionalTests::test_parse_float_parameter_default PASSED [ 55%]
tests/utils/parse_test.py::ParseParamsFunctionalTests::test_parse_float_parameter_from_params PASSED [ 55%]
tests/utils/parse_test.py::ParseParamsFunctionalTests::test_parse_int_parameter_default PASSED [ 56%]
tests/utils/parse_test.py::ParseParamsFunctionalTests::test_parse_int_parameter_from_params PASSED [ 56%]
tests/utils/parse_test.py::ParseParamsFunctionalTests::test_parse_string_parameter_default PASSED [ 56%]
tests/utils/parse_test.py::ParseParamsFunctionalTests::test_parse_string_parameter_from_params PASSED [ 56%]

tests/utils/dataset_test.py::DataSetTestCase::testBigANNAsAcceptableDataSetFormatWithFloatExtension PASSED [ 47%]
tests/utils/dataset_test.py::DataSetTestCase::testHDF5AsAcceptableDataSetFormat PASSED [ 47%]
tests/utils/dataset_test.py::DataSetTestCase::testUnSupportedDataSetFormat PASSED [ 47%]

================= 1200 passed, 5 skipped, 3 warnings in 15.22s =================

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

VectorSearch uses file formats like hdf5 and bigann. We need library to parse corresponding formats. Signed-off-by: Vijayan Balasubramanian <[email protected]>

VijayanB · 2023-12-19T01:32:22Z

I tested this with my next change where i actually perform vector search using new param source that accepts hdf5 and u8bin dataset.

osbenchmark/utils/dataset.py

IanHoang · 2023-12-20T17:09:04Z

osbenchmark/utils/dataset.py

+        if end_offset > self.size():
+            end_offset = self.size()
+
+        v = cast(np.ndarray, self.data[self.current:end_offset])


Nit: what does v represent? Is there a more descriptive term that you can be used instead?

Update: v represents vector. Can we update the variable to be vector?

osbenchmark/utils/dataset.py

osbenchmark/utils/parse.py

IanHoang · 2023-12-20T18:58:02Z

osbenchmark/utils/parse.py

+    raise ConfigurationError("Value must be a string for param {}".format(key))
+
+
+def parse_int_parameter(key: str, params: dict, default: int = None) -> int:


There's some redundancy in these three functions and I think we can simplify it. Assuming OSB will need to check the type before calling one of the three functions, maybe we can condense these three functions into a single generic function where default parameter is dynamically typed and treated as Any, such as the following:

def parse_parameter(key: str, params:dict, default=None): valid_parameter_types = [str, int, float] if key not in params: if default: return default else: raise ConfigurationError( f"Value cannot be None for param {key}" ) for parameter_type in valid_parameter_types: if isinstance(params[key], parameter_type): return params[key] formatted_parameter_types = ", ".join([ parameter_type.__name__ for parameter_type in valid_parameter_types]) raise ConfigurationError(f"Value must be one of the followings for param {formatted_parameter_types}")

I've tested this against your test cases and they all pass as well. Let me know your thoughts @VijayanB ?

What happens if we pass integer value to string parameter? The above looks complicated for simple function. I don't see any advantage. However, if you have strong opinion, i can refactor.

What happens if we pass integer value to string parameter?

Are you referring to passing an integer value to the string parameter key? The code above is based off of the three functions (parse_str_parameter(), parse_float_parameter(), and parse_int_parameter()) and how they were used in the unittests in parse_test.py. In parse_test.py, it doesn't seem like there's a scenario where an integer value is passed to key since only values such as "int-value", "string-value", "default-value", "float-value" are provided. Based off these unittests, I was under the assumption that OSB must first identify the parameter's data-type before calling any of these functions. Let me know if I'm misunderstanding this.

That being said, I don't have a strong preference and if it's more simple to use these three functions, we can keep them as is 👍🏻

IanHoang

Left some comments

Added Hdf5, Bigann dataset parser. Added test cases for dataset and parser. Signed-off-by: Vijayan Balasubramanian <[email protected]>

IanHoang · 2023-12-21T15:55:34Z

osbenchmark/exceptions.py

@@ -110,3 +110,17 @@ class WorkloadConfigError(BenchmarkError):

 class NotFound(BenchmarkError):
    pass
+
+
+class InvalidExtensionException(BenchmarkError):


Thanks for doing this!

IanHoang

LGTM!

VijayanB requested review from IanHoang and gkamat as code owners December 18, 2023 21:54

VijayanB force-pushed the add-dataset-parser branch 3 times, most recently from b839733 to 4753b8f Compare December 18, 2023 22:35

VijayanB changed the title ~~Add dataset parser~~ Add dataset parser for vector search Dec 18, 2023

Add dependency to setup file

f5752b3

VectorSearch uses file formats like hdf5 and bigann. We need library to parse corresponding formats. Signed-off-by: Vijayan Balasubramanian <[email protected]>

VijayanB force-pushed the add-dataset-parser branch from 4753b8f to c0cbbf2 Compare December 19, 2023 00:06

jmazanec15 approved these changes Dec 19, 2023

View reviewed changes

navneet1v reviewed Dec 19, 2023

View reviewed changes

osbenchmark/utils/dataset.py Outdated Show resolved Hide resolved

osbenchmark/utils/dataset.py Outdated Show resolved Hide resolved

osbenchmark/utils/dataset.py Show resolved Hide resolved

VijayanB force-pushed the add-dataset-parser branch 2 times, most recently from 4e18633 to 86a4617 Compare December 19, 2023 04:33

VijayanB requested a review from navneet1v December 19, 2023 04:34

navneet1v approved these changes Dec 19, 2023

View reviewed changes

VijayanB mentioned this pull request Dec 20, 2023

Add vector search param source #425

Merged

1 task