[FEATURE] Softer validation of corpora workload parameters for vectorsearch benchmark #572

finnroblin · 2024-06-25T20:53:07Z

Is your feature request related to a problem? Please describe

Currently each workload corpus requires a target index parameter when there are multiple indices. However, the vector search bulk ingest workload operation (bulk-vector-data-set) does not use this target index when ingesting data. Instead, users specify the target index as a parameter in the custom-bulk ingest operation in their test procedure.

I'm opening this issue because the target-index parameter is required at workload validation time despite it being unnecessary for vector search workloads. As a result the VS workload.json must contain unused parameters.

Describe the solution you'd like

One solution is to make the target-index corpora parameter optional at validation time. Perhaps it's also possible to enforce that either the parameter is specified in the corpora in workloads.json or that bulk-vector-data-set is used in a test procedure.

Describe alternatives you've considered

I don't have all the context for why there are two ways of specifying the target index for ingesting data but I believe it's due to vector datasets being in hdf5 format and the normal bulk operation requiring json documents.

Additional context

There is another issue stemming from VS ingestion being different than normal ingestion — Issue 317 in the workloads repo requests a VS feature that's available in non-VS workloads. I think the lack of feature parity is due to the vector-bulk operation being different from the normal bulk operation.

The text was updated successfully, but these errors were encountered:

IanHoang · 2024-06-27T18:02:40Z

Thanks for bringing this up @finnroblin. I agree, it'd be convenient not to specify unused parameters. Of the two approaches suggested, I prefer the second because target-index is still a requirement for workloads other than Vector Search.

One way we can circumvent this pain point is to rearrange test procedure creation to be before corpora in loader.py

# Switch these in loader.py so that test_procedures is created first. 
  corpora = self._create_corpora(self._r(workload_specification, "corpora", mandatory=False, default_value=[]),
                                 indices, data_streams)
  test_procedures = self._create_test_procedures(workload_specification)

We can pass in test_procedures object to self._create_corpora() so that we have access to the operations / schedule and check if bulk-vector-data-set operation exists in the list. If so, target-index is not necessary.

If you (or anyone else) think of any cleaner approaches, feel free to propose and implement them.

finnroblin added enhancement New feature or request untriaged labels Jun 25, 2024

finnroblin mentioned this issue Jun 25, 2024

Add vectorsearch training workload opensearch-project/opensearch-benchmark-workloads#333

Merged

IanHoang removed the untriaged label Jun 27, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FEATURE] Softer validation of corpora workload parameters for vectorsearch benchmark #572

[FEATURE] Softer validation of corpora workload parameters for vectorsearch benchmark #572

finnroblin commented Jun 25, 2024

IanHoang commented Jun 27, 2024

[FEATURE] Softer validation of corpora workload parameters for vectorsearch benchmark #572

[FEATURE] Softer validation of corpora workload parameters for vectorsearch benchmark #572

Comments

finnroblin commented Jun 25, 2024

Is your feature request related to a problem? Please describe

Describe the solution you'd like

Describe alternatives you've considered

Additional context

IanHoang commented Jun 27, 2024