Add reindex ITs #446

zane-neo · 2023-10-13T02:11:55Z

Description

Add reindex ITs in neural search

Issues Resolved

Issue: #386

Check List

New functionality includes testing.
- All tests pass
New functionality has been documented.
- New functionality has javadoc added
Commits are signed as per the DCO using --signoff

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

Signed-off-by: zane-neo <[email protected]>

codecov · 2023-10-13T04:52:06Z

Codecov Report

Merging #446 (4af404d) into main (04c1e05) will not change coverage.
The diff coverage is n/a.

@@            Coverage Diff            @@
##               main     #446   +/-   ##
=========================================
  Coverage     84.37%   84.37%           
  Complexity      498      498           
=========================================
  Files            40       40           
  Lines          1491     1491           
  Branches        228      228           
=========================================
  Hits           1258     1258           
  Misses          133      133           
  Partials        100      100

Signed-off-by: zane-neo <[email protected]>

navneet1v · 2023-10-25T18:25:13Z

@zane-neo can we add reindex IT for other processors too?

navneet1v · 2023-11-20T19:28:44Z

@zane-neo any update on this PR?

zane-neo · 2023-11-21T00:29:09Z

@zane-neo can we add reindex IT for other processors too?

The issue in ml-commons is not specific to individual processor, it's more like a generic issue(user info not restored after the listener ran), so one or two processor to cover this is enough I thought.

navneet1v · 2023-11-21T00:41:04Z

@zane-neo can we add reindex IT for other processors too?

The issue in ml-commons is not specific to individual processor, it's more like a generic issue(user info not restored after the listener ran), so one or two processor to cover this is enough I thought.

I won't agree with this. It will be good if we have reindex ITs for atleast all the local model like splade and text embeddings. Text + image processor works on remote model so we cannot integrate here.

zane-neo · 2023-11-21T00:52:37Z

@zane-neo can we add reindex IT for other processors too?

The issue in ml-commons is not specific to individual processor, it's more like a generic issue(user info not restored after the listener ran), so one or two processor to cover this is enough I thought.

I won't agree with this. It will be good if we have reindex ITs for atleast all the local model like splade and text embeddings. Text + image processor works on remote model so we cannot integrate here.

This PR already includes splade and text embedding processors.

martin-gaievski · 2023-11-21T01:09:55Z

Using Text_image_embeddings processor with only text field should work with local model. Processor uses mlClient to call the model, and model type is abstracted by the client.

zane-neo · 2023-11-21T01:12:48Z

Using Text_image_embeddings processor with only text field should work with local model. Processor uses mlClient to call the model, and model type is abstracted by the client.

it's a text embedding case right?

martin-gaievski · 2023-11-21T01:31:46Z

Using Text_image_embeddings processor with only text field should work with local model. Processor uses mlClient to call the model, and model type is abstracted by the client.

it's a text embedding case right?

I think it's not. Embeddings are only text, that's correct, but the processor is text_image. That processor can work with text, image and text+image. If we want to test that reindexing works with all processors it would not be correct to test it with text_embedding and assume that text_image_embedding processor is also tested.

zane-neo · 2023-11-21T01:40:36Z

Using Text_image_embeddings processor with only text field should work with local model. Processor uses mlClient to call the model, and model type is abstracted by the client.

it's a text embedding case right?

I think it's not. Embeddings are only text, that's correct, but the processor is text_image. That processor can work with text, image and text+image. If we want to test that reindexing works with all processors it would not be correct to test it with text_embedding and assume that text_image_embedding processor is also tested.

current tests covered local models which can prove the correctness of ml-commons code, no need to add other cases IMO.

navneet1v · 2023-11-21T03:26:55Z

current tests covered local models which can prove the correctness of ml-commons code, no need to add other cases IMO.

This is not the sole purpose of these tests. If the purpose is for correctness of ML code, the tests should be in MLCommons.

We want to cover as many use cases as possible in which an ingestion processor can be used. Re-indexing is one of the use-case here. I know we started this PR as part of making sure that bug we got here: #386 can be caught in future, but that is not the sole purpose of the tests.

zane-neo · 2023-11-21T06:02:30Z

current tests covered local models which can prove the correctness of ml-commons code, no need to add other cases IMO.

This is not the sole purpose of these tests. If the purpose is for correctness of ML code, the tests should be in MLCommons.

We want to cover as many use cases as possible in which an ingestion processor can be used. Re-indexing is one of the use-case here. I know we started this PR as part of making sure that bug we got here: #386 can be caught in future, but that is not the sole purpose of the tests.

There're quite a lot different flows/features in OpenSearch, e.g. shrink index, split index, restore index etc. Do we have a way to cover all these features when implementing a new feature to ensure they're not breaking? Is covering only reindex enough?

navneet1v · 2023-11-21T06:06:16Z

Do we have a way to cover all these features when implementing a new feature to ensure they're not breaking? Is covering only reindex enough?

Given the issue in hand we should start with re-index and then add more case which you provided after this. We can cut a github issue to make sure that we are covering the cases you mentioned too.

zane-neo · 2023-11-21T06:08:54Z

Do we have a way to cover all these features when implementing a new feature to ensure they're not breaking? Is covering only reindex enough?

Given the issue in hand we should start with re-index and then add more case which you provided after this. We can cut a github issue to make sure that we are covering the cases you mentioned too.

Is there already a generic approach to address this? Like which can be used for all plugins so that no need to implement this in individual plugin separately?

navneet1v · 2024-01-31T17:26:54Z

Do we have a way to cover all these features when implementing a new feature to ensure they're not breaking? Is covering only reindex enough?

Given the issue in hand we should start with re-index and then add more case which you provided after this. We can cut a github issue to make sure that we are covering the cases you mentioned too.

Is there already a generic approach to address this? Like which can be used for all plugins so that no need to implement this in individual plugin separately?

As per my understanding there is no such thing. Main thing which I was trying to put here was, you are adding ITs only for sparse and text embedding processor. There are other processors too in the plugin. So it will be better if you can add re-index ITs for those processors too.

zane-neo · 2024-02-06T09:42:55Z

Will do more test on open source part to reproduce this issue, will reopen this or create a new PR to this. Closing this now.

Add reindex ITs

025c85e

Signed-off-by: zane-neo <[email protected]>

zane-neo requested review from heemin32, navneet1v, VijayanB, vamshin, jmazanec15, naveentatikonda, junqiu-lei, martin-gaievski, sean-zheng-amazon, model-collapse, wujunshen, ylwu-amzn and jngz-es as code owners October 13, 2023 02:11

zane-neo added the skip-changelog label Oct 13, 2023

format code

78ad325

Signed-off-by: zane-neo <[email protected]>

parameterize pipeline name

4af404d

Signed-off-by: zane-neo <[email protected]>

zane-neo closed this Feb 6, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add reindex ITs #446

Add reindex ITs #446

zane-neo commented Oct 13, 2023 •

edited by navneet1v

Loading

codecov bot commented Oct 13, 2023 •

edited

Loading

navneet1v commented Oct 25, 2023

navneet1v commented Nov 20, 2023

zane-neo commented Nov 21, 2023

navneet1v commented Nov 21, 2023

zane-neo commented Nov 21, 2023

martin-gaievski commented Nov 21, 2023

zane-neo commented Nov 21, 2023

martin-gaievski commented Nov 21, 2023

zane-neo commented Nov 21, 2023

navneet1v commented Nov 21, 2023

zane-neo commented Nov 21, 2023

navneet1v commented Nov 21, 2023

zane-neo commented Nov 21, 2023

navneet1v commented Jan 31, 2024

zane-neo commented Feb 6, 2024

Add reindex ITs #446

Add reindex ITs #446

Conversation

zane-neo commented Oct 13, 2023 • edited by navneet1v Loading

Description

Issues Resolved

Check List

codecov bot commented Oct 13, 2023 • edited Loading

Codecov Report

navneet1v commented Oct 25, 2023

navneet1v commented Nov 20, 2023

zane-neo commented Nov 21, 2023

navneet1v commented Nov 21, 2023

zane-neo commented Nov 21, 2023

martin-gaievski commented Nov 21, 2023

zane-neo commented Nov 21, 2023

martin-gaievski commented Nov 21, 2023

zane-neo commented Nov 21, 2023

navneet1v commented Nov 21, 2023

zane-neo commented Nov 21, 2023

navneet1v commented Nov 21, 2023

zane-neo commented Nov 21, 2023

navneet1v commented Jan 31, 2024

zane-neo commented Feb 6, 2024

zane-neo commented Oct 13, 2023 •

edited by navneet1v

Loading

codecov bot commented Oct 13, 2023 •

edited

Loading