[FEATURE] Add z-score for the normalization processor #376 #468

sam-herman · 2023-10-19T03:49:39Z

Description

This change implements #376

Add z-score for hybrid query normalization processor
Add IT that test normalization end to end

Issues Resolved

Resolving #376

Check List

[x ] New functionality includes testing.
- [ x] All tests pass
[x ] New functionality has been documented.
- [ x] New functionality has javadoc added
[ x] Commits are signed as per the DCO using --signoff

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

Signed-off-by: Samuel Herman <[email protected]>

navneet1v · 2023-10-19T05:47:13Z

@samuel-oci thanks for creating the PR. But given that this is a new feature, the recommendation is to put this code in feature branch. This is to ensure that main branch is not blocked.

I will go ahead and create a feature branch for this change.

As you have already written down the full code, its a good time to start doing the performance testing and search relevancy testing for this feature.

navneet1v · 2023-10-19T05:48:46Z

Created a feature branch from main, for this feature: feature/z-score-normalization

Please raise the PR against that branch. Also can you add a entry in the CHANGELOG.md file for this change.

navneet1v · 2023-10-19T05:49:00Z

src/main/java/org/opensearch/neuralsearch/processor/NormalizationProcessorWorkflow.java

@@ -52,6 +52,7 @@ public void execute(
        final ScoreNormalizationTechnique normalizationTechnique,
        final ScoreCombinationTechnique combinationTechnique
    ) {
+        log.info("Entering normalization processor workflow");


we can remove this.

navneet1v · 2023-10-19T05:49:18Z

src/main/java/org/opensearch/neuralsearch/processor/combination/ScoreCombiner.java

@@ -26,8 +26,6 @@
 @Log4j2
 public class ScoreCombiner {

-    private static final Float ZERO_SCORE = 0.0f;


any reason why we are removing this?

It's not in use anywhere in the code

navneet1v · 2023-10-19T06:00:25Z

...n/java/org/opensearch/neuralsearch/processor/normalization/ZScoreNormalizationTechnique.java

+ */
+/*
+TODO: Some todo items that apply here but also on the original normalization techniques on which it is modeled {@link L2ScoreNormalizationTechnique} and {@link MinMaxScoreNormalizationTechnique}
+1. Random access to abstract list object is a bad practice both stylistically and from performance perspective and should be removed


Can you please provide an alternative what should be used?

As per my understanding, random access on the List is bad if List concrete implementation is LinkedList. But what I have seen generally is we use ArrayList which is backed by arrays, hence random access is done in constant time.

It should be fine if we know the exact implementation of List, as Navneet mentioned. But with list we can use functional style easier, without expensive conversion array -> stream, that was a reason why we switched to a List.

Usually it is highly discouraged to do List.get() for an abstract List object because it could be an implementation that doesn't support random access efficiently (e.g. LinkedList). Suggested alternative is to enforce that this is explicitly declared as an ArrayList object throughout the hot path that require random access.

I'm ok to switch from using general List to ArrayList, that still works with stream API and keep our requirements to a caller code cleaner. I expect that change will affect a lot of classes, thus I prefer to see it as a separate refactoring PR.

@martin-gaievski same here, I added the comment out of intention to propose as a separate refactoring PR.

navneet1v · 2023-10-19T06:03:54Z

...n/java/org/opensearch/neuralsearch/processor/normalization/ZScoreNormalizationTechnique.java

+/*
+TODO: Some todo items that apply here but also on the original normalization techniques on which it is modeled {@link L2ScoreNormalizationTechnique} and {@link MinMaxScoreNormalizationTechnique}
+1. Random access to abstract list object is a bad practice both stylistically and from performance perspective and should be removed
+2. Identical sub queries and their distribution between shards is currently completely implicit based on ordering and should be explicit based on identifier


This is really a good thought, but problem is none of the query clauses in Opensearch supports identifiers. During the implementation this was discussed. The problem is the way after QueryPhase the results are returned. They are returned in a ScoreDocs array which doesn't support identifiers.

We can go around that but it will require changes in interface of OpenSearch Core. Hence we decided against it to make sure that we are compatible with OpenSearch core.

If there is an alternative supported in opensearch please let us know, may be we are missing something

sounds good @navneet1v I will give it some thought and will come up with suggestion. In any case not planning to do as part of this change. Can keep it for now and can suggest refactor or just remove if not achievable.

navneet1v · 2023-10-19T06:04:01Z

...n/java/org/opensearch/neuralsearch/processor/normalization/ZScoreNormalizationTechnique.java

+TODO: Some todo items that apply here but also on the original normalization techniques on which it is modeled {@link L2ScoreNormalizationTechnique} and {@link MinMaxScoreNormalizationTechnique}
+1. Random access to abstract list object is a bad practice both stylistically and from performance perspective and should be removed
+2. Identical sub queries and their distribution between shards is currently completely implicit based on ordering and should be explicit based on identifier
+3. Weird calculation of numOfSubQueries instead of having a more explicit indicator


same as above.

navneet1v · 2023-10-19T06:05:01Z

...n/java/org/opensearch/neuralsearch/processor/normalization/ZScoreNormalizationTechnique.java

+        // why are we doing that? is List<CompoundTopDocs> the list of subqueries for a single shard? or a global list of all subqueries across shards?
+        // If a subquery comes from each shard then when is it combined? that seems weird that combination will do combination of normalized results that each is normalized just based on shard level result


Lets talk about these on the github issue and not on the PR.

ack, I think this comment is no longer relevant I put it there and forgot to remove so feel free to ignore this one.

navneet1v · 2023-10-19T06:05:46Z

...n/java/org/opensearch/neuralsearch/processor/normalization/ZScoreNormalizationTechnique.java

+        //TODO: make this better, currently
+        // this is a horrible implementation in particular when it comes to the topDocsPerSubQuery.get(j)
+        // which does a random search on an abstract list type.


Please provide reason why this is bad and how it can be improved.

Also, let's avoid using subject word like 'horrible'.

@heemin32 my apologies, will avoid it in the future.

martin-gaievski · 2023-10-19T16:31:52Z

...n/java/org/opensearch/neuralsearch/processor/normalization/ZScoreNormalizationTechnique.java

+    public static final String TECHNIQUE_NAME = "z_score";
+    private static final float SINGLE_RESULT_SCORE = 1.0f;
+    @Override
+    public void normalize(List<CompoundTopDocs> queryTopDocs) {


please make all args of all public methods final

martin-gaievski · 2023-10-19T16:38:49Z

...n/java/org/opensearch/neuralsearch/processor/normalization/ZScoreNormalizationTechnique.java

+                .findAny()
+                .get()
+                .getTopDocs()
+                .size();


Can you please add more checks for nulls and empty objects. I think we're assuming a lot, e.g. topDocs.getTopDocs() is not null, .findAny() will always return something etc.

yeah agreed, I will add checks, currently it's modeled on existing normalization techniques (MinMax/L2) which use similar code.
Also I'm not sure if there could also be a potential issue with implicit assumption that is made regarding the way we discover the number of sub queries. Currently it assumes that all none empty shard results will have all subqueries returned with 0 hits. I didn't confirm if this assumption always hold, but it would be nice if we had someway of passing it as metadata from upstream instead of making implicit assumptions. So just placing it here as a food for thought in case there is interest to explore that.

martin-gaievski · 2023-10-19T16:45:47Z

src/test/java/org/opensearch/neuralsearch/query/HybridQueryZScoreIT.java

+ * SPDX-License-Identifier: Apache-2.0
+ */
+
+package org.opensearch.neuralsearch.query;


I think this integ test belongs to a processor package, as we're mainly testing normalization technique that is part of the processor.

sounds good, will move to processor

martin-gaievski · 2023-10-19T16:48:27Z

src/test/java/org/opensearch/neuralsearch/query/HybridQueryZScoreIT.java

+
+    private Optional<Float> getMaxScore(Map<String, Object> searchResponseAsMap) {
+        Map<String, Object> hitsMap = (Map<String, Object>) searchResponseAsMap.get("hits");
+        return hitsMap.get("max_score") == null ? Optional.empty() : Optional.of(((Double) hitsMap.get("max_score")).floatValue());


there 3 methods are copied from https://github.com/opensearch-project/neural-search/blob/main/src/test/java/org/opensearch/neuralsearch/query/HybridQueryIT.java#L271-L283, can we refactor code and pull them to a base class or a utility class for tests

martin-gaievski · 2023-10-19T16:54:38Z

...a/org/opensearch/neuralsearch/processor/normalization/ZScoreNormalizationTechniqueTests.java

+                        new TopDocs(new TotalHits(0, TotalHits.Relation.EQUAL_TO), new ScoreDoc[0]),
+                        new TopDocs(
+                                new TotalHits(3, TotalHits.Relation.EQUAL_TO),
+                                new ScoreDoc[] { new ScoreDoc(3, 0.98058068f), new ScoreDoc(4, 0.39223227f), new ScoreDoc(2, -1.37281295f) }


can you please add a function or description with a formula of how those expected scores are calculated?

Also why we have a negative score value for one of ScoreDocs?

@martin-gaievski will add documentation. Regarding negatives, z-scores can also be negative:
https://www.z-table.com/
Btw, I should have mentioned in the comments that I wanted to bring up that the combiner is not good at dealing with negative values, but that would be adding additional scope.

heemin32 · 2023-10-19T17:51:31Z

...n/java/org/opensearch/neuralsearch/processor/normalization/ZScoreNormalizationTechnique.java

+        //TODO: make this better, currently
+        // this is a horrible implementation in particular when it comes to the topDocsPerSubQuery.get(j)
+        // which does a random search on an abstract list type.


Also, let's avoid using subject word like 'horrible'.

...n/java/org/opensearch/neuralsearch/processor/normalization/ZScoreNormalizationTechnique.java

heemin32 · 2023-10-19T17:56:21Z

...n/java/org/opensearch/neuralsearch/processor/normalization/ZScoreNormalizationTechnique.java

+        }
+    }
+
+    static private float[] findScoreSumPerSubQuery(final List<CompoundTopDocs> queryTopDocs, final int numOfScores) {


nit. private would be better unless you have specific reason this to be static. Better way would be moving all these methods to another class to make it easier to write unit test.

convention I was following is that if method is not dependent on any instance object it should be static.
Regarding refactoring method out to utility class, are there any other classes that can use it right now or in the future? Ideally I would like to avoid creating unnecessary abstraction.

sam-herman · 2023-10-20T20:22:07Z

Created a feature branch from main, for this feature: feature/z-score-normalization

Please raise the PR against that branch. Also can you add a entry in the CHANGELOG.md file for this change.

Thank you @navneet1v @heemin32 @martin-gaievski for reviewing, I will create a new PR against the feature branch with your comments addressed.

Signed-off-by: Samuel Herman <[email protected]>

sam-herman · 2023-10-21T18:55:01Z

As you have already written down the full code, its a good time to start doing the performance testing and search relevancy testing for this feature.

@navneet1v can you point me to how to run the performance and search relevancy tests? Is there a ready workflow with existing benchmarks?

sam-herman added 4 commits October 17, 2023 13:43

add z-score and logging for tests

e6d2130

Signed-off-by: Samuel Herman <[email protected]>

wire z-score normalization

9cce739

Signed-off-by: Samuel Herman <[email protected]>

add IT test

10b5646

Signed-off-by: Samuel Herman <[email protected]>

fix IT test

266a34b

Signed-off-by: Samuel Herman <[email protected]>

sam-herman requested review from heemin32, navneet1v, VijayanB, vamshin, jmazanec15, naveentatikonda, junqiu-lei, martin-gaievski, sean-zheng-amazon, model-collapse, wujunshen, zane-neo, ylwu-amzn and jngz-es as code owners October 19, 2023 03:49

navneet1v reviewed Oct 19, 2023

View reviewed changes

martin-gaievski reviewed Oct 19, 2023

View reviewed changes

heemin32 reviewed Oct 19, 2023

View reviewed changes

review feedback

0effd07

Signed-off-by: Samuel Herman <[email protected]>

sam-herman force-pushed the add-z-score branch from f1ef018 to 0effd07 Compare October 21, 2023 18:39

sam-herman mentioned this pull request Oct 21, 2023

[FEATURE] Add z-score for the normalization processor #376 #470

Open

sam-herman closed this Oct 21, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FEATURE] Add z-score for the normalization processor #376 #468

[FEATURE] Add z-score for the normalization processor #376 #468

sam-herman commented Oct 19, 2023

navneet1v commented Oct 19, 2023

navneet1v commented Oct 19, 2023

navneet1v Oct 19, 2023

navneet1v Oct 19, 2023

sam-herman Oct 19, 2023

navneet1v Oct 19, 2023

martin-gaievski Oct 19, 2023

sam-herman Oct 19, 2023

martin-gaievski Oct 19, 2023

sam-herman Oct 20, 2023

navneet1v Oct 19, 2023 •

edited

Loading

sam-herman Oct 20, 2023

navneet1v Oct 19, 2023

navneet1v Oct 19, 2023

sam-herman Oct 19, 2023

navneet1v Oct 19, 2023

heemin32 Oct 19, 2023

sam-herman Oct 20, 2023

martin-gaievski Oct 19, 2023

martin-gaievski Oct 19, 2023

sam-herman Oct 20, 2023

martin-gaievski Oct 19, 2023

sam-herman Oct 20, 2023

martin-gaievski Oct 19, 2023

martin-gaievski Oct 19, 2023

sam-herman Oct 20, 2023

heemin32 Oct 19, 2023

heemin32 Oct 19, 2023

sam-herman Oct 20, 2023

sam-herman commented Oct 20, 2023

sam-herman commented Oct 21, 2023

		// why are we doing that? is List<CompoundTopDocs> the list of subqueries for a single shard? or a global list of all subqueries across shards?
		// If a subquery comes from each shard then when is it combined? that seems weird that combination will do combination of normalized results that each is normalized just based on shard level result

[FEATURE] Add z-score for the normalization processor #376 #468

[FEATURE] Add z-score for the normalization processor #376 #468

Conversation

sam-herman commented Oct 19, 2023

Description

Issues Resolved

Check List

navneet1v commented Oct 19, 2023

navneet1v commented Oct 19, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

navneet1v Oct 19, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sam-herman commented Oct 20, 2023

sam-herman commented Oct 21, 2023

navneet1v Oct 19, 2023 •

edited

Loading