Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding 'dense_vector' field type #3659

Closed

Conversation

martin-gaievski
Copy link
Member

Signed-off-by: Martin Gaievski [email protected]

Description

Adding a new field type dense_vector. It's a first step in enabling KNN search in core OpenSearch. This implementation is based on Lucene HNSW, more details in corresponding RFC #3545

We'll be adding query type support in next PR, limiting this change to field type only to keep PR/change size smaller. With this change user will be able to :

  • define fields of type dense_vector with an option to index for the KNN search
  • add documents with dense_vector fields to the index

Issues Resolved

opensearch-project/k-NN#39

Check List

  • New functionality includes testing.
    • All tests pass
  • Commits are signed per the DCO using --signoff

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

@opensearch-ci-bot
Copy link
Collaborator

✅   Gradle Check success e6f79103191b342d6dd26eca23fb31f23609aaf7
Log 6242

Reports 6242

@martin-gaievski martin-gaievski marked this pull request as ready for review June 24, 2022 22:25
@martin-gaievski martin-gaievski requested review from a team and reta as code owners June 24, 2022 22:25
@opensearch-ci-bot
Copy link
Collaborator

✅   Gradle Check success b46bdb08b912c7cb65813b54bda729d766843966
Log 6313

Reports 6313

@opensearch-ci-bot
Copy link
Collaborator

✅   Gradle Check success 521644c537396b2ccdc00fb057761a3270afdb2b
Log 6314

Reports 6314

* compatible open source license.
*/

/*
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is new code, isn't it? Doesn't need an ES license grant.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I'm adding several new classes. I actually wasn't sure on which one to use, checked few recent check-ins and classes in the same package, they used ES style license.
Let me update PR and exclude ES part

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Only SPDX header is required for new code AFAIK.

Copy link
Member Author

@martin-gaievski martin-gaievski Jun 28, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't mind changing, it's the same we use in kNN repo. I just checked few random PRs for this repo before making this change, found the header I used here, e.g. #3004 or #690. At the same time I haven't found any example with the compact header.

Do you mean it's some recent requirement for the SPDX header and there might be no examples in the repo yet?

Copy link
Collaborator

@nknize nknize left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @martin-gaievski this is coming together! I left some initial feedback. Nothing too major. Mostly around test scaffolding and some extensibility protections.

import static org.opensearch.index.mapper.KnnAlgorithmContextFactory.HNSW_PARAMETER_BEAM_WIDTH;
import static org.opensearch.index.mapper.KnnAlgorithmContextFactory.HNSW_PARAMETER_MAX_CONNECTIONS;

public class DenseVectorFieldTypeTests extends OpenSearchSingleNodeTestCase {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Most of these are FieldMapperTests (parsing the mapping) not FieldTypeTests so lets refactor them over to DenseVectorFieldMapperTests.

Also, OpenSearchSingleNodeTestCase is slow and big so we try to avoid it. Let's refactor this test class as follows:

Suggested change
public class DenseVectorFieldTypeTests extends OpenSearchSingleNodeTestCase {
public class DenseVectorFieldTypeTests extends FieldTypeTestCase {

For guidance on this test see the CollationFieldTypeTests as an example.

import static org.opensearch.index.mapper.KnnAlgorithmContextFactory.HNSW_PARAMETER_BEAM_WIDTH;
import static org.opensearch.index.mapper.KnnAlgorithmContextFactory.HNSW_PARAMETER_MAX_CONNECTIONS;

public class DenseVectorMapperTests extends MapperServiceTestCase {
Copy link
Collaborator

@nknize nknize Jun 29, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
public class DenseVectorMapperTests extends MapperServiceTestCase {
public class DenseVectorFieldMapperTests extends FieldMapperTestCase2<DenseVectorFieldMapper.Builder> {

We should use the FieldMapperTestCase2 helper class here to ensure we covering mapper merge conflict and serialization tests that are important for the REST client interfaces.

For guidance on this test see the LegacyGeoShapeFieldMapperTests example

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can't use FieldMapperTestCase2 directly as DenseVectorFieldMapper.Builder extends ParametrizedFieldMapper.Builder that is not supported by FieldMapperTestCase2, seems that majority of other mapper builders extend ParametrizedFieldMapper.Builder as well.

@nknize what would be your suggestion, is it better to keep new mapper for sense_vector as ParametrizedFieldMapper.Builder and write custom test cases for merge conflicts, or change it to FieldMapper.Builder? I do not see any strong argument for keeping it as ParametrizedFieldMapper.Builder except that it will require additional efforts to re-write current implementation, while with FieldMapper.Builder we obviously will have merge and serialization tests for free?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've done refactoring, but to b honest I'm not sure if that is better than a previous version. While test coverage is better, there is a lot of additional code added that has been taking care of by the ParametrizedFieldMapper. Please check and let me know what do you think. Thx!

}

public DenseVectorFieldType(String name, Map<String, String> meta, int dimension, KnnContext knnContext, boolean hasDocValues) {
super(name, false, false, false, TextSearchInfo.NONE, meta);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
super(name, false, false, false, TextSearchInfo.NONE, meta);
super(name, false, false, hasDocValues, TextSearchInfo.NONE, meta);

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ack


private final int dimension;
private final KnnContext knnContext;
private final boolean hasDocValues;
Copy link
Collaborator

@nknize nknize Jun 29, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

MappedFieldType already takes care of hasDocValues for you. Let's remove this variable and use the built in mechanism.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let me check MappedFieldType logic, initially I had to add custom code because unit tests were failing. Probably some piece was missing.

import java.util.Map;

/**
* Field Mapper for Dense vector type. Extends ParametrizedFieldMapper in order to easily configure mapping parameters.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we want external plugins (e.g., KNN) to support extending this class to add plugin specific logic? If we don't have anything as of yet, then can we make this final? Also let's annotate w/ @opensearch.internal.

Suggested change
* Field Mapper for Dense vector type. Extends ParametrizedFieldMapper in order to easily configure mapping parameters.
* Field Mapper for Dense vector type. Extends ParametrizedFieldMapper in order to easily configure mapping parameters.
*
* @opensearch.internal

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ack

/**
* Abstracts KNN segment of dense_vector field type
*/
public class KnnContext implements ToXContentFragment, Writeable {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we mark this class as final to prevent any subclassing fragmentation?

Suggested change
public class KnnContext implements ToXContentFragment, Writeable {
public final class KnnContext implements ToXContentFragment, Writeable {

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ack

import java.util.Objects;

/**
* Abstracts KNN segment of dense_vector field type
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
* Abstracts KNN segment of dense_vector field type
* Abstracts KNN segment of dense_vector field type
*
* @opensearch.internal

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ack

/**
* Abstracts supported metrics for dense_vector and KNN search
*/
public enum Metric {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can this be package private?

Suggested change
public enum Metric {
final enum Metric {

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think package private should be possible, let me check. I'm not sure if we can/need do enum final

import java.util.Optional;

/**
* Abstracts supported metrics for dense_vector and KNN search
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
* Abstracts supported metrics for dense_vector and KNN search
* Abstracts supported metrics for dense_vector and KNN search
*
* @opensearch.internal

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ack

int value = XContentMapValues.nodeIntegerValue(o);
if (value > MAX_DIMENSION) {
throw new IllegalArgumentException(
String.format(Locale.ROOT, "[dimension] value cannot be greater than %d for vector [%s]", MAX_DIMENSION, name)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
String.format(Locale.ROOT, "[dimension] value cannot be greater than %d for vector [%s]", MAX_DIMENSION, name)
String.format(Locale.ROOT, "[dimension] value %d cannot be greater than %d for vector [%s]", value, MAX_DIMENSION, name)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ack

}
if (value <= 0) {
throw new IllegalArgumentException(
String.format(Locale.ROOT, "[dimension] value must be greater than 0 for vector [%s]", name)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
String.format(Locale.ROOT, "[dimension] value must be greater than 0 for vector [%s]", name)
String.format(Locale.ROOT, "[dimension] value %d must be greater than 0 for vector [%s]", value, name)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ack

@github-actions
Copy link
Contributor

github-actions bot commented Jul 1, 2022

Gradle Check (Jenkins) Run Completed with:

@github-actions
Copy link
Contributor

github-actions bot commented Jul 2, 2022

Gradle Check (Jenkins) Run Completed with:

}

private boolean isDenseVectorFieldType(final MappedFieldType mappedFieldType) {
if (mappedFieldType != null && mappedFieldType instanceof DenseVectorFieldMapper.DenseVectorFieldType) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need explicit null check before instanceof ? i think instanceof takes care of that.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you're right, I learned a new thing today thanks to you - https://docs.oracle.com/javase/specs/jls/se11/html/jls-15.html#jls-15.20.2. I'll update the code

Signed-off-by: Martin Gaievski <[email protected]>
private final Method method;
private final Map<String, Object> parameters;

private static final int MAX_NUMBER_OF_ALGORITHM_PARAMETERS = 50;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: should we move all constants to beginning of class?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ack

private static final int MAX_NUMBER_OF_ALGORITHM_PARAMETERS = 50;

public KnnAlgorithmContext(Method method, Map<String, Object> parameters) {
this.method = method;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

requireNonNull check for params?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ack

@github-actions
Copy link
Contributor

github-actions bot commented Jul 6, 2022

Gradle Check (Jenkins) Run Completed with:

Object v = e.getValue();
if (v instanceof Map) {
throw new MapperParsingException(
String.format(Locale.ROOT, "Unable to parse parameter [%s] for [algorithm]", e.getValue())
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
String.format(Locale.ROOT, "Unable to parse parameter [%s] for [algorithm]", e.getValue())
String.format(Locale.ROOT, "Unable to parse parameter [%s] for [algorithm]", v)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ack

Comment on lines 107 to 121
if (parameters == null) {
builder.field(PARAMETERS, (String) null);
} else {
builder.startObject(PARAMETERS);
parameters.forEach((key, value) -> {
try {
builder.field(key, value);
} catch (IOException ioe) {
throw new RuntimeException("Unable to generate xcontent for method component");
}

});
builder.endObject();
}
return builder;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if (parameters == null) {
builder.field(PARAMETERS, (String) null);
} else {
builder.startObject(PARAMETERS);
parameters.forEach((key, value) -> {
try {
builder.field(key, value);
} catch (IOException ioe) {
throw new RuntimeException("Unable to generate xcontent for method component");
}
});
builder.endObject();
}
return builder;
if (parameters == null) {
return builder.field(PARAMETERS, (String) null);
}
builder.startObject(PARAMETERS);
parameters.forEach((key, value) -> {
try {
builder.field(key, value);
} catch (IOException ioe) {
throw new RuntimeException("Unable to generate xcontent for method component");
}
});
return builder.endObject();

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ack

@github-actions
Copy link
Contributor

github-actions bot commented Jul 6, 2022

Gradle Check (Jenkins) Run Completed with:

*
* @opensearch.internal
*/
public class KnnVectorFormatFactory {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

WDYT?

Suggested change
public class KnnVectorFormatFactory {
public class KNNVectorFormatFactory {

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure to be honest - we're wrapping Lucene classes and they have notation as "Knn" (e.g. https://github.com/apache/lucene/blob/main/lucene/core/src/java/org/apache/lucene/document/KnnVectorField.java), so I've follow them

Comment on lines 57 to 60
if (mappedFieldType instanceof DenseVectorFieldMapper.DenseVectorFieldType) {
return true;
}
return false;
Copy link
Member

@VijayanB VijayanB Jul 6, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if (mappedFieldType instanceof DenseVectorFieldMapper.DenseVectorFieldType) {
return true;
}
return false;
return (mappedFieldType instanceof DenseVectorFieldMapper.DenseVectorFieldType);

or you don't need method if it is used once?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ack

@github-actions
Copy link
Contributor

github-actions bot commented Jul 6, 2022

Gradle Check (Jenkins) Run Completed with:

Comment on lines 34 to 36
Objects.requireNonNull(method, "[method] for knn algorithm context cannot be null");
Objects.requireNonNull(parameters, "[parameters] for knn algorithm context cannot be null");

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit:

Suggested change
Objects.requireNonNull(method, "[method] for knn algorithm context cannot be null");
Objects.requireNonNull(parameters, "[parameters] for knn algorithm context cannot be null");
this.method = Objects.requireNonNull(method, "[method] for knn algorithm context cannot be null");
this.parameters = Objects.requireNonNull(parameters, "[parameters] for knn algorithm context cannot be null");

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ack

@github-actions
Copy link
Contributor

github-actions bot commented Jul 6, 2022

Gradle Check (Jenkins) Run Completed with:

@dblock
Copy link
Member

dblock commented Jul 7, 2022

@nknize Take a look at the updates?

@martin-gaievski
Copy link
Member Author

Closing this PR as team decided to go with implementation in k-NN plugin

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants