Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PoC DO NOT MERGE - Store semantic_text mapping info #9

Conversation

carlosdelest
Copy link
Owner

semantic_text mapping information is added to the MappingLookup structure, so it can be retrieved from the Field Inference service.

Some fixes were done to both semantic_text field type and the field inference service so they are compatible with multiple inference fields in the same doc.

Code for testing:

Deploy ELSERv2 model:

PUT _ml/trained_models/.elser_model_2
{
  "input": {
	"field_names": ["text_field"]
  }
}

PUT _inference/sparse_embedding/my-elser-model
{
  "service": "elser",
  "service_settings": {
    "num_allocations": 1,
    "num_threads": 1
  },
  "task_settings": {}
}

Create an index mapping with the real model id used:

PUT test-semantic
{
    "mappings": {
        "properties": {
            "infer_field": {
                "type": "semantic_text",
                "model_id": "my-elser-model"
            },
            "non_infer_field": {
                "type": "text"
            },
            "another_infer_field": {
                "type": "semantic_text",
                "model_id": "my-elser-model"
            }
        }
    }
}

Ingest some doc:

PUT test-semantic/_doc/doc1
{
    "infer_field": "these are not the droids you're looking for",
    "non_infer_field": "hello",
    "another_infer_field": "carry on"
}

Inference process uses the model_id specified in the mapping, and produces the following doc:

GET test-semantic/_doc/doc1
{
    "_index": "test-semantic",
    "_id": "doc1",
    "_version": 1,
    "_seq_no": 0,
    "_primary_term": 1,
    "found": true,
    "_source": {
        "non_infer_field": "hello",
        "infer_field": {
            "inference": {
                "lucas": 0.05212344,
                "ty": 0.041213956,
                "dragon": 0.50991,
                "type": 0.23241979,
                "dr": 1.9312073,
                "##o": 0.2797593,
                "these": 1.1422911
            },
            "text": "these are not the droids you're looking for"
        },
        "another_infer_field": {
            "inference": {
                "gift": 0.30502087,
                "ryan": 0.6608564,
                "possession": 0.16975912,
                "expedition": 0.35585117,
                "bring": 0.062369782,
                "bag": 0.75154513,
                "aviation": 0.011173652,
                "luggage": 0.36109453,
                "continue": 0.019830657,
                "jet": 0.5011043,
                "military": 0.25796777,
                "cargo": 0.41693583
            },
            "text": "carry on"
        }
    }
}

@carlosdelest carlosdelest changed the title Store semantic_text mapping info PoC DO NOT MERGE - Store semantic_text mapping info Oct 27, 2023
Comment on lines +64 to +76
public SemanticTextFieldMapper build(MapperBuilderContext context) {
String fullName = context.buildFullName(name);
String subfieldName = fullName + "." + SPARSE_VECTOR_SUBFIELD_NAME;
SparseVectorFieldMapper sparseVectorFieldMapper = new SparseVectorFieldMapper.Builder(subfieldName).build(context);
return new SemanticTextFieldMapper(
name(),
new SemanticTextFieldType(name(), modelId.getValue(), meta.getValue()),
modelId.getValue(),
sparseVectorFieldMapper,
copyTo,
this
);
}

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not 100% sure about this. I am thinking there is a top level field like _inference_results maybe?

It gets really tricky to dynamically default to not including fields in the results.

How we store these things will likely be dictated to how we figure out how to default to not including them in _source in search requests, while still allowing users to specifically request them (and allowing them to be indexed via reindex).

Take a look at MetadataFieldMapper for some inspiration.

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it - so the idea would be instead of the subfields nesting into the actual semantic_text field, to use a top level field that would nest all inference results. The _inference_results field would be populated by the ingestion process.

So you're suggesting we create a new MetadataFieldMapper (or similar) that handles all the information that is passed in _source under the _inference_results and create the appropriate Lucene fields for storing that.

I'll give it a go as a separate PoC to check, thanks for the pointers!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants