PoC DO NOT MERGE - Store semantic_text mapping info #9

carlosdelest · 2023-10-27T19:22:15Z

semantic_text mapping information is added to the MappingLookup structure, so it can be retrieved from the Field Inference service.

Some fixes were done to both semantic_text field type and the field inference service so they are compatible with multiple inference fields in the same doc.

Code for testing:

Deploy ELSERv2 model:

PUT _ml/trained_models/.elser_model_2
{
  "input": {
	"field_names": ["text_field"]
  }
}

PUT _inference/sparse_embedding/my-elser-model
{
  "service": "elser",
  "service_settings": {
    "num_allocations": 1,
    "num_threads": 1
  },
  "task_settings": {}
}

Create an index mapping with the real model id used:

PUT test-semantic
{
    "mappings": {
        "properties": {
            "infer_field": {
                "type": "semantic_text",
                "model_id": "my-elser-model"
            },
            "non_infer_field": {
                "type": "text"
            },
            "another_infer_field": {
                "type": "semantic_text",
                "model_id": "my-elser-model"
            }
        }
    }
}

Ingest some doc:

PUT test-semantic/_doc/doc1
{
    "infer_field": "these are not the droids you're looking for",
    "non_infer_field": "hello",
    "another_infer_field": "carry on"
}

Inference process uses the model_id specified in the mapping, and produces the following doc:

GET test-semantic/_doc/doc1
{
    "_index": "test-semantic",
    "_id": "doc1",
    "_version": 1,
    "_seq_no": 0,
    "_primary_term": 1,
    "found": true,
    "_source": {
        "non_infer_field": "hello",
        "infer_field": {
            "inference": {
                "lucas": 0.05212344,
                "ty": 0.041213956,
                "dragon": 0.50991,
                "type": 0.23241979,
                "dr": 1.9312073,
                "##o": 0.2797593,
                "these": 1.1422911
            },
            "text": "these are not the droids you're looking for"
        },
        "another_infer_field": {
            "inference": {
                "gift": 0.30502087,
                "ryan": 0.6608564,
                "possession": 0.16975912,
                "expedition": 0.35585117,
                "bring": 0.062369782,
                "bag": 0.75154513,
                "aviation": 0.011173652,
                "luggage": 0.36109453,
                "continue": 0.019830657,
                "jet": 0.5011043,
                "military": 0.25796777,
                "cargo": 0.41693583
            },
            "text": "carry on"
        }
    }
}

…mantic-text-mapping-info # Conflicts: # server/src/main/java/org/elasticsearch/TransportVersions.java # server/src/main/java/org/elasticsearch/node/Node.java

benwtrent · 2023-10-31T16:22:46Z

server/src/main/java/org/elasticsearch/index/mapper/SemanticTextFieldMapper.java

+        public SemanticTextFieldMapper build(MapperBuilderContext context) {
+            String fullName = context.buildFullName(name);
+            String subfieldName = fullName + "." + SPARSE_VECTOR_SUBFIELD_NAME;
+            SparseVectorFieldMapper sparseVectorFieldMapper = new SparseVectorFieldMapper.Builder(subfieldName).build(context);
+            return new SemanticTextFieldMapper(
+                name(),
+                new SemanticTextFieldType(name(), modelId.getValue(), meta.getValue()),
+                modelId.getValue(),
+                sparseVectorFieldMapper,
+                copyTo,
+                this
+            );
+        }


I am not 100% sure about this. I am thinking there is a top level field like _inference_results maybe?

It gets really tricky to dynamically default to not including fields in the results.

How we store these things will likely be dictated to how we figure out how to default to not including them in _source in search requests, while still allowing users to specifically request them (and allowing them to be indexed via reindex).

Take a look at MetadataFieldMapper for some inspiration.

Got it - so the idea would be instead of the subfields nesting into the actual semantic_text field, to use a top level field that would nest all inference results. The _inference_results field would be populated by the ingestion process.

So you're suggesting we create a new MetadataFieldMapper (or similar) that handles all the information that is passed in _source under the _inference_results and create the appropriate Lucene fields for storing that.

I'll give it a go as a separate PoC to check, thanks for the pointers!

carlosdelest added 6 commits October 27, 2023 10:13

Merge remote-tracking branch 'origin/main' into carlosdelest/store-se…

8b2ca59

…mantic-text-mapping-info # Conflicts: # server/src/main/java/org/elasticsearch/TransportVersions.java # server/src/main/java/org/elasticsearch/node/Node.java

Fix merge from main

3276167

Add inference models to field types

7a38395

Fix issues in field inference when multiple fields use inference

7e52a49

Fix node construction

f6088b0

Added back modelForField

99def84

carlosdelest changed the title ~~Store semantic_text mapping info~~ PoC DO NOT MERGE - Store semantic_text mapping info Oct 27, 2023

carlosdelest added 2 commits October 30, 2023 18:55

Fix or operation

ac89ac5

Made inference sequential so it works better with multiple fields

338ecd7

benwtrent reviewed Oct 31, 2023

View reviewed changes

carlosdelest mentioned this pull request Nov 1, 2023

PoC DO NOT MERGE - Use root object for storing field inference #10

Closed

carlosdelest closed this Aug 1, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PoC DO NOT MERGE - Store semantic_text mapping info #9

PoC DO NOT MERGE - Store semantic_text mapping info #9

carlosdelest commented Oct 27, 2023

benwtrent Oct 31, 2023

carlosdelest Oct 31, 2023

PoC DO NOT MERGE - Store semantic_text mapping info #9

PoC DO NOT MERGE - Store semantic_text mapping info #9

Conversation

carlosdelest commented Oct 27, 2023

benwtrent Oct 31, 2023

Choose a reason for hiding this comment

carlosdelest Oct 31, 2023

Choose a reason for hiding this comment