Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update UpdateForV9 in AttachmentProcessor #118186

Merged
Show file tree
Hide file tree
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
91 changes: 68 additions & 23 deletions docs/reference/ingest/processors/attachment.asciidoc
Original file line number Diff line number Diff line change
Expand Up @@ -19,15 +19,15 @@ representation. The processor will skip the base64 decoding then.
.Attachment options
[options="header"]
|======
| Name | Required | Default | Description
| `field` | yes | - | The field to get the base64 encoded field from
| `target_field` | no | attachment | The field that will hold the attachment information
| `indexed_chars` | no | 100000 | The number of chars being used for extraction to prevent huge fields. Use `-1` for no limit.
| `indexed_chars_field` | no | `null` | Field name from which you can overwrite the number of chars being used for extraction. See `indexed_chars`.
| `properties` | no | all properties | Array of properties to select to be stored. Can be `content`, `title`, `name`, `author`, `keywords`, `date`, `content_type`, `content_length`, `language`
| `ignore_missing` | no | `false` | If `true` and `field` does not exist, the processor quietly exits without modifying the document
| `remove_binary` | no | `false` | If `true`, the binary `field` will be removed from the document
| `resource_name` | no | | Field containing the name of the resource to decode. If specified, the processor passes this resource name to the underlying Tika library to enable https://tika.apache.org/1.24.1/detection.html#Resource_Name_Based_Detection[Resource Name Based Detection].
| Name | Required | Default | Description
| `field` | yes | - | The field to get the base64 encoded field from
| `target_field` | no | attachment | The field that will hold the attachment information
| `indexed_chars` | no | 100000 | The number of chars being used for extraction to prevent huge fields. Use `-1` for no limit.
| `indexed_chars_field` | no | `null` | Field name from which you can overwrite the number of chars being used for extraction. See `indexed_chars`.
| `properties` | no | all properties | Array of properties to select to be stored. Can be `content`, `title`, `name`, `author`, `keywords`, `date`, `content_type`, `content_length`, `language`
| `ignore_missing` | no | `false` | If `true` and `field` does not exist, the processor quietly exits without modifying the document
| `remove_binary` | encouraged | `false` | If `true`, the binary `field` will be removed from the document. This option is not required, but setting it explicitly is encouraged, and omitting it will result in a warning.
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wasn't sure of the best way to convey this state of affairs... Alternative suggestions welcome!

| `resource_name` | no | | Field containing the name of the resource to decode. If specified, the processor passes this resource name to the underlying Tika library to enable https://tika.apache.org/1.24.1/detection.html#Resource_Name_Based_Detection[Resource Name Based Detection].
|======

[discrete]
Expand Down Expand Up @@ -58,7 +58,7 @@ PUT _ingest/pipeline/attachment
{
"attachment" : {
"field" : "data",
"remove_binary": false
"remove_binary": true
}
}
]
Expand All @@ -82,7 +82,6 @@ The document's `attachment` object contains extracted properties for the file:
"_seq_no": 22,
"_primary_term": 1,
"_source": {
"data": "e1xydGYxXGFuc2kNCkxvcmVtIGlwc3VtIGRvbG9yIHNpdCBhbWV0DQpccGFyIH0=",
"attachment": {
"content_type": "application/rtf",
"language": "ro",
Expand All @@ -94,9 +93,6 @@ The document's `attachment` object contains extracted properties for the file:
----
// TESTRESPONSE[s/"_seq_no": \d+/"_seq_no" : $body._seq_no/ s/"_primary_term" : 1/"_primary_term" : $body._primary_term/]

NOTE: Keeping the binary as a field within the document might consume a lot of resources. It is highly recommended
to remove that field from the document. Set `remove_binary` to `true` to automatically remove the field.

[[attachment-fields]]
==== Exported fields

Expand Down Expand Up @@ -143,7 +139,7 @@ PUT _ingest/pipeline/attachment
"attachment" : {
"field" : "data",
"properties": [ "content", "title" ],
"remove_binary": false
"remove_binary": true
}
}
]
Expand All @@ -154,6 +150,59 @@ NOTE: Extracting contents from binary data is a resource intensive operation and
consumes a lot of resources. It is highly recommended to run pipelines
using this processor in a dedicated ingest node.

[[attachment-keep-binary]]
==== Keeping the attachment binary

Keeping the binary as a field within the document might consume a lot of resources. It is highly recommended to remove
that field from the document, by setting `remove_binary` to `true` to automatically remove the field, as in the other
examples shown on this page. If you _do_ want to keep the binary field, explicitly set `remove_binary` to `false` to
avoid the warning you get from omitting it:

[source,console]
----
PUT _ingest/pipeline/attachment
{
"description" : "Extract attachment information including original binary",
"processors" : [
{
"attachment" : {
"field" : "data",
"remove_binary": false
}
}
]
}
PUT my-index-000001/_doc/my_id?pipeline=attachment
{
"data": "e1xydGYxXGFuc2kNCkxvcmVtIGlwc3VtIGRvbG9yIHNpdCBhbWV0DQpccGFyIH0="
}
GET my-index-000001/_doc/my_id
----

The document's `_source` object includes the original binary field:

[source,console-result]
----
{
"found": true,
"_index": "my-index-000001",
"_id": "my_id",
"_version": 1,
"_seq_no": 22,
"_primary_term": 1,
"_source": {
"data": "e1xydGYxXGFuc2kNCkxvcmVtIGlwc3VtIGRvbG9yIHNpdCBhbWV0DQpccGFyIH0=",
"attachment": {
"content_type": "application/rtf",
"language": "ro",
"content": "Lorem ipsum dolor sit amet",
"content_length": 28
}
}
}
----
// TESTRESPONSE[s/"_seq_no": \d+/"_seq_no" : $body._seq_no/ s/"_primary_term" : 1/"_primary_term" : $body._primary_term/]

[[attachment-cbor]]
==== Use the attachment processor with CBOR

Expand All @@ -170,7 +219,7 @@ PUT _ingest/pipeline/cbor-attachment
{
"attachment" : {
"field" : "data",
"remove_binary": false
"remove_binary": true
}
}
]
Expand Down Expand Up @@ -226,7 +275,7 @@ PUT _ingest/pipeline/attachment
"field" : "data",
"indexed_chars" : 11,
"indexed_chars_field" : "max_size",
"remove_binary": false
"remove_binary": true
}
}
]
Expand All @@ -250,7 +299,6 @@ Returns this:
"_seq_no": 35,
"_primary_term": 1,
"_source": {
"data": "e1xydGYxXGFuc2kNCkxvcmVtIGlwc3VtIGRvbG9yIHNpdCBhbWV0DQpccGFyIH0=",
"attachment": {
"content_type": "application/rtf",
"language": "is",
Expand All @@ -274,7 +322,7 @@ PUT _ingest/pipeline/attachment
"field" : "data",
"indexed_chars" : 11,
"indexed_chars_field" : "max_size",
"remove_binary": false
"remove_binary": true
}
}
]
Expand All @@ -299,7 +347,6 @@ Returns this:
"_seq_no": 40,
"_primary_term": 1,
"_source": {
"data": "e1xydGYxXGFuc2kNCkxvcmVtIGlwc3VtIGRvbG9yIHNpdCBhbWV0DQpccGFyIH0=",
"max_size": 5,
"attachment": {
"content_type": "application/rtf",
Expand Down Expand Up @@ -358,7 +405,7 @@ PUT _ingest/pipeline/attachment
"attachment": {
"target_field": "_ingest._value.attachment",
"field": "_ingest._value.data",
"remove_binary": false
"remove_binary": true
}
}
}
Expand Down Expand Up @@ -396,7 +443,6 @@ Returns this:
"attachments" : [
{
"filename" : "ipsum.txt",
"data" : "dGhpcyBpcwpqdXN0IHNvbWUgdGV4dAo=",
"attachment" : {
"content_type" : "text/plain; charset=ISO-8859-1",
"language" : "en",
Expand All @@ -406,7 +452,6 @@ Returns this:
},
{
"filename" : "test.txt",
"data" : "VGhpcyBpcyBhIHRlc3QK",
"attachment" : {
"content_type" : "text/plain; charset=ISO-8859-1",
"language" : "en",
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -18,7 +18,7 @@
import org.elasticsearch.common.Strings;
import org.elasticsearch.common.logging.DeprecationCategory;
import org.elasticsearch.common.logging.DeprecationLogger;
import org.elasticsearch.core.UpdateForV9;
import org.elasticsearch.core.UpdateForV10;
import org.elasticsearch.ingest.AbstractProcessor;
import org.elasticsearch.ingest.IngestDocument;
import org.elasticsearch.ingest.Processor;
Expand Down Expand Up @@ -196,7 +196,7 @@ public IngestDocument execute(IngestDocument ingestDocument) {
* @param property property to add
* @param value value to add
*/
private <T> void addAdditionalField(Map<String, Object> additionalFields, Property property, String value) {
private void addAdditionalField(Map<String, Object> additionalFields, Property property, String value) {
if (properties.contains(property) && Strings.hasLength(value)) {
additionalFields.put(property.toLowerCase(), value);
}
Expand Down Expand Up @@ -233,16 +233,16 @@ public AttachmentProcessor create(
String processorTag,
String description,
Map<String, Object> config
) throws Exception {
) {
String field = readStringProperty(TYPE, processorTag, config, "field");
String resourceName = readOptionalStringProperty(TYPE, processorTag, config, "resource_name");
String targetField = readStringProperty(TYPE, processorTag, config, "target_field", "attachment");
List<String> propertyNames = readOptionalList(TYPE, processorTag, config, "properties");
int indexedChars = readIntProperty(TYPE, processorTag, config, "indexed_chars", NUMBER_OF_CHARS_INDEXED);
boolean ignoreMissing = readBooleanProperty(TYPE, processorTag, config, "ignore_missing", false);
String indexedCharsField = readOptionalStringProperty(TYPE, processorTag, config, "indexed_chars_field");
@UpdateForV9(owner = UpdateForV9.Owner.DATA_MANAGEMENT)
// update the [remove_binary] default to be 'true' assuming enough time has passed. Deprecated in September 2022.
@UpdateForV10(owner = UpdateForV10.Owner.DATA_MANAGEMENT)
// Revisit whether we want to update the [remove_binary] default to be 'true' - would need to find a way to do this safely
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

N.B. I'm not currently changing the deprecation message (which says that the property "will be set to 'true' in a future release", when actually we're not committed to that) because we don't know for sure what we're going to do in V10 and it seems better to (a) stay with the status quo, and (b) err on the side of making an overly-strong warning. I'm open to suggestions for rewording, though.

Boolean removeBinary = readOptionalBooleanProperty(TYPE, processorTag, config, "remove_binary");
if (removeBinary == null) {
DEPRECATION_LOGGER.warn(
Expand Down