Skip to content

Document Processors

Patrick (Gus) Heck edited this page Mar 24, 2023 · 16 revisions

Document processors are the heart of what JesterJ is about. Everything else in the system is meant to procure data and feed it to the document processors in an orderly fashion. JesterJ aims to provide useful document processors that when used in combination can serve most common use cases. When unique use cases arrive it should also be easy to create custom processors.

Writing Custom Processors

This should be easy. If it's not please file an issue here in github! All you need to do is implement org.jesterj.ingest.model.DocumentProcessor and pass an instance of your processor to StepImpl.Builder.withProcessor(). That is the only hard requirement right now, but here are some suggestions for smoother operation and adaptability to future releases:

Guidelines

  • If a document errors out it should do doc.setStatus(Status.ERROR) and return it from the method. Logging the status change and not sending it to the next step is handled by the framework.
  • If a document should be dropped, again doc.setStatus(Status.DROP) and return it from processDocument()
  • Your processor should either be stateless or immutable if possible. There will be lots of threads out there once we implement thread pool scaling and it's far easier to be immutable than thread safe.
  • For future compatibility you will be best served by creating a builder pattern similar to the existing DocumentProcessor implementations, StepImpl, PlanImpl etc. In the future releases we may serialize the builders and send them to other nodes so that those nodes can do work too.

Intellij Templates

A template for creating a document processor implementation is available here

https://github.com/nsoft/jesterj/tree/master/code/ingest/ide-helpers/intellij/create-file-templates

Existing Processors

Field Manuipulations

These processors massage data within a field, or move data among fields.

CopyField

Copies data from an existing field to a new field creating it if necessary, or adding/overwriting depending on configuration.

DropFieldProcessor

Remove a field and all it's values from the docucment

FieldTemplateProcessor

Interpret the value of a field as a Apache Velocity template using the document as context. If the field has multiple values all values will be interpreted and replaced. It's also important to remember that the fields being referenced can contain multiple values, so one usually wants to write $foobar[0], not $foobar. The latter will lead to replacement with the string value of the list containing the values. In other words [foo] if only one value or [foo,bar,baz] if 3 values are presently held in the field.

WARNING: this uses the velocity templating engine which is a powerful, but potentially dangerous technique!! You want to ensure that the template is NOT derived from and does NOT CONTAIN any text that is provided by users or other untrustworthy sources before it is interpreted by this processor. If you allow user data to be interpreted as a template, you have given the user the ability to run ARBITRARY code on the ingestion infrastructure. Recommended usages include specifying the template field as a statically defined field, or drawn from a known controlled and curated database containing templates. Users are also strongly cautioned against chaining multiple instances of this step, since it becomes exponentially more difficult to ensure user controlled data is not added to the template and then subsequently interpreted. With great power comes great responsibility. Don't run with scissors... you have been warned!

RegexValueReplace

Reads a field value, and replaces all matches of the supplied regular expression

SetReadableFileSize

Sets readable file size field values, as follows:

  • Reads the value of the specified input field, interprets it as a number, determines its magnitude and expresses it as bytes, KB, MB, GB or TB;
  • Provides options to write a combined field ("200 KB"), a units field ("KB"), and/or a numeric field ("200"). If the size is over 1 GB, the size is returned as the number of whole GB, i.e. the size is rounded down to the nearest GB boundary. Similarly for the 1 MB and 1 KB boundaries.

SimpleDateTimeReformatter

Takes an input field name and an output field name and parses the input with a date format string (as per Java's DateTimeFormatter class) and then formats the result using an output format (DateTimeFormatter.ISO_INSTANT by default)

SplitFieldProcessor

Converts the current value or values to multiple values by splitting them on a delimiter

UrlEncodeFieldProcessor

Applies URLEncoder.encode(value, enc) to the values of the field and replaces the values with the encoded result. The encoding may be specified when this processor is configured.

SetStaticValue

Data Enrichment

These processors acquire additional data to be added to the document.

FetchUrl

Interprets a specified field as a URL and issues a GET request to fetch a document at a particular URL. The result is set as a string value in a field in the document. Can be configured to throttle outgoing requests to avoid causing denial of service on the destination site. Uses a simple URL connection and has no facility for adding Headers etc. For more complicated scenarios such as authentication, a custom processor implemented with Apache HTTP Client is reccomended.

Data Extraction

These processors analyze the raw data read by a scanner and use it to populate many document fields. They also may manipulate the cached copy of the data originally read by the processor.

StaxExtractingProcessor

A memory efficient processor for extracting information from XML documents. The raw data for the document is parsed with a StaxProcessor. Data is extracted from the xml and mapped to fields in the document. Simple cases can be specified in a fashion similar to XPath but with much less complicated syntax. More complicated cases involving attributes can be achieved with an ElementSpec instance and full control below the match point can be exercised by supplying a custom StaxExtractinProcessor.LimitedStaxHandlerFactory() implementation to the ElementSpec. See the unit test for examples:

TikaProcessor

Extracts text and metadata from the document's data via Apache Tika. Can be provided with full Tika configuration using tika's XML configuration format. Extracted metadata are added to the document as fields with a configurable suffix and the extracted text replaces the raw data in the Document object.

Data Flow

These processors relate to the flow of data through the ingestion DAG.

LogAndDrop

This processor is for any case where you want to explicitly drop a document based on the results of the logic in a router. The most likely use case for this is as one of the destinations for a RouteByStepName router if the value of a field can be used to determine that the document is not useful for your search index.

LogAndFail

This processor is similar to LogAndDrop except the result will be an error. JesterJ will attempt to reprocess the document and so usually this is only useful for testing error scenarios.

Solr Specific

These processors prepare or send data to Apache Solr

PreAnalyzeFields

This processor allows you to move the vast majority of the work that solr does when indexing OUT of solr. If you have a very heavy analysis phase using this processor could keep your index responsive to queries even under heavy indexing load, because the heavy load will slow down JesterJ not solr!

The configuration for this processor allows you to provide your schema.xml (or current managed_schema) to JesterJ. JesterJ then uses Solr's classes to read the schema and pre-populate the fields with Solr's preanalyzed json data. See https://solr.apache.org/guide/solr/latest/indexing-guide/external-files-processes.html#the-preanalyzedfield-type for further information.

Note that if you are using a different version of Solr than JesterJ, you may want to copy the source for this class into your project, and package your version of Solr into an UnoJar to ensure no compatibility issues.

SendToSolrCloudProcessor

This is how you get documents into Solr! This is typically the final step in the primary path through your ingestion plan. This processor provides batching to ensure solr is used efficiency, and it has an automatic fallback when a batch fails, that re-sends the documents individually to ensure that as many documents as possible make it into your index, and it's easy to determine which documents actually have problems.

Abstract Base Classes

These classes may be useful in writing your own processors

BatchProcessor

Extend this class if you want to write your own processor that sends documents somewhere else (like Elastic, or Algolia, Coveo, or any other system that benefits from receiving data in baches).

test edit