Add a more efficient PathExtractor Implementation #28

rmarrowstone · 2024-09-11T00:30:44Z

The existing implementation loops through every potential search
path at each step of matching. It therefore exhibits exponential time
complexity as the number of extracted fields grows. Some of the
Amazon usage involves extractions of thousands of attributes and
users needed to craft a workaround for the performance hit.

This change adds a more efficient PathExtractor implementation that
supports a narrower set of SearchPaths and combinations thereof, but
avoids the exponential time cost. This is reflected in the benchmark
results and testing on my own toy dataset. The results show that
the relative performance of the new implementation improves as the
number of extracted attributes increases.

It is built with a key assumption that does not hold for the existing
implementation: that users may not register paths that would result
in a single element matching more than one terminal matcher.

I searched for Amazon internal usages and believe that the supported
usage patterns cover any and all usage I could find. I found no usage
of Annotations not on the Top-Level-Values themselves.

Note: I changed the "dom" benchmarks to iterate over Top-Level-Values.
This makes them faster and a fairer comparison target.

Control (Iterate over IonValues)
PathExtractorBenchmark.domBinary           thrpt    5   5.060 ±  0.075  ops/s
PathExtractorBenchmark.domText             thrpt    5   1.172 ±  0.040  ops/s

Benchmark                                   Mode  Cnt   Score   Error  Units
PathExtractorBenchmark.fullBinary          thrpt    5   3.846 ± 0.033  ops/s
PathExtractorBenchmark.fullText            thrpt    5   1.094 ± 0.008  ops/s
PathExtractorBenchmark.partialBinary       thrpt    5  43.020 ± 2.681  ops/s
PathExtractorBenchmark.partialBinaryNoDom  thrpt    5  42.628 ± 2.549  ops/s
PathExtractorBenchmark.partialText         thrpt    5   2.461 ± 0.121  ops/s
PathExtractorBenchmark.partialTextNoDom    thrpt    5   2.368 ± 0.121  ops/s

Benchmark                                   Mode  Cnt   Score    Error  Units
PathExtractorBenchmark.fullBinary          thrpt    5   6.011 ±  0.107  ops/s
PathExtractorBenchmark.fullText            thrpt    5   1.214 ±  0.025  ops/s
PathExtractorBenchmark.partialBinary       thrpt    5  57.329 ± 13.585  ops/s
PathExtractorBenchmark.partialBinaryNoDom  thrpt    5  56.598 ±  2.424  ops/s
PathExtractorBenchmark.partialText         thrpt    5   2.430 ±  0.073  ops/s
PathExtractorBenchmark.partialTextNoDom    thrpt    5   2.416 ±  0.175  ops/s

I will update the README to reflect the new perf and usage guidance once I'm
sure there won't be major changes.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

The existing implementation loops through every potential search path at each step of matching. And therefore has exponential time complexity as the number of extracted fields grows. This change adds a more efficient PathExtractor implementation that supports a narrower set of SearchPaths and combinations thereof. But avoids the exponential time cost. In the benchmark testing on my machine, I found it to be significantly faster for the "full" extraction and slightly faster for the "partial" extraction. With the throughput score from the benchmarking about 6.5:3.8 and 48:42. I also found it to be significantly faster (roughly 3:5) for extracting most of the fields out of binary Ion dataset of about 20k records, each with about 15 distinct elements, some nested. It is built with a key assumption that does not hold for the existing implementation: that users may not register paths that would result in a single element matching more than one terminal matcher. I searched for Amazon internal usages and believe that the supported usage patterns cover any and all usage I could find. I found no usage of Annotations not on the Top-Level-Values themselves.

Applied some minor tweaks from profiling: * Use Lazy String.format in invariant checks in hot path * Use primitives where possible to avoid auto-boxing Made tests work for both Fsm and Legacy Impls. Added some test cases around case-(in)sensitivity. Minor error message formatting tweaks.

Removed incomplete and unused toStrings() from PathComponents. Made hasAnnotations method name consistent. Added trailing newlines to files that lacked them.

Micro-optimization with statistically significant performance improvement.

rmarrowstone · 2024-09-12T21:53:36Z

src/test/kotlin/com/amazon/ionpathextraction/PathExtractorTest.kt

@@ -241,30 +255,12 @@ class PathExtractorTest {
        assertEquals(ION.singleValue("[1]"), out)
    }

-    @ParameterizedTest
-    @EnumSource(API::class)
-    fun caseInsensitive(api: API) {


This moved into the test-cases.ion with some other case insensitive tests.

tgregg · 2024-09-12T22:39:00Z

src/main/java/com/amazon/ionpathextraction/internal/Preconditions.java

@@ -28,27 +28,57 @@ public class Preconditions {

    /**
     * Validates argument, fails if condition is not met.
+     * Prefer variable arity overload instead of concatenating Strings at call-site!!!


I didn't understand what this meant at first, but I get it now. I think it would be more clear if you just {@link ...} the method that should be preferred.

tgregg · 2024-09-12T22:41:17Z

src/main/java/com/amazon/ionpathextraction/UnsupportedPathExpression.java

+
+package com.amazon.ionpathextraction;
+
+public class UnsupportedPathExpression extends RuntimeException {


A brief comment about when this should be thrown would be appreciated.

src/main/java/com/amazon/ionpathextraction/PathExtractorBuilder.java

src/main/java/com/amazon/ionpathextraction/FsmMatcherBuilder.java

tgregg · 2024-09-13T22:00:13Z

src/main/java/com/amazon/ionpathextraction/SearchPath.java

+     * The "normalized" path treats this as an explicit Wildcard step and adds it to the head
+     * of the PathComponents.
+     */
+    List<PathComponent> getNormalizedPath() {


What is the practical reason why this is needed by the FsmPathExtractor but not the PathExtractorImpl?

The PathExtractorImpl uses the SearchPath to hold and manage the Annotations matching for TopLevelValues. The FsmPathExtractor treats matching Top-Level-Values, annotations or not, as just another PathComponent; it is a Wildcard or Annotations match, the same as any other nested step. I think it's a simplifier, but it does take a small shift in thinking.

But I realize that's not a reason why it's needed. It just makes the code cleaner. Looking at it, I can just manage this in the FsmMatcherBuilder and avoid exposing more to this internal but public API... and avoid exposing the Wildcard constructor too.

Well I did try to avoid this, but I practically needed to expose the Annotions and PathComponents from the SearchPath anyway. And given my initial approach was cleaner it seemed preferable.

Which brings me to the implicit insight in my initial reply: in the PathExtractorImpl the SearchPaths do the filtering. In this implementation, there are a DTO to represent the result of parsing the users' expressions. I needed to expose some more data to do that and this just seemed like a pragmatic choice. 🤷

Co-authored-by: Tyler Gregg <[email protected]>

…r.java Co-authored-by: Tyler Gregg <[email protected]>

rmarrowstone added 6 commits September 10, 2024 17:27

Fix Checkstyle Errors

c157cfb

Some Trivial Updates

1ffe71c

Removed incomplete and unused toStrings() from PathComponents. Made hasAnnotations method name consistent. Added trailing newlines to files that lacked them.

Remove one more incomplete toString()

1ab2032

Break on stepOut instead of check in while condition

6bc6a8c

Micro-optimization with statistically significant performance improvement.

rmarrowstone marked this pull request as ready for review September 12, 2024 21:01

rmarrowstone requested a review from tgregg September 12, 2024 21:01

rmarrowstone commented Sep 12, 2024

View reviewed changes

tgregg reviewed Sep 13, 2024

View reviewed changes

rmarrowstone and others added 2 commits September 13, 2024 15:32

Update src/main/java/com/amazon/ionpathextraction/FsmMatcherBuilder.java

03c9f0f

Co-authored-by: Tyler Gregg <[email protected]>

Update src/main/java/com/amazon/ionpathextraction/FsmMatcherBuilder.java

107502c

Co-authored-by: Tyler Gregg <[email protected]>

tgregg approved these changes Sep 13, 2024

View reviewed changes

rmarrowstone and others added 5 commits September 17, 2024 15:57

Update src/main/java/com/amazon/ionpathextraction/PathExtractorBuilde…

00cd91d

…r.java Co-authored-by: Tyler Gregg <[email protected]>

Minor Javadoc Fixups

8dc0efd

Minor javadoc fixup

8ba4616

Update README.md to reflect strict mode

2e180e2

Enhance Updated README.md

8ef2f2a

rmarrowstone merged commit 84ef782 into master Sep 18, 2024
4 checks passed

tgregg deleted the fsm-matching branch September 23, 2024 19:34

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add a more efficient PathExtractor Implementation #28

Add a more efficient PathExtractor Implementation #28

rmarrowstone commented Sep 11, 2024 •

edited

Loading

rmarrowstone Sep 12, 2024

tgregg Sep 12, 2024

tgregg Sep 12, 2024

tgregg Sep 13, 2024

rmarrowstone Sep 13, 2024

rmarrowstone Sep 17, 2024 •

edited

Loading

rmarrowstone Sep 17, 2024


		package com.amazon.ionpathextraction;

		public class UnsupportedPathExpression extends RuntimeException {

Add a more efficient PathExtractor Implementation #28

Add a more efficient PathExtractor Implementation #28

Conversation

rmarrowstone commented Sep 11, 2024 • edited Loading

rmarrowstone Sep 12, 2024

Choose a reason for hiding this comment

tgregg Sep 12, 2024

Choose a reason for hiding this comment

tgregg Sep 12, 2024

Choose a reason for hiding this comment

tgregg Sep 13, 2024

Choose a reason for hiding this comment

rmarrowstone Sep 13, 2024

Choose a reason for hiding this comment

rmarrowstone Sep 17, 2024 • edited Loading

Choose a reason for hiding this comment

rmarrowstone Sep 17, 2024

Choose a reason for hiding this comment

rmarrowstone commented Sep 11, 2024 •

edited

Loading

rmarrowstone Sep 17, 2024 •

edited

Loading