Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add a more efficient PathExtractor Implementation #28

Merged
merged 13 commits into from
Sep 18, 2024
Merged

Conversation

rmarrowstone
Copy link
Contributor

@rmarrowstone rmarrowstone commented Sep 11, 2024

The existing implementation loops through every potential search
path at each step of matching. It therefore exhibits exponential time
complexity as the number of extracted fields grows. Some of the
Amazon usage involves extractions of thousands of attributes and
users needed to craft a workaround for the performance hit.

This change adds a more efficient PathExtractor implementation that
supports a narrower set of SearchPaths and combinations thereof, but
avoids the exponential time cost. This is reflected in the benchmark
results and testing on my own toy dataset. The results show that
the relative performance of the new implementation improves as the
number of extracted attributes increases.

It is built with a key assumption that does not hold for the existing
implementation: that users may not register paths that would result
in a single element matching more than one terminal matcher.

I searched for Amazon internal usages and believe that the supported
usage patterns cover any and all usage I could find. I found no usage
of Annotations not on the Top-Level-Values themselves.

Note: I changed the "dom" benchmarks to iterate over Top-Level-Values.
This makes them faster and a fairer comparison target.

Control (Iterate over IonValues)
PathExtractorBenchmark.domBinary           thrpt    5   5.060 ±  0.075  ops/s
PathExtractorBenchmark.domText             thrpt    5   1.172 ±  0.040  ops/s

Benchmark                                   Mode  Cnt   Score   Error  Units
PathExtractorBenchmark.fullBinary          thrpt    5   3.846 ± 0.033  ops/s
PathExtractorBenchmark.fullText            thrpt    5   1.094 ± 0.008  ops/s
PathExtractorBenchmark.partialBinary       thrpt    5  43.020 ± 2.681  ops/s
PathExtractorBenchmark.partialBinaryNoDom  thrpt    5  42.628 ± 2.549  ops/s
PathExtractorBenchmark.partialText         thrpt    5   2.461 ± 0.121  ops/s
PathExtractorBenchmark.partialTextNoDom    thrpt    5   2.368 ± 0.121  ops/s

Benchmark                                   Mode  Cnt   Score    Error  Units
PathExtractorBenchmark.fullBinary          thrpt    5   6.011 ±  0.107  ops/s
PathExtractorBenchmark.fullText            thrpt    5   1.214 ±  0.025  ops/s
PathExtractorBenchmark.partialBinary       thrpt    5  57.329 ± 13.585  ops/s
PathExtractorBenchmark.partialBinaryNoDom  thrpt    5  56.598 ±  2.424  ops/s
PathExtractorBenchmark.partialText         thrpt    5   2.430 ±  0.073  ops/s
PathExtractorBenchmark.partialTextNoDom    thrpt    5   2.416 ±  0.175  ops/s

I will update the README to reflect the new perf and usage guidance once I'm
sure there won't be major changes.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

The existing implementation loops through every potential search
path at each step of matching. And therefore has exponential time
complexity as the number of extracted fields grows.

This change adds a more efficient PathExtractor implementation that
supports a narrower set of SearchPaths and combinations thereof. But
avoids the exponential time cost.

In the benchmark testing on my machine, I found it to be significantly
faster for the "full" extraction and slightly faster for the "partial"
extraction. With the throughput score from the benchmarking about
6.5:3.8 and 48:42. I also found it to be significantly faster (roughly 3:5)
for extracting most of the fields out of binary Ion dataset of about
20k records, each with about 15 distinct elements, some nested.

It is built with a key assumption that does not hold for the existing
implementation: that users may not register paths that would result
in a single element matching more than one terminal matcher.

I searched for Amazon internal usages and believe that the supported
usage patterns cover any and all usage I could find. I found no usage
of Annotations not on the Top-Level-Values themselves.
Applied some minor tweaks from profiling:
* Use Lazy String.format in invariant checks in hot path
* Use primitives where possible to avoid auto-boxing
Made tests work for both Fsm and Legacy Impls.
Added some test cases around case-(in)sensitivity.
Minor error message formatting tweaks.
Removed incomplete and unused toStrings() from PathComponents.
Made hasAnnotations method name consistent.
Added trailing newlines to files that lacked them.
Micro-optimization with statistically significant performance improvement.
@rmarrowstone rmarrowstone marked this pull request as ready for review September 12, 2024 21:01
@rmarrowstone rmarrowstone requested a review from tgregg September 12, 2024 21:01
@@ -241,30 +255,12 @@ class PathExtractorTest {
assertEquals(ION.singleValue("[1]"), out)
}

@ParameterizedTest
@EnumSource(API::class)
fun caseInsensitive(api: API) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This moved into the test-cases.ion with some other case insensitive tests.

@@ -28,27 +28,57 @@ public class Preconditions {

/**
* Validates argument, fails if condition is not met.
* Prefer variable arity overload instead of concatenating Strings at call-site!!!
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't understand what this meant at first, but I get it now. I think it would be more clear if you just {@link ...} the method that should be preferred.


package com.amazon.ionpathextraction;

public class UnsupportedPathExpression extends RuntimeException {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A brief comment about when this should be thrown would be appreciated.

* The "normalized" path treats this as an explicit Wildcard step and adds it to the head
* of the PathComponents.
*/
List<PathComponent> getNormalizedPath() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the practical reason why this is needed by the FsmPathExtractor but not the PathExtractorImpl?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The PathExtractorImpl uses the SearchPath to hold and manage the Annotations matching for TopLevelValues. The FsmPathExtractor treats matching Top-Level-Values, annotations or not, as just another PathComponent; it is a Wildcard or Annotations match, the same as any other nested step. I think it's a simplifier, but it does take a small shift in thinking.

Copy link
Contributor Author

@rmarrowstone rmarrowstone Sep 17, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But I realize that's not a reason why it's needed. It just makes the code cleaner. Looking at it, I can just manage this in the FsmMatcherBuilder and avoid exposing more to this internal but public API... and avoid exposing the Wildcard constructor too.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well I did try to avoid this, but I practically needed to expose the Annotions and PathComponents from the SearchPath anyway. And given my initial approach was cleaner it seemed preferable.

Which brings me to the implicit insight in my initial reply: in the PathExtractorImpl the SearchPaths do the filtering. In this implementation, there are a DTO to represent the result of parsing the users' expressions. I needed to expose some more data to do that and this just seemed like a pragmatic choice. 🤷

@rmarrowstone rmarrowstone merged commit 84ef782 into master Sep 18, 2024
4 checks passed
@tgregg tgregg deleted the fsm-matching branch September 23, 2024 19:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants