-
Notifications
You must be signed in to change notification settings - Fork 9
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add a more efficient PathExtractor Implementation #28
Conversation
The existing implementation loops through every potential search path at each step of matching. And therefore has exponential time complexity as the number of extracted fields grows. This change adds a more efficient PathExtractor implementation that supports a narrower set of SearchPaths and combinations thereof. But avoids the exponential time cost. In the benchmark testing on my machine, I found it to be significantly faster for the "full" extraction and slightly faster for the "partial" extraction. With the throughput score from the benchmarking about 6.5:3.8 and 48:42. I also found it to be significantly faster (roughly 3:5) for extracting most of the fields out of binary Ion dataset of about 20k records, each with about 15 distinct elements, some nested. It is built with a key assumption that does not hold for the existing implementation: that users may not register paths that would result in a single element matching more than one terminal matcher. I searched for Amazon internal usages and believe that the supported usage patterns cover any and all usage I could find. I found no usage of Annotations not on the Top-Level-Values themselves.
Applied some minor tweaks from profiling: * Use Lazy String.format in invariant checks in hot path * Use primitives where possible to avoid auto-boxing Made tests work for both Fsm and Legacy Impls. Added some test cases around case-(in)sensitivity. Minor error message formatting tweaks.
Removed incomplete and unused toStrings() from PathComponents. Made hasAnnotations method name consistent. Added trailing newlines to files that lacked them.
Micro-optimization with statistically significant performance improvement.
@@ -241,30 +255,12 @@ class PathExtractorTest { | |||
assertEquals(ION.singleValue("[1]"), out) | |||
} | |||
|
|||
@ParameterizedTest | |||
@EnumSource(API::class) | |||
fun caseInsensitive(api: API) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This moved into the test-cases.ion
with some other case insensitive tests.
@@ -28,27 +28,57 @@ public class Preconditions { | |||
|
|||
/** | |||
* Validates argument, fails if condition is not met. | |||
* Prefer variable arity overload instead of concatenating Strings at call-site!!! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I didn't understand what this meant at first, but I get it now. I think it would be more clear if you just {@link ...}
the method that should be preferred.
|
||
package com.amazon.ionpathextraction; | ||
|
||
public class UnsupportedPathExpression extends RuntimeException { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A brief comment about when this should be thrown would be appreciated.
src/main/java/com/amazon/ionpathextraction/PathExtractorBuilder.java
Outdated
Show resolved
Hide resolved
src/main/java/com/amazon/ionpathextraction/FsmMatcherBuilder.java
Outdated
Show resolved
Hide resolved
src/main/java/com/amazon/ionpathextraction/FsmMatcherBuilder.java
Outdated
Show resolved
Hide resolved
* The "normalized" path treats this as an explicit Wildcard step and adds it to the head | ||
* of the PathComponents. | ||
*/ | ||
List<PathComponent> getNormalizedPath() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the practical reason why this is needed by the FsmPathExtractor but not the PathExtractorImpl?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The PathExtractorImpl
uses the SearchPath
to hold and manage the Annotations matching for TopLevelValues. The FsmPathExtractor
treats matching Top-Level-Values, annotations or not, as just another PathComponent
; it is a Wildcard
or Annotations
match, the same as any other nested step. I think it's a simplifier, but it does take a small shift in thinking.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But I realize that's not a reason why it's needed. It just makes the code cleaner. Looking at it, I can just manage this in the FsmMatcherBuilder
and avoid exposing more to this internal but public API... and avoid exposing the Wildcard
constructor too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well I did try to avoid this, but I practically needed to expose the Annotions
and PathComponents
from the SearchPath
anyway. And given my initial approach was cleaner it seemed preferable.
Which brings me to the implicit insight in my initial reply: in the PathExtractorImpl
the SearchPath
s do the filtering. In this implementation, there are a DTO to represent the result of parsing the users' expressions. I needed to expose some more data to do that and this just seemed like a pragmatic choice. 🤷
Co-authored-by: Tyler Gregg <[email protected]>
Co-authored-by: Tyler Gregg <[email protected]>
The existing implementation loops through every potential search
path at each step of matching. It therefore exhibits exponential time
complexity as the number of extracted fields grows. Some of the
Amazon usage involves extractions of thousands of attributes and
users needed to craft a workaround for the performance hit.
This change adds a more efficient PathExtractor implementation that
supports a narrower set of SearchPaths and combinations thereof, but
avoids the exponential time cost. This is reflected in the benchmark
results and testing on my own toy dataset. The results show that
the relative performance of the new implementation improves as the
number of extracted attributes increases.
It is built with a key assumption that does not hold for the existing
implementation: that users may not register paths that would result
in a single element matching more than one terminal matcher.
I searched for Amazon internal usages and believe that the supported
usage patterns cover any and all usage I could find. I found no usage
of Annotations not on the Top-Level-Values themselves.
Note: I changed the "dom" benchmarks to iterate over Top-Level-Values.
This makes them faster and a fairer comparison target.
I will update the README to reflect the new perf and usage guidance once I'm
sure there won't be major changes.
By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.