-
Notifications
You must be signed in to change notification settings - Fork 36
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Overhaul the text parsers, port from nom
to winnow
#892
base: main
Are you sure you want to change the base?
Conversation
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #892 +/- ##
==========================================
- Coverage 77.63% 77.53% -0.10%
==========================================
Files 136 136
Lines 35094 34298 -796
Branches 35094 34298 -796
==========================================
- Hits 27244 26592 -652
+ Misses 5793 5728 -65
+ Partials 2057 1978 -79 ☔ View full report in Codecov by Sentry. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🗺️ PR Tour 🧭
@@ -57,7 +57,7 @@ compact_str = "0.8.0" | |||
chrono = { version = "0.4", default-features = false, features = ["clock", "std", "wasmbind"] } | |||
delegate = "0.12.0" | |||
thiserror = "1.0" | |||
nom = "7.1.1" | |||
winnow = { version = "0.6", features = ["simd"] } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🪧 The simd
feature enables the memchr
operation when scanning for an expected token.
@@ -47,9 +47,8 @@ fn maximally_compact_1_1_data(num_values: usize) -> TestData_1_1 { | |||
|
|||
let text_1_1_data = r#"(:event 1670446800245 418 "6" "1" "abc123" (:: "region 4" "2022-12-07T20:59:59.744000Z"))"#.repeat(num_values); | |||
|
|||
let mut binary_1_1_data = vec![0xE0u8, 0x01, 0x01, 0xEA]; // IVM |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🪧 This benchmark is really showing its age. When it was written, there was no support for reading encoding directives, so the tests/benchmarks manually compiled and registered their own templates. Now that the readers manage their encoding context as expected, reading a leading IVM clears the manually registered templates.
When our managed writer API is fleshed out, we'll have a way to hand a macro to the writer so it gets serialized in the data stream. For now, we simply skip the IVM in binary 1.1.
let mut reader = LazyRawBinaryReader_1_1::new(binary_1_1_data); | ||
let mut reader = LazyRawBinaryReader_1_1::new(context_ref, binary_1_1_data); | ||
let mut num_top_level_values: usize = 0; | ||
// Skip past the IVM | ||
reader.next(context_ref).unwrap().expect_ivm().unwrap(); | ||
reader.next().unwrap().expect_ivm().unwrap(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🪧 The raw readers now take a reference to the encoding context at construction time instead of having it be passed into each call to next()
.
Taking them as an argument to next
was intended to allow a raw reader to exist as long as needed, with the context being provided any time they read. In practice, however, the raw readers only exist long enough to read a single top-level value from the stream, and requiring a context ref at every turn gets pretty tedious.
@@ -45,28 +42,24 @@ use crate::lazy::raw_stream_item::LazyRawStreamItem; | |||
use crate::lazy::raw_value_ref::RawValueRef; | |||
use crate::lazy::span::Span; | |||
use crate::lazy::streaming_raw_reader::RawReaderState; | |||
use crate::lazy::text::raw::r#struct::{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🪧 Many of the raw-level, container-related types are now generic over their TextEncoding
, allowing them to work with both 1.0 and 1.1. The changes in this file reflect that update.
} | ||
|
||
fn resume_at_offset(data: &'data [u8], offset: usize, mut encoding_hint: IonEncoding) -> Self { | ||
fn resume(context: EncodingContextRef<'data>, mut saved_state: RawReaderState<'data>) -> Self { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🪧 The RawReaderState
type already existed but wasn't used in this method (which predated it). Replacing the individual arguments with one type made it easy to add a field to the state in a centralized place.
pub fn match_argument_for( | ||
self, | ||
&mut self, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🪧 The parsing methods now use &mut self
, allowing them to avoid defining new variables at each step of parsing (unless that's what you want).
|
||
/// Matches a parser that must be followed by input that matches `terminator`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🪧 More incompleteness detection special casing.
|
||
// === nom trait implementations === |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🪧 Here we're switching over nom
trait implementations to winnow
trait implementations.
], | ||
expect_incomplete: [ | ||
"0x", // Base 16 prefix w/no number | ||
"0b", // Base 2 prefix w/no number | ||
] | ||
], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🪧 Because these unit tests are all reading from fixed slices, the parser will never return Incomplete
. Those inputs have been moved to the expect_mismatch
sections. We have a separate test suite just for incompleteness detection anyway.
impl<'data> LazyRawSequence<'data, TextEncoding_1_0> for LazyRawTextSExp_1_0<'data> { | ||
type Iterator = RawTextSExpIterator_1_0<'data>; | ||
impl<'data, E: TextEncoding<'data>> LazyRawSequence<'data, E> for RawTextSExp<'data, E> { | ||
type Iterator = RawTextSequenceCacheIterator<'data, E>; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🪧 All of the text containers now cache their child expressions and iterate over the cache as needed.
// These tests are all failing because multipart long strings are not handled correctly when the | ||
// "part" boundary happens to also fall on a point where the reader needs to refill the input buffer. | ||
const INCOMPLETE_LONG_STRING_SKIP_LIST: SkipList = &[ | ||
"ion-tests/iontestdata/good/equivs/localSymbolTableAppend.ion", | ||
"ion-tests/iontestdata/good/equivs/localSymbolTableNullSlots.ion", | ||
"ion-tests/iontestdata/good/equivs/longStringsWithComments.ion", | ||
"ion-tests/iontestdata/good/equivs/strings.ion", | ||
"ion-tests/iontestdata/good/lists.ion", | ||
"ion-tests/iontestdata/good/strings.ion", | ||
"ion-tests/iontestdata/good/stringsWithWhitespace.ion", | ||
"ion-tests/iontestdata/good/strings_cr_nl.ion", | ||
"ion-tests/iontestdata/good/strings2.ion", | ||
"ion-tests/iontestdata/good/structs.ion", | ||
"ion-tests/iontestdata/good/strings_nl.ion", | ||
]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🎉
nom
to winnow
This PR migrates the text parser from
nom
to an actively developed fork calledwinnow
. You can read about the relationship between the two here.The migration offers a number of benefits:
winnow
includes adebug
feature that lets you see the reader's path through the branches of the parser, making it MUCH easier to find the root of parser misbehavior.nom
offers two flavors of each parser,complete
andstreaming
,winnow
allows an input source to report whether it is partial or complete. This means that there is a single flavor of each parser that will do the correct thing when it runs out of data. This allowed me to remove a LOT of special cases.Parser
trait takes&mut self
and modifies it in-place rather than takingself
and returning an updated copy along with the expected output value. (Discussion here.) Reducing the return value of every method from(TextBuffer<'_>, T)
to justT
increases the odds that the value will be returned in a register. Happily, becauseTextBuffer
isCopy
, we can still make as many intermediate state copies as we'd like when it's called for.alt((...))
combinator tries each of the provided parsers in turn and takes the first match. This is often quite slow--winnow
provides adispatch!
macro that allows you to prune the tree of options up-front bymatch
ing on the head of the stream.tag("foo")
/literal("foo")
can now be expressed as just"foo"
. Similarly, tuples of parsers are now themselves parsers, so you no longer need to writetuple(("/*", multiline_body_comment, "*/"))
. You can just write("/*", multiline_body_comment, "*/")
.I also made several improvements that did not require
winnow
per se:E: TextEncoding
, eliminating a large amount of mostly duplicated code.Hopefully this makes it much easier to both read and maintain.
Performance improvements
By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
Testing improvements
Because nearly all of the tests read some amount of Ion text, this patch lead to a huge drop in the time needed to run the test harness.
Command
This command runs everything but the doc tests:
Before
After