Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Overhaul the text parsers, port from nom to winnow #892

Open
wants to merge 28 commits into
base: main
Choose a base branch
from

Conversation

zslayton
Copy link
Contributor

@zslayton zslayton commented Jan 3, 2025

This PR migrates the text parser from nom to an actively developed fork called winnow. You can read about the relationship between the two here.

The migration offers a number of benefits:

  • winnow includes a debug feature that lets you see the reader's path through the branches of the parser, making it MUCH easier to find the root of parser misbehavior.
  • Where nom offers two flavors of each parser, complete and streaming, winnow allows an input source to report whether it is partial or complete. This means that there is a single flavor of each parser that will do the correct thing when it runs out of data. This allowed me to remove a LOT of special cases.
  • The Parser trait takes &mut self and modifies it in-place rather than taking self and returning an updated copy along with the expected output value. (Discussion here.) Reducing the return value of every method from (TextBuffer<'_>, T) to just T increases the odds that the value will be returned in a register. Happily, because TextBuffer is Copy, we can still make as many intermediate state copies as we'd like when it's called for.
  • The alt((...)) combinator tries each of the provided parsers in turn and takes the first match. This is often quite slow--winnow provides a dispatch! macro that allows you to prune the tree of options up-front by matching on the head of the stream.
  • Some simple types are now considered parsers themselves, which makes the parser methods that use them easier to read. For example, tag("foo")/literal("foo") can now be expressed as just "foo". Similarly, tuples of parsers are now themselves parsers, so you no longer need to write tuple(("/*", multiline_body_comment, "*/")). You can just write ("/*", multiline_body_comment, "*/").

I also made several improvements that did not require winnow per se:

  • Made several encoding-version-specific methods and types generic over E: TextEncoding, eliminating a large amount of mostly duplicated code.
  • Made a few parsers do an easily inlineable up-front check before calling the real (not as inlineable) complex implementation.
  • Modified the 1.0 container parsers to cache the sub-expressions they encounter during lexing using the bump allocator, offering a big speedup. (This optimization had already been applied to the 1.1 container parsers.)

Hopefully this makes it much easier to both read and maintain.


Performance improvements

image

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.


Testing improvements

Because nearly all of the tests read some amount of Ion text, this patch lead to a huge drop in the time needed to run the test harness.

Command

This command runs everything but the doc tests:

time cargo test --all-features --lib --tests

Before

image

After

image

Copy link

codecov bot commented Jan 3, 2025

Codecov Report

Attention: Patch coverage is 80.72838% with 127 lines in your changes missing coverage. Please review.

Project coverage is 77.53%. Comparing base (46cc6b2) to head (a9583c5).

Files with missing lines Patch % Lines
src/lazy/any_encoding.rs 51.04% 22 Missing and 25 partials ⚠️
src/lazy/binary/raw/v1_1/reader.rs 48.83% 10 Missing and 34 partials ⚠️
src/lazy/text/raw/reader.rs 87.71% 3 Missing and 4 partials ⚠️
src/lazy/binary/raw/reader.rs 86.66% 6 Missing ⚠️
src/lazy/encoder/write_as_ion.rs 0.00% 4 Missing ⚠️
src/lazy/text/raw/sequence.rs 92.85% 3 Missing and 1 partial ⚠️
src/lazy/text/raw/v1_1/reader.rs 92.30% 1 Missing and 3 partials ⚠️
src/lazy/text/parse_result.rs 85.00% 3 Missing ⚠️
src/lazy/encoder/text/v1_1/writer.rs 50.00% 0 Missing and 2 partials ⚠️
src/lazy/text/raw/struct.rs 92.00% 0 Missing and 2 partials ⚠️
... and 4 more
Additional details and impacted files
@@            Coverage Diff             @@
##             main     #892      +/-   ##
==========================================
- Coverage   77.63%   77.53%   -0.10%     
==========================================
  Files         136      136              
  Lines       35094    34298     -796     
  Branches    35094    34298     -796     
==========================================
- Hits        27244    26592     -652     
+ Misses       5793     5728      -65     
+ Partials     2057     1978      -79     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Copy link
Contributor Author

@zslayton zslayton left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🗺️ PR Tour 🧭

@@ -57,7 +57,7 @@ compact_str = "0.8.0"
chrono = { version = "0.4", default-features = false, features = ["clock", "std", "wasmbind"] }
delegate = "0.12.0"
thiserror = "1.0"
nom = "7.1.1"
winnow = { version = "0.6", features = ["simd"] }
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🪧 The simd feature enables the memchr operation when scanning for an expected token.

@@ -47,9 +47,8 @@ fn maximally_compact_1_1_data(num_values: usize) -> TestData_1_1 {

let text_1_1_data = r#"(:event 1670446800245 418 "6" "1" "abc123" (:: "region 4" "2022-12-07T20:59:59.744000Z"))"#.repeat(num_values);

let mut binary_1_1_data = vec![0xE0u8, 0x01, 0x01, 0xEA]; // IVM
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🪧 This benchmark is really showing its age. When it was written, there was no support for reading encoding directives, so the tests/benchmarks manually compiled and registered their own templates. Now that the readers manage their encoding context as expected, reading a leading IVM clears the manually registered templates.

When our managed writer API is fleshed out, we'll have a way to hand a macro to the writer so it gets serialized in the data stream. For now, we simply skip the IVM in binary 1.1.

Comment on lines -447 to +444
let mut reader = LazyRawBinaryReader_1_1::new(binary_1_1_data);
let mut reader = LazyRawBinaryReader_1_1::new(context_ref, binary_1_1_data);
let mut num_top_level_values: usize = 0;
// Skip past the IVM
reader.next(context_ref).unwrap().expect_ivm().unwrap();
reader.next().unwrap().expect_ivm().unwrap();
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🪧 The raw readers now take a reference to the encoding context at construction time instead of having it be passed into each call to next().

Taking them as an argument to next was intended to allow a raw reader to exist as long as needed, with the context being provided any time they read. In practice, however, the raw readers only exist long enough to read a single top-level value from the stream, and requiring a context ref at every turn gets pretty tedious.

@@ -45,28 +42,24 @@ use crate::lazy::raw_stream_item::LazyRawStreamItem;
use crate::lazy::raw_value_ref::RawValueRef;
use crate::lazy::span::Span;
use crate::lazy::streaming_raw_reader::RawReaderState;
use crate::lazy::text::raw::r#struct::{
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🪧 Many of the raw-level, container-related types are now generic over their TextEncoding, allowing them to work with both 1.0 and 1.1. The changes in this file reflect that update.

}

fn resume_at_offset(data: &'data [u8], offset: usize, mut encoding_hint: IonEncoding) -> Self {
fn resume(context: EncodingContextRef<'data>, mut saved_state: RawReaderState<'data>) -> Self {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🪧 The RawReaderState type already existed but wasn't used in this method (which predated it). Replacing the individual arguments with one type made it easy to add a field to the state in a centralized place.

Comment on lines 780 to +781
pub fn match_argument_for(
self,
&mut self,
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🪧 The parsing methods now use &mut self, allowing them to avoid defining new variables at each step of parsing (unless that's what you want).


/// Matches a parser that must be followed by input that matches `terminator`.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🪧 More incompleteness detection special casing.


// === nom trait implementations ===
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🪧 Here we're switching over nom trait implementations to winnow trait implementations.

Comment on lines -3063 to +2389
],
expect_incomplete: [
"0x", // Base 16 prefix w/no number
"0b", // Base 2 prefix w/no number
]
],
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🪧 Because these unit tests are all reading from fixed slices, the parser will never return Incomplete. Those inputs have been moved to the expect_mismatch sections. We have a separate test suite just for incompleteness detection anyway.

impl<'data> LazyRawSequence<'data, TextEncoding_1_0> for LazyRawTextSExp_1_0<'data> {
type Iterator = RawTextSExpIterator_1_0<'data>;
impl<'data, E: TextEncoding<'data>> LazyRawSequence<'data, E> for RawTextSExp<'data, E> {
type Iterator = RawTextSequenceCacheIterator<'data, E>;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🪧 All of the text containers now cache their child expressions and iterate over the cache as needed.

Comment on lines -16 to -30
// These tests are all failing because multipart long strings are not handled correctly when the
// "part" boundary happens to also fall on a point where the reader needs to refill the input buffer.
const INCOMPLETE_LONG_STRING_SKIP_LIST: SkipList = &[
"ion-tests/iontestdata/good/equivs/localSymbolTableAppend.ion",
"ion-tests/iontestdata/good/equivs/localSymbolTableNullSlots.ion",
"ion-tests/iontestdata/good/equivs/longStringsWithComments.ion",
"ion-tests/iontestdata/good/equivs/strings.ion",
"ion-tests/iontestdata/good/lists.ion",
"ion-tests/iontestdata/good/strings.ion",
"ion-tests/iontestdata/good/stringsWithWhitespace.ion",
"ion-tests/iontestdata/good/strings_cr_nl.ion",
"ion-tests/iontestdata/good/strings2.ion",
"ion-tests/iontestdata/good/structs.ion",
"ion-tests/iontestdata/good/strings_nl.ion",
];
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🎉

@zslayton zslayton marked this pull request as ready for review January 7, 2025 21:10
@zslayton zslayton changed the title Winnow experiment Overhaul the text parsers, port from nom to winnow Jan 7, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant