Overhaul the text parsers, port from `nom` to `winnow` #892

zslayton · 2025-01-03T22:55:13Z

This PR migrates the text parser from nom to an actively developed fork called winnow. You can read about the relationship between the two here.

The migration offers a number of benefits:

winnow includes a debug feature that lets you see the reader's path through the branches of the parser, making it MUCH easier to find the root of parser misbehavior.
Where nom offers two flavors of each parser, complete and streaming, winnow allows an input source to report whether it is partial or complete. This means that there is a single flavor of each parser that will do the correct thing when it runs out of data. This allowed me to remove a LOT of special cases.
The Parser trait takes &mut self and modifies it in-place rather than taking self and returning an updated copy along with the expected output value. (Discussion here.) Reducing the return value of every method from (TextBuffer<'_>, T) to just T increases the odds that the value will be returned in a register. Happily, because TextBuffer is Copy, we can still make as many intermediate state copies as we'd like when it's called for.
The alt((...)) combinator tries each of the provided parsers in turn and takes the first match. This is often quite slow--winnow provides a dispatch! macro that allows you to prune the tree of options up-front by matching on the head of the stream.
Some simple types are now considered parsers themselves, which makes the parser methods that use them easier to read. For example, tag("foo")/literal("foo") can now be expressed as just "foo". Similarly, tuples of parsers are now themselves parsers, so you no longer need to write tuple(("/*", multiline_body_comment, "*/")). You can just write ("/*", multiline_body_comment, "*/").

I also made several improvements that did not require winnow per se:

Made several encoding-version-specific methods and types generic over E: TextEncoding, eliminating a large amount of mostly duplicated code.
Made a few parsers do an easily inlineable up-front check before calling the real (not as inlineable) complex implementation.
Modified the 1.0 container parsers to cache the sub-expressions they encounter during lexing using the bump allocator, offering a big speedup. (This optimization had already been applied to the 1.1 container parsers.)

Hopefully this makes it much easier to both read and maintain.

Performance improvements

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

Testing improvements

Because nearly all of the tests read some amount of Ion text, this patch lead to a huge drop in the time needed to run the test harness.

Command

This command runs everything but the doc tests:

time cargo test --all-features --lib --tests

Before

After

codecov · 2025-01-03T22:59:08Z

Codecov Report

Attention: Patch coverage is 80.72838% with 127 lines in your changes missing coverage. Please review.

Project coverage is 77.53%. Comparing base (46cc6b2) to head (a9583c5).

Files with missing lines	Patch %	Lines
src/lazy/any_encoding.rs	51.04%	22 Missing and 25 partials ⚠️
src/lazy/binary/raw/v1_1/reader.rs	48.83%	10 Missing and 34 partials ⚠️
src/lazy/text/raw/reader.rs	87.71%	3 Missing and 4 partials ⚠️
src/lazy/binary/raw/reader.rs	86.66%	6 Missing ⚠️
src/lazy/encoder/write_as_ion.rs	0.00%	4 Missing ⚠️
src/lazy/text/raw/sequence.rs	92.85%	3 Missing and 1 partial ⚠️
src/lazy/text/raw/v1_1/reader.rs	92.30%	1 Missing and 3 partials ⚠️
src/lazy/text/parse_result.rs	85.00%	3 Missing ⚠️
src/lazy/encoder/text/v1_1/writer.rs	50.00%	0 Missing and 2 partials ⚠️
src/lazy/text/raw/struct.rs	92.00%	0 Missing and 2 partials ⚠️
... and 4 more

Additional details and impacted files

@@            Coverage Diff             @@
##             main     #892      +/-   ##
==========================================
- Coverage   77.63%   77.53%   -0.10%     
==========================================
  Files         136      136              
  Lines       35094    34298     -796     
  Branches    35094    34298     -796     
==========================================
- Hits        27244    26592     -652     
+ Misses       5793     5728      -65     
+ Partials     2057     1978      -79

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

zslayton

🗺️ PR Tour 🧭

zslayton · 2025-01-07T19:06:38Z

Cargo.toml

@@ -57,7 +57,7 @@ compact_str = "0.8.0"
 chrono = { version = "0.4", default-features = false, features = ["clock", "std", "wasmbind"] }
 delegate = "0.12.0"
 thiserror = "1.0"
-nom = "7.1.1"
+winnow = { version = "0.6", features = ["simd"] }


🪧 The simd feature enables the memchr operation when scanning for an expected token.

zslayton · 2025-01-07T19:09:33Z

benches/read_many_structs.rs

@@ -47,9 +47,8 @@ fn maximally_compact_1_1_data(num_values: usize) -> TestData_1_1 {

    let text_1_1_data = r#"(:event 1670446800245 418 "6" "1" "abc123" (:: "region 4" "2022-12-07T20:59:59.744000Z"))"#.repeat(num_values);

-    let mut binary_1_1_data = vec![0xE0u8, 0x01, 0x01, 0xEA]; // IVM


🪧 This benchmark is really showing its age. When it was written, there was no support for reading encoding directives, so the tests/benchmarks manually compiled and registered their own templates. Now that the readers manage their encoding context as expected, reading a leading IVM clears the manually registered templates.

When our managed writer API is fleshed out, we'll have a way to hand a macro to the writer so it gets serialized in the data stream. For now, we simply skip the IVM in binary 1.1.

zslayton · 2025-01-07T19:12:07Z

benches/read_many_structs.rs

-                let mut reader = LazyRawBinaryReader_1_1::new(binary_1_1_data);
+                let mut reader = LazyRawBinaryReader_1_1::new(context_ref, binary_1_1_data);
                let mut num_top_level_values: usize = 0;
                // Skip past the IVM
-                reader.next(context_ref).unwrap().expect_ivm().unwrap();
+                reader.next().unwrap().expect_ivm().unwrap();


🪧 The raw readers now take a reference to the encoding context at construction time instead of having it be passed into each call to next().

Taking them as an argument to next was intended to allow a raw reader to exist as long as needed, with the context being provided any time they read. In practice, however, the raw readers only exist long enough to read a single top-level value from the stream, and requiring a context ref at every turn gets pretty tedious.

zslayton · 2025-01-07T19:13:36Z

src/lazy/any_encoding.rs

@@ -45,28 +42,24 @@ use crate::lazy::raw_stream_item::LazyRawStreamItem;
 use crate::lazy::raw_value_ref::RawValueRef;
 use crate::lazy::span::Span;
 use crate::lazy::streaming_raw_reader::RawReaderState;
-use crate::lazy::text::raw::r#struct::{


🪧 Many of the raw-level, container-related types are now generic over their TextEncoding, allowing them to work with both 1.0 and 1.1. The changes in this file reflect that update.

zslayton · 2025-01-07T19:15:29Z

src/lazy/any_encoding.rs

    }

-    fn resume_at_offset(data: &'data [u8], offset: usize, mut encoding_hint: IonEncoding) -> Self {
+    fn resume(context: EncodingContextRef<'data>, mut saved_state: RawReaderState<'data>) -> Self {


🪧 The RawReaderState type already existed but wasn't used in this method (which predated it). Replacing the individual arguments with one type made it easy to add a field to the state in a centralized place.

zslayton · 2025-01-07T20:55:50Z

src/lazy/text/buffer.rs

    pub fn match_argument_for(
-        self,
+        &mut self,


🪧 The parsing methods now use &mut self, allowing them to avoid defining new variables at each step of parsing (unless that's what you want).

zslayton · 2025-01-07T20:56:41Z

src/lazy/text/buffer.rs


-    /// Matches a parser that must be followed by input that matches `terminator`.


🪧 More incompleteness detection special casing.

zslayton · 2025-01-07T20:58:11Z

src/lazy/text/buffer.rs


-// === nom trait implementations ===


🪧 Here we're switching over nom trait implementations to winnow trait implementations.

zslayton · 2025-01-07T21:04:46Z

src/lazy/text/buffer.rs

-        ],
-        expect_incomplete: [
            "0x",          // Base 16 prefix w/no number
            "0b",          // Base 2 prefix w/no number
-        ]
+        ],


🪧 Because these unit tests are all reading from fixed slices, the parser will never return Incomplete. Those inputs have been moved to the expect_mismatch sections. We have a separate test suite just for incompleteness detection anyway.

zslayton · 2025-01-07T21:08:01Z

src/lazy/text/raw/sequence.rs

-impl<'data> LazyRawSequence<'data, TextEncoding_1_0> for LazyRawTextSExp_1_0<'data> {
-    type Iterator = RawTextSExpIterator_1_0<'data>;
+impl<'data, E: TextEncoding<'data>> LazyRawSequence<'data, E> for RawTextSExp<'data, E> {
+    type Iterator = RawTextSequenceCacheIterator<'data, E>;


🪧 All of the text containers now cache their child expressions and iterate over the cache as needed.

zslayton · 2025-01-07T21:10:03Z

tests/detect_incomplete_text.rs

-// These tests are all failing because multipart long strings are not handled correctly when the
-// "part" boundary happens to also fall on a point where the reader needs to refill the input buffer.
-const INCOMPLETE_LONG_STRING_SKIP_LIST: SkipList = &[
-    "ion-tests/iontestdata/good/equivs/localSymbolTableAppend.ion",
-    "ion-tests/iontestdata/good/equivs/localSymbolTableNullSlots.ion",
-    "ion-tests/iontestdata/good/equivs/longStringsWithComments.ion",
-    "ion-tests/iontestdata/good/equivs/strings.ion",
-    "ion-tests/iontestdata/good/lists.ion",
-    "ion-tests/iontestdata/good/strings.ion",
-    "ion-tests/iontestdata/good/stringsWithWhitespace.ion",
-    "ion-tests/iontestdata/good/strings_cr_nl.ion",
-    "ion-tests/iontestdata/good/strings2.ion",
-    "ion-tests/iontestdata/good/structs.ion",
-    "ion-tests/iontestdata/good/strings_nl.ion",
-];


zslayton added 11 commits December 26, 2024 17:27

Makes Writer symbol table experimentally pub

fd4e853

Makes AnnotatableWriter experimentally pub

158c3ae

Clippy suggestions RE: variant name prefixes

0e53e48

Fix build when experimental-ion-hash is only feature enabled

4bd0ae4

Migrate from nom to winnow 0.3

7ea8b12

wip; migrated to v0.38.0

092032e

Migrated to [email protected]

32d215f

Remove cruft that winnow does not require

4a945c8

update read_many_structs

b2d65fd

inlines some hot parsers

33a7ef9

Merge branch 'main' into winnow-experiment

4dab692

zslayton and others added 16 commits January 5, 2025 07:46

Removes special cases for incompleteness detection

58ce9cd

match_value now uses dispatch! instead of alt

3205514

Added macro to define mappings from MatchedValue to LazyRawTextValue

84a0a26

match_value_1_1 now uses dispatch instead of alt

41dabd0

clippy suggestions

1d1def2

Adds text version-agnostic container parsers

7cd83c8

Removes lots of version-specific container parsing code

8201e04

fix for read_many_structs benchmark

5b1b84e

Makes raw text lists generic over Ion version

aa27f61

Maker raw sexps generic over Ion version

6016bf3

Maker raw structs generic over Ion version

46842da

cleanup

0b9aa80

remove doc links to private types

fba64e9

More comments to container_matcher

ee4c5a7

Remove old println

c8a414d

Doc comment

9a6473d

zslayton commented Jan 7, 2025

View reviewed changes

Removes skip list for incompletness checking.

a9583c5

zslayton commented Jan 7, 2025

View reviewed changes

zslayton marked this pull request as ready for review January 7, 2025 21:10

zslayton requested review from jobarr-amzn, popematt and tgregg January 7, 2025 21:10

zslayton changed the title ~~Winnow experiment~~ Overhaul the text parsers, port from nom to winnow Jan 7, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Overhaul the text parsers, port from `nom` to `winnow` #892

Overhaul the text parsers, port from `nom` to `winnow` #892

zslayton commented Jan 3, 2025 •

edited

Loading

codecov bot commented Jan 3, 2025 •

edited

Loading

zslayton left a comment

zslayton Jan 7, 2025

zslayton Jan 7, 2025

zslayton Jan 7, 2025

zslayton Jan 7, 2025

zslayton Jan 7, 2025

zslayton Jan 7, 2025

zslayton Jan 7, 2025

zslayton Jan 7, 2025

zslayton Jan 7, 2025

zslayton Jan 7, 2025

zslayton Jan 7, 2025

		@@ -47,9 +47,8 @@ fn maximally_compact_1_1_data(num_values: usize) -> TestData_1_1 {

		let text_1_1_data = r#"(:event 1670446800245 418 "6" "1" "abc123" (:: "region 4" "2022-12-07T20:59:59.744000Z"))"#.repeat(num_values);

		let mut binary_1_1_data = vec![0xE0u8, 0x01, 0x01, 0xEA]; // IVM


		/// Matches a parser that must be followed by input that matches `terminator`.

Overhaul the text parsers, port from nom to winnow #892

Are you sure you want to change the base?

Overhaul the text parsers, port from nom to winnow #892

Conversation

zslayton commented Jan 3, 2025 • edited Loading

Performance improvements

Testing improvements

Command

Before

After

codecov bot commented Jan 3, 2025 • edited Loading

Codecov Report

zslayton left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Overhaul the text parsers, port from `nom` to `winnow` #892

Overhaul the text parsers, port from `nom` to `winnow` #892

zslayton commented Jan 3, 2025 •

edited

Loading

codecov bot commented Jan 3, 2025 •

edited

Loading