Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Overhaul the text parsers, port from nom to winnow #892

Open
wants to merge 29 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from 27 commits
Commits
Show all changes
29 commits
Select commit Hold shift + click to select a range
fd4e853
Makes Writer symbol table experimentally pub
zslayton Dec 26, 2024
158c3ae
Makes AnnotatableWriter experimentally pub
zslayton Dec 26, 2024
0e53e48
Clippy suggestions RE: variant name prefixes
zslayton Dec 26, 2024
4bd0ae4
Fix build when `experimental-ion-hash` is only feature enabled
zslayton Dec 26, 2024
7ea8b12
Migrate from nom to winnow 0.3
zslayton Dec 28, 2024
092032e
wip; migrated to v0.38.0
zslayton Dec 29, 2024
32d215f
Migrated to [email protected]
zslayton Jan 3, 2025
4a945c8
Remove cruft that winnow does not require
zslayton Jan 3, 2025
b2d65fd
update read_many_structs
zslayton Jan 3, 2025
33a7ef9
inlines some hot parsers
zslayton Jan 3, 2025
4dab692
Merge branch 'main' into winnow-experiment
zslayton Jan 3, 2025
58ce9cd
Removes special cases for incompleteness detection
zslayton Jan 5, 2025
3205514
`match_value` now uses `dispatch!` instead of `alt`
zslayton Jan 5, 2025
84a0a26
Added macro to define mappings from MatchedValue to LazyRawTextValue
zslayton Jan 6, 2025
41dabd0
`match_value_1_1` now uses `dispatch` instead of `alt`
zslayton Jan 6, 2025
1d1def2
clippy suggestions
zslayton Jan 6, 2025
7cd83c8
Adds text version-agnostic container parsers
zslayton Jan 6, 2025
8201e04
Removes lots of version-specific container parsing code
zslayton Jan 6, 2025
5b1b84e
fix for read_many_structs benchmark
zslayton Jan 6, 2025
aa27f61
Makes raw text lists generic over Ion version
zslayton Jan 7, 2025
6016bf3
Maker raw sexps generic over Ion version
zslayton Jan 7, 2025
46842da
Maker raw structs generic over Ion version
zslayton Jan 7, 2025
0b9aa80
cleanup
zslayton Jan 7, 2025
fba64e9
remove doc links to private types
zslayton Jan 7, 2025
ee4c5a7
More comments to `container_matcher`
zslayton Jan 7, 2025
c8a414d
Remove old println
zslayton Jan 7, 2025
9a6473d
Doc comment
zslayton Jan 7, 2025
a9583c5
Removes skip list for incompletness checking.
zslayton Jan 7, 2025
2ee8877
Removes unused import from `detect_incomplete_text` test
zslayton Jan 8, 2025
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion Cargo.toml
Original file line number Diff line number Diff line change
Expand Up @@ -57,7 +57,7 @@ compact_str = "0.8.0"
chrono = { version = "0.4", default-features = false, features = ["clock", "std", "wasmbind"] }
delegate = "0.12.0"
thiserror = "1.0"
nom = "7.1.1"
winnow = { version = "0.6", features = ["simd"] }
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🪧 The simd feature enables the memchr operation when scanning for an expected token.

num-integer = "0.1.44"
num-traits = "0.2"
arrayvec = "0.7"
Expand Down
18 changes: 6 additions & 12 deletions benches/read_many_structs.rs
Original file line number Diff line number Diff line change
Expand Up @@ -47,9 +47,8 @@ fn maximally_compact_1_1_data(num_values: usize) -> TestData_1_1 {

let text_1_1_data = r#"(:event 1670446800245 418 "6" "1" "abc123" (:: "region 4" "2022-12-07T20:59:59.744000Z"))"#.repeat(num_values);

let mut binary_1_1_data = vec![0xE0u8, 0x01, 0x01, 0xEA]; // IVM
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🪧 This benchmark is really showing its age. When it was written, there was no support for reading encoding directives, so the tests/benchmarks manually compiled and registered their own templates. Now that the readers manage their encoding context as expected, reading a leading IVM clears the manually registered templates.

When our managed writer API is fleshed out, we'll have a way to hand a macro to the writer so it gets serialized in the data stream. For now, we simply skip the IVM in binary 1.1.

#[rustfmt::skip]
let mut binary_1_1_data_body: Vec<u8> = [MacroTable::FIRST_USER_MACRO_ID as u8, // Macro ID
let binary_1_1_data: Vec<u8> = [MacroTable::FIRST_USER_MACRO_ID as u8, // Macro ID
0b10, // [NOTE: `0b`] `parameters*` arg is an arg group
0x66, // 6-byte integer (`timestamp` param)
0x75, 0x5D, 0x63, 0xEE, 0x84, 0x01,
Expand All @@ -73,7 +72,6 @@ fn maximally_compact_1_1_data(num_values: usize) -> TestData_1_1 {
0x39, 0x3A, 0x35, 0x39,
0x2E, 0x37, 0x34, 0x34,
0x30, 0x30, 0x30, 0x5A].repeat(num_values);
binary_1_1_data.append(&mut binary_1_1_data_body);
TestData_1_1 {
name: "maximally compact".to_owned(),
template_definition_text,
Expand Down Expand Up @@ -107,9 +105,8 @@ fn moderately_compact_1_1_data(num_values: usize) -> TestData_1_1 {
"#;

let text_1_1_data = r#"(:event 1670446800245 418 "scheduler-thread-6" "example-client-1" "aws-us-east-5f-abc123" (:: "region 4" "2022-12-07T20:59:59.744000Z"))"#.repeat(num_values);
let mut binary_1_1_data = vec![0xE0u8, 0x01, 0x01, 0xEA]; // IVM
#[rustfmt::skip]
let mut binary_1_1_data_body: Vec<u8> = [MacroTable::FIRST_USER_MACRO_ID as u8, // Macro ID
let binary_1_1_data: Vec<u8> = [MacroTable::FIRST_USER_MACRO_ID as u8, // Macro ID
0b10, // [NOTE: `0b` prefix] `parameters*` arg is an arg group
0x66, // 6-byte integer (`timestamp` param)
0x75, 0x5D, 0x63, 0xEE, 0x84, 0x01,
Expand Down Expand Up @@ -142,7 +139,6 @@ fn moderately_compact_1_1_data(num_values: usize) -> TestData_1_1 {
0x2E, 0x37, 0x34, 0x34,
0x30, 0x30, 0x30, 0x5A].repeat(num_values);

binary_1_1_data.append(&mut binary_1_1_data_body);
TestData_1_1 {
name: "moderately compact".to_owned(),
template_definition_text: template_definition_text.to_owned(),
Expand Down Expand Up @@ -176,9 +172,8 @@ fn length_prefixed_moderately_compact_1_1_data(num_values: usize) -> TestData_1_
"#;

let text_1_1_data = r#"(:event 1670446800245 418 "scheduler-thread-6" "example-client-1" "aws-us-east-5f-abc123" (:: "region 4" "2022-12-07T20:59:59.744000Z"))"#.repeat(num_values);
let mut binary_1_1_data = vec![0xE0u8, 0x01, 0x01, 0xEA]; // IVM
#[rustfmt::skip]
let mut binary_1_1_data_body: Vec<u8> = [0xF5, // LP invocation
let binary_1_1_data: Vec<u8> = [0xF5, // LP invocation
((MacroTable::FIRST_USER_MACRO_ID * 2) + 1) as u8, // Macro ID
0xDF, // Length prefix: FlexUInt 111
0b10, // [NOTE: `0b` prefix] `parameters*` arg is an arg group
Expand Down Expand Up @@ -213,7 +208,6 @@ fn length_prefixed_moderately_compact_1_1_data(num_values: usize) -> TestData_1_
0x2E, 0x37, 0x34, 0x34,
0x30, 0x30, 0x30, 0x5A].repeat(num_values);

binary_1_1_data.append(&mut binary_1_1_data_body);
TestData_1_1 {
name: "moderately compact w/length-prefixed top level".to_owned(),
template_definition_text: template_definition_text.to_owned(),
Expand Down Expand Up @@ -444,12 +438,12 @@ mod benchmark {
b.iter(|| {
// We don't have an API for doing this with the application-level reader yet, so
// for now we use a manually configured context and a raw reader.
let mut reader = LazyRawBinaryReader_1_1::new(binary_1_1_data);
let mut reader = LazyRawBinaryReader_1_1::new(context_ref, binary_1_1_data);
let mut num_top_level_values: usize = 0;
// Skip past the IVM
reader.next(context_ref).unwrap().expect_ivm().unwrap();
reader.next().unwrap().expect_ivm().unwrap();
Comment on lines -447 to +444
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🪧 The raw readers now take a reference to the encoding context at construction time instead of having it be passed into each call to next().

Taking them as an argument to next was intended to allow a raw reader to exist as long as needed, with the context being provided any time they read. In practice, however, the raw readers only exist long enough to read a single top-level value from the stream, and requiring a context ref at every turn gets pretty tedious.

// Expect every top-level item to be an e-expression.
while let RawStreamItem::EExp(raw_eexp) = reader.next(context_ref).unwrap() {
while let RawStreamItem::EExp(raw_eexp) = reader.next().unwrap() {
num_top_level_values += 1;
// Look up the e-expression's invoked macro ID in the encoding context.
let eexp = raw_eexp.resolve(context_ref).unwrap();
Expand Down
Loading
Loading