Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adds additional tooling APIs to enable Ion 1.1 support in the CLI's inspect command #802

Merged
merged 15 commits into from
Aug 13, 2024

Conversation

zslayton
Copy link
Contributor

@zslayton zslayton commented Aug 9, 2024

This PR:

  • Adds a new trait, EncodedBinaryValue, with methods for exposing Spans representing the various components of a value's encoding. This trait is implemented by the 1.0 and 1.1 binary value types. These methods allow ion inspect to handle both versions' values uniformly.
  • Adds peek methods to raw container iterators and MacroExpansion, allowing tools to see the next value without advancing state.
  • Adds an ExpandedStreamItem enum that surfaces input EExps in addition to their expansion. This allows tools (like inspect to investigate the EExp and then investigate its output values.
  • Renames ExpandedValueSource::EExp to SingletonEExp. If a LazyValue is backed by an EExp, it's because it was found to always expand to a single value at compile time--that makes it possible to hand out a LazyValue knowing that when it is eventually read there will be exactly one value. Other values can come from an EExp, but are not backed by it. The renaming hopes to make that less fuzzy.
  • Adds convenience methods for inspecting where a value/expr came from, like is_ephemeral, is_parameter, etc.

It also fixes three bugs for detecting incomplete input:

  • Text EExp IDs were being parsed using the identifier symbol parser. Unlike a symbol, an EExp's ID cannot be the last thing in a stream.
  • The float parser correctly rejected input like 123e (no digits after e), but it did it by declaring it invalid rather than incomplete. If the buffer received more data, it's possible retrying would succeed.
  • Binary 1.1 values were not confirming that the entire value was in the buffer after reading its length.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

@zslayton zslayton requested a review from popematt August 9, 2024 18:55
Copy link
Contributor Author

@zslayton zslayton left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🗺️ PR Tour 🧭

Comment on lines -604 to +609
let (remaining_data, stream_offset, mut encoding) = match &self.encoding_reader {
Text_1_0(r) => r.stream_data(),
Binary_1_0(r) => r.stream_data(),
Text_1_1(r) => r.stream_data(),
Binary_1_1(r) => r.stream_data(),
let reader_state = match &self.encoding_reader {
Text_1_0(r) => r.save_state(),
Binary_1_0(r) => r.save_state(),
Text_1_1(r) => r.save_state(),
Binary_1_1(r) => r.save_state(),
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🗺️ I renamed stream_data to save_state to eliminate the noun/verb ambiguity Matt noticed in the feedback for #796.

Copy link
Contributor

@jobarr-amzn jobarr-amzn Aug 12, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It took me a while to find the comment: #796 (comment)

Doc comment would be helpful because it's not immediately obvious whether "stream" is a verb or a noun here.

It's a little ironic because "save state" is nearly as ambiguous as "stream data". Both "save" and "stream" may be adjectives or verbs. "save state" is clearer though, and has the correct implication interpreted either way. I like the change :)

Copy link
Contributor

@popematt popematt Aug 12, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If it's a noun, consider data_stream, saved_state. If it's a verb, consider save_reader_state maybe?

Or even get_reader_state()? (Or, if it accepts owned self, then maybe into_reader_state().)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While the name stream_data was noun/verb ambiguous, I think the actual problem being solved was that the name didn't explain what it did. I think save_state addresses that, even if it remains noun/verb ambiguous.

@@ -1149,10 +1153,12 @@ impl<'top> LazyContainerPrivate<'top, AnyEncoding> for LazyRawAnyList<'top> {
}
}

#[derive(Debug, Copy, Clone)]
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🗺️ Lots of structs now implement Copy et al not only because they can/should, but also because it makes implementing peek() possible for raw containers/readers.

Comment on lines +483 to +654
if total_length > input.len() {
return IonResult::incomplete(
"the stream ended unexpectedly in the middle of a value",
header_offset,
);
}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🗺️ Somehow the test for incomplete binary values never made the leap from 1.0 to 1.1, so here it is.

@@ -3,9 +3,12 @@
use std::fmt::Debug;
use std::ops::Range;

use num_traits::PrimInt;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🗺️ cargo fmt moved this from below.

@@ -823,3 +825,35 @@ impl<'top> LazyRawBinaryValue_1_1<'top> {
Ok(LazyRawBinaryStruct_1_1::from_value(self))
}
}

impl<'top> EncodedBinaryValue<'top, BinaryEncoding_1_1> for &'top LazyRawBinaryValue_1_1<'top> {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🗺️ The definition of this trait is in the next file down in the PR. For now, just notice that these methods are exposing chunks of the input backing various encoding primitives in the value's bytes.

Comment on lines +48 to +52
pub struct RawReaderState<'a> {
data: &'a [u8],
offset: usize,
encoding: IonEncoding,
}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🗺️ I created a type for this instead of passing around a 3-tuple.

// Unlike a symbol value with identifier syntax, an e-expression identifier cannot be
// the last thing in the stream.
return Err(nom::Err::Incomplete(Needed::Unknown));
}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🗺️ Fixing an incompleteness detection bug that occurs when the last thing in the buffer is the e-exp ID.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I remember you!

))(self)
}

/// Like `match_digits_before_dot`, but allows leading zeros.
fn match_digits_after_dot(self) -> IonMatchResult<'top> {
fn match_one_or_more_digits_after_dot(self) -> IonMatchResult<'top> {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🗺️ Fixing an incompleteness detection bug that occurs when the e in a float is the last thing in the buffer.

Comment on lines +2919 to +2928
expect_incomplete: [
"5d",
"-5d",
"5.d",
"-5.d",
"5.D",
"-5.D",
"5.1d", "-5.1d",
"5.1D", "-5.1D",
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🗺️ All of these could be valid decimals if the buffer ends up getting at least one more digit.

let buffer_tail = &input.bytes()[input.len() - tail_bytes_to_take..];
let tail = format!("{:X?}", buffer_tail);
(head, tail)
}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🗺️ These changes make the error message that surfaces for illegal input much more friendly.

image

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice, looks good!

@zslayton zslayton marked this pull request as ready for review August 9, 2024 21:24
@zslayton zslayton requested a review from nirosys August 9, 2024 21:24
Base automatically changed from read-flex_uint-annotate to main August 11, 2024 19:10
Comment on lines 279 to 298
// Nonsense values for now
ion_type_code: OpcodeType::Nop,
low_nibble: 0,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For now? Until when?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My intent (not communicated in this comment) was to modify the EncodedValue type such that it wouldn't require a Header anymore. Matt also suggested this independently; I'm going to do it as part of #805.

/// Returns `true` if this expression was produced by evaluating a macro. Otherwise, returns `false`.
pub fn is_ephemeral(&self) -> bool {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🚲 🏠 is_ephemeral doesn't say "is the result of evaluating a macro" to me, but I am ignorant of the standard terminology here. Is this phrasing used elsewhere?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've been using it in ion-rust and ion inspect to describe values that don't appear anywhere in the input stream. They exist only during macro evaluation. I'm open to other ideas.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you expand on "don't appear anywhere in the input stream"? An example, perhaps?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure!

Here's an example in which the stream contains:

  1. An encoding directive ($ion_encoding::(...)) that defines a macro called triple
  2. An e-expression with a string argument ((:triple "hello"))

image

When a Reader traverses the stream it will hand out three LazyValues, one for each "hello" in the stream's expansion. However, if you were to ask any of those LazyValues for its Span (backing bytes), you would get None--the values aren't encoded as values the stream itself, they're the ephemeral result of evaluating the e-expression.

Does that help?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the example :) Yes that does help. If the triple macro produced only one value then you could say that that is encoded where the macro is invoked (as you describe in the comment below about "singleton macros".

Hmm, I could be inclined to think of those "hello" values as occupying the same 0-byte span between bytes... 73 and 74? But then I'd have to wonder- how do I distinguish the position of the second from the first? If I'm able to distinguish the difference between multiple values in the same place then I might as well say all three values are "in" the macro invocation. Is the answer then that we need another dimension to locate values, that all three "hello" are located between bytes 67-74 at different ordinals? E.g. ([67-74], 0), ([67-74], 1), ([67-74], 2)?

Are these values ephemeral? Virtual? Implicit? Derived? Computed?

Lots of questions. No suggestions :) Works for me.

Comment on lines 377 to 420
_ => unreachable!("e-expression-backed lazy values must yield a single value literal"),
_ => ice!(IonResult::decoding_error(format!(
"expansion of {self:?} was required to produce exactly one value",
))),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's happening here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This path is an optimization available to macros that are statically determined to expand to a single non-system value at template-compile time. Consider this macro definition:

(macro Point2D ($x $y)
  {
    x: $x,
    y: $y,
  }
)

When the TemplateCompiler processes this, it notices that the body of the macro is an unannotated struct. This means that it will always evaluate to exactly one value and that value cannot be a symbol table or encoding directive (because those require specific annotations).

The codebase refers to this category of macros as "singleton macros." They offer a few benefits:

  • Because it will definitely be exactly one value, the reader can hand out a LazyValue holding the e-expression as its backing data. Other macros cannot do this because if you're holding a LazyValue and the macro evaluates to nothing or 100 things, there's not a way for the application to handle those outcomes.
  • At the top level, singleton macros can be skipped over without evaluation because we know they won't produce a system value that could modify the encoding context.
  • Expanding a singleton macro doesn't require an evaluator with a stack because as soon as you've gotten a value, you're done--no need to pop() and preserve state.

Now that we're looking at this more closely though, the decoding error should actually be a panic--if it ever happens, it's a bug in the library and not a problem with the data. While I'm in there I'll also add some of the description above to the ExpansionSingleton type.

Comment on lines 1090 to 1103
new_macro_table.macro_with_id(3),
new_macro_table.macro_with_id(4),
new_macro_table.macro_with_name("seventeen")
);
assert_eq!(
new_macro_table.macro_with_id(4),
new_macro_table.macro_with_id(5),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did these change because of the addition of annotate? I see the assertion on new line 1097 doesn't change. The comment on new line 1095 is now out of date though, yes?

Copy link
Contributor Author

@zslayton zslayton Aug 13, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did these change because of the addition of annotate?

Yep. Matt rightly pointed out that I should use an offset from NUM_SYSTEM_MACROS to prevent future breakage. I'm going to do that for #807.

The comment on new line 1095 is now out of date though, yes?

Yeah, good catch.

// Unlike a symbol value with identifier syntax, an e-expression identifier cannot be
// the last thing in the stream.
return Err(nom::Err::Incomplete(Needed::Unknown));
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I remember you!

Comment on lines +1575 to +1580
// Note: ^-- We use this `pair(satisfy(...), complete_digit0)` to guarantee a subtle
// behavior. At the end of the buffer, an empty input to this parser must be
// considered 'incomplete' instead of 'invalid'. In contrast, an input of a single
// digit would be considered complete even though the buffer could get more data later.
// (If the buffer gets more data, it's the StreamingRawReader's responsibility to
// discard the `1.1` and try again.)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I very much appreciate comments like these, thanks!

let buffer_tail = &input.bytes()[input.len() - tail_bytes_to_take..];
let tail = format!("{:X?}", buffer_tail);
(head, tail)
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice, looks good!

@zslayton zslayton requested a review from jobarr-amzn August 13, 2024 15:43
@zslayton zslayton merged commit d4ad0c5 into main Aug 13, 2024
28 of 31 checks passed
@zslayton zslayton deleted the inspect-1_1 branch August 13, 2024 19:59
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants