Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adds NativeElementWriter, string/clob validation, big DecodedInts #384

Merged
merged 5 commits into from
May 19, 2022

Conversation

zslayton
Copy link
Contributor

This PR makes the following changes:

  • A native Rust ElementWriter implementation (NativeElementWriter)
    has been added with basic testing; it is not yet being used in
    ion-tests.
  • ion-tests integration has been refactored to facilitate running
    roundtrip tests with either the IonCSliceElementWriter or the
    NativeElementWriter as desired.
  • Fixed a bug in the ion-tests integration that caused the ion-c reader
    to be used in some tests even when the native reader had been requested.
    This change surfaced some additional failing ion-tests that have been
    addressed in the changes below.
  • The 64-bit size limit on the binary encoding primitive Int has been
    lifted. This allows Timestamps and Decimals of arbitrary precision to be
    read; write support for big Ints has not been added yet.
  • Clobs now disallow unescaped non-ASCII characters and unescaped ASCII
    control characters.
  • Strings now disallow unescaped ASCII control characters.
  • Strings and clobs now correctly decode utf-16 surrogate pairs.
  • Long-form clobs and strings now normalize unescaped newlines.
  • Fixed a bug in the TextBuffer that could cause a panic if restacking
    the buffer's contents caused truncate to evaluate garbage data beyond
    the truncation point.
  • Fixed a bug in the way that Timestamp fractional seconds were being
    converted to a Decimal.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

zslayton added 2 commits May 18, 2022 17:37
This PR makes the following changes:

* A native Rust `ElementWriter` implementation (`NativeElementWriter`)
has been added with basic testing; it is not yet being used in
`ion-tests`.
* ion-tests integration has been refactored to facilitate running
roundtrip tests with either the `IonCSliceElementWriter` or the
`NativeElementWriter` as desired.
* Fixed a bug in the ion-tests integration that caused the ion-c reader
to be used in some tests even when the native reader had been requested.
This change surfaced some additional failing ion-tests that have been
addressed in the changes below.
* The 64-bit size limit on the binary encoding primitive `Int` has been
lifted. This allows Timestamps and Decimals of arbitrary precision to be
read; write support for big `Int`s has not been added yet.
* Clobs now disallow unescaped non-ASCII characters and unescaped ASCII
control characters.
* Strings now disallow unescaped ASCII control characters.
* Strings and clobs now correctly decode utf-16 surrogate pairs.
* Long-form clobs and strings now normalize unescaped newlines.
* Fixed a bug in the `TextBuffer` that could cause a panic if restacking
the buffer's contents caused `truncate` to evaluate garbage data beyond
the truncation point.
* Fixed a bug in the way that Timestamp fractional seconds were being
converted to a Decimal.
@codecov
Copy link

codecov bot commented May 18, 2022

Codecov Report

Merging #384 (1501d56) into main (0dc698a) will increase coverage by 0.00%.
The diff coverage is 91.97%.

@@           Coverage Diff            @@
##             main     #384    +/-   ##
========================================
  Coverage   91.39%   91.39%            
========================================
  Files          77       78     +1     
  Lines       11288    11486   +198     
========================================
+ Hits        10317    10498   +181     
- Misses        971      988    +17     
Impacted Files Coverage Δ
src/text/parsers/decimal.rs 95.14% <ø> (ø)
src/text/parsers/mod.rs 89.74% <ø> (ø)
src/value/mod.rs 88.92% <ø> (ø)
src/binary/raw_binary_writer.rs 94.91% <60.00%> (+0.03%) ⬆️
src/value/native_writer.rs 82.75% <82.75%> (ø)
src/text/parsers/text_support.rs 92.10% <92.10%> (-2.70%) ⬇️
src/text/parsers/clob.rs 97.95% <92.30%> (-2.05%) ⬇️
src/binary/int.rs 95.40% <94.52%> (-1.03%) ⬇️
src/types/timestamp.rs 92.67% <97.05%> (+0.50%) ⬆️
src/binary/decimal.rs 67.74% <100.00%> (ø)
... and 14 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 0dc698a...1501d56. Read the comment docs.

Copy link
Contributor Author

@zslayton zslayton left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🗺️ PR tour

@@ -55,7 +57,7 @@ where
bytes_written += VarInt::write_i64(self, decimal.exponent)?;

if decimal.coefficient.is_negative_zero() {
bytes_written += Int::write_negative_zero(self)?;
bytes_written += DecodedInt::write_negative_zero(self)?;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🗺️ Sometime last year, the UInt primitive was split into DecodedUInt and EncodedUInt. This change starts the same process for Int by renaming it to DecodedInt. A future PR focused on writing will create its EncodedInt counterpart.

Comment on lines +15 to +18
// This limit is used for stack-allocating buffer space to encode/decode Ints.
const INT_STACK_BUFFER_SIZE: usize = 16;
// This number was chosen somewhat arbitrarily and could be lifted if a use case demands it.
const MAX_INT_SIZE_IN_BYTES: usize = 2048;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🗺️ An earlier PR lifted the size restriction on reading UInts, making it possible to read integers that are up to the semi-arbitrary limit of 2048 bytes. This PR does the same for Int, making it possible to read Decimal and Timestamp values with an absurd level of precision.

size_in_bytes: usize,
value: IntStorage,
// [IntStorage] is not capable of natively representing negative zero. We track the sign
value: Integer,
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🗺️ Instead of an IntStorage (a type alias for i64), the DecodedInt's value is now the enum Integer, allowing it to be a BigInt when necessary.

}

fn write_int_test(value: i64, expected_bytes: &[u8]) -> IonResult<()> {
let mut buffer: Vec<u8> = vec![];
Int::write_i64(&mut buffer, value)?;
DecodedInt::write_i64(&mut buffer, value)?;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🗺️ Writing Ints of arbitrary size is still not possible, but will be added in a future PR.

@@ -533,7 +533,7 @@ impl<R: IonDataSource> StreamReader for RawBinaryReader<R> {
return Ok(Decimal::negative_zero_with_exponent(exponent));
}

Ok(Decimal::new(coefficient.value(), exponent))
Ok(Decimal::new(coefficient, exponent))
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🗺️ Decimal::new accepts any type that implements From<T> for Coefficient; we added such an implementation for DecodedInt earlier in the diff.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

file_name: &str,
value_assert: F1,
group_assert: F2,
) -> IonResult<()>
where
F1: Fn(&OwnedElement, &OwnedElement),
// group index, value 1 index, value 1, value 2 index, value 2
F1: Fn(usize, usize, &OwnedElement, usize, &OwnedElement),
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🗺️ This closure now takes indexes for the group being evaluated and for each of the elements being compared. This made debugging failures much easier.

format: Format,
) -> IonResult<Vec<OwnedElement>> {
let new_elements =
Self::RoundTripper::roundtrip(source_elements, format, Self::make_reader())?;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🗺️ Blink and you'll miss it! I know because I missed it. 😞

The Self::make_reader() at the end of this line used to be a call to element_reader(), which meant that for lots of tests the native Rust reader wasn't being used; element_reader() always returns an IonCSliceElementWriter.

This PR fixed that, bringing a handful of other test failures to light. This PR also fixed most of them.

|group_index, this_index, this, that_index, that| {
assert!(
this.ion_eq(that),
"in group {}, index {} ({:?}) was not ion_eq to index {} ({:?})",
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🗺️ Here and below: lots of additional context around test failures.

"ion-tests/iontestdata/good/non-equivs/timestamps.ion",
"ion-tests/iontestdata/good/equivs/timestamps.ion",
"ion-tests/iontestdata/good/equivs/timestampFractions.ion",
// -----
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🗺️ I plan to leave the ion-c-writer-related failures on the skip list for now and move on to wiring up the NativeElementWriter. I assume it also has bugs, but that's a higher priority for investing time in fixes.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I like this plan.

}

#[test_resources("ion-tests/iontestdata/good/**/*.ion")]
#[test_resources("ion-tests/iontestdata/good/**/*.10n")]
fn native_good_roundtrip_text_binary(file_name: &str) {
good_roundtrip_text_binary(NativeElementApi {}, file_name)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🗺️ Originally, NativeElementApi was defined like this:

struct NativeElementApi {}

This PR changed it to:

struct NativeElementApi;

which means that all usages can now drop their {}s.

@zslayton zslayton requested review from jobarr-amzn and desaikd May 19, 2022 00:48
Comment on lines +95 to +99
// We're going to treat the buffer's contents like the big-endian bytes of an
// unsigned integer. Now that we've made a note of the sign, set the sign bit
// in the buffer to zero.
buffer[0] &= 0b0111_1111;
let value = BigInt::from_bytes_be(sign, buffer);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To my unseasoned eye it looks like we could take this approach to i64 values too. What's the difference?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could! The i64 implementation was one of the first bits of ion-rust I wrote, and I wrote it before from_bytes_be was a feature--it wasn't stabilized until 2019.

@@ -533,7 +533,7 @@ impl<R: IonDataSource> StreamReader for RawBinaryReader<R> {
return Ok(Decimal::negative_zero_with_exponent(exponent));
}

Ok(Decimal::new(coefficient.value(), exponent))
Ok(Decimal::new(coefficient, exponent))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

Comment on lines +175 to +181
fn char_is_legal_clob_ascii(c: char) -> bool {
// Depending on where you look in the spec and/or `ion-tests`, you'll find conflicting
// information about which ASCII characters can appear unescaped in a clob. Some say
// "characters >= 0x20", but that excludes lots of whitespace characters that are < 0x20.
// Some say "displayable ASCII", but DEL (0x7F) is shown to be legal in one of the ion-tests.
// The definition used here has largely been inferred from the contents of `ion-tests`.
c.is_ascii() && (u32::from(c) >= 0x20 || WHITESPACE_CHARACTERS.contains(&c))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this worth opening an issue in ion spec linked to ion-tests, to resolve?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment on lines +34 to +36
'\x09', // Horizontal tab
'\x0B', // Vertical tab
'\x0C', // Form feed
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, much better :D

@@ -70,67 +74,33 @@ fn long_string_fragment(input: &str) -> IonParseResult<StringFragment> {
pub(in crate::text::parsers) fn long_string_fragment_without_escaped_text(
input: &str,
) -> IonParseResult<StringFragment> {
// In contrast with the `short_string_fragment_without_escaped_text` function, this parser is
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes that is an improvement. Thanks!

/// * <https://www.unicode.org/glossary/#surrogate_code_point>
fn code_point_is_a_high_surrogate(value: u32) -> bool {
(0xD800..=0xDFFF).contains(&value)
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🙇

#[test]
fn ion_eq_fraction_seconds_mixed_mantissa() {
let t1 = Timestamp {
date_time: NaiveDateTime::from_str("1857-05-29T19:25:59.100").unwrap(),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is this date? Arbitrary? Closest thing I can find in history is the founding of the precursor bank to BBVA, on May 28, 1857.

Copy link
Contributor Author

@zslayton zslayton May 19, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I pulled it from a timestamp ion-test that was failing for closer inspection.

"ion-tests/iontestdata/good/non-equivs/timestamps.ion",
"ion-tests/iontestdata/good/equivs/timestamps.ion",
"ion-tests/iontestdata/good/equivs/timestampFractions.ion",
// -----
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I like this plan.

@jobarr-amzn
Copy link
Contributor

I still find the conversation/threading model in GitHub PRs to be just bizarre.

@zslayton zslayton merged commit 791efb4 into main May 19, 2022
@zslayton zslayton deleted the strings-and-clobs branch May 19, 2022 10:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants