-
Notifications
You must be signed in to change notification settings - Fork 36
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Adds NativeElementWriter
, string/clob validation, big DecodedInt
s
#384
Conversation
This PR makes the following changes: * A native Rust `ElementWriter` implementation (`NativeElementWriter`) has been added with basic testing; it is not yet being used in `ion-tests`. * ion-tests integration has been refactored to facilitate running roundtrip tests with either the `IonCSliceElementWriter` or the `NativeElementWriter` as desired. * Fixed a bug in the ion-tests integration that caused the ion-c reader to be used in some tests even when the native reader had been requested. This change surfaced some additional failing ion-tests that have been addressed in the changes below. * The 64-bit size limit on the binary encoding primitive `Int` has been lifted. This allows Timestamps and Decimals of arbitrary precision to be read; write support for big `Int`s has not been added yet. * Clobs now disallow unescaped non-ASCII characters and unescaped ASCII control characters. * Strings now disallow unescaped ASCII control characters. * Strings and clobs now correctly decode utf-16 surrogate pairs. * Long-form clobs and strings now normalize unescaped newlines. * Fixed a bug in the `TextBuffer` that could cause a panic if restacking the buffer's contents caused `truncate` to evaluate garbage data beyond the truncation point. * Fixed a bug in the way that Timestamp fractional seconds were being converted to a Decimal.
Codecov Report
@@ Coverage Diff @@
## main #384 +/- ##
========================================
Coverage 91.39% 91.39%
========================================
Files 77 78 +1
Lines 11288 11486 +198
========================================
+ Hits 10317 10498 +181
- Misses 971 988 +17
Continue to review full report at Codecov.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🗺️ PR tour
@@ -55,7 +57,7 @@ where | |||
bytes_written += VarInt::write_i64(self, decimal.exponent)?; | |||
|
|||
if decimal.coefficient.is_negative_zero() { | |||
bytes_written += Int::write_negative_zero(self)?; | |||
bytes_written += DecodedInt::write_negative_zero(self)?; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🗺️ Sometime last year, the UInt
primitive was split into DecodedUInt
and EncodedUInt
. This change starts the same process for Int
by renaming it to DecodedInt
. A future PR focused on writing will create its EncodedInt
counterpart.
// This limit is used for stack-allocating buffer space to encode/decode Ints. | ||
const INT_STACK_BUFFER_SIZE: usize = 16; | ||
// This number was chosen somewhat arbitrarily and could be lifted if a use case demands it. | ||
const MAX_INT_SIZE_IN_BYTES: usize = 2048; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🗺️ An earlier PR lifted the size restriction on reading UInt
s, making it possible to read integers that are up to the semi-arbitrary limit of 2048 bytes. This PR does the same for Int
, making it possible to read Decimal
and Timestamp
values with an absurd level of precision.
size_in_bytes: usize, | ||
value: IntStorage, | ||
// [IntStorage] is not capable of natively representing negative zero. We track the sign | ||
value: Integer, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🗺️ Instead of an IntStorage
(a type alias for i64
), the DecodedInt
's value is now the enum Integer
, allowing it to be a BigInt
when necessary.
} | ||
|
||
fn write_int_test(value: i64, expected_bytes: &[u8]) -> IonResult<()> { | ||
let mut buffer: Vec<u8> = vec![]; | ||
Int::write_i64(&mut buffer, value)?; | ||
DecodedInt::write_i64(&mut buffer, value)?; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🗺️ Writing Int
s of arbitrary size is still not possible, but will be added in a future PR.
@@ -533,7 +533,7 @@ impl<R: IonDataSource> StreamReader for RawBinaryReader<R> { | |||
return Ok(Decimal::negative_zero_with_exponent(exponent)); | |||
} | |||
|
|||
Ok(Decimal::new(coefficient.value(), exponent)) | |||
Ok(Decimal::new(coefficient, exponent)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🗺️ Decimal::new
accepts any type that implements From<T> for Coefficient
; we added such an implementation for DecodedInt
earlier in the diff.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
file_name: &str, | ||
value_assert: F1, | ||
group_assert: F2, | ||
) -> IonResult<()> | ||
where | ||
F1: Fn(&OwnedElement, &OwnedElement), | ||
// group index, value 1 index, value 1, value 2 index, value 2 | ||
F1: Fn(usize, usize, &OwnedElement, usize, &OwnedElement), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🗺️ This closure now takes indexes for the group being evaluated and for each of the elements being compared. This made debugging failures much easier.
format: Format, | ||
) -> IonResult<Vec<OwnedElement>> { | ||
let new_elements = | ||
Self::RoundTripper::roundtrip(source_elements, format, Self::make_reader())?; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🗺️ Blink and you'll miss it! I know because I missed it. 😞
The Self::make_reader()
at the end of this line used to be a call to element_reader()
, which meant that for lots of tests the native Rust reader wasn't being used; element_reader()
always returns an IonCSliceElementWriter
.
This PR fixed that, bringing a handful of other test failures to light. This PR also fixed most of them.
|group_index, this_index, this, that_index, that| { | ||
assert!( | ||
this.ion_eq(that), | ||
"in group {}, index {} ({:?}) was not ion_eq to index {} ({:?})", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🗺️ Here and below: lots of additional context around test failures.
"ion-tests/iontestdata/good/non-equivs/timestamps.ion", | ||
"ion-tests/iontestdata/good/equivs/timestamps.ion", | ||
"ion-tests/iontestdata/good/equivs/timestampFractions.ion", | ||
// ----- |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🗺️ I plan to leave the ion-c
-writer-related failures on the skip list for now and move on to wiring up the NativeElementWriter
. I assume it also has bugs, but that's a higher priority for investing time in fixes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I like this plan.
} | ||
|
||
#[test_resources("ion-tests/iontestdata/good/**/*.ion")] | ||
#[test_resources("ion-tests/iontestdata/good/**/*.10n")] | ||
fn native_good_roundtrip_text_binary(file_name: &str) { | ||
good_roundtrip_text_binary(NativeElementApi {}, file_name) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🗺️ Originally, NativeElementApi
was defined like this:
struct NativeElementApi {}
This PR changed it to:
struct NativeElementApi;
which means that all usages can now drop their {}
s.
// We're going to treat the buffer's contents like the big-endian bytes of an | ||
// unsigned integer. Now that we've made a note of the sign, set the sign bit | ||
// in the buffer to zero. | ||
buffer[0] &= 0b0111_1111; | ||
let value = BigInt::from_bytes_be(sign, buffer); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To my unseasoned eye it looks like we could take this approach to i64 values too. What's the difference?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could! The i64
implementation was one of the first bits of ion-rust
I wrote, and I wrote it before from_bytes_be
was a feature--it wasn't stabilized until 2019.
@@ -533,7 +533,7 @@ impl<R: IonDataSource> StreamReader for RawBinaryReader<R> { | |||
return Ok(Decimal::negative_zero_with_exponent(exponent)); | |||
} | |||
|
|||
Ok(Decimal::new(coefficient.value(), exponent)) | |||
Ok(Decimal::new(coefficient, exponent)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
fn char_is_legal_clob_ascii(c: char) -> bool { | ||
// Depending on where you look in the spec and/or `ion-tests`, you'll find conflicting | ||
// information about which ASCII characters can appear unescaped in a clob. Some say | ||
// "characters >= 0x20", but that excludes lots of whitespace characters that are < 0x20. | ||
// Some say "displayable ASCII", but DEL (0x7F) is shown to be legal in one of the ion-tests. | ||
// The definition used here has largely been inferred from the contents of `ion-tests`. | ||
c.is_ascii() && (u32::from(c) >= 0x20 || WHITESPACE_CHARACTERS.contains(&c)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this worth opening an issue in ion spec linked to ion-tests
, to resolve?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
'\x09', // Horizontal tab | ||
'\x0B', // Vertical tab | ||
'\x0C', // Form feed |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, much better :D
@@ -70,67 +74,33 @@ fn long_string_fragment(input: &str) -> IonParseResult<StringFragment> { | |||
pub(in crate::text::parsers) fn long_string_fragment_without_escaped_text( | |||
input: &str, | |||
) -> IonParseResult<StringFragment> { | |||
// In contrast with the `short_string_fragment_without_escaped_text` function, this parser is |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes that is an improvement. Thanks!
/// * <https://www.unicode.org/glossary/#surrogate_code_point> | ||
fn code_point_is_a_high_surrogate(value: u32) -> bool { | ||
(0xD800..=0xDFFF).contains(&value) | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🙇
#[test] | ||
fn ion_eq_fraction_seconds_mixed_mantissa() { | ||
let t1 = Timestamp { | ||
date_time: NaiveDateTime::from_str("1857-05-29T19:25:59.100").unwrap(), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is this date? Arbitrary? Closest thing I can find in history is the founding of the precursor bank to BBVA, on May 28, 1857.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I pulled it from a timestamp
ion-test
that was failing for closer inspection.
"ion-tests/iontestdata/good/non-equivs/timestamps.ion", | ||
"ion-tests/iontestdata/good/equivs/timestamps.ion", | ||
"ion-tests/iontestdata/good/equivs/timestampFractions.ion", | ||
// ----- |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I like this plan.
I still find the conversation/threading model in GitHub PRs to be just bizarre. |
This PR makes the following changes:
ElementWriter
implementation (NativeElementWriter
)has been added with basic testing; it is not yet being used in
ion-tests
.roundtrip tests with either the
IonCSliceElementWriter
or theNativeElementWriter
as desired.to be used in some tests even when the native reader had been requested.
This change surfaced some additional failing ion-tests that have been
addressed in the changes below.
Int
has beenlifted. This allows Timestamps and Decimals of arbitrary precision to be
read; write support for big
Int
s has not been added yet.control characters.
TextBuffer
that could cause a panic if restackingthe buffer's contents caused
truncate
to evaluate garbage data beyondthe truncation point.
converted to a Decimal.
By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.