Correctly detect encoding even without BOM #465

Mingun · 2022-08-26T16:17:06Z

This PR correctly processes input and allow to detect UTF-16 encoding even without BOM, using the algorithm, recommended by W3C. Previously it is not worked because we ran it too late, when the < (0x3C) byte was already recognized as start of a tag and the "start text" does not contained it.

There is one thin thing there: when we should strip BOM? Currently I chosen the consistent behavior across encoding feature enabled / disabled and parsing from bytes / from &str, but probably it is not the best choice.

The available variants can be summarized in that table:

	`from_reader`	`from_str`
`--features encoding`	auto-detected, BOM stripped	UTF-8, is it needed to strip possible BOM?
no `--features encoding`	UTF-8, is it needed to strip possible BOM?	UTF-8, is it needed to strip possible BOM?

I doubt about stripping BOM from &str arguments, because it is highly likely that BOM already was stripped when you've got the string, and it it the other BOM in the beginning, it is probably the content part (because BOM is ordinary character, U+FEFF, and can be used in any place like any other character).

However, when encoding feature is not enabled, the from_reader and from_str cases are the same -- input should be in UTF-8. Should we strip BOM in from_reader case then?

This PR also adds a set of files in all supported encodings most of which contains all characters that corresponding encoding can handle. The utility project for generation is in the test-gen folder

src/encoding.rs

Cargo.toml

Mingun · 2022-08-26T16:35:32Z

src/encoding.rs

@@ -173,25 +180,26 @@ pub(crate) fn remove_bom<'b>(bytes: &'b [u8], encoding: &'static Encoding) -> &'
 ///
 /// | Bytes       |Detected encoding
 /// |-------------|------------------------------------------
-/// |`FE FF ## ##`|UTF-16, big-endian
+/// | **BOM**
+/// |`FE_FF_##_##`|UTF-16, big-endian


Underscores added because otherwise long text in lower cells shrink the first column and leads to text wrapping, which looks not good

Mingun · 2022-08-26T17:24:05Z

What about BOM stripping? Should we strip BOM when we parse from &str?

dralley · 2022-08-26T17:31:20Z

I don't think so personally - #461 should handle it so that the input itself never needs to be transformed. Unless you think we need it would be useful in the interim before that issue is dealt with.

Mingun · 2022-08-26T17:41:21Z

Ok, then for now I leave it as implemented:

	`from_reader`	`from_str`
`--features encoding`	auto-detected, BOM stripped if present	UTF-8, BOM stripped if present
no `--features encoding`	UTF-8, BOM stripped if present	UTF-8, BOM stripped if present

PR is ready for review

dralley · 2022-08-26T21:48:38Z

Thanks. I've been waiting for this to avoid overlapping work as I needed to make a few similar changes - will try to proceed with the decoding work this weekend.

dralley · 2022-08-26T21:59:20Z

Cargo.toml

+## - [UTF-16LE]
+## - [ISO-2022-JP]
+##
+## You should stop to process document when one of that encoding will be detected,


The suggestion makes sense in isolation, but I don't know that we want to contribute to even more churn than necessary? It's likely to either break or be unnecessary in the very next release.

Yes, I just want to explicitly warn users about that UTF-16 and ISO-2022-JP are not supported for now. For example, parsing documents with some Chinese characters that represented as [ASCII byte, some byte] in UTF-16BE or [some byte, ASCII byte] in UTF16LE can confuse the parser. That can be avoided if you stop to processing such documents in the very beginning.

I want make this change because I want to cut release this weekend, and it is still unclear when the correct solution will be ready, so it is best to avoid using problematic encodings for now.

I hope that in the next release this note will be removed. It should not break anything -- once correct support will be implemented, users will be updated and remove their guard code.

I added a remark, that restriction is temporary and will be eliminated once #158 is fixed.

Thanks. I do think finishing #158 this weekend is plausible but I won't promise it.

Well, an intermediate "good enough to release" stage of it, in any case.

Or it would be, except for async. Keep forgetting about that...

dralley · 2022-08-26T22:00:44Z

Cargo.toml

@@ -54,9 +54,54 @@ async-tokio = ["tokio"]
 ## UTF-16 will not work (therefore, `quick-xml` is not [standard compliant]).
 ##
 ## List of supported encodings includes all encodings supported by [`encoding_rs`]
-## crate, that satisfied the restriction above.
+## crate, that satisfied the restriction above. So, the following encodings are
+## **not supported**:


There's a lot of redundancy between this, the lines above, and the lines below that can probably be cleaned up.

I tried to remove redundancy, if you have something else in mind, please make a suggestion.

dralley

Changing it to approved so you can merge once the minor comments are addressed.

…oding` feature for them in the doc This also allow to show doc for both variants of the methods

After removing StartText event in tafia#459 text events can be generated at the beginning of the stream

`test-gen` is a project used to generate test files in tests/documents/encoding directory failures (2): detect::utf16be detect::utf16le

Fixes 2 tests

CDATA section was formed incorrectly and instead was recognized as a Start tag. File introduced in PR tafia#465 and was made manually and not using generator, because WHATWG does not have definition of this encoding as a separate entry in index.json. Actually, this encoding the same as ISO-8859-8, but influences layout direction when render text. Wikipedia: The WHATWG Encoding Standard used by HTML5 treats ISO-8859-8 and ISO-8859-8-I as distinct encodings with the same mapping due to influence on the layout direction So generator was fixed and file regenerated

Mingun added bug encoding Issues related to support of various encodings of the XML documents labels Aug 26, 2022

Mingun requested a review from dralley August 26, 2022 16:32

Mingun commented Aug 26, 2022

View reviewed changes

src/encoding.rs Outdated Show resolved Hide resolved

dralley reviewed Aug 26, 2022

View reviewed changes

Cargo.toml Show resolved Hide resolved

Mingun commented Aug 26, 2022

View reviewed changes

Mingun force-pushed the detect-encoding branch from 0fb349e to 7520339 Compare August 26, 2022 17:32

Mingun force-pushed the detect-encoding branch from 7520339 to 55f0155 Compare August 26, 2022 17:46

dralley requested changes Aug 26, 2022

View reviewed changes

dralley approved these changes Aug 26, 2022

View reviewed changes

Mingun and others added 11 commits August 27, 2022 18:13

Use pretty_assertions for encoding tests

9b9e5fc

Use pretty_assertions when compare events in reader tests

82572f1

Move fuzzing tests from encoding to a dedicated file

a1b840e

Specify required features for a test

d2e5a6e

Remove unused decode_with_bom_removal method and free function

f82e325

Merge Decoder methods to avoid wrong remark about necessarily of `enc…

ad77e3f

…oding` feature for them in the doc This also allow to show doc for both variants of the methods

Remove excess test

bf2a360

After removing StartText event in tafia#459 text events can be generated at the beginning of the stream

Move documents to test encodings to a sub-folder

41c36b5

Add tests for encoding detection

813dd20

`test-gen` is a project used to generate test files in tests/documents/encoding directory failures (2): detect::utf16be detect::utf16le

Correctly detect UTF-16 encoding even without BOM

ba46694

Fixes 2 tests

Add warning about unsupported encodings

6f303c6

Mingun force-pushed the detect-encoding branch from 55f0155 to 6f303c6 Compare August 27, 2022 16:42

Mingun mentioned this pull request Aug 27, 2022

Release 0.24.0 #452

Closed

18 tasks

dralley merged commit 9598c37 into tafia:master Aug 27, 2022

Mingun deleted the detect-encoding branch August 27, 2022 17:13

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Correctly detect encoding even without BOM #465

Correctly detect encoding even without BOM #465

Mingun commented Aug 26, 2022 •

edited

Loading

Mingun Aug 26, 2022

Mingun commented Aug 26, 2022

dralley commented Aug 26, 2022

Mingun commented Aug 26, 2022

dralley commented Aug 26, 2022

dralley Aug 26, 2022

Mingun Aug 27, 2022

Mingun Aug 27, 2022

dralley Aug 27, 2022

dralley Aug 27, 2022 •

edited

Loading

dralley Aug 28, 2022 •

edited

Loading

dralley Aug 26, 2022

Mingun Aug 27, 2022

dralley left a comment

Correctly detect encoding even without BOM #465

Correctly detect encoding even without BOM #465

Conversation

Mingun commented Aug 26, 2022 • edited Loading

Mingun Aug 26, 2022

Choose a reason for hiding this comment

Mingun commented Aug 26, 2022

dralley commented Aug 26, 2022

Mingun commented Aug 26, 2022

dralley commented Aug 26, 2022

dralley Aug 26, 2022

Choose a reason for hiding this comment

Mingun Aug 27, 2022

Choose a reason for hiding this comment

Mingun Aug 27, 2022

Choose a reason for hiding this comment

dralley Aug 27, 2022

Choose a reason for hiding this comment

dralley Aug 27, 2022 • edited Loading

Choose a reason for hiding this comment

dralley Aug 28, 2022 • edited Loading

Choose a reason for hiding this comment

dralley Aug 26, 2022

Choose a reason for hiding this comment

Mingun Aug 27, 2022

Choose a reason for hiding this comment

dralley left a comment

Choose a reason for hiding this comment

Mingun commented Aug 26, 2022 •

edited

Loading

dralley Aug 27, 2022 •

edited

Loading

dralley Aug 28, 2022 •

edited

Loading