-
Notifications
You must be signed in to change notification settings - Fork 238
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Correctly detect encoding even without BOM #465
Conversation
@@ -173,25 +180,26 @@ pub(crate) fn remove_bom<'b>(bytes: &'b [u8], encoding: &'static Encoding) -> &' | |||
/// | |||
/// | Bytes |Detected encoding | |||
/// |-------------|------------------------------------------ | |||
/// |`FE FF ## ##`|UTF-16, big-endian | |||
/// | **BOM** | |||
/// |`FE_FF_##_##`|UTF-16, big-endian |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Underscores added because otherwise long text in lower cells shrink the first column and leads to text wrapping, which looks not good
What about BOM stripping? Should we strip BOM when we parse from |
I don't think so personally - #461 should handle it so that the input itself never needs to be transformed. Unless you think we need it would be useful in the interim before that issue is dealt with. |
0fb349e
to
7520339
Compare
Ok, then for now I leave it as implemented:
PR is ready for review |
7520339
to
55f0155
Compare
Thanks. I've been waiting for this to avoid overlapping work as I needed to make a few similar changes - will try to proceed with the decoding work this weekend. |
## - [UTF-16LE] | ||
## - [ISO-2022-JP] | ||
## | ||
## You should stop to process document when one of that encoding will be detected, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The suggestion makes sense in isolation, but I don't know that we want to contribute to even more churn than necessary? It's likely to either break or be unnecessary in the very next release.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I just want to explicitly warn users about that UTF-16 and ISO-2022-JP are not supported for now. For example, parsing documents with some Chinese characters that represented as [ASCII byte, some byte]
in UTF-16BE or [some byte, ASCII byte]
in UTF16LE can confuse the parser. That can be avoided if you stop to processing such documents in the very beginning.
I want make this change because I want to cut release this weekend, and it is still unclear when the correct solution will be ready, so it is best to avoid using problematic encodings for now.
I hope that in the next release this note will be removed. It should not break anything -- once correct support will be implemented, users will be updated and remove their guard code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added a remark, that restriction is temporary and will be eliminated once #158 is fixed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks. I do think finishing #158 this weekend is plausible but I won't promise it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, an intermediate "good enough to release" stage of it, in any case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or it would be, except for async. Keep forgetting about that...
Cargo.toml
Outdated
@@ -54,9 +54,54 @@ async-tokio = ["tokio"] | |||
## UTF-16 will not work (therefore, `quick-xml` is not [standard compliant]). | |||
## | |||
## List of supported encodings includes all encodings supported by [`encoding_rs`] | |||
## crate, that satisfied the restriction above. | |||
## crate, that satisfied the restriction above. So, the following encodings are | |||
## **not supported**: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's a lot of redundancy between this, the lines above, and the lines below that can probably be cleaned up.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tried to remove redundancy, if you have something else in mind, please make a suggestion.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changing it to approved so you can merge once the minor comments are addressed.
…oding` feature for them in the doc This also allow to show doc for both variants of the methods
After removing StartText event in tafia#459 text events can be generated at the beginning of the stream
`test-gen` is a project used to generate test files in tests/documents/encoding directory failures (2): detect::utf16be detect::utf16le
Fixes 2 tests
55f0155
to
6f303c6
Compare
CDATA section was formed incorrectly and instead was recognized as a Start tag. File introduced in PR tafia#465 and was made manually and not using generator, because WHATWG does not have definition of this encoding as a separate entry in index.json. Actually, this encoding the same as ISO-8859-8, but influences layout direction when render text. Wikipedia: The WHATWG Encoding Standard used by HTML5 treats ISO-8859-8 and ISO-8859-8-I as distinct encodings with the same mapping due to influence on the layout direction So generator was fixed and file regenerated
This PR correctly processes input and allow to detect UTF-16 encoding even without BOM, using the algorithm, recommended by W3C. Previously it is not worked because we ran it too late, when the
<
(0x3C) byte was already recognized as start of a tag and the "start text" does not contained it.There is one thin thing there: when we should strip BOM? Currently I chosen the consistent behavior across
encoding
feature enabled / disabled and parsing from bytes / from&str
, but probably it is not the best choice.The available variants can be summarized in that table:
from_reader
from_str
--features encoding
--features encoding
I doubt about stripping BOM from
&str
arguments, because it is highly likely that BOM already was stripped when you've got the string, and it it the other BOM in the beginning, it is probably the content part (because BOM is ordinary character, U+FEFF, and can be used in any place like any other character).However, when
encoding
feature is not enabled, thefrom_reader
andfrom_str
cases are the same -- input should be in UTF-8. Should we strip BOM infrom_reader
case then?This PR also adds a set of files in all supported encodings most of which contains all characters that corresponding encoding can handle. The utility project for generation is in the
test-gen
folder