[Fix Issue #1070] Jsoup parse string not skipping BOM character correctly #1073
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Issue #1070 shows a case where Jsoup not parsing an HTML page correctly and putting all the information in the body instead of head. The problem can be reproducible only if use
Jsoup.parse(response.body())
. Using another methodJsoup.connect(url).get()
won't reproduce the same problem.After digging around for a while, it turns out that the HTML page has the BOM character
65279
in its first character. The reason that the second method,Jsoup.connect(url).get()
, internally usesDataUtil.parseInputStream()
, which handles the BOM character correctly. However,Jsoup.parse(string)
constructs aHTMLTreeBuilder
to parse the document directly, therefore the BOM handling process is not there.This PR fixes this issue by introducing the same handling process in
TreeBuilder
constructor. Before it passes the input reader to the tokenizer, it reuses the same helper functionDataUtil.detectCharsetFromBom
to detect the BOM character, and skip the BOM character if needed.This PR also adds a new test case in
DataUtilTest
to test ifJsoup.parse(string)
can work correctly with BOM. After adding the fix, the problem in issue #1070 can produce the correct result.There are already several efforts to fix the problem caused by the BOM character:
Issue #348 is the first issue mentioning the problem, and it is fixed in commit 3f9f33d , this commit only fixes the issue in
parseByteData
(which later becomesparseInputStream
).Later there are several commits to refactor the BOM handling code, such as c3cbe1b , 4eb4f2b
The latest issue mentioning the problem is Issue #1003, and the corresponding fix is in 0f7e0cc , however, this commit still only fixes the function
parseInputStream
, which is not used byJsoup.parse(string)
.@jhy I'm not sure if
TreeBuilder
class is the best place to put in the fix code. I am willing to devote more efforts to improve it if you could give some more advice. Thanks :)