Fix bad unicode character issue #579

koechkevin · 2024-02-27T11:23:28Z

Description

Fix an issue with our initial implementation of html2text where meedan returns bad unicode character error.
Replaces html2text with trafilatura.
A formatted input from this post would now be processed as
output.txt

Fixes (issue)

Type of change

Please delete options that are not relevant.

Bug fix (non-breaking change which fixes an issue)

Screenshots

Checklist:

My code follows the style guidelines of this project
I have performed a self-review of my own code
I have commented my code, particularly in hard-to-understand areas
I have made corresponding changes to the documentation

kilemensi · 2024-02-27T11:41:33Z

Before review the code, what is the actual issue here @koechkevin ? Meedan Check doesn't accept links?

koechkevin · 2024-02-27T11:47:29Z

@kilemensi How html2text handles some href links created some invalid unicode characters causing the error. Of all the options mentioned in the documentation, ignore_links is the only that seemed to work for the specific case. Maybe another edge case will be introduced with other uploads

kilemensi · 2024-02-27T12:02:18Z

The option may work but I still don't understand the issue @koechkevin: Why would links raise unicode error? May be there is actually unicode error in the original article itself?

Does Meedan Check expect unicode or ascii?
Is there a maximum length limit?

If Meedan Check accepts unicode,

Have we tried the unicode_snob option?
If this didn't work, did we try manually encoding e.g. html2text(html).encode("utf-8", errors="replace").decode("utf-8", errors="ignore")?

koechkevin · 2024-02-27T13:19:33Z

@kilemensi this is the encoded url that causes the issues after parse by html2text https://www.reuters.com/world/middle-east/egypt-steps-up-security-border-israeli-offensive-gaza-nears-2024-02-09/#:~:text=CAIRO%2C%20Feb%209%20\(Reuters\),two%20Egyptian%20security%20sources%20said.

kilemensi · 2024-02-27T13:32:08Z

@kilemensi this is the encoded url that causes the issues after parse by html2text https://www.reuters.com/world/middle-east/egypt-steps-up-security-border-israeli-offensive-gaza-nears-2024-02-09/#:~:text=CAIRO%2C%20Feb%209%20(Reuters),two%20Egyptian%20security%20sources%20said.

But this doesn't seem to have any unicode issues @koechkevin ... as a matter of fact, it's actually just ascii. How does it look after being parsed by html2text?

koechkevin · 2024-02-27T13:40:54Z

@kilemensi this is the encoded url that causes the issues after parse by html2text https://www.reuters.com/world/middle-east/egypt-steps-up-security-border-israeli-offensive-gaza-nears-2024-02-09/#:~:text=CAIRO%2C%20Feb%209%20(Reuters),two%20Egyptian%20security%20sources%20said.

But this doesn't seem to have any unicode issues @koechkevin ... as a matter of fact, it's actually just ascii. How does it look after being parsed by html2text?
Yes, no unicode issue @kilemensi but html2text adds escape characters around brackets on(Reuters) which is rejected by meedan.

koechkevin · 2024-02-27T14:02:18Z

Egypt

koechkevin · 2024-03-01T07:54:31Z

@kilemensi based on caiosba's response, I suggest we go with this implementation because html2text implements links as markdown

kilemensi · 2024-03-01T07:58:44Z

@kilemensi based on caiosba's response, I suggest we go with this implementation because html2text implements links as markdown

I think caiosba's saying links are important @koechkevin :

for the URLs, since those can be sent to end users on the tiplines, I suggest you simplify them and remove the fragments, in order to guarantee that WhatsApp linkify them correctly

If by current implementation you mean removing links then doesn't that mean URLs won't be sent to the end users?

koechkevin · 2024-03-01T08:39:43Z

@kilemensi based on caiosba's response, I suggest we go with this implementation because html2text implements links as markdown

I think caiosba's saying links are important @koechkevin :

for the URLs, since those can be sent to end users on the tiplines, I suggest you simplify them and remove the fragments, in order to guarantee that WhatsApp linkify them correctly

If by current implementation you mean removing links then doesn't that mean URLs won't be sent to the end users?

@kilemensi yes

koechkevin · 2024-03-06T13:50:59Z

@kilemensi I have updated this PR.

As per this,

Links are useful and content should be plaintext (No markdown). However, since html2text only converts to markdown, I have used Beautiful Soup to convert this to text appended links(href & img src) so we won't be having bad unicode errors anymore.

kilemensi · 2024-03-06T14:14:03Z

Links are useful and content should be plaintext (No markdown). However, since html2text only converts to markdown, I have used Beautiful Soup to convert this to text appended links(href & img src) so we won't be having bad unicode errors anymore.

beatufulsoup output isn't very good @koechkevin... lets just pull the big gun out and use trafilatura then.

koechkevin · 2024-03-07T10:10:44Z

Links are useful and content should be plaintext (No markdown). However, since html2text only converts to markdown, I have used Beautiful Soup to convert this to text appended links(href & img src) so we won't be having bad unicode errors anymore.

beatufulsoup output isn't very good @koechkevin... lets just pull the big gun out and use trafilatura then.

@kilemensi this is updated

kilemensi · 2024-03-07T10:32:09Z

@kilemensi this is updated

Can you include an output sample in the PR description @koechkevin ?

kilemensi · 2024-03-07T12:03:47Z

Is output.txt the current output @koechkevin? Asking because I see links so not sure if this is better than html2text or not.

koechkevin · 2024-03-07T12:10:38Z

Is output.txt the current output @koechkevin? Asking because I see links so not sure if this is better than html2text or not.

@kilemensi Yes, output.txt is the current implementation. It is better than html2text because, it does not process the whole html as a markdown but only parts with the links are formatted like markdown and no bad unicode characters included.

3rdparty/py/requirements-all.txt

pesacheck_meedan_bridge/py/main.py

3rdparty/py/requirements-all.txt

Fix bad unicode character

f1d8de7

koechkevin added the bug Something isn't working label Feb 27, 2024

koechkevin self-assigned this Feb 27, 2024

koechkevin requested review from kilemensi and thepsalmist February 27, 2024 11:23

Pants format

bfb266f

Update version, use preserve links

96ecd49

Merge branch 'main' into bugfix/bad-unicode-characters

0f56bdb

Update latest version

5be3334

kilemensi reviewed Mar 7, 2024

View reviewed changes

3rdparty/py/requirements-all.txt Outdated Show resolved Hide resolved

pesacheck_meedan_bridge/py/main.py Outdated Show resolved Hide resolved

pesacheck_meedan_bridge/py/main.py Outdated Show resolved Hide resolved

Rename variables

338c16b

koechkevin requested a review from kilemensi March 7, 2024 12:58

kilemensi approved these changes Mar 7, 2024

View reviewed changes

3rdparty/py/requirements-all.txt Outdated Show resolved Hide resolved

Remove unneeded file

1ad1395

koechkevin merged commit 2a2d92f into main Mar 7, 2024
3 checks passed

koechkevin deleted the bugfix/bad-unicode-characters branch March 7, 2024 13:15

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix bad unicode character issue #579

Fix bad unicode character issue #579

koechkevin commented Feb 27, 2024 •

edited

Loading

kilemensi commented Feb 27, 2024

koechkevin commented Feb 27, 2024

kilemensi commented Feb 27, 2024

koechkevin commented Feb 27, 2024

kilemensi commented Feb 27, 2024

koechkevin commented Feb 27, 2024

koechkevin commented Feb 27, 2024

koechkevin commented Mar 1, 2024

kilemensi commented Mar 1, 2024

koechkevin commented Mar 1, 2024

koechkevin commented Mar 6, 2024

kilemensi commented Mar 6, 2024

koechkevin commented Mar 7, 2024

kilemensi commented Mar 7, 2024

kilemensi commented Mar 7, 2024

koechkevin commented Mar 7, 2024

Fix bad unicode character issue #579

Fix bad unicode character issue #579

Conversation

koechkevin commented Feb 27, 2024 • edited Loading

Description

Type of change

Screenshots

Checklist:

kilemensi commented Feb 27, 2024

koechkevin commented Feb 27, 2024

kilemensi commented Feb 27, 2024

koechkevin commented Feb 27, 2024

kilemensi commented Feb 27, 2024

koechkevin commented Feb 27, 2024

koechkevin commented Feb 27, 2024

koechkevin commented Mar 1, 2024

kilemensi commented Mar 1, 2024

koechkevin commented Mar 1, 2024

koechkevin commented Mar 6, 2024

kilemensi commented Mar 6, 2024

koechkevin commented Mar 7, 2024

kilemensi commented Mar 7, 2024

kilemensi commented Mar 7, 2024

koechkevin commented Mar 7, 2024

koechkevin commented Feb 27, 2024 •

edited

Loading