Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix bad unicode character issue #579

Merged
merged 7 commits into from
Mar 7, 2024
Merged

Conversation

koechkevin
Copy link
Contributor

@koechkevin koechkevin commented Feb 27, 2024

Description

  • Fix an issue with our initial implementation of html2text where meedan returns bad unicode character error.
  • Replaces html2text with trafilatura.
  • A formatted input from this post would now be processed as
    output.txt

Fixes (issue)

Type of change

Please delete options that are not relevant.

  • Bug fix (non-breaking change which fixes an issue)

Screenshots

Checklist:

  • My code follows the style guidelines of this project
  • I have performed a self-review of my own code
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation

@koechkevin koechkevin added the bug Something isn't working label Feb 27, 2024
@koechkevin koechkevin self-assigned this Feb 27, 2024
@kilemensi
Copy link
Member

Before review the code, what is the actual issue here @koechkevin ? Meedan Check doesn't accept links?

@koechkevin
Copy link
Contributor Author

@kilemensi How html2text handles some href links created some invalid unicode characters causing the error. Of all the options mentioned in the documentation, ignore_links is the only that seemed to work for the specific case. Maybe another edge case will be introduced with other uploads

@kilemensi
Copy link
Member

The option may work but I still don't understand the issue @koechkevin: Why would links raise unicode error? May be there is actually unicode error in the original article itself?

  1. Does Meedan Check expect unicode or ascii?
  2. Is there a maximum length limit?

If Meedan Check accepts unicode,

  1. Have we tried the unicode_snob option?
  2. If this didn't work, did we try manually encoding e.g. html2text(html).encode("utf-8", errors="replace").decode("utf-8", errors="ignore")?

@koechkevin
Copy link
Contributor Author

@kilemensi
Copy link
Member

@kilemensi this is the encoded url that causes the issues after parse by html2text https://www.reuters.com/world/middle-east/egypt-steps-up-security-border-israeli-offensive-gaza-nears-2024-02-09/#:~:text=CAIRO%2C%20Feb%209%20(Reuters),two%20Egyptian%20security%20sources%20said.

But this doesn't seem to have any unicode issues @koechkevin ... as a matter of fact, it's actually just ascii. How does it look after being parsed by html2text?

@koechkevin
Copy link
Contributor Author

@kilemensi this is the encoded url that causes the issues after parse by html2text https://www.reuters.com/world/middle-east/egypt-steps-up-security-border-israeli-offensive-gaza-nears-2024-02-09/#:~:text=CAIRO%2C%20Feb%209%20(Reuters),two%20Egyptian%20security%20sources%20said.

But this doesn't seem to have any unicode issues @koechkevin ... as a matter of fact, it's actually just ascii. How does it look after being parsed by html2text?
Yes, no unicode issue @kilemensi but html2text adds escape characters around brackets on(Reuters) which is rejected by meedan.

@koechkevin
Copy link
Contributor Author

Egypt

@koechkevin
Copy link
Contributor Author

@kilemensi based on caiosba's response, I suggest we go with this implementation because html2text implements links as markdown

@kilemensi
Copy link
Member

@kilemensi based on caiosba's response, I suggest we go with this implementation because html2text implements links as markdown

I think caiosba's saying links are important @koechkevin :

for the URLs, since those can be sent to end users on the tiplines, I suggest you simplify them and remove the fragments, in order to guarantee that WhatsApp linkify them correctly

If by current implementation you mean removing links then doesn't that mean URLs won't be sent to the end users?

@koechkevin
Copy link
Contributor Author

@kilemensi based on caiosba's response, I suggest we go with this implementation because html2text implements links as markdown

I think caiosba's saying links are important @koechkevin :

for the URLs, since those can be sent to end users on the tiplines, I suggest you simplify them and remove the fragments, in order to guarantee that WhatsApp linkify them correctly

If by current implementation you mean removing links then doesn't that mean URLs won't be sent to the end users?

@kilemensi yes

@koechkevin
Copy link
Contributor Author

@kilemensi I have updated this PR.

As per this,

Links are useful and content should be plaintext (No markdown). However, since html2text only converts to markdown, I have used Beautiful Soup to convert this to text appended links(href & img src) so we won't be having bad unicode errors anymore.

@kilemensi
Copy link
Member

Links are useful and content should be plaintext (No markdown). However, since html2text only converts to markdown, I have used Beautiful Soup to convert this to text appended links(href & img src) so we won't be having bad unicode errors anymore.

beatufulsoup output isn't very good @koechkevin... lets just pull the big gun out and use trafilatura then.

@koechkevin
Copy link
Contributor Author

Links are useful and content should be plaintext (No markdown). However, since html2text only converts to markdown, I have used Beautiful Soup to convert this to text appended links(href & img src) so we won't be having bad unicode errors anymore.

beatufulsoup output isn't very good @koechkevin... lets just pull the big gun out and use trafilatura then.

@kilemensi this is updated

@kilemensi
Copy link
Member

@kilemensi this is updated

Can you include an output sample in the PR description @koechkevin ?

@kilemensi
Copy link
Member

Is output.txt the current output @koechkevin? Asking because I see links so not sure if this is better than html2text or not.

@koechkevin
Copy link
Contributor Author

Is output.txt the current output @koechkevin? Asking because I see links so not sure if this is better than html2text or not.

@kilemensi Yes, output.txt is the current implementation. It is better than html2text because, it does not process the whole html as a markdown but only parts with the links are formatted like markdown and no bad unicode characters included.

3rdparty/py/requirements-all.txt Outdated Show resolved Hide resolved
pesacheck_meedan_bridge/py/main.py Outdated Show resolved Hide resolved
pesacheck_meedan_bridge/py/main.py Outdated Show resolved Hide resolved
@koechkevin koechkevin requested a review from kilemensi March 7, 2024 12:58
3rdparty/py/requirements-all.txt Outdated Show resolved Hide resolved
@koechkevin koechkevin merged commit 2a2d92f into main Mar 7, 2024
3 checks passed
@koechkevin koechkevin deleted the bugfix/bad-unicode-characters branch March 7, 2024 13:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants