-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix bad unicode character issue #579
Conversation
Before review the code, what is the actual issue here @koechkevin ? Meedan Check doesn't accept links? |
@kilemensi How html2text handles some href links created some invalid unicode characters causing the error. Of all the options mentioned in the documentation, ignore_links is the only that seemed to work for the specific case. Maybe another edge case will be introduced with other uploads |
The option may work but I still don't understand the issue @koechkevin: Why would links raise unicode error? May be there is actually unicode error in the original article itself?
If Meedan Check accepts unicode,
|
@kilemensi this is the encoded url that causes the issues after parse by html2text https://www.reuters.com/world/middle-east/egypt-steps-up-security-border-israeli-offensive-gaza-nears-2024-02-09/#:~:text=CAIRO%2C%20Feb%209%20\(Reuters\),two%20Egyptian%20security%20sources%20said. |
But this doesn't seem to have any unicode issues @koechkevin ... as a matter of fact, it's actually just ascii. How does it look after being parsed by html2text? |
|
@kilemensi based on caiosba's response, I suggest we go with this implementation because html2text implements links as markdown |
I think caiosba's saying links are important @koechkevin :
If by current implementation you mean removing links then doesn't that mean URLs won't be sent to the end users? |
@kilemensi yes |
@kilemensi I have updated this PR. As per this, Links are useful and content should be plaintext (No markdown). However, since html2text only converts to markdown, I have used Beautiful Soup to convert this to text appended links(href & img src) so we won't be having bad unicode errors anymore. |
|
@kilemensi this is updated |
Can you include an output sample in the PR description @koechkevin ? |
Is output.txt the current output @koechkevin? Asking because I see links so not sure if this is better than html2text or not. |
@kilemensi Yes, output.txt is the current implementation. It is better than html2text because, it does not process the whole html as a markdown but only parts with the links are formatted like markdown and no bad unicode characters included. |
Description
html2text
withtrafilatura
.output.txt
Fixes (issue)
Type of change
Please delete options that are not relevant.
Screenshots
Checklist: