Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Missing images in full content mode #506

Closed
4 of 5 tasks
jocmp opened this issue Nov 14, 2024 · 14 comments
Closed
4 of 5 tasks

Missing images in full content mode #506

jocmp opened this issue Nov 14, 2024 · 14 comments
Assignees

Comments

@jocmp
Copy link
Owner

jocmp commented Nov 14, 2024

Background

Capy Reader uses a library called Readability4J that has a few rules to parse the article's full content.

Sometimes those rules fail leading to missing images in Capy's full content mode. This is an annoying issue without a single fix-all solution. Every website is different and changes over time which is part of the beauty and chaos of the web.

If you run into this issue with a feed, please post a link to the feed with an example to this thread. I'll track these to fix some point in the future. Thanks!

Feeds

@jocmp jocmp added the bug Something isn't working label Nov 14, 2024
@jocmp jocmp self-assigned this Nov 14, 2024
@jocmp jocmp moved this to On Deck in Capy Reader Nov 14, 2024
@jocmp jocmp removed the status in Capy Reader Nov 27, 2024
@jocmp jocmp changed the title Investigate missing images Missing images in full content mode Nov 27, 2024
@jocmp jocmp pinned this issue Nov 27, 2024
@jocmp jocmp moved this to Parking Lot in Capy Reader Nov 27, 2024
@PhilC813
Copy link
Contributor

The articles' main image isn't shown in Capy's full content mode for the following feed:
https://mobilesyrup.com/feed/

Article example:
https://mobilesyrup.com/2024/11/28/google-releases-ai-generated-pieces-chess-game/

(I only noticed this today so maybe it used to work?)

HTML of the image:
<img fetchpriority="high" width="1867" height="1046" src="https://cdn.mobilesyrup.com/wp-content/uploads/2024/11/gen-ai-chess.jpg" class="attachment-full size-full wp-post-image" alt="" decoding="async" srcset="https://cdn.mobilesyrup.com/wp-content/uploads/2024/11/gen-ai-chess.jpg 1867w, https://cdn.mobilesyrup.com/wp-content/uploads/2024/11/gen-ai-chess-300x168.jpg 300w, https://cdn.mobilesyrup.com/wp-content/uploads/2024/11/gen-ai-chess-1024x574.jpg 1024w, https://cdn.mobilesyrup.com/wp-content/uploads/2024/11/gen-ai-chess-768x430.jpg 768w, https://cdn.mobilesyrup.com/wp-content/uploads/2024/11/gen-ai-chess-1536x861.jpg 1536w, https://cdn.mobilesyrup.com/wp-content/uploads/2024/11/gen-ai-chess-417x235.jpg 417w" sizes="(max-width: 1867px) 100vw, 1867px" />

@jocmp
Copy link
Owner Author

jocmp commented Nov 30, 2024

@PhilC813 an update. I'm toying around with Mercury Parser again and seeing some potential upsides. Here's a comparison of a Les Versants article.

Before After

Mobile Syrup

Before After

@PhilC813
Copy link
Contributor

Waw, seems very promising.

Do you mind checking with this article?
https://mobilesyrup.com/2024/11/28/here-are-the-2024-staples-black-friday-deals/

It's an article with Black Friday deals, and the current parser basically removes all the bullet points in which the deals are listed 😅

@jocmp
Copy link
Owner Author

jocmp commented Nov 30, 2024

The new parser skips over lists by default, but with a little bit of code it works: https://github.com/jocmp/capyreader/pull/569/files#diff-a5310ab57bf17835286b2a012ceca522b0f9af190ceeea2dcf80c52f82c6479dR41-R49

@PhilC813
Copy link
Contributor

PhilC813 commented Nov 30, 2024

So you can easily specify the <ul> tag as an exception, sweet. Frankly I don't really see a reason why they would be excluded by default. They are more likely to be content than ads.

Also, is there any parser that is still actively maintained? Mercury seems abandoned like Readability4J. It's not necessarily a problem, but having an active project is always a +.

@jocmp
Copy link
Owner Author

jocmp commented Nov 30, 2024

Couldn't agree more. I think Mercury is more extensible and maintainable between the two. I forked it and I'm working on bringing its dependencies up to date here: https://github.com/jocmp/mercury-parser.

@PhilC813
Copy link
Contributor

PhilC813 commented Dec 4, 2024

I've updated the app to 2024.12.1080-dev and despite the reintroduction of Mercury, I'm not seeing the results you shared above with the quick check I've done with the feed "Les Versants".

Screenrecorder-20241204-011225.mp4

As you can see, in the same article you used for testing, the headline is still missing, and all those grey enclosures further down actually correspond to ad placements. Then there's the last ad of the page that does manage to render.


Also, it seems like the sticky configuration of the "Extract full content" button doesn't work properly in this build.

In an article, if you tap the button to turn it off, then tap it again to turn it back on, and move to article of the same feed, it will be off upon opening an article of the same feed.

@jocmp
Copy link
Owner Author

jocmp commented Dec 4, 2024

Let me take another look. I may be able to filter out those ad placements too. Just to make sure I'm testing the same thing, are you using a local account?

About the sticky config, I'm able to reproduce that bug. I'll follow up with a different ticket to fix that. #576

@PhilC813
Copy link
Contributor

PhilC813 commented Dec 5, 2024

Just to make sure I'm testing the same thing, are you using a local account?

I'm using Capy with my Feedbin account.

About the sticky config, I'm able to reproduce that bug. I'll follow up with a different ticket to fix that. #576

Don't give up!! 😆

@jocmp
Copy link
Owner Author

jocmp commented Dec 5, 2024

Aha, I use Feedbin's copy of Mercury Parser for those accounts. Local accounts rely on the Mercury Parser that I'm updating. So they're different right now.

I'll see what I can do to use the same version of the parser everywhere. It should result in a more consistent experience across the board.

@jocmp
Copy link
Owner Author

jocmp commented Dec 7, 2024

@PhilC813 I enabled the updated Mercury Parser for Feedbin accounts in 2024.12.1081-dev and also fixed the sticky content bug. Let me know how it works for you!

@PhilC813
Copy link
Contributor

PhilC813 commented Dec 7, 2024

Seeing some extremely positive results so far. I'm also seeing some YouTube videos that were filtered out before now being displayed properly. Solid update..!

@jocmp jocmp unpinned this issue Dec 7, 2024
@jocmp jocmp added full content request and removed bug Something isn't working labels Dec 7, 2024
@privacyadmin
Copy link

privacyadmin commented Dec 21, 2024

Possible to fix articles for this domain?

Seems like all the text and images in their articles are missing/incomplete.

Below are some examples

https://www.hardwarezone.com.sg/feature-how-spot-potential-scam-messages-ios-and-android-singapore-rcs-sms

https://www.phoronix.com/news/Raspberry-Pi-HEVC-H265-Decode

@jocmp
Copy link
Owner Author

jocmp commented Dec 22, 2024

hey @privacyadmin I'll take a look. Can you open a new issue for each of those feeds using this template? https://github.com/jocmp/capyreader/issues/new?labels=full%20content%20request&template=2-full-content-request.yml

I want to close out this mega-issue since it's hard to track

@jocmp jocmp closed this as completed Dec 22, 2024
@github-project-automation github-project-automation bot moved this from Parking Lot to Done in Capy Reader Dec 22, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: Done
Development

No branches or pull requests

3 participants