-
-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conditionally fetch blog RSS feed - performance improvement #2831
Comments
Hello, I'm new to this codebase and interested in tackling this. Could I be assigned to this issue? Any additional guidance would be appreciated. Thanks! |
@mekarpeles Can I please take on this issue? |
@tfmorris @mekarpeles @xayhewalo Can you please assign me this issue and guide me further if I am struck at any problem? |
@tfmorris How would you feel about increasing the cache time from 5 minutes to 1 day? I agree it is wasteful to parse all the xml so often and I think it is be a fair tradeoff to increase the polling time. The alternative of getting/storing/sending the last modified time and checking the headers seems like a decent amount of work for relatively little benefit. |
Increasing the cache timeout sounds fine, but you might want to check with @mekarpeles concerning how much staleness he's willing to live with. Even increasing it to the 1 hr RSS polling interval would be a 12x reduction in work though. To be honest, I don't remember how I envisioned state management for this working, but you wouldn't need to look at response headers because you'd get an explicit 304 Not Modified error response. Having said that, the way things are set up now, you'd have already incurred a cache miss, so you'd no longer have the information that you need to avoid redoing the work. |
Hi hi! I took the liberty of opening up a PR that addresses this. Comments and concerns definitely appreciated! |
Presumably now we're only calling this when we recache the home page so it's probably not a huge cost. |
Ah, okay. Do you think it's still worth pursuing the PR, then? Also, curious if there's any other good first issue stuff, in that case... (most everything seems to be assigned and/or have an attached in-progress PR) |
I think that increasing the time that blog posts are cached to a day should solve this issue to everybody's satisfaction (anybody can correct me if I'm wrong). Staff will be able to evict this cache entry if we want the latest changes to be displayed immediately. |
Currently we unconditionally fetch and XML parse the RSS feed. We do cache the result, but only for 5 minutes and the feed changes on the scale of months, not minutes. We should use one of HTTP's conditional GET mechanisms to only fetch the payload when it's changed.
Describe the problem that you'd like solved
The payload includes
which could be used to tailor the polling period, but by the time we've done all the XML parsing, we've done the bulk of the work, so we should instead conditionally fetch the URL and only do the parsing if we get new data.
Proposal & Constraints
The response headers include Last-Modified and ETag:
and the Last-Modified time matches the payload build time, so we can use request headers of either If-Modified-Since or If-None-Match to conditionally GET the RSS feed, shortcutting any additional work if no payload is returned.
Additional context
Where we cache the blog posts:
openlibrary/openlibrary/plugins/upstream/utils.py
Lines 1429 to 1431 in 077bcfe
Stakeholders
The text was updated successfully, but these errors were encountered: