Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Conditionally fetch blog RSS feed - performance improvement #2831

Open
tfmorris opened this issue Jan 8, 2020 · 9 comments · May be fixed by #10015
Open

Conditionally fetch blog RSS feed - performance improvement #2831

tfmorris opened this issue Jan 8, 2020 · 9 comments · May be fixed by #10015
Labels
Good First Issue Easy issue. Good for newcomers. [managed] Lead: @mekarpeles Issues overseen by Mek (Staff: Program Lead) [managed] Needs: Feedback A proposed feature or bug resolution needs community feedback prior to forging ahead. [managed] Priority: 3 Issues that we can consider at our leisure. [managed] Theme: Performance Issues related to UI or Server performance. [managed] Type: Refactor/Clean-up Issues related to reorganization/clean-up of data or code (e.g. for maintainability). [managed]

Comments

@tfmorris
Copy link
Contributor

tfmorris commented Jan 8, 2020

Currently we unconditionally fetch and XML parse the RSS feed. We do cache the result, but only for 5 minutes and the feed changes on the scale of months, not minutes. We should use one of HTTP's conditional GET mechanisms to only fetch the payload when it's changed.

Describe the problem that you'd like solved

The payload includes

<channel>
    <lastBuildDate>Wed, 27 Nov 2019 19:06:51 +0000</lastBuildDate>
    <sy:updatePeriod>hourly</sy:updatePeriod>
    <sy:updateFrequency>1</sy:updateFrequency>

which could be used to tailor the polling period, but by the time we've done all the XML parsing, we've done the bulk of the work, so we should instead conditionally fetch the URL and only do the parsing if we get new data.

Proposal & Constraints

The response headers include Last-Modified and ETag:

Last-Modified: Wed, 27 Nov 2019 19:06:51 GMT
ETag: "e520140570cbdfad48c7432feffb9d91-gzip"

and the Last-Modified time matches the payload build time, so we can use request headers of either If-Modified-Since or If-None-Match to conditionally GET the RSS feed, shortcutting any additional work if no payload is returned.

Additional context

Where we cache the blog posts:

_get_blog_feeds = cache.memcache_memoize(
_get_blog_feeds, key_prefix="upstream.get_blog_feeds", timeout=5 * 60
)

Stakeholders

@tfmorris tfmorris added the Theme: Performance Issues related to UI or Server performance. [managed] label Jan 8, 2020
@xayhewalo xayhewalo added Lead: @mekarpeles Issues overseen by Mek (Staff: Program Lead) [managed] Priority: 3 Issues that we can consider at our leisure. [managed] State: Backlogged Type: Refactor/Clean-up Issues related to reorganization/clean-up of data or code (e.g. for maintainability). [managed] labels Jan 14, 2020
@mekarpeles mekarpeles added the Good First Issue Easy issue. Good for newcomers. [managed] label Dec 19, 2022
@BenjaminVC
Copy link

Hello,

I'm new to this codebase and interested in tackling this. Could I be assigned to this issue? Any additional guidance would be appreciated.

Thanks!

@XiyueJasmineZhang
Copy link

@mekarpeles Can I please take on this issue?

@tusharv01
Copy link

@tfmorris @mekarpeles @xayhewalo Can you please assign me this issue and guide me further if I am struck at any problem?

@RayBB
Copy link
Collaborator

RayBB commented Mar 21, 2024

@tfmorris How would you feel about increasing the cache time from 5 minutes to 1 day?

I agree it is wasteful to parse all the xml so often and I think it is be a fair tradeoff to increase the polling time.

The alternative of getting/storing/sending the last modified time and checking the headers seems like a decent amount of work for relatively little benefit.

@tfmorris
Copy link
Contributor Author

Increasing the cache timeout sounds fine, but you might want to check with @mekarpeles concerning how much staleness he's willing to live with. Even increasing it to the 1 hr RSS polling interval would be a 12x reduction in work though.

To be honest, I don't remember how I envisioned state management for this working, but you wouldn't need to look at response headers because you'd get an explicit 304 Not Modified error response. Having said that, the way things are set up now, you'd have already incurred a cache miss, so you'd no longer have the information that you need to avoid redoing the work.

@RayBB RayBB added the Needs: Feedback A proposed feature or bug resolution needs community feedback prior to forging ahead. [managed] label Mar 21, 2024
@astatide astatide linked a pull request Nov 8, 2024 that will close this issue
@astatide
Copy link

astatide commented Nov 9, 2024

Hi hi! I took the liberty of opening up a PR that addresses this. Comments and concerns definitely appreciated!

#10015

@github-actions github-actions bot added the Needs: Response Issues which require feedback from lead label Nov 9, 2024
@mekarpeles
Copy link
Member

mekarpeles commented Nov 11, 2024

Presumably now we're only calling this when we recache the home page so it's probably not a huge cost.

@astatide
Copy link

Ah, okay. Do you think it's still worth pursuing the PR, then?

Also, curious if there's any other good first issue stuff, in that case... (most everything seems to be assigned and/or have an attached in-progress PR)

@jimchamp jimchamp removed the Needs: Response Issues which require feedback from lead label Nov 13, 2024
@jimchamp
Copy link
Collaborator

Ah, okay. Do you think it's still worth pursuing the PR, then?

I think that increasing the time that blog posts are cached to a day should solve this issue to everybody's satisfaction (anybody can correct me if I'm wrong).

Staff will be able to evict this cache entry if we want the latest changes to be displayed immediately.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Good First Issue Easy issue. Good for newcomers. [managed] Lead: @mekarpeles Issues overseen by Mek (Staff: Program Lead) [managed] Needs: Feedback A proposed feature or bug resolution needs community feedback prior to forging ahead. [managed] Priority: 3 Issues that we can consider at our leisure. [managed] Theme: Performance Issues related to UI or Server performance. [managed] Type: Refactor/Clean-up Issues related to reorganization/clean-up of data or code (e.g. for maintainability). [managed]
Projects
None yet
Development

Successfully merging a pull request may close this issue.

9 participants