Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

There's still a need to bump the memcache size #1107

Open
jidanni opened this issue Jun 27, 2024 · 26 comments
Open

There's still a need to bump the memcache size #1107

jidanni opened this issue Jun 27, 2024 · 26 comments

Comments

@jidanni
Copy link

jidanni commented Jun 27, 2024

Hello. In openstreetmap/openstreetmap-website#2457 I was told to open an issue here. But as it is getting a little over my head, I will just leave this here.

@tomhughes
Copy link
Member

There is no evidence at all in the graphs that this in fact an issue. I definitely see the issue that you are referring to but I am unable it as all evidence says it shouldn't be down to memcache.

@tbertels
Copy link

tbertels commented Jul 6, 2024

Could these sessions disconnections be caused by server restarts or does the server never restart?

@mmd-osm
Copy link

mmd-osm commented Jul 6, 2024

I don't see any server restart in the stats, at least for the last 6 months: https://prometheus.openstreetmap.org/d/l4zgNUdMz/memcached?orgId=1&refresh=1m&from=now-6M&to=now

Also, the OP didn't provide any details how frequently they have to log in again. There might be external factors, like cookies being removed by the browser or some browser extension, etc.

@jidanni
Copy link
Author

jidanni commented Jul 7, 2024

I thought everybody else also has to login again at least once every three or four days.
Maybe it's because I use various browsers on various devices. But why on the same device do I need to login again after three or four days?
Anyways welcome to check the logs to see why user jidanni has to login again so often.

@tbertels
Copy link

tbertels commented Jul 7, 2024

Which stat do you use to check if the server restarted?
Aren't these sudden drops in memory usage symptoms of a server restart?
Note that the dates are in the format month/day.
Copie d'écran_20240707_143049m

@mmd-osm
Copy link

mmd-osm commented Jul 7, 2024

Ah, the link wasn't that helpful. There are about 11 memcached instances overall. However, for the 3 frontend servers, only 3 memcached instances (spike-06 ... spike-08) are relevant. Items in cache and memory usage are fairly stable for these three.

https://prometheus.openstreetmap.org/d/l4zgNUdMz/memcached?orgId=1&refresh=1m&from=now-6M&to=now&var-instance=spike-06&var-instance=spike-07&var-instance=spike-08

I think this should match the following config in chef: https://github.com/openstreetmap/chef/blob/45dc24b65b23a6c1dcc2f0ba2aa971563555c35e/roles/web.rb#L20

@tomhughes
Copy link
Member

A restart would indeed lose all sessions but as @mmd-osm says it's only those three machines that we're talking about here and they last restarted in November last year:

image

At that time it took nearly two months for the caches to fill up which suggests that it should take about that long for things to get expired unless there has been a significant increase in the cache usage since.

@pnorman
Copy link
Collaborator

pnorman commented Jul 12, 2024

The eviction rate has increased since November but it hasn't consisntently bee more than double. commands/second has remaind the same

@tbertels
Copy link

I logged back in 5 days ago: 1 day later my session was still active but today I'm logged out.
We can also see a dip today from ~100 millions items in cache to ~66 millions.

I suggest to store the sessions in the DB and use memcache only to speed up sessions check for frequently used sessions.

@tomhughes
Copy link
Member

One of the machines was rebooted yesterday while fighting the DDOS so 1/3 of the the cache entries were lost.

@mmd-osm
Copy link

mmd-osm commented Jul 12, 2024

I'm wondering how many of these entries originate from CGImap (key prefix would be "cgimap:"). For some reason, these entries have the expiration value set to 0 (unlimited). This doesn't make a whole lot of sense for rate limiting requests, where the exact timestamp would be known upfront at which time these entries become irrelevant.

@mmd-osm
Copy link

mmd-osm commented Jul 13, 2024

At least when testing locally, I've noticed that every anonymous user creates a rails session without expiry (that's the "0" in "1 0 73" below), whereas logged in users have an entry with 4-5 weeks expiration.

Anonymous user sessions:

/usr/share/memcached/scripts/memcached-tool localhost:11211 dump 
Dumping memcache contents
add rails:session:2::2d28d018bdda81f05bae57ba42ee200a7a14af6df74134bb93ee82f99bf7baab 1 0 73
{I"_csrf_token:EFI"096xa2ms9DVncEF7CBUeBJ0wP9VYJrKO6lzxqDomep74;F

Logged in user:

Expires at 1723288155 = Sat Aug 10 13:09:15 CEST 2024

add rails:session:2::2d28d018bdda81f05bae57ba42ee200a7a14af6df74134bb93ee82f99bf7baab 1 1723288155 200
{	I"_csrf_token:EFI"096xa2ms9DVncEF7CBUeBJ0wP9VYJrKO6lzxqDomep74;FI"	user;FiI"fingerprint;FI"E....

@tomhughes
Copy link
Member

Expiry shouldn't really matter that much because anything that isn't used just moves down the LRU list and gets discarded eventually when we need space for a new entry.

Logged in sessions (with "remember me" checked) do get an expiry of 28 days which matches the cookie expiry while other sessions (not logged in and logged in without "remember me" checked) actually don't have an expiry but issue a session cookie that expires when the browser is closed.

@mmd-osm
Copy link

mmd-osm commented Jul 13, 2024

First of all, I find it a bit difficult to reason about the logged in sessions based on Prometheus stats, in particular after how many days these entries would be discarded.

memcached has an LRU crawler which reclaims expired entries even before they're reaching the end of the LRU list. With a non zero TTL, we might get rid of many "non-logged in user" entries early on, before they might evict "logged in user" entries.

@mmd-osm
Copy link

mmd-osm commented Jul 23, 2024

At the current growth rate, we will likely see some evictions in about 10 days (=21 days after last memcached restart).

@jidanni : did you notice any issues with lost login sessions in the last 8-9 days? If so, it can’t be memcached related…

@tomhughes
Copy link
Member

It's not that simple because only one machine was reset I think? So only keys which hash to that machine are currently exempt from being evicted.

@mmd-osm
Copy link

mmd-osm commented Jul 23, 2024

I think spike-06..08 were all restarted, the aggregated cached items count on Prometheus shows 0 entries about 10 days ago.

@jidanni
Copy link
Author

jidanni commented Jul 24, 2024

@mmd-osm rather than using my misty memory,
surely there must be some internal logs you can check regarding me (user: jidanni)
that can give you precise details.

@mmd-osm
Copy link

mmd-osm commented Jul 24, 2024

We want to hear from you first hand, as you’ve also raised the issue. Misty memory is ok. If you say it hasn’t bothered you recently then that’s good enough for now.

What we see in the charts right now is that no entries are being removed. So chances are that your session is still around.

@jidanni
Copy link
Author

jidanni commented Jul 24, 2024

Okay. I will remember next time to report each and every incident right here to the thread.

@jidanni
Copy link
Author

jidanni commented Aug 5, 2024

Okay. Just had to log in again as you can see in your logs perhaps.

@mmd-osm
Copy link

mmd-osm commented Aug 5, 2024

Thank you for the feedback. This is not completely unexpected. Evicting entries started again on August 1st, even a bit sooner than estimated.

@jidanni
Copy link
Author

jidanni commented Aug 31, 2024

On a laptop I hadn't used in five days:
Had to login again to OSM.
But didn't need to login again to GitHub to add this comment.

@mmd-osm
Copy link

mmd-osm commented Oct 13, 2024

At least 8 other users have reported the same issue in https://community.openstreetmap.org/t/osm-webseite-standiges-login-notig/120072

All different browsers, not only Firefox. I could also reproduce it today on my mobile.

spike-0[6-8] are seeing some cache evictions since a few days again:

image

image

Following up on my previous comment to get rid of anonymous sessions as early as possible, we could check how the Gitlab repo addressed the issue. They're having similar issues with Redis and unauthenticated users filling up the memory. Redis and Memcached implementations should be fairly similar, ['rack.session.options'][:expire_after] is also used by the memcached client.

Initially, Gitlab added a special helper for this purpose: https://gitlab.com/gitlab-org/gitlab/-/blob/ee088fc0d53198016e245c515f28e03d8229e297/app/controllers/application_controller.rb#L29 and some PRs on the topic: https://gitlab.com/gitlab-org/gitlab/-/merge_requests/88514/diffs

Helper: https://gitlab.com/gitlab-org/gitlab/-/blob/ee088fc0d53198016e245c515f28e03d8229e297/app/helpers/sessions_helper.rb#L17-41

Lately they seem to have moved it to an own rack middleware to cover more scenarios: https://gitlab.com/gitlab-org/gitlab/-/commit/8c85364205ccb1f4602ab3543d10ff55295bd6cc

This might be worthwhile checking out.

@mmd-osm
Copy link

mmd-osm commented Oct 14, 2024

I've adjusted the Gitlab code a bit to work with the osm website: https://github.com/mmd-osm/openstreetmap-website/tree/patch/sessionexpiry

It's more of a proof of concept at this time, to demo the idea. I can create a PR to continue the discussion, if needed.
It should also not interfere with session_persistence.rb and session_methods.rb, which define a cookie expiration for logged on users only.

For testing, I recommend to check results of "memcached-tool localhost:11211 dump" after each activity, in particular the TTL value. That's second last value in each line starting with "add rails:session:2:..." (format: unix epoch).

/fyi: @AntonKhorev


Meanwhile, memcached has also been restarted or purged, so we're down to 0 evictions for the next few weeks.

@jidanni
Copy link
Author

jidanni commented Oct 19, 2024

Today had to login again.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants