Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

M_NOT_FOUND from synapse every time a new user tries to sign in #3354

Open
thefinn93 opened this issue Oct 9, 2024 · 2 comments
Open

M_NOT_FOUND from synapse every time a new user tries to sign in #3354

thefinn93 opened this issue Oct 9, 2024 · 2 comments
Labels
A-Jobs Related to asynchronous jobs T-Defect Something isn't working X-Blocked Blocked by another issue

Comments

@thefinn93
Copy link

thefinn93 commented Oct 9, 2024

Describe the bug
On our MAS instance, every time we attempt to onboard a new user, they are prevented from completing login and shown error request failed with status 404 Not Found: M_NOT_FOUND: Unknown user. A sample of the MAS logs when this happens:

2024-10-09T19:19:13.339723Z  INFO activity_tracker.flush: mas_handlers::activity_tracker::worker: crates/handlers/src/activity_tracker/worker.rs:219: Flushing 2 activity records to the database
2024-10-09T19:20:13.347047Z  INFO activity_tracker.flush: mas_handlers::activity_tracker::worker: crates/handlers/src/activity_tracker/worker.rs:219: Flushing 1 activity records to the database
2024-10-09T19:20:50.876417Z ERROR http.server.request{otel.kind="server" otel.name="POST /complete-compat-sso/:grant_id" network.protocol.name="http" network.protocol.version="1.1" http.request.method="POST" url.path="/complete-compat-sso/01J9SBK4850FS9RTKXBYEX7JD4" url.scheme="http" http.route="/complete-compat-sso/:grant_id" user_agent.original="Mozilla/5.0 (Linux; Android 10; K) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/129.0.0.0 Mobile Safari/537.36"}:handlers.compat.login_sso_complete.post{compat_sso_login.id=01J9SBK4850FS9RTKXBYEX7JD4}:homeserver.create_device{matrix.homeserver="devhack.net" matrix.mxid="@xxxx:devhack.net" matrix.device_id="LorCzfWcDe"}: mas_matrix_synapse: crates/matrix-synapse/src/lib.rs:345: error=Failed to create device in Synapse

Caused by:
    request failed with status 404 Not Found: M_NOT_FOUND: Unknown user
2024-10-09T19:20:50.876491Z ERROR http.server.request{otel.kind="server" otel.name="POST /complete-compat-sso/:grant_id" network.protocol.name="http" network.protocol.version="1.1" http.request.method="POST" url.path="/complete-compat-sso/01J9SBK4850FS9RTKXBYEX7JD4" url.scheme="http" http.route="/complete-compat-sso/:grant_id" user_agent.original="Mozilla/5.0 (Linux; Android 10; K) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/129.0.0.0 Mobile Safari/537.36"}:handlers.compat.login_sso_complete.post{compat_sso_login.id=01J9SBK4850FS9RTKXBYEX7JD4}: mas_handlers::compat::login_sso_complete: crates/handlers/src/compat/login_sso_complete.rs:117: error=Internal error: Failed to provision device (Failed to provision device

Caused by:
    0: Failed to create device in Synapse
    1: request failed with status 404 Not Found: M_NOT_FOUND: Unknown user)
2024-10-09T19:21:13.351558Z  INFO activity_tracker.flush: mas_handlers::activity_tracker::worker: crates/handlers/src/activity_tracker/worker.rs:219: Flushing 1 activity records to the database
2024-10-09T19:21:24.690731Z ERROR http.server.request{otel.kind="server" otel.name="POST /upstream/link/:link_id" network.protocol.name="http" network.protocol.version="1.1" http.request.method="POST" url.path="/upstream/link/01J9SBK5JD1WXZP5JMA1XHZM9T" url.scheme="http" http.route="/upstream/link/:link_id" user_agent.original="Mozilla/5.0 (Linux; Android 10; K) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/129.0.0.0 Mobile Safari/537.36"}:handlers.upstream_oauth2.link.post{upstream_oauth_link.id=01J9SBK5JD1WXZP5JMA1XHZM9T}: mas_handlers::upstream_oauth2::link: crates/handlers/src/upstream_oauth2/link.rs:469: error=Missing session cookie
2024-10-09T19:21:27.795058Z ERROR http.server.request{otel.kind="server" otel.name="GET /upstream/link/:link_id" network.protocol.name="http" network.protocol.version="1.1" http.request.method="GET" url.path="/upstream/link/01J9SBK5JD1WXZP5JMA1XHZM9T" url.scheme="http" http.route="/upstream/link/:link_id" user_agent.original="Mozilla/5.0 (Linux; Android 10; K) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/129.0.0.0 Mobile Safari/537.36"}:handlers.upstream_oauth2.link.get{upstream_oauth_link.id=01J9SBK5JD1WXZP5JMA1XHZM9T}: mas_handlers::upstream_oauth2::link: crates/handlers/src/upstream_oauth2/link.rs:182: error=Missing session cookie
2024-10-09T19:21:38.304873Z ERROR http.server.request{otel.kind="server" otel.name="POST /complete-compat-sso/:grant_id" network.protocol.name="http" network.protocol.version="1.1" http.request.method="POST" url.path="/complete-compat-sso/01J9SBNDEA00KYXQKPV53857EF" url.scheme="http" http.route="/complete-compat-sso/:grant_id" user_agent.original="Mozilla/5.0 (Linux; Android 10; K) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/129.0.0.0 Mobile Safari/537.36"}:handlers.compat.login_sso_complete.post{compat_sso_login.id=01J9SBNDEA00KYXQKPV53857EF}:homeserver.create_device{matrix.homeserver="devhack.net" matrix.mxid="@xxx:devhack.net" matrix.device_id="hYQWiVuKGg"}: mas_matrix_synapse: crates/matrix-synapse/src/lib.rs:345: error=Failed to create device in Synapse

Caused by:
    request failed with status 404 Not Found: M_NOT_FOUND: Unknown user
2024-10-09T19:21:38.304945Z ERROR http.server.request{otel.kind="server" otel.name="POST /complete-compat-sso/:grant_id" network.protocol.name="http" network.protocol.version="1.1" http.request.method="POST" url.path="/complete-compat-sso/01J9SBNDEA00KYXQKPV53857EF" url.scheme="http" http.route="/complete-compat-sso/:grant_id" user_agent.original="Mozilla/5.0 (Linux; Android 10; K) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/129.0.0.0 Mobile Safari/537.36"}:handlers.compat.login_sso_complete.post{compat_sso_login.id=01J9SBNDEA00KYXQKPV53857EF}: mas_handlers::compat::login_sso_complete: crates/handlers/src/compat/login_sso_complete.rs:117: error=Internal error: Failed to provision device (Failed to provision device

Caused by:
    0: Failed to create device in Synapse
    1: request failed with status 404 Not Found: M_NOT_FOUND: Unknown user)
2024-10-09T19:22:08.904046Z ERROR http.server.request{otel.kind="server" otel.name="POST /complete-compat-sso/:grant_id" network.protocol.name="http" network.protocol.version="1.1" http.request.method="POST" url.path="/complete-compat-sso/01J9SBPEFCDVKEZX7CGPTW0A46" url.scheme="http" http.route="/complete-compat-sso/:grant_id" user_agent.original="Mozilla/5.0 (Linux; Android 10; K) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/129.0.0.0 Mobile Safari/537.36"}:handlers.compat.login_sso_complete.post{compat_sso_login.id=01J9SBPEFCDVKEZX7CGPTW0A46}:homeserver.create_device{matrix.homeserver="devhack.net" matrix.mxid="@xxx:devhack.net" matrix.device_id="CJirStFj8Q"}: mas_matrix_synapse: crates/matrix-synapse/src/lib.rs:345: error=Failed to create device in Synapse

Caused by:
    request failed with status 404 Not Found: M_NOT_FOUND: Unknown user
2024-10-09T19:22:08.904114Z ERROR http.server.request{otel.kind="server" otel.name="POST /complete-compat-sso/:grant_id" network.protocol.name="http" network.protocol.version="1.1" http.request.method="POST" url.path="/complete-compat-sso/01J9SBPEFCDVKEZX7CGPTW0A46" url.scheme="http" http.route="/complete-compat-sso/:grant_id" user_agent.original="Mozilla/5.0 (Linux; Android 10; K) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/129.0.0.0 Mobile Safari/537.36"}:handlers.compat.login_sso_complete.post{compat_sso_login.id=01J9SBPEFCDVKEZX7CGPTW0A46}: mas_handlers::compat::login_sso_complete: crates/handlers/src/compat/login_sso_complete.rs:117: error=Internal error: Failed to provision device (Failed to provision device

Caused by:
    0: Failed to create device in Synapse
    1: request failed with status 404 Not Found: M_NOT_FOUND: Unknown user)
2024-10-09T19:22:13.360865Z  INFO activity_tracker.flush: mas_handlers::activity_tracker::worker: crates/handlers/src/activity_tracker/worker.rs:219: Flushing 2 activity records to the database
2024-10-09T19:23:00.007733Z  INFO job.run{otel.kind="consumer" job.id=JID-01J9SBR3H0XHFN50P04ZX7P87M job.attempts=0 job.name="cleanup-expired-tokens"}: mas_tasks::database: crates/tasks/src/database.rs:62: cleaned up expired tokens count=1
2024-10-09T19:23:13.368808Z  INFO activity_tracker.flush: mas_handlers::activity_tracker::worker: crates/handlers/src/activity_tracker/worker.rs:219: Flushing 1 activity records to the database

To Reproduce
Steps to reproduce the behavior:

  1. Have an account on our Keycloak instance created.
  2. Attempt to authenticate to our matrix server, browser is sent to MAS which sends you to Keycloak
  3. Authenticate to Keycloak, get sent back to MAS.
  4. Get error screen.

Expected behavior
User should be able to login to matrix normally.

Screenshots
image

Desktop (please complete the following information):
This seems to happen to people on many platforms.

Smartphone (please complete the following information):
This seems to happen to people on many platforms.

Additional context

  • Restarting the mas process (killing the pod) seems to fix the issue for the particular user. I'm not clear why.
  • When I asked about this in #matrix-auth:matrix.org the first time it happened, some workarounds were suggested. I tried manage provision-all-users at the time it was believed that the worker just got stuck. That seemed plausible, but it has since happened several times.
  • Since the first time it happened, we setup Prometheus metrics, which seem to show regularly scheduled cleanup-expired-tokens jobs continuing to complete, suggesting the jobs are not just stuck:
    image
    image
    According to the logs, the errors happened at 19:20:50, 19:21:38, 19:22:08 and 19:24:20. I restarted the process shortly after that.
  • As far as I can tell, this is happening every single time we onboard someone.
  • I'm not sure how to go about debugging this. I'm happy to create new test users, inspect the database, pull metrics and logs, dump configs, etc as needed.
@davidmehren
Copy link

We had similar errors, but resolved them by restarting MAS and running provision-all-users.

The whole thing felt a bit buggy for such a central authentication software. Maybe MAS can automatically try to reprovision the user in Synapse if it receives a 404 on device creation?

@sandhose
Copy link
Member

This is basically due to the job queue being unreliable.

See #2625 and #1490

We've started rewriting the job queue completely to make it reliable and properly retry lost/errored jobs, which should solve this issue

You should subscribe to #2785 to check with the progress

@sandhose sandhose added A-Jobs Related to asynchronous jobs T-Defect Something isn't working X-Blocked Blocked by another issue labels Oct 29, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-Jobs Related to asynchronous jobs T-Defect Something isn't working X-Blocked Blocked by another issue
Projects
None yet
Development

No branches or pull requests

3 participants