-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
perf(Predictions.PubSub): Read stops + routes from global cache instead of API #206
Conversation
Coverage of commit
|
Load testing against this I very quickly started to get errors joining the channel:
I think we will need to either populate the cache on startup or have a fallback mechanism. |
Coverage of commit
|
b18f7cd
to
011cfda
Compare
Coverage of commit
|
Added a call to recalculate as part of init, seeing splunk error though. Potentially we should move checking for the presence of the global data into the health check endpoint so that traffic only shifts to an instance once it has global data. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think being able to use different keys and having a Mox mock are two different ways to solve the problem of providing test data from the GlobalDataCache to consumers, and I'm not quite sure if we need both of them, but I haven't finished thinking through which one would be enough.
If we report that instances are only healthy when they have global data, that'll cause smoke testing the Docker container in CI to fail again, unless we give the Docker container smoke test an API key, which might actually be the correct fix anyway, and I think that would let us actually just fetch the data eagerly.
update_data(state.key) | ||
|
||
Process.send_after(self(), :recalculate, state.update_ms) | ||
if :persistent_term.get(state.key, nil) do |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
update_data
will only not call :persistent_term.put/2
if it crashes, so I don't think I understand when this check could fail.
|
||
state = %State{ | ||
key: opts[:key], | ||
update_ms: opts[:update_ms] || :timer.minutes(5) | ||
} | ||
|
||
Process.send_after(self(), :recalculate, :timer.seconds(1)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's weird that this wasn't here before. This is really tough to test locally, since you can only know it's working if GTFS actually changes, but that might mean this never actually worked and was only ever calculating the global data once. Oops!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added some light tests for this by checking that the message is sent
end | ||
|
||
@spec get_data(key()) :: data() | ||
@impl true | ||
def get_data(key \\ default_key()) do | ||
:persistent_term.get(key, nil) || update_data(key) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe this update_data
should be moved into a GenServer.call
or something so that if there are a dozen simultaneous calls to get_data/1
before data is loaded they don't each call update_data/1
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd be somewhat worried about putting it into GenServer.call
since if it is slow for the first user, subsequent user requests will all fail too.
I think the best bet is making sure that the data is populated first & removing the call to update_data
from get_data
.
Doing that asynchronously via the scheduled checks & preventing user traffic via a healthcheck seems like the safest approach for that to me - if the global data can't load immediately for some reason, it seems cleaner to continue trying to re-fetch without crashing. Maybe I'm overly wary of crashes though. In any case, I think resolving that mechanism could be part of a separate PR so that the polling is in place
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@boringcactus I'm going to break this PR up into separate ones so that we can clear out the immediate problem of setting up the timed refresh of this data.
I'm in favor of the health check approach over fetching the data in init
in failing. In speaking to Paul about it (since he might be pitching in for that change anyway), it has the advantage over the init
of faster deploys in the case that the request fails. Skate and API take that approach as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think Glides has any particular application-logic-specific health checks in its /_health
, which is probably why I hadn't thought of it, but I guess if that's common we may as well do it here. Good catch on fixing the refresh first - I had completely lost track of that in the context of the TestFlight public beta.
@boringcactus I like having the Mox available here so that it is possible to mock the higher-level function rather than having to insert all the required data in persistent term. It was especially helpful in the StreamSubscriber tests to be able to mock I don't think giving the docker container an API key would solve the issue - I thought it couldn't make any network requests in CI (these V3 API requests should succeed without an API key anyway) |
That makes sense. If we want to be able to unit test the I'm not sure I'm aware of any issues with the Docker container not being able to make outgoing network requests in CI - we run load tests in CI against a real API instance, and it works fine. The API requests would succeed with no API key, but they can't succeed with no API URL. |
246f0db
to
c30143a
Compare
Coverage of commit
|
Coverage of commit
|
Coverage of commit
|
f122fae
to
0cff77e
Compare
Coverage of commit
|
This reverts commit f42ef4d.
Coverage of commit
|
Coverage of commit
|
Summary
Ticket: Predictions Scalability: new channel that publishes predictions updates in chunks
What is this PR for?
This PR incorporates the global cache data added in #200 since we found those API calls to be a main source of latency in load testing (notes).