Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

perf(Predictions.PubSub): Read stops + routes from global cache instead of API #206

Merged
merged 6 commits into from
Oct 2, 2024

Conversation

KaylaBrady
Copy link
Collaborator

Summary

Ticket: Predictions Scalability: new channel that publishes predictions updates in chunks

What is this PR for?

This PR incorporates the global cache data added in #200 since we found those API calls to be a main source of latency in load testing (notes).

@KaylaBrady KaylaBrady requested a review from a team as a code owner September 25, 2024 12:53
@KaylaBrady KaylaBrady requested review from boringcactus and removed request for a team September 25, 2024 12:53
@KaylaBrady KaylaBrady added the deploy to dev-orange Automatically deploy this PR to dev-orange label Sep 25, 2024
Copy link

Coverage of commit 4781abb

Summary coverage rate:
  lines......: 79.4% (1341 of 1689 lines)
  functions..: 70.1% (564 of 804 functions)
  branches...: no data found

Files changed coverage rate:
                                                                         |Lines       |Functions  |Branches    
  Filename                                                               |Rate     Num|Rate    Num|Rate     Num
  =============================================================================================================
  lib/mbta_v3_api/stop.ex                                                |97.7%     44|93.3%    15|    -      0
  lib/mobile_app_backend/global_data_cache.ex                            |79.1%     43|66.7%    15|    -      0
  lib/mobile_app_backend/predictions/pub_sub.ex                          |92.9%     70|88.2%    17|    -      0
  lib/mobile_app_backend/predictions/stream_subscriber.ex                |85.7%      7| 100%     1|    -      0

Download coverage report

@KaylaBrady
Copy link
Collaborator Author

KaylaBrady commented Sep 25, 2024

Load testing against this I very quickly started to get errors joining the channel:

87955d4a5a6b 13:02:37.701 [error] GenServer #PID<0.8821.0> terminating
87955d4a5a6b ** (MatchError) no match of right hand side value: {:error, %Req.TransportError{reason: :timeout}}
87955d4a5a6b (mobile_app_backend 0.1.0) lib/mobile_app_backend/global_data_cache.ex:194: MobileAppBackend.GlobalDataCache.Impl.fetch_route_patterns/0
87955d4a5a6b (mobile_app_backend 0.1.0) lib/mobile_app_backend/global_data_cache.ex:157: MobileAppBackend.GlobalDataCache.Impl.update_data/1
87955d4a5a6b (mobile_app_backend 0.1.0) lib/mobile_app_backend/predictions/pub_sub.ex:70: MobileAppBackend.Predictions.PubSub.subscribe_for_stops/1

I think we will need to either populate the cache on startup or have a fallback mechanism.

Copy link

Coverage of commit b18f7cd

Summary coverage rate:
  lines......: 79.4% (1341 of 1689 lines)
  functions..: 70.1% (564 of 804 functions)
  branches...: no data found

Files changed coverage rate:
                                                                         |Lines       |Functions  |Branches    
  Filename                                                               |Rate     Num|Rate    Num|Rate     Num
  =============================================================================================================
  lib/mbta_v3_api/stop.ex                                                |97.7%     44|93.3%    15|    -      0
  lib/mobile_app_backend/global_data_cache.ex                            |79.1%     43|66.7%    15|    -      0
  lib/mobile_app_backend/predictions/pub_sub.ex                          |92.9%     70|88.2%    17|    -      0
  lib/mobile_app_backend/predictions/stream_subscriber.ex                |85.7%      7| 100%     1|    -      0

Download coverage report

Copy link

Coverage of commit 011cfda

Summary coverage rate:
  lines......: 79.3% (1342 of 1692 lines)
  functions..: 70.1% (564 of 804 functions)
  branches...: no data found

Files changed coverage rate:
                                                                         |Lines       |Functions  |Branches    
  Filename                                                               |Rate     Num|Rate    Num|Rate     Num
  =============================================================================================================
  lib/mbta_v3_api/stop.ex                                                |97.7%     44|93.3%    15|    -      0
  lib/mobile_app_backend/global_data_cache.ex                            |76.1%     46|66.7%    15|    -      0
  lib/mobile_app_backend/predictions/pub_sub.ex                          |92.9%     70|88.2%    17|    -      0
  lib/mobile_app_backend/predictions/stream_subscriber.ex                |85.7%      7| 100%     1|    -      0

Download coverage report

@KaylaBrady
Copy link
Collaborator Author

KaylaBrady commented Sep 25, 2024

Added a call to recalculate as part of init, seeing splunk error though.
received unexpected message in handle_info/2: :recalculate

Potentially we should move checking for the presence of the global data into the health check endpoint so that traffic only shifts to an instance once it has global data.

Copy link
Member

@boringcactus boringcactus left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think being able to use different keys and having a Mox mock are two different ways to solve the problem of providing test data from the GlobalDataCache to consumers, and I'm not quite sure if we need both of them, but I haven't finished thinking through which one would be enough.

If we report that instances are only healthy when they have global data, that'll cause smoke testing the Docker container in CI to fail again, unless we give the Docker container smoke test an API key, which might actually be the correct fix anyway, and I think that would let us actually just fetch the data eagerly.

lib/mobile_app_backend/global_data_cache.ex Outdated Show resolved Hide resolved
update_data(state.key)

Process.send_after(self(), :recalculate, state.update_ms)
if :persistent_term.get(state.key, nil) do
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

update_data will only not call :persistent_term.put/2 if it crashes, so I don't think I understand when this check could fail.


state = %State{
key: opts[:key],
update_ms: opts[:update_ms] || :timer.minutes(5)
}

Process.send_after(self(), :recalculate, :timer.seconds(1))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's weird that this wasn't here before. This is really tough to test locally, since you can only know it's working if GTFS actually changes, but that might mean this never actually worked and was only ever calculating the global data once. Oops!

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added some light tests for this by checking that the message is sent

end

@spec get_data(key()) :: data()
@impl true
def get_data(key \\ default_key()) do
:persistent_term.get(key, nil) || update_data(key)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe this update_data should be moved into a GenServer.call or something so that if there are a dozen simultaneous calls to get_data/1 before data is loaded they don't each call update_data/1.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd be somewhat worried about putting it into GenServer.call since if it is slow for the first user, subsequent user requests will all fail too.

I think the best bet is making sure that the data is populated first & removing the call to update_data from get_data.

Doing that asynchronously via the scheduled checks & preventing user traffic via a healthcheck seems like the safest approach for that to me - if the global data can't load immediately for some reason, it seems cleaner to continue trying to re-fetch without crashing. Maybe I'm overly wary of crashes though. In any case, I think resolving that mechanism could be part of a separate PR so that the polling is in place

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@boringcactus I'm going to break this PR up into separate ones so that we can clear out the immediate problem of setting up the timed refresh of this data.

I'm in favor of the health check approach over fetching the data in init in failing. In speaking to Paul about it (since he might be pitching in for that change anyway), it has the advantage over the init of faster deploys in the case that the request fails. Skate and API take that approach as well.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think Glides has any particular application-logic-specific health checks in its /_health, which is probably why I hadn't thought of it, but I guess if that's common we may as well do it here. Good catch on fixing the refresh first - I had completely lost track of that in the context of the TestFlight public beta.

@KaylaBrady
Copy link
Collaborator Author

@boringcactus I like having the Mox available here so that it is possible to mock the higher-level function rather than having to insert all the required data in persistent term. It was especially helpful in the StreamSubscriber tests to be able to mock route_ids_for_stops in a way that is consistent with mocking other data.

I don't think giving the docker container an API key would solve the issue - I thought it couldn't make any network requests in CI (these V3 API requests should succeed without an API key anyway)

@boringcactus
Copy link
Member

That makes sense. If we want to be able to unit test the GlobalDataCache itself, we probably still need to be able to run it with arbitrary keys, but for testing things that call it it's more work to make a cache key and fill the persistent data directly. If we're setting up Mox here for that purpose, this is probably the right place to rework the GlobalControllerTest to use Mox instead and roll back my changes to GlobalController to pick a cache key out of the connection assigns.

I'm not sure I'm aware of any issues with the Docker container not being able to make outgoing network requests in CI - we run load tests in CI against a real API instance, and it works fine. The API requests would succeed with no API key, but they can't succeed with no API URL.

@KaylaBrady KaylaBrady removed the deploy to dev-orange Automatically deploy this PR to dev-orange label Sep 26, 2024
Copy link

Coverage of commit c30143a

Summary coverage rate:
  lines......: 79.6% (1353 of 1699 lines)
  functions..: 69.6% (567 of 815 functions)
  branches...: no data found

Files changed coverage rate:
                                                                         |Lines       |Functions  |Branches    
  Filename                                                               |Rate     Num|Rate    Num|Rate     Num
  =============================================================================================================
  lib/mbta_v3_api/stop.ex                                                |97.7%     44|87.5%    16|    -      0
  lib/mobile_app_backend/application.ex                                  |88.9%      9|50.0%     2|    -      0
  lib/mobile_app_backend/global_data_cache.ex                            |85.4%     48|86.7%    15|    -      0
  lib/mobile_app_backend/predictions/pub_sub.ex                          |92.9%     70|88.2%    17|    -      0
  lib/mobile_app_backend/predictions/stream_subscriber.ex                |85.7%      7| 100%     1|    -      0

Download coverage report

Copy link

Coverage of commit 05821ff

Summary coverage rate:
  lines......: 79.6% (1353 of 1699 lines)
  functions..: 69.6% (567 of 815 functions)
  branches...: no data found

Files changed coverage rate:
                                                                         |Lines       |Functions  |Branches    
  Filename                                                               |Rate     Num|Rate    Num|Rate     Num
  =============================================================================================================
  lib/mbta_v3_api/stop.ex                                                |97.7%     44|87.5%    16|    -      0
  lib/mobile_app_backend/application.ex                                  |88.9%      9|50.0%     2|    -      0
  lib/mobile_app_backend/global_data_cache.ex                            |85.4%     48|86.7%    15|    -      0
  lib/mobile_app_backend/predictions/pub_sub.ex                          |92.9%     70|88.2%    17|    -      0
  lib/mobile_app_backend/predictions/stream_subscriber.ex                |85.7%      7| 100%     1|    -      0

Download coverage report

Copy link

Coverage of commit f122fae

Summary coverage rate:
  lines......: 79.6% (1351 of 1697 lines)
  functions..: 69.6% (567 of 815 functions)
  branches...: no data found

Files changed coverage rate:
                                                                         |Lines       |Functions  |Branches    
  Filename                                                               |Rate     Num|Rate    Num|Rate     Num
  =============================================================================================================
  lib/mbta_v3_api/stop.ex                                                |97.7%     44|87.5%    16|    -      0
  lib/mobile_app_backend/application.ex                                  |88.9%      9|50.0%     2|    -      0
  lib/mobile_app_backend/global_data_cache.ex                            |85.4%     48|86.7%    15|    -      0
  lib/mobile_app_backend/predictions/pub_sub.ex                          |92.9%     70|88.2%    17|    -      0
  lib/mobile_app_backend/predictions/stream_subscriber.ex                |85.7%      7| 100%     1|    -      0
  lib/mobile_app_backend_web/controllers/global_controller.ex            | 100%      4|85.7%     7|    -      0

Download coverage report

@KaylaBrady KaylaBrady added the deploy to dev-orange Automatically deploy this PR to dev-orange label Oct 1, 2024
Copy link

github-actions bot commented Oct 1, 2024

Coverage of commit 24b01ce

Summary coverage rate:
  lines......: 79.7% (1354 of 1699 lines)
  functions..: 69.6% (568 of 816 functions)
  branches...: no data found

Files changed coverage rate:
                                                                         |Lines       |Functions  |Branches    
  Filename                                                               |Rate     Num|Rate    Num|Rate     Num
  =============================================================================================================
  lib/mbta_v3_api/stop.ex                                                |97.7%     44|87.5%    16|    -      0
  lib/mobile_app_backend/global_data_cache.ex                            |87.0%     46|86.7%    15|    -      0
  lib/mobile_app_backend/predictions/pub_sub.ex                          |92.9%     70|88.2%    17|    -      0
  lib/mobile_app_backend/predictions/stream_subscriber.ex                |85.7%      7| 100%     1|    -      0

Download coverage report

Copy link

github-actions bot commented Oct 2, 2024

Coverage of commit da02a7f

Summary coverage rate:
  lines......: 79.7% (1354 of 1699 lines)
  functions..: 69.6% (568 of 816 functions)
  branches...: no data found

Files changed coverage rate:
                                                                         |Lines       |Functions  |Branches    
  Filename                                                               |Rate     Num|Rate    Num|Rate     Num
  =============================================================================================================
  lib/mbta_v3_api/stop.ex                                                |97.7%     44|87.5%    16|    -      0
  lib/mobile_app_backend/global_data_cache.ex                            |87.0%     46|86.7%    15|    -      0
  lib/mobile_app_backend/predictions/pub_sub.ex                          |92.9%     70|88.2%    17|    -      0
  lib/mobile_app_backend/predictions/stream_subscriber.ex                |85.7%      7| 100%     1|    -      0

Download coverage report

Copy link

github-actions bot commented Oct 2, 2024

Coverage of commit b1e22d8

Summary coverage rate:
  lines......: 79.7% (1354 of 1699 lines)
  functions..: 69.6% (568 of 816 functions)
  branches...: no data found

Files changed coverage rate:
                                                                         |Lines       |Functions  |Branches    
  Filename                                                               |Rate     Num|Rate    Num|Rate     Num
  =============================================================================================================
  lib/mbta_v3_api/stop.ex                                                |97.7%     44|87.5%    16|    -      0
  lib/mobile_app_backend/global_data_cache.ex                            |87.0%     46|86.7%    15|    -      0
  lib/mobile_app_backend/predictions/pub_sub.ex                          |92.9%     70|88.2%    17|    -      0
  lib/mobile_app_backend/predictions/stream_subscriber.ex                |85.7%      7| 100%     1|    -      0

Download coverage report

@KaylaBrady KaylaBrady merged commit 02d94a0 into main Oct 2, 2024
5 checks passed
@KaylaBrady KaylaBrady deleted the kb-pred-read-cache branch October 2, 2024 16:49
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
deploy to dev-orange Automatically deploy this PR to dev-orange
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants