Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Issues with run times of data pulling #345

Open
aylapear opened this issue Nov 29, 2024 · 9 comments
Open

Issues with run times of data pulling #345

aylapear opened this issue Nov 29, 2024 · 9 comments

Comments

@aylapear
Copy link

The shinywqg shiny app uses the bcdata::bcdc_get_data() function to pull seven data sets from the BC data catalogue when the app launches.

Normally it takes about 14 seconds for the data to pull using the bcdata::bcdc_get_data() function. Users have reported that sometimes it can take over 90 seconds or even longer and this is often during periods where you would expect high volume use ie during Monday morning.

This issue has been difficult to diagnose because it is intermittent. I was able to simulate the issue by running the function in parallel and the more functions that were running at the same time I started to see the pull times for the seven data sets increase. When the function was run a single time it took 14 seconds to pull the seven data sets and when I had the function running 10 times simultaneously the pull times for the seven data sets increased to 35 seconds.

App users have been reporting these issues and so we are looking for solutions for how to decrease this run time when multiple users are accessing the app.

@boshek
Copy link
Collaborator

boshek commented Dec 3, 2024

👋 @aylapear. Great to hear that you are using bcdata! This is a bit difficult to diagnose and is possibly a shiny issue rather than a bcdata one. Would you be able to construct a reprex that is self contained (i.e. not using shiny) and time it? I think one approach could be to run a bunch of bcdata commands in parallel and then sequential and see what sort of timings you get.

@stephhazlitt
Copy link
Member

stephhazlitt commented Dec 4, 2024

I strongly suspect this is an I/O issue (busy traffic to the BC Data Catalogue API and/or network speed when transferring the CSV data).

BC Stats has a few Apps with the same challenge. We improved one by breaking up the data calls (the App uses tabs and data is loaded only when users are on a tab). It might be worth exploring storing the open-licensed data somewhere else for the App and seeing if that improves speed reliability, and/or even better, if it opens up options for changing the data store format to something much more efficient, e.g., parquet files, which means less data to move?

@smnorris
Copy link

smnorris commented Dec 4, 2024

A single bucket for all open data would be pretty handy.
I'm starting to cache WFS data to object storage as parquet - but haven't got to the point of just caching all data with a single repo of scheduled workflows and modifying all apps to hit the cache rather than using bcdata/WFS.

@aylapear
Copy link
Author

This is the reprex I used to prove to myself it wasn't the shiny part causing the issue. I opened several scripts and ran it, I noticed as I opened more the runs times started to increase, it took at least 10 or so before I started to noticed the issue.

df <- tibble::tribble(
  ~record,                                ~resource,
  "85d3990a-ec0a-4436-8ebd-150de3ba0747", "6f32a85b-a3d9-44c3-9a14-15175eba25b6",
  "23ada5c3-67a6-4703-9369-c8d690b092e1", NULL,
  "a35c7d13-76dd-4c23-aab8-7b32b0310e2f", NULL,
  "d25b348f-a8da-41fc-8141-bd29df155e9c", NULL,
  "065581bf-fd52-4aad-aa82-fc510c074cab", NULL,
  "6b1cc604-18d1-426c-9d9a-c31b5ba15a16", NULL,
  "2396b78b-8782-444d-bade-4487b75b789c", NULL
)

list_main <- list()
for (i in 1:150) {
  list_in <- list()
  for (j in 1:nrow(df)) {
    
    start <- Sys.time()
    
    data <- bcdata::bcdc_get_data(
      record =  df$record[j],
      resource = df$resource[[j]],
      show_col_types = FALSE
    )
    
    end <- Sys.time()
    run_time <- end - start
    
    
    list_in[[j]] <- run_time[[1]]
  }
  list_main[[i]] <- list_in
}
beepr::beep(3)
run_times <- c()
for (i in 1:5) {
  x <- sum(unlist(list_main[[i]]))
  
  run_times <- c(run_times, x)
}

mean(run_times)
min(run_times)
max(run_times)

@aylapear
Copy link
Author

aylapear commented Dec 12, 2024

This is a new issue and the app has grown since the initial design. The bcdata package is great and worked for a long time for this app without issues. If this is an issue that is not going to go away or may only get worse with higher user volumes then the app will need to be refactored. I wanted to open the dialogue to see if others are having this issue and if something within bcdata could be changed instead of having to refactor the app.

Glad people have responded indicating they have also had this issue and found some creative solutions as a work around.

@stephhazlitt
Copy link
Member

stephhazlitt commented Dec 12, 2024

I am not an expert in APIs, but I think one way to test if the the performance reduction is bcdata or the CKAN API itself is to use the API directly (https://bcgov.github.io/data-publication/pages/dps_bcdc_api_w_how_to_use.html) and ramp up downloads as you did in R and see if the same pattern can be reproduced (or use curl in R instead of bcdata?). It might also be worth opening a ticket with the BC Data Catalogue team (here) and see if they have any insight they can share about API limits/throttling, performance change with traffic etc.

@boshek
Copy link
Collaborator

boshek commented Dec 12, 2024

@aylapear can you elaborate on what you were doing here?

I opened several scripts and ran it, I noticed as I opened more the runs times started to increase, it took at least 10 or so before I started to noticed the issue.

Did you open up different R processes for this?
Does this script accurately reflect the situation in the shinyapp? Does it really make 150 calls x 7 times to the catalogue?

Also can you post the times that you saw when you ran the above script? On my network I get:

> mean(run_times)
[1] 7.441925
> min(run_times)
[1] 6.744544
> max(run_times)
[1] 9.745393

That variation does not seem too bad to me?

@aylapear
Copy link
Author

@boshek Yes multiple R processes were opened. I opened several instances of RStudio and then ran the script. So I had multiple RStudio windows/apps open and ran the script once in each RStudio. I found if only one or two RStudios were open (and running the script) then the run time were around 8 seconds but once 10 RStudios were open (and running the script) the run times started to increase to 22 seconds.

# when only a 1 RStudio was open and running the script 
> mean(run_times)
[1] 9.812262
> min(run_times)
[1] 8.494646
> max(run_times)
[1] 11.12988
# when 10 RStudios were open and running the script this is the output of the RStudio window that had the highest run times
> mean(run_times)
[1] 21.01758
> min(run_times)
[1] 19.57308
> max(run_times)
[1] 22.91434

The app does not make 150 calls x 7 times to the catalogue. The app does make 7 calls to the catalogue each time it is opened by a user. The reason I made the 150 calls was to capture that users reported changes in the run times at different times of the day as well as it made it easy for me to open multiple R processes, get one running, then start the next one and ensure the first ones are still running by the time the last one started. The 150 number is just a random number I picked.

The BC gov employees who manage the app have reported times of over 90 seconds for the app to start up and I noted this when I first started investigating the issue but at that time did not have a system for tracking it as I was unsure of what was causing the issue. When the app starts up there are messages it shows so we can see the app launches and then the next step is to download the data from the data catalogue using the bcdata package and this is where it would get stuck. It does not happen all the time but it has happened enough that users have been reporting it and have been frustrated by it. @aazizish do you have additional comments to add to this discussion?

@aylapear
Copy link
Author

I ran things again this morning.

I started by opening a single RStudio and ran the script. It got this run time:

> mean(run_times)
[1] 9.979069
> min(run_times)
[1] 8.703425
> max(run_times)
[1] 11.25471

Then I opened 14 RStudio's and ran the script in each one. I would start the script in one RStudio and then open the next RStudio and start the script until all 15 RStudio windows were running the script at the same time. The run times started to increase in all the RStudio's. With the highest run time being 34 seconds.

Here are the run times from each RStudio:

# 1
> mean(run_times)
[1] 27.63555
> min(run_times)
[1] 22.26966
> max(run_times)
[1] 31.86886
# 2
> mean(run_times)
[1] 26.7724
> min(run_times)
[1] 21.83208
> max(run_times)
[1] 31.58645
# 3
> mean(run_times)
[1] 26.25767
> min(run_times)
[1] 21.7672
> max(run_times)
[1] 34.06811
# 4
> mean(run_times)
[1] 20.97825
> min(run_times)
[1] 10.96962
> max(run_times)
[1] 29.96217
# 5
> mean(run_times)
[1] 27.60679
> min(run_times)
[1] 19.53013
> max(run_times)
[1] 31.99741
# 6
> mean(run_times)
[1] 26.25231
> min(run_times)
[1] 23.7442
> max(run_times)
[1] 28.55587
# 7
> mean(run_times)
[1] 25.43196
> min(run_times)
[1] 17.86646
> max(run_times)
[1] 32.56424
# 8
> mean(run_times)
[1] 25.17905
> min(run_times)
[1] 19.04733
> max(run_times)
[1] 33.04431
# 9
> mean(run_times)
[1] 26.19547
> min(run_times)
[1] 19.26298
> max(run_times)
[1] 33.63481
# 10
> mean(run_times)
[1] 28.74862
> min(run_times)
[1] 26.8705
> max(run_times)
[1] 31.36086
# 11
> mean(run_times)
[1] 26.50004
> min(run_times)
[1] 20.06559
> max(run_times)
[1] 30.62028
# 12
> mean(run_times)
[1] 25.53255
> min(run_times)
[1] 18.61949
> max(run_times)
[1] 34.37966
# 13
> mean(run_times)
[1] 29.09956
> min(run_times)
[1] 26.30623
> max(run_times)
[1] 32.22448
# 14
> mean(run_times)
[1] 26.4824
> min(run_times)
[1] 20.26294
> max(run_times)
[1] 33.2982

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants