Issues with run times of data pulling #345

aylapear · 2024-11-29T19:33:08Z

The shinywqg shiny app uses the bcdata::bcdc_get_data() function to pull seven data sets from the BC data catalogue when the app launches.

Normally it takes about 14 seconds for the data to pull using the bcdata::bcdc_get_data() function. Users have reported that sometimes it can take over 90 seconds or even longer and this is often during periods where you would expect high volume use ie during Monday morning.

This issue has been difficult to diagnose because it is intermittent. I was able to simulate the issue by running the function in parallel and the more functions that were running at the same time I started to see the pull times for the seven data sets increase. When the function was run a single time it took 14 seconds to pull the seven data sets and when I had the function running 10 times simultaneously the pull times for the seven data sets increased to 35 seconds.

App users have been reporting these issues and so we are looking for solutions for how to decrease this run time when multiple users are accessing the app.

The text was updated successfully, but these errors were encountered:

boshek · 2024-12-03T16:15:41Z

👋 @aylapear. Great to hear that you are using bcdata! This is a bit difficult to diagnose and is possibly a shiny issue rather than a bcdata one. Would you be able to construct a reprex that is self contained (i.e. not using shiny) and time it? I think one approach could be to run a bunch of bcdata commands in parallel and then sequential and see what sort of timings you get.

stephhazlitt · 2024-12-04T01:20:47Z

I strongly suspect this is an I/O issue (busy traffic to the BC Data Catalogue API and/or network speed when transferring the CSV data).

BC Stats has a few Apps with the same challenge. We improved one by breaking up the data calls (the App uses tabs and data is loaded only when users are on a tab). It might be worth exploring storing the open-licensed data somewhere else for the App and seeing if that improves speed reliability, and/or even better, if it opens up options for changing the data store format to something much more efficient, e.g., parquet files, which means less data to move?

smnorris · 2024-12-04T17:33:16Z

A single bucket for all open data would be pretty handy.
I'm starting to cache WFS data to object storage as parquet - but haven't got to the point of just caching all data with a single repo of scheduled workflows and modifying all apps to hit the cache rather than using bcdata/WFS.

aylapear · 2024-12-12T01:19:45Z

This is the reprex I used to prove to myself it wasn't the shiny part causing the issue. I opened several scripts and ran it, I noticed as I opened more the runs times started to increase, it took at least 10 or so before I started to noticed the issue.

df <- tibble::tribble(
  ~record,                                ~resource,
  "85d3990a-ec0a-4436-8ebd-150de3ba0747", "6f32a85b-a3d9-44c3-9a14-15175eba25b6",
  "23ada5c3-67a6-4703-9369-c8d690b092e1", NULL,
  "a35c7d13-76dd-4c23-aab8-7b32b0310e2f", NULL,
  "d25b348f-a8da-41fc-8141-bd29df155e9c", NULL,
  "065581bf-fd52-4aad-aa82-fc510c074cab", NULL,
  "6b1cc604-18d1-426c-9d9a-c31b5ba15a16", NULL,
  "2396b78b-8782-444d-bade-4487b75b789c", NULL
)

list_main <- list()
for (i in 1:150) {
  list_in <- list()
  for (j in 1:nrow(df)) {
    
    start <- Sys.time()
    
    data <- bcdata::bcdc_get_data(
      record =  df$record[j],
      resource = df$resource[[j]],
      show_col_types = FALSE
    )
    
    end <- Sys.time()
    run_time <- end - start
    
    
    list_in[[j]] <- run_time[[1]]
  }
  list_main[[i]] <- list_in
}
beepr::beep(3)
run_times <- c()
for (i in 1:5) {
  x <- sum(unlist(list_main[[i]]))
  
  run_times <- c(run_times, x)
}

mean(run_times)
min(run_times)
max(run_times)

aylapear · 2024-12-12T01:35:14Z

This is a new issue and the app has grown since the initial design. The bcdata package is great and worked for a long time for this app without issues. If this is an issue that is not going to go away or may only get worse with higher user volumes then the app will need to be refactored. I wanted to open the dialogue to see if others are having this issue and if something within bcdata could be changed instead of having to refactor the app.

Glad people have responded indicating they have also had this issue and found some creative solutions as a work around.

stephhazlitt · 2024-12-12T03:52:28Z

I am not an expert in APIs, but I think one way to test if the the performance reduction is bcdata or the CKAN API itself is to use the API directly (https://bcgov.github.io/data-publication/pages/dps_bcdc_api_w_how_to_use.html) and ramp up downloads as you did in R and see if the same pattern can be reproduced (or use curl in R instead of bcdata?). It might also be worth opening a ticket with the BC Data Catalogue team (here) and see if they have any insight they can share about API limits/throttling, performance change with traffic etc.

boshek · 2024-12-12T20:28:56Z

@aylapear can you elaborate on what you were doing here?

I opened several scripts and ran it, I noticed as I opened more the runs times started to increase, it took at least 10 or so before I started to noticed the issue.

Did you open up different R processes for this?
Does this script accurately reflect the situation in the shinyapp? Does it really make 150 calls x 7 times to the catalogue?

Also can you post the times that you saw when you ran the above script? On my network I get:

> mean(run_times)
[1] 7.441925
> min(run_times)
[1] 6.744544
> max(run_times)
[1] 9.745393

That variation does not seem too bad to me?

aylapear · 2024-12-18T06:19:21Z

@boshek Yes multiple R processes were opened. I opened several instances of RStudio and then ran the script. So I had multiple RStudio windows/apps open and ran the script once in each RStudio. I found if only one or two RStudios were open (and running the script) then the run time were around 8 seconds but once 10 RStudios were open (and running the script) the run times started to increase to 22 seconds.

# when only a 1 RStudio was open and running the script 
> mean(run_times)
[1] 9.812262
> min(run_times)
[1] 8.494646
> max(run_times)
[1] 11.12988

# when 10 RStudios were open and running the script this is the output of the RStudio window that had the highest run times
> mean(run_times)
[1] 21.01758
> min(run_times)
[1] 19.57308
> max(run_times)
[1] 22.91434

The app does not make 150 calls x 7 times to the catalogue. The app does make 7 calls to the catalogue each time it is opened by a user. The reason I made the 150 calls was to capture that users reported changes in the run times at different times of the day as well as it made it easy for me to open multiple R processes, get one running, then start the next one and ensure the first ones are still running by the time the last one started. The 150 number is just a random number I picked.

The BC gov employees who manage the app have reported times of over 90 seconds for the app to start up and I noted this when I first started investigating the issue but at that time did not have a system for tracking it as I was unsure of what was causing the issue. When the app starts up there are messages it shows so we can see the app launches and then the next step is to download the data from the data catalogue using the bcdata package and this is where it would get stuck. It does not happen all the time but it has happened enough that users have been reporting it and have been frustrated by it. @aazizish do you have additional comments to add to this discussion?

aylapear · 2024-12-18T19:46:39Z

I ran things again this morning.

I started by opening a single RStudio and ran the script. It got this run time:

> mean(run_times)
[1] 9.979069
> min(run_times)
[1] 8.703425
> max(run_times)
[1] 11.25471

Then I opened 14 RStudio's and ran the script in each one. I would start the script in one RStudio and then open the next RStudio and start the script until all 15 RStudio windows were running the script at the same time. The run times started to increase in all the RStudio's. With the highest run time being 34 seconds.

Here are the run times from each RStudio:

# 1
> mean(run_times)
[1] 27.63555
> min(run_times)
[1] 22.26966
> max(run_times)
[1] 31.86886

# 2
> mean(run_times)
[1] 26.7724
> min(run_times)
[1] 21.83208
> max(run_times)
[1] 31.58645

# 3
> mean(run_times)
[1] 26.25767
> min(run_times)
[1] 21.7672
> max(run_times)
[1] 34.06811

# 4
> mean(run_times)
[1] 20.97825
> min(run_times)
[1] 10.96962
> max(run_times)
[1] 29.96217

# 5
> mean(run_times)
[1] 27.60679
> min(run_times)
[1] 19.53013
> max(run_times)
[1] 31.99741

# 6
> mean(run_times)
[1] 26.25231
> min(run_times)
[1] 23.7442
> max(run_times)
[1] 28.55587

# 7
> mean(run_times)
[1] 25.43196
> min(run_times)
[1] 17.86646
> max(run_times)
[1] 32.56424

# 8
> mean(run_times)
[1] 25.17905
> min(run_times)
[1] 19.04733
> max(run_times)
[1] 33.04431

# 9
> mean(run_times)
[1] 26.19547
> min(run_times)
[1] 19.26298
> max(run_times)
[1] 33.63481

# 10
> mean(run_times)
[1] 28.74862
> min(run_times)
[1] 26.8705
> max(run_times)
[1] 31.36086

# 11
> mean(run_times)
[1] 26.50004
> min(run_times)
[1] 20.06559
> max(run_times)
[1] 30.62028

# 12
> mean(run_times)
[1] 25.53255
> min(run_times)
[1] 18.61949
> max(run_times)
[1] 34.37966

# 13
> mean(run_times)
[1] 29.09956
> min(run_times)
[1] 26.30623
> max(run_times)
[1] 32.22448

# 14
> mean(run_times)
[1] 26.4824
> min(run_times)
[1] 20.26294
> max(run_times)
[1] 33.2982

smnorris mentioned this issue Dec 18, 2024

Cache WFS data to object storage #349

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Issues with run times of data pulling #345

Issues with run times of data pulling #345

aylapear commented Nov 29, 2024

boshek commented Dec 3, 2024

stephhazlitt commented Dec 4, 2024 •

edited

Loading

smnorris commented Dec 4, 2024 •

edited

Loading

aylapear commented Dec 12, 2024

aylapear commented Dec 12, 2024 •

edited

Loading

stephhazlitt commented Dec 12, 2024 •

edited

Loading

boshek commented Dec 12, 2024

aylapear commented Dec 18, 2024

aylapear commented Dec 18, 2024

Issues with run times of data pulling #345

Issues with run times of data pulling #345

Comments

aylapear commented Nov 29, 2024

boshek commented Dec 3, 2024

stephhazlitt commented Dec 4, 2024 • edited Loading

smnorris commented Dec 4, 2024 • edited Loading

aylapear commented Dec 12, 2024

aylapear commented Dec 12, 2024 • edited Loading

stephhazlitt commented Dec 12, 2024 • edited Loading

boshek commented Dec 12, 2024

aylapear commented Dec 18, 2024

aylapear commented Dec 18, 2024

stephhazlitt commented Dec 4, 2024 •

edited

Loading

smnorris commented Dec 4, 2024 •

edited

Loading

aylapear commented Dec 12, 2024 •

edited

Loading

stephhazlitt commented Dec 12, 2024 •

edited

Loading