Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cache WFS data to object storage #349

Closed
smnorris opened this issue Dec 18, 2024 · 7 comments
Closed

Cache WFS data to object storage #349

smnorris opened this issue Dec 18, 2024 · 7 comments

Comments

@smnorris
Copy link

smnorris commented Dec 18, 2024

As mentioned in #345, caching frequently accessed WFS datasets to file on object storage could work well to reduce maintenance burden / outages related to WFS / WFS server load.

Functional proof of concept:

https://github.com/smnorris/bcfishpass/blob/main/.github/workflows/replicate-monthly.yaml
https://github.com/smnorris/bcfishpass/blob/main/.github/workflows/replicate-weekly.yaml
https://github.com/smnorris/bcfishpass/blob/main/jobs/replicate_bcgw
https://github.com/smnorris/bcfishpass/blob/main/jobs/bcgw_sources.json

Existing structure presumes clients would access the cache via direct url rather than interfacing with bcdata, but that could be tweaked.

WFS vs S3 cache for a ~370k record dataset:

$ time bcdata dump whse_fish.fiss_fish_obsrvtn_pnt_sp --query "POINT_TYPE_CODE = 'Observation'" > test.geojson
2024-12-18 12:40:36,988:INFO:bcdata.wfs: Total features requested: 377867

real	2m45.538s
user	0m18.398s
sys	0m1.821s

$ time curl -o test.parquet https://nrs.objectstore.gov.bc.ca/bchamp/bcdata/whse_fish.fiss_fish_obsrvtn_pnt_sp.parquet
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 19.2M  100 19.2M    0     0  10.7M      0  0:00:01  0:00:01 --:--:-- 10.7M

real	0m1.828s
user	0m0.306s
sys	0m0.156s
@boshek
Copy link
Collaborator

boshek commented Dec 18, 2024

@smnorris can you clarify if you mean local or remote object storage?

@smnorris
Copy link
Author

smnorris commented Dec 18, 2024

Remote.
Local would be useful per application but scheduled replication job(s) loading frequently accessed data to remote object storage would mean all open data clients can access the cached data via a direct url rather dealing with WFS.

I have not yet tinkered with optimizing the parquet files, the data volume is so trivial in most cases it isn't a big deal.

This presumes that clients generally do not need 'live' downloads - data replicated on a scheduled basis by a centralized workflow would meet user needs. This is likely the case for 99% of WFS data - fire perimeters are an obvious exception.

@smnorris
Copy link
Author

FWIW, I could just add scheduled replication workflows to Python bcdata - but I don't have an NRS object storage bucket with that mandate, so I'm asking here 😁

@ateucher
Copy link
Collaborator

Hey @smnorris - love the idea, but I think it's out of scope for us here. If the mapping team does something like this though we will 100% make use of it!

@ateucher ateucher closed this as not planned Won't fix, can't repro, duplicate, stale Dec 18, 2024
@smnorris
Copy link
Author

Too bad!
I'll continue to replicate/cache as much as I can justify per application.
Feel free to count this as another resource as long as the BC fish passage program/application is active. So far:

https://nrs.objectstore.gov.bc.ca/bchamp/bcdata/whse_admin_boundaries.clab_indian_reserves.parquet
https://nrs.objectstore.gov.bc.ca/bchamp/bcdata/whse_admin_boundaries.clab_national_parks.parquet
https://nrs.objectstore.gov.bc.ca/bchamp/bcdata/whse_basemapping.gba_local_reg_greenspaces_sp.parquet
https://nrs.objectstore.gov.bc.ca/bchamp/bcdata/whse_basemapping.gba_railway_structure_lines_sp.parquet
https://nrs.objectstore.gov.bc.ca/bchamp/bcdata/whse_basemapping.gba_railway_tracks_sp.parquet
https://nrs.objectstore.gov.bc.ca/bchamp/bcdata/whse_basemapping.gba_transmission_lines_sp.parquet
https://nrs.objectstore.gov.bc.ca/bchamp/bcdata/whse_basemapping.gns_geographical_names_sp.parquet
https://nrs.objectstore.gov.bc.ca/bchamp/bcdata/whse_environmental_monitoring.envcan_hydrometric_stn_sp.parquet
https://nrs.objectstore.gov.bc.ca/bchamp/bcdata/whse_fish.fiss_fish_obsrvtn_pnt_sp.parquet
https://nrs.objectstore.gov.bc.ca/bchamp/bcdata/whse_fish.fiss_obstacles_pnt_sp.parquet
https://nrs.objectstore.gov.bc.ca/bchamp/bcdata/whse_fish.fiss_stream_sample_sites_sp.parquet
https://nrs.objectstore.gov.bc.ca/bchamp/bcdata/whse_fish.pscis_assessment_svw.parquet
https://nrs.objectstore.gov.bc.ca/bchamp/bcdata/whse_fish.pscis_design_proposal_svw.parquet
https://nrs.objectstore.gov.bc.ca/bchamp/bcdata/whse_fish.pscis_habitat_confirmation_svw.parquet
https://nrs.objectstore.gov.bc.ca/bchamp/bcdata/whse_fish.pscis_remediation_svw.parquet
https://nrs.objectstore.gov.bc.ca/bchamp/bcdata/whse_forest_tenure.ften_range_poly_svw.parquet
https://nrs.objectstore.gov.bc.ca/bchamp/bcdata/whse_forest_tenure.ften_road_section_lines_svw.parquet
https://nrs.objectstore.gov.bc.ca/bchamp/bcdata/whse_imagery_and_base_maps.mot_road_structure_sp.parquet
https://nrs.objectstore.gov.bc.ca/bchamp/bcdata/whse_legal_admin_boundaries.abms_municipalities_sp.parquet
https://nrs.objectstore.gov.bc.ca/bchamp/bcdata/whse_mineral_tenure.og_petrlm_dev_rds_pre06_pub_sp.parquet
https://nrs.objectstore.gov.bc.ca/bchamp/bcdata/whse_mineral_tenure.og_road_segment_permit_sp.parquet
https://nrs.objectstore.gov.bc.ca/bchamp/bcdata/whse_tantalis.ta_conservancy_areas_svw.parquet
https://nrs.objectstore.gov.bc.ca/bchamp/bcdata/whse_tantalis.ta_park_ecores_pa_svw.parquet

@ateucher
Copy link
Collaborator

I think the other thing that makes this less practical for us is that we support spatial and non-spatial querying (I can't remember if you do?), so would have to build a whole other query backend for parquet. Not hard, but time is limited! :)

@smnorris
Copy link
Author

Good point, they won't work as a resource without that work.
I do support spatial and non-spatial filtering - but I just create the files with bcdata, I don't read them. That is what duckdb/sf/geopandas etc are for.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants