Add `fetch` command to download expedition data based on `space_time_region` from `schedule.yaml` #83

iuryt · 2024-11-13T04:10:47Z

Based on the discussion from #68

Description

This PR updates the fetch command to download data specified by the expedition's area_of_interest . Key changes include:

Integration with Schedule File: The fetch command now uses _get_schedule to load the Schedule YAML file from the specified expedition_dir. The Schedule now includes area_of_interest with spatial and temporal boundaries, which define the download region and timeframe for data subsets.
Unified Data Download Configuration: All datasets, including bathymetry, are now defined in a single download_dict, making the code cleaner and allowing for easy addition or modification of datasets without changes to the function logic.
Consistent Use of area_of_interest: All dataset downloads utilize area_of_interest for spatial (min/max latitude and longitude) and temporal (start/end time) boundaries, ensuring consistency across all downloaded data.

Should we delete the whole scripts/ folder?

Status

This is a draft PR. I have not tested running the fetch command with an example expedition_dir to ensure data is downloaded correctly based on area_of_interest.

for more information, see https://pre-commit.ci

iuryt · 2024-11-13T10:08:35Z

@ammedd and @VeckoTheGecko
What do you think of this way I am implementing this?
I was also wondering whether _get_schedule and _get_ship_config should be in do_expedition or somewhere else.

VeckoTheGecko

I like the structure. Here's a preliminary review

src/virtualship/cli/commands.py

VeckoTheGecko · 2024-11-15T06:30:51Z

Should we delete the whole scripts/ folder?

I think so

@iuryt where would the data be downloaded to do you think? I think it would be nice to have it in a subfolder in the mission folder, and also somehow tie in the spatial/temporal ranges to that dataset. Perhaps something like:

mission_folder
└── data
    └── vSgsJWk
        ├── a.nc
        ├── b.nc
        ├── c.nc
        ├── d.nc
        └── schedule.yaml

where vSgsJWk is a hash of the spatial/temporal information. If users change the region/time range it would generate a new folder. Schedule.yaml is just a copy of the file so that users can know what spatial/temporal settings they had.

Not perfect (i.e., changing any of the domains results in a full data download) but at least it avoids re-downloading already downloaded data, and isolates the data files for different areas.

erikvansebille

Looks good! More comments from me below

src/virtualship/cli/commands.py

erikvansebille · 2024-11-15T07:28:39Z

src/virtualship/cli/commands.py

+            minimum_depth=0.49402499198913574,
+            maximum_depth=5727.9169921875,


Do we always need to download full-depth datasets? @ammedd, would there also be cases (when not using CTD?) that only upper/surface ocean data are needed? Or would that just overcomplicate the script?

We can add this very easily to the area_of_interest . Do we have checks to see if the data covers the survey?

The are indeed (many) cases where the full-depth is not needed. It would be good to check the ship_config for the maximum depth needed.

Is this something that we should deal with in this PR? It would require some choices about depths for which datasets/instruments, which might be better to discuss in another issue/future PR.

No thanks for bringing it up, let's discuss in the meeting tomorrow and move it to a new PR.

src/virtualship/expedition/area_of_interest.py

src/virtualship/static/schedule.yaml

src/virtualship/cli/commands.py

Co-authored-by: Vecko <[email protected]>

iuryt · 2024-11-15T14:58:03Z

Should we delete the whole scripts/ folder?

I think so

@iuryt where would the data be downloaded to do you think? I think it would be nice to have it in a subfolder in the mission folder, and also somehow tie in the spatial/temporal ranges to that dataset. Perhaps something like:
mission_folder
└── data
    └── vSgsJWk
        ├── a.nc
        ├── b.nc
        ├── c.nc
        ├── d.nc
        └── schedule.yaml
where vSgsJWk is a hash of the spatial/temporal information. If users change the region/time range it would generate a new folder. Schedule.yaml is just a copy of the file so that users can know what spatial/temporal settings they had.

Not perfect (i.e., changing any of the domains results in a full data download) but at least it avoids re-downloading already downloaded data, and isolates the data files for different areas.

Thank you so much for your suggestions! I’ve been thinking about how to put them into action.

One idea I had is to keep all the data in a single folder and just check the metadata for any changes in the spatial-temporal domain. If the user changes the domain, we can easily overwrite the data. I’m curious, though—why might someone need access to the previous data? To make things easier and skip loading the metadata each time, we could simply check if the copied data/schedule.yaml matches the current schedule.yaml domain.

I really like the concept of using a hash, but I’m a bit concerned it might not be the friendliest option for those who aren’t familiar with it, especially students. It might require some extra explanation during lectures. I’d love to hear what @ammedd and @erikvansebille think about this!

erikvansebille · 2024-11-15T15:31:10Z

I really like the concept of using a hash, but I’m a bit concerned it might not be the friendliest option for those who aren’t familiar with it, especially students. It might require some extra explanation during lectures. I’d love to hear what @ammedd and @erikvansebille think about this!

Hi @iuryt, I agree that a hash-based folder name might not be very friendly to work with for students. Perhaps a timestamp-based folder would be friendlier? E.g. mission_folder/data/download_YYYMMDD_HHMMSS/*.

That would also help easily locate the latest version of the downloaded data because the downloads would automatically be sorted.

iuryt · 2024-11-15T15:58:17Z

I really like the concept of using a hash, but I’m a bit concerned it might not be the friendliest option for those who aren’t familiar with it, especially students. It might require some extra explanation during lectures. I’d love to hear what @ammedd and @erikvansebille think about this!

Hi @iuryt, I agree that a hash-based folder name might not be very friendly to work with for students. Perhaps a timestamp-based folder would be friendlier? E.g. mission_folder/data/download_YYYMMDD_HHMMSS/*.

That would also help easily locate the latest version of the downloaded data because the downloads would automatically be sorted.

That makes sense. But do you think it is helpful to keep past downloads?

erikvansebille · 2024-11-15T16:05:00Z

But do you think it is helpful to keep past downloads?

Well, we don't want people to have to redownload every time they run the virtual ship, I assume. We could only keep the last download, but then why even have a folder per download (the idea of a hash in the first place)?

iuryt · 2024-11-15T16:59:22Z

But do you think it is helpful to keep past downloads?

Well, we don't want people to have to redownload every time they run the virtual ship, I assume. We could only keep the last download, but then why even have a folder per download (the idea of a hash in the first place)?

I was wondering if we could check for updates to the schedule.yaml file compared to the copy in the download folder. We could only overwrite it if there were changes in the spatial-temporal domain. This way, we wouldn’t need separate folders for each download. Let me know your thoughts.

VeckoTheGecko · 2024-11-18T05:00:18Z

I'm still a fan of hashes as I think it simplifies the implementation details and adds a bit of futureproofing (i.e., what if we want to "force" new data downloads with a new version of virtualship because we add a new dataset, or the download command has changed. Doing so with a hash is a change to a single line of code in the hashing function. Also what if a new version of VS changes the format of schedule.yaml?). The concept of hashing isn't important to students, all they need to know is: "Defining the spatial/temporal domain in schedule.yaml will make it so that data is downloaded to a unique folder. If you re-use the same domain previous downloads will be found". If users want to look at the details of the download, they can look into the folder at schedule.yaml. Doing a purely timestamp based system would require loading each schedule, which can be sidestepped via hashing. We could have the best of both worlds by doing data/YYYYMMDD_HHMMSS_vSgsJWk so its human readable and functional with the hash.

This is all underpinned by whether re-using the data is important. My conversations with @ammedd mentioning that these data downloads as sizeable and take a long time (if someone can advise how long/how large, since I don't know this) leads me to be inclined to save the downloads and re-use them as much as practical, and leave managing the storage space to the user. A "one folder per domain" approach takes more space, but allows the user to change the domain as they see fit without worrying about it being cleared out from a separate run they did.

One idea I had is to keep all the data in a single folder

As in, across different missions the data would also be saved to the same folder? (e.g., a folder that they're prompted to specify the first time that they run VS?) I initially suggested in #83 (comment) the data being downloaded in the mission folder so that it was very visible to the user and centralised. I think the mission folder approach is easier to implement (no need to worry about (a) permissions elsewhere on the file system, (b) users not knowing where the data is stored, (c) setting/saving application level configuration), although users would need to manually copy data files between missions if they want to avoid doing a download. I think that trade-off would be worth it

Sorry about the wall of text :)

ammedd · 2024-11-18T08:29:33Z

Wow! So many ideas/much work going on!

I like saving data to the mission folder and the data/YYYYMMDD_vSgsJWk human readable and hash option sounds like the best of both worlds. Stating the date of the downloaded dataset, not the date of the download ;-)

If I remember correctly downloads took up to 2 hours each last year and my 256GB disk couldn't handle all the data needed for the 7 groups we had last year. I do think the download method (and speed) has improved since then.

erikvansebille · 2024-11-21T10:52:55Z

Maybe this is because the PR is still in draft, but when I tried to use it (to download data for a drifter simulation), I got the following error

(parcels) erik ~/Codes/VirtualShip % virtualship fetch TrAtldrifters
Traceback (most recent call last):
  File "/Users/erik/anaconda3/envs/parcels/bin/virtualship", line 8, in <module>
    sys.exit(cli())
             ^^^^^
  File "/Users/erik/anaconda3/envs/parcels/lib/python3.12/site-packages/click/core.py", line 1157, in __call__
    return self.main(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/erik/anaconda3/envs/parcels/lib/python3.12/site-packages/click/core.py", line 1078, in main
    rv = self.invoke(ctx)
         ^^^^^^^^^^^^^^^^
  File "/Users/erik/anaconda3/envs/parcels/lib/python3.12/site-packages/click/core.py", line 1688, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
                           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/erik/anaconda3/envs/parcels/lib/python3.12/site-packages/click/core.py", line 1434, in invoke
    return ctx.invoke(self.callback, **ctx.params)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/erik/anaconda3/envs/parcels/lib/python3.12/site-packages/click/core.py", line 783, in invoke
    return __callback(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^
TypeError: fetch() got an unexpected keyword argument 'path'

Any idea what's going on?

src/virtualship/cli/commands.py

Fixes OceanParcels#90

VeckoTheGecko · 2025-01-13T16:06:16Z

I have now added features for:

Automatic re-use of downloads with the same AOI
Handling of partial downloads (i.e., if the student cancels a download part way through, they will be prompted to manually delete it next time they run the tool)
Updated to use copernicusmarine>=2

Also mentioned in #83 (comment), but wondering:

Should we leave the "only downloading up to a certain depth" for a future issue and PR so that we can scope out and discuss it separately? If I'm not mistaked, from our discussion earlier, it didn't sound very simple what the solution should be.

https://about.readthedocs.com/blog/2024/12/deprecate-config-files-without-sphinx-or-mkdocs-config/

erikvansebille

Do we also need to update the information on docs.oceanparcels.org, via changes to README.md? E.g. give more info on the credentials in the virtualship fetch --help info

erikvansebille · 2025-01-14T12:52:02Z

src/virtualship/cli/_creds.py

+    """
+    Execute flow of getting credentials for use in the `fetch` command.
+
+    - If username and password are provided via CLI, use them (ignore the credentials file if exists).


Maybe explain what CLI is? Mot sure every user will know that

Suggested change

- If username and password are provided via CLI, use them (ignore the credentials file if exists).

- If username and password are provided via Command Line Interface (CLI), use them (ignore the credentials file if exists).

This was intended more as developer documentation rather than user documentation. What do we think in terms of user documentation? Perhaps I can write a documentation page? (I imagine that we wouldn't want this information silo'd in a notebook, but rather link out to the doc page - I imagine there would be other course content in a notebook). Thoughts on how this fits with course content structure @ammedd?

Updating the docstring here wont appear to users unless its also displayed in some sort of user documentation.

Lets put that doc page in a different PR, this is already big enough and getting a bit much 😅

src/virtualship/cli/_creds.py

VeckoTheGecko · 2025-01-15T10:54:29Z

src/virtualship/cli/commands.py

+    """
+    Download input data for an expedition.
+
+    Entrypoint for the tool to download data based on space-time region provided in the
+    schedule file. Data is downloaded from Copernicus Marine, credentials for which can be
+    obtained via registration: https://data.marine.copernicus.eu/register . Credentials can
+    be provided on prompt, via command line arguments, or via a YAML config file. Run
+    `virtualship fetch` on a expedition for more info.
+    """


Do we also need to update the information on docs.oceanparcels.org, via changes to README.md? E.g. give more info on the credentials in the virtualship fetch --help info

oh yes, that was an item discussed yesterday. Updated - thoughts on this (which will be surfaced in the --help command)? I don't want to double up on the info, hence the Run virtualship fetch on a expedition for more info.. Not sure if Credentials can be provided on prompt can be worded better

I'll update the readme once confirmed

VeckoTheGecko · 2025-01-15T13:57:09Z

By the way, I have also updated all references to "area of interest" to "space time region" as per the review feedback before

iuryt and others added 13 commits November 12, 2024 23:00

add bbox and time range

7ac5d4d

add AreaOfInterest class

8c641cc

export AreaOfInterest

6b6b874

add AreaOfInterest to Schedule class

6743f1f

fetch function for downloading data based on the area of interest

40d4d0d

[pre-commit.ci] auto fixes from pre-commit.com hooks

3194056

for more information, see https://pre-commit.ci

import copernicusmarine

aacbb38

import datetime

5988c90

fix conflict

fe77ea1

[pre-commit.ci] auto fixes from pre-commit.com hooks

3d7697e

for more information, see https://pre-commit.ci

fix typo

fafc74f

fix conflict

ad97b93

[pre-commit.ci] auto fixes from pre-commit.com hooks

46af584

for more information, see https://pre-commit.ci

VeckoTheGecko requested changes Nov 15, 2024

View reviewed changes

src/virtualship/cli/commands.py Outdated Show resolved Hide resolved

src/virtualship/cli/commands.py Outdated Show resolved Hide resolved

src/virtualship/cli/commands.py Outdated Show resolved Hide resolved

src/virtualship/cli/commands.py Outdated Show resolved Hide resolved

This comment was marked as outdated.

Sign in to view

erikvansebille reviewed Nov 15, 2024

View reviewed changes

iuryt and others added 2 commits November 15, 2024 08:52

Update src/virtualship/cli/commands.py

e092658

Co-authored-by: Vecko <[email protected]>

Update src/virtualship/cli/commands.py

bb776a6

Co-authored-by: Vecko <[email protected]>

VeckoTheGecko reviewed Nov 21, 2024

View reviewed changes

src/virtualship/cli/commands.py Outdated Show resolved Hide resolved

patch path varname

870a2c2

VeckoTheGecko added 7 commits January 13, 2025 15:00

Handle download_cleanup on wrong credentials

fd060d7

Copy schedule to download folde

d2a7bc8

Patch download command

18b86ec

Add input_data param for testing

524c650

Add tests

ad995c3

Add fetch test

afe6112

Pin copernicusmarine >= 2

5669e41

Fixes OceanParcels#90

VeckoTheGecko marked this pull request as ready for review January 13, 2025 16:06

VeckoTheGecko added 5 commits January 13, 2025 17:22

RTD explicit config key

bcfe01e

https://about.readthedocs.com/blog/2024/12/deprecate-config-files-without-sphinx-or-mkdocs-config/

Delete download_data script

0ef7cf0

Update pyproject.toml

b830393

Improve filename to hash conversion

8d35d4b

Rename to assert_complete_download

9ffd8de

VeckoTheGecko requested review from erikvansebille and ammedd January 13, 2025 16:39

VeckoTheGecko approved these changes Jan 13, 2025

View reviewed changes

VeckoTheGecko added 3 commits January 14, 2025 16:03

Error message when area of interest isn't defined

cfc50cc

Add area of interest hash salting

7132766

Update 'area of interest' to 'space-time region' throughout

8698d13

erikvansebille reviewed Jan 15, 2025

View reviewed changes

This comment was marked as outdated.

Sign in to view

VeckoTheGecko added 3 commits January 15, 2025 10:50

Avoid circular import

04d7c13

virtualship help documentation

93576c8

Update help messages

a2290b6

VeckoTheGecko reviewed Jan 15, 2025

View reviewed changes

VeckoTheGecko changed the title ~~Add fetch command to download expedition data based on area_of_interest from schedule.yaml~~ Add fetch command to download expedition data based on space_time_interest from schedule.yaml Jan 15, 2025

VeckoTheGecko changed the title ~~Add fetch command to download expedition data based on space_time_interest from schedule.yaml~~ Add fetch command to download expedition data based on space_time_region from schedule.yaml Jan 15, 2025

erikvansebille approved these changes Jan 16, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add `fetch` command to download expedition data based on `space_time_region` from `schedule.yaml` #83

Add `fetch` command to download expedition data based on `space_time_region` from `schedule.yaml` #83

iuryt commented Nov 13, 2024 •

edited

Loading

iuryt commented Nov 13, 2024

VeckoTheGecko left a comment

This comment was marked as outdated.

VeckoTheGecko commented Nov 15, 2024 •

edited

Loading

erikvansebille left a comment

erikvansebille Nov 15, 2024

iuryt Nov 15, 2024

ammedd Nov 18, 2024

VeckoTheGecko Jan 13, 2025

ammedd Jan 13, 2025

iuryt commented Nov 15, 2024

erikvansebille commented Nov 15, 2024

iuryt commented Nov 15, 2024

erikvansebille commented Nov 15, 2024

iuryt commented Nov 15, 2024

VeckoTheGecko commented Nov 18, 2024 •

edited

Loading

ammedd commented Nov 18, 2024

erikvansebille commented Nov 21, 2024

VeckoTheGecko commented Jan 13, 2025 •

edited

Loading

erikvansebille left a comment

erikvansebille Jan 14, 2025

VeckoTheGecko Jan 15, 2025 •

edited

Loading

This comment was marked as outdated.

VeckoTheGecko Jan 15, 2025

VeckoTheGecko Jan 15, 2025

VeckoTheGecko commented Jan 15, 2025

		minimum_depth=0.49402499198913574,
		maximum_depth=5727.9169921875,

	- If username and password are provided via CLI, use them (ignore the credentials file if exists).
	- If username and password are provided via Command Line Interface (CLI), use them (ignore the credentials file if exists).

Add fetch command to download expedition data based on space_time_region from schedule.yaml #83

Are you sure you want to change the base?

Add fetch command to download expedition data based on space_time_region from schedule.yaml #83

Conversation

iuryt commented Nov 13, 2024 • edited Loading

Description

Status

iuryt commented Nov 13, 2024

VeckoTheGecko left a comment

Choose a reason for hiding this comment

This comment was marked as outdated.

VeckoTheGecko commented Nov 15, 2024 • edited Loading

erikvansebille left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

iuryt commented Nov 15, 2024

erikvansebille commented Nov 15, 2024

iuryt commented Nov 15, 2024

erikvansebille commented Nov 15, 2024

iuryt commented Nov 15, 2024

VeckoTheGecko commented Nov 18, 2024 • edited Loading

ammedd commented Nov 18, 2024

erikvansebille commented Nov 21, 2024

VeckoTheGecko commented Jan 13, 2025 • edited Loading

erikvansebille left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

VeckoTheGecko Jan 15, 2025 • edited Loading

Choose a reason for hiding this comment

This comment was marked as outdated.

Choose a reason for hiding this comment

Choose a reason for hiding this comment

VeckoTheGecko commented Jan 15, 2025

Add `fetch` command to download expedition data based on `space_time_region` from `schedule.yaml` #83

Add `fetch` command to download expedition data based on `space_time_region` from `schedule.yaml` #83

iuryt commented Nov 13, 2024 •

edited

Loading

VeckoTheGecko commented Nov 15, 2024 •

edited

Loading

VeckoTheGecko commented Nov 18, 2024 •

edited

Loading

VeckoTheGecko commented Jan 13, 2025 •

edited

Loading

VeckoTheGecko Jan 15, 2025 •

edited

Loading