Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add fetch command to download expedition data based on space_time_region from schedule.yaml #83

Open
wants to merge 70 commits into
base: main
Choose a base branch
from

Conversation

iuryt
Copy link
Contributor

@iuryt iuryt commented Nov 13, 2024

Based on the discussion from #68

Description

This PR updates the fetch command to download data specified by the expedition's area_of_interest . Key changes include:

  1. Integration with Schedule File: The fetch command now uses _get_schedule to load the Schedule YAML file from the specified expedition_dir. The Schedule now includes area_of_interest with spatial and temporal boundaries, which define the download region and timeframe for data subsets.

  2. Unified Data Download Configuration: All datasets, including bathymetry, are now defined in a single download_dict, making the code cleaner and allowing for easy addition or modification of datasets without changes to the function logic.

  3. Consistent Use of area_of_interest: All dataset downloads utilize area_of_interest for spatial (min/max latitude and longitude) and temporal (start/end time) boundaries, ensuring consistency across all downloaded data.

Should we delete the whole scripts/ folder?

Status

This is a draft PR. I have not tested running the fetch command with an example expedition_dir to ensure data is downloaded correctly based on area_of_interest.

@iuryt
Copy link
Contributor Author

iuryt commented Nov 13, 2024

@ammedd and @VeckoTheGecko
What do you think of this way I am implementing this?
I was also wondering whether _get_schedule and _get_ship_config should be in do_expedition or somewhere else.

Copy link
Collaborator

@VeckoTheGecko VeckoTheGecko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the structure. Here's a preliminary review

src/virtualship/cli/commands.py Outdated Show resolved Hide resolved
src/virtualship/cli/commands.py Outdated Show resolved Hide resolved
src/virtualship/cli/commands.py Outdated Show resolved Hide resolved
src/virtualship/cli/commands.py Outdated Show resolved Hide resolved
@VeckoTheGecko

This comment was marked as outdated.

@VeckoTheGecko
Copy link
Collaborator

VeckoTheGecko commented Nov 15, 2024

Should we delete the whole scripts/ folder?

I think so


@iuryt where would the data be downloaded to do you think? I think it would be nice to have it in a subfolder in the mission folder, and also somehow tie in the spatial/temporal ranges to that dataset. Perhaps something like:

mission_folder
└── data
    └── vSgsJWk
        ├── a.nc
        ├── b.nc
        ├── c.nc
        ├── d.nc
        └── schedule.yaml

where vSgsJWk is a hash of the spatial/temporal information. If users change the region/time range it would generate a new folder. Schedule.yaml is just a copy of the file so that users can know what spatial/temporal settings they had.

Not perfect (i.e., changing any of the domains results in a full data download) but at least it avoids re-downloading already downloaded data, and isolates the data files for different areas.

Copy link
Member

@erikvansebille erikvansebille left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good! More comments from me below

src/virtualship/cli/commands.py Show resolved Hide resolved
Comment on lines 107 to 108
minimum_depth=0.49402499198913574,
maximum_depth=5727.9169921875,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we always need to download full-depth datasets? @ammedd, would there also be cases (when not using CTD?) that only upper/surface ocean data are needed? Or would that just overcomplicate the script?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can add this very easily to the area_of_interest . Do we have checks to see if the data covers the survey?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The are indeed (many) cases where the full-depth is not needed. It would be good to check the ship_config for the maximum depth needed.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this something that we should deal with in this PR? It would require some choices about depths for which datasets/instruments, which might be better to discuss in another issue/future PR.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No thanks for bringing it up, let's discuss in the meeting tomorrow and move it to a new PR.

src/virtualship/expedition/area_of_interest.py Outdated Show resolved Hide resolved
src/virtualship/static/schedule.yaml Show resolved Hide resolved
src/virtualship/cli/commands.py Outdated Show resolved Hide resolved
src/virtualship/cli/commands.py Outdated Show resolved Hide resolved
@iuryt
Copy link
Contributor Author

iuryt commented Nov 15, 2024

Should we delete the whole scripts/ folder?

I think so

@iuryt where would the data be downloaded to do you think? I think it would be nice to have it in a subfolder in the mission folder, and also somehow tie in the spatial/temporal ranges to that dataset. Perhaps something like:

mission_folder
└── data
    └── vSgsJWk
        ├── a.nc
        ├── b.nc
        ├── c.nc
        ├── d.nc
        └── schedule.yaml

where vSgsJWk is a hash of the spatial/temporal information. If users change the region/time range it would generate a new folder. Schedule.yaml is just a copy of the file so that users can know what spatial/temporal settings they had.

Not perfect (i.e., changing any of the domains results in a full data download) but at least it avoids re-downloading already downloaded data, and isolates the data files for different areas.

Thank you so much for your suggestions! I’ve been thinking about how to put them into action.

One idea I had is to keep all the data in a single folder and just check the metadata for any changes in the spatial-temporal domain. If the user changes the domain, we can easily overwrite the data. I’m curious, though—why might someone need access to the previous data? To make things easier and skip loading the metadata each time, we could simply check if the copied data/schedule.yaml matches the current schedule.yaml domain.

I really like the concept of using a hash, but I’m a bit concerned it might not be the friendliest option for those who aren’t familiar with it, especially students. It might require some extra explanation during lectures. I’d love to hear what @ammedd and @erikvansebille think about this!

@erikvansebille
Copy link
Member

I really like the concept of using a hash, but I’m a bit concerned it might not be the friendliest option for those who aren’t familiar with it, especially students. It might require some extra explanation during lectures. I’d love to hear what @ammedd and @erikvansebille think about this!

Hi @iuryt, I agree that a hash-based folder name might not be very friendly to work with for students. Perhaps a timestamp-based folder would be friendlier? E.g. mission_folder/data/download_YYYMMDD_HHMMSS/*.

That would also help easily locate the latest version of the downloaded data because the downloads would automatically be sorted.

@iuryt
Copy link
Contributor Author

iuryt commented Nov 15, 2024

I really like the concept of using a hash, but I’m a bit concerned it might not be the friendliest option for those who aren’t familiar with it, especially students. It might require some extra explanation during lectures. I’d love to hear what @ammedd and @erikvansebille think about this!

Hi @iuryt, I agree that a hash-based folder name might not be very friendly to work with for students. Perhaps a timestamp-based folder would be friendlier? E.g. mission_folder/data/download_YYYMMDD_HHMMSS/*.

That would also help easily locate the latest version of the downloaded data because the downloads would automatically be sorted.

That makes sense. But do you think it is helpful to keep past downloads?

@erikvansebille
Copy link
Member

But do you think it is helpful to keep past downloads?

Well, we don't want people to have to redownload every time they run the virtual ship, I assume. We could only keep the last download, but then why even have a folder per download (the idea of a hash in the first place)?

@iuryt
Copy link
Contributor Author

iuryt commented Nov 15, 2024

But do you think it is helpful to keep past downloads?

Well, we don't want people to have to redownload every time they run the virtual ship, I assume. We could only keep the last download, but then why even have a folder per download (the idea of a hash in the first place)?

I was wondering if we could check for updates to the schedule.yaml file compared to the copy in the download folder. We could only overwrite it if there were changes in the spatial-temporal domain. This way, we wouldn’t need separate folders for each download. Let me know your thoughts.

@VeckoTheGecko
Copy link
Collaborator

VeckoTheGecko commented Nov 18, 2024

I'm still a fan of hashes as I think it simplifies the implementation details and adds a bit of futureproofing (i.e., what if we want to "force" new data downloads with a new version of virtualship because we add a new dataset, or the download command has changed. Doing so with a hash is a change to a single line of code in the hashing function. Also what if a new version of VS changes the format of schedule.yaml?). The concept of hashing isn't important to students, all they need to know is: "Defining the spatial/temporal domain in schedule.yaml will make it so that data is downloaded to a unique folder. If you re-use the same domain previous downloads will be found". If users want to look at the details of the download, they can look into the folder at schedule.yaml. Doing a purely timestamp based system would require loading each schedule, which can be sidestepped via hashing. We could have the best of both worlds by doing data/YYYYMMDD_HHMMSS_vSgsJWk so its human readable and functional with the hash.

This is all underpinned by whether re-using the data is important. My conversations with @ammedd mentioning that these data downloads as sizeable and take a long time (if someone can advise how long/how large, since I don't know this) leads me to be inclined to save the downloads and re-use them as much as practical, and leave managing the storage space to the user. A "one folder per domain" approach takes more space, but allows the user to change the domain as they see fit without worrying about it being cleared out from a separate run they did.

One idea I had is to keep all the data in a single folder

As in, across different missions the data would also be saved to the same folder? (e.g., a folder that they're prompted to specify the first time that they run VS?) I initially suggested in #83 (comment) the data being downloaded in the mission folder so that it was very visible to the user and centralised. I think the mission folder approach is easier to implement (no need to worry about (a) permissions elsewhere on the file system, (b) users not knowing where the data is stored, (c) setting/saving application level configuration), although users would need to manually copy data files between missions if they want to avoid doing a download. I think that trade-off would be worth it

Sorry about the wall of text :)

@ammedd
Copy link
Collaborator

ammedd commented Nov 18, 2024

Wow! So many ideas/much work going on!

I like saving data to the mission folder and the data/YYYYMMDD_vSgsJWk human readable and hash option sounds like the best of both worlds. Stating the date of the downloaded dataset, not the date of the download ;-)

If I remember correctly downloads took up to 2 hours each last year and my 256GB disk couldn't handle all the data needed for the 7 groups we had last year. I do think the download method (and speed) has improved since then.

@erikvansebille
Copy link
Member

Maybe this is because the PR is still in draft, but when I tried to use it (to download data for a drifter simulation), I got the following error

(parcels) erik ~/Codes/VirtualShip % virtualship fetch TrAtldrifters
Traceback (most recent call last):
  File "/Users/erik/anaconda3/envs/parcels/bin/virtualship", line 8, in <module>
    sys.exit(cli())
             ^^^^^
  File "/Users/erik/anaconda3/envs/parcels/lib/python3.12/site-packages/click/core.py", line 1157, in __call__
    return self.main(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/erik/anaconda3/envs/parcels/lib/python3.12/site-packages/click/core.py", line 1078, in main
    rv = self.invoke(ctx)
         ^^^^^^^^^^^^^^^^
  File "/Users/erik/anaconda3/envs/parcels/lib/python3.12/site-packages/click/core.py", line 1688, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
                           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/erik/anaconda3/envs/parcels/lib/python3.12/site-packages/click/core.py", line 1434, in invoke
    return ctx.invoke(self.callback, **ctx.params)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/erik/anaconda3/envs/parcels/lib/python3.12/site-packages/click/core.py", line 783, in invoke
    return __callback(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^
TypeError: fetch() got an unexpected keyword argument 'path'

Any idea what's going on?

@VeckoTheGecko
Copy link
Collaborator

VeckoTheGecko commented Jan 13, 2025

I have now added features for:

  • Automatic re-use of downloads with the same AOI
  • Handling of partial downloads (i.e., if the student cancels a download part way through, they will be prompted to manually delete it next time they run the tool)
  • Updated to use copernicusmarine>=2

Also mentioned in #83 (comment), but wondering:

Should we leave the "only downloading up to a certain depth" for a future issue and PR so that we can scope out and discuss it separately? If I'm not mistaked, from our discussion earlier, it didn't sound very simple what the solution should be.

@VeckoTheGecko VeckoTheGecko marked this pull request as ready for review January 13, 2025 16:06
Copy link
Member

@erikvansebille erikvansebille left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we also need to update the information on docs.oceanparcels.org, via changes to README.md? E.g. give more info on the credentials in the virtualship fetch --help info

"""
Execute flow of getting credentials for use in the `fetch` command.

- If username and password are provided via CLI, use them (ignore the credentials file if exists).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe explain what CLI is? Mot sure every user will know that

Suggested change
- If username and password are provided via CLI, use them (ignore the credentials file if exists).
- If username and password are provided via Command Line Interface (CLI), use them (ignore the credentials file if exists).

Copy link
Collaborator

@VeckoTheGecko VeckoTheGecko Jan 15, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was intended more as developer documentation rather than user documentation. What do we think in terms of user documentation? Perhaps I can write a documentation page? (I imagine that we wouldn't want this information silo'd in a notebook, but rather link out to the doc page - I imagine there would be other course content in a notebook). Thoughts on how this fits with course content structure @ammedd?

Updating the docstring here wont appear to users unless its also displayed in some sort of user documentation.

Lets put that doc page in a different PR, this is already big enough and getting a bit much 😅

src/virtualship/cli/_creds.py Show resolved Hide resolved
@erikvansebille

This comment was marked as outdated.

Comment on lines +69 to +77
"""
Download input data for an expedition.

Entrypoint for the tool to download data based on space-time region provided in the
schedule file. Data is downloaded from Copernicus Marine, credentials for which can be
obtained via registration: https://data.marine.copernicus.eu/register . Credentials can
be provided on prompt, via command line arguments, or via a YAML config file. Run
`virtualship fetch` on a expedition for more info.
"""
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we also need to update the information on docs.oceanparcels.org, via changes to README.md? E.g. give more info on the credentials in the virtualship fetch --help info

oh yes, that was an item discussed yesterday. Updated - thoughts on this (which will be surfaced in the --help command)? I don't want to double up on the info, hence the Run virtualship fetch on a expedition for more info.. Not sure if Credentials can be provided on prompt can be worded better

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll update the readme once confirmed

@VeckoTheGecko VeckoTheGecko changed the title Add fetch command to download expedition data based on area_of_interest from schedule.yaml Add fetch command to download expedition data based on space_time_interest from schedule.yaml Jan 15, 2025
@VeckoTheGecko
Copy link
Collaborator

By the way, I have also updated all references to "area of interest" to "space time region" as per the review feedback before

@VeckoTheGecko VeckoTheGecko changed the title Add fetch command to download expedition data based on space_time_interest from schedule.yaml Add fetch command to download expedition data based on space_time_region from schedule.yaml Jan 15, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants