-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add fetch
command to download expedition data based on space_time_region
from schedule.yaml
#83
base: main
Are you sure you want to change the base?
Conversation
for more information, see https://pre-commit.ci
for more information, see https://pre-commit.ci
for more information, see https://pre-commit.ci
@ammedd and @VeckoTheGecko |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like the structure. Here's a preliminary review
This comment was marked as outdated.
This comment was marked as outdated.
I think so @iuryt where would the data be downloaded to do you think? I think it would be nice to have it in a subfolder in the mission folder, and also somehow tie in the spatial/temporal ranges to that dataset. Perhaps something like:
where Not perfect (i.e., changing any of the domains results in a full data download) but at least it avoids re-downloading already downloaded data, and isolates the data files for different areas. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good! More comments from me below
src/virtualship/cli/commands.py
Outdated
minimum_depth=0.49402499198913574, | ||
maximum_depth=5727.9169921875, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we always need to download full-depth datasets? @ammedd, would there also be cases (when not using CTD?) that only upper/surface ocean data are needed? Or would that just overcomplicate the script?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can add this very easily to the area_of_interest
. Do we have checks to see if the data covers the survey?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The are indeed (many) cases where the full-depth is not needed. It would be good to check the ship_config for the maximum depth needed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this something that we should deal with in this PR? It would require some choices about depths for which datasets/instruments, which might be better to discuss in another issue/future PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No thanks for bringing it up, let's discuss in the meeting tomorrow and move it to a new PR.
Co-authored-by: Vecko <[email protected]>
Co-authored-by: Vecko <[email protected]>
Thank you so much for your suggestions! I’ve been thinking about how to put them into action. One idea I had is to keep all the data in a single folder and just check the metadata for any changes in the spatial-temporal domain. If the user changes the domain, we can easily overwrite the data. I’m curious, though—why might someone need access to the previous data? To make things easier and skip loading the metadata each time, we could simply check if the copied I really like the concept of using a hash, but I’m a bit concerned it might not be the friendliest option for those who aren’t familiar with it, especially students. It might require some extra explanation during lectures. I’d love to hear what @ammedd and @erikvansebille think about this! |
Hi @iuryt, I agree that a hash-based folder name might not be very friendly to work with for students. Perhaps a timestamp-based folder would be friendlier? E.g. That would also help easily locate the latest version of the downloaded data because the downloads would automatically be sorted. |
That makes sense. But do you think it is helpful to keep past downloads? |
Well, we don't want people to have to redownload every time they run the virtual ship, I assume. We could only keep the last download, but then why even have a folder per download (the idea of a hash in the first place)? |
I was wondering if we could check for updates to the |
I'm still a fan of hashes as I think it simplifies the implementation details and adds a bit of futureproofing (i.e., what if we want to "force" new data downloads with a new version of virtualship because we add a new dataset, or the download command has changed. Doing so with a hash is a change to a single line of code in the hashing function. Also what if a new version of VS changes the format of This is all underpinned by whether re-using the data is important. My conversations with @ammedd mentioning that these data downloads as sizeable and take a long time (if someone can advise how long/how large, since I don't know this) leads me to be inclined to save the downloads and re-use them as much as practical, and leave managing the storage space to the user. A "one folder per domain" approach takes more space, but allows the user to change the domain as they see fit without worrying about it being cleared out from a separate run they did.
As in, across different missions the data would also be saved to the same folder? (e.g., a folder that they're prompted to specify the first time that they run VS?) I initially suggested in #83 (comment) the data being downloaded in the mission folder so that it was very visible to the user and centralised. I think the mission folder approach is easier to implement (no need to worry about (a) permissions elsewhere on the file system, (b) users not knowing where the data is stored, (c) setting/saving application level configuration), although users would need to manually copy data files between missions if they want to avoid doing a download. I think that trade-off would be worth it Sorry about the wall of text :) |
Wow! So many ideas/much work going on! I like saving data to the mission folder and the If I remember correctly downloads took up to 2 hours each last year and my 256GB disk couldn't handle all the data needed for the 7 groups we had last year. I do think the download method (and speed) has improved since then. |
Maybe this is because the PR is still in draft, but when I tried to use it (to download data for a drifter simulation), I got the following error (parcels) erik ~/Codes/VirtualShip % virtualship fetch TrAtldrifters
Traceback (most recent call last):
File "/Users/erik/anaconda3/envs/parcels/bin/virtualship", line 8, in <module>
sys.exit(cli())
^^^^^
File "/Users/erik/anaconda3/envs/parcels/lib/python3.12/site-packages/click/core.py", line 1157, in __call__
return self.main(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/erik/anaconda3/envs/parcels/lib/python3.12/site-packages/click/core.py", line 1078, in main
rv = self.invoke(ctx)
^^^^^^^^^^^^^^^^
File "/Users/erik/anaconda3/envs/parcels/lib/python3.12/site-packages/click/core.py", line 1688, in invoke
return _process_result(sub_ctx.command.invoke(sub_ctx))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/erik/anaconda3/envs/parcels/lib/python3.12/site-packages/click/core.py", line 1434, in invoke
return ctx.invoke(self.callback, **ctx.params)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/erik/anaconda3/envs/parcels/lib/python3.12/site-packages/click/core.py", line 783, in invoke
return __callback(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^
TypeError: fetch() got an unexpected keyword argument 'path' Any idea what's going on? |
I have now added features for:
Also mentioned in #83 (comment), but wondering: Should we leave the "only downloading up to a certain depth" for a future issue and PR so that we can scope out and discuss it separately? If I'm not mistaked, from our discussion earlier, it didn't sound very simple what the solution should be. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we also need to update the information on docs.oceanparcels.org, via changes to README.md
? E.g. give more info on the credentials in the virtualship fetch --help
info
""" | ||
Execute flow of getting credentials for use in the `fetch` command. | ||
|
||
- If username and password are provided via CLI, use them (ignore the credentials file if exists). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe explain what CLI is? Mot sure every user will know that
- If username and password are provided via CLI, use them (ignore the credentials file if exists). | |
- If username and password are provided via Command Line Interface (CLI), use them (ignore the credentials file if exists). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was intended more as developer documentation rather than user documentation. What do we think in terms of user documentation? Perhaps I can write a documentation page? (I imagine that we wouldn't want this information silo'd in a notebook, but rather link out to the doc page - I imagine there would be other course content in a notebook). Thoughts on how this fits with course content structure @ammedd?
Updating the docstring here wont appear to users unless its also displayed in some sort of user documentation.
Lets put that doc page in a different PR, this is already big enough and getting a bit much 😅
This comment was marked as outdated.
This comment was marked as outdated.
""" | ||
Download input data for an expedition. | ||
|
||
Entrypoint for the tool to download data based on space-time region provided in the | ||
schedule file. Data is downloaded from Copernicus Marine, credentials for which can be | ||
obtained via registration: https://data.marine.copernicus.eu/register . Credentials can | ||
be provided on prompt, via command line arguments, or via a YAML config file. Run | ||
`virtualship fetch` on a expedition for more info. | ||
""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we also need to update the information on docs.oceanparcels.org, via changes to
README.md
? E.g. give more info on the credentials in thevirtualship fetch --help
info
oh yes, that was an item discussed yesterday. Updated - thoughts on this (which will be surfaced in the --help command)? I don't want to double up on the info, hence the Run
virtualship fetch on a expedition for more info.
. Not sure if Credentials can be provided on prompt
can be worded better
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll update the readme once confirmed
fetch
command to download expedition data based on area_of_interest
from schedule.yaml
fetch
command to download expedition data based on space_time_interest
from schedule.yaml
By the way, I have also updated all references to "area of interest" to "space time region" as per the review feedback before |
fetch
command to download expedition data based on space_time_interest
from schedule.yaml
fetch
command to download expedition data based on space_time_region
from schedule.yaml
Based on the discussion from #68
Description
This PR updates the
fetch
command to download data specified by the expedition'sarea_of_interest
. Key changes include:Integration with Schedule File: The
fetch
command now uses_get_schedule
to load theSchedule
YAML file from the specifiedexpedition_dir
. TheSchedule
now includesarea_of_interest
with spatial and temporal boundaries, which define the download region and timeframe for data subsets.Unified Data Download Configuration: All datasets, including bathymetry, are now defined in a single
download_dict
, making the code cleaner and allowing for easy addition or modification of datasets without changes to the function logic.Consistent Use of
area_of_interest
: All dataset downloads utilizearea_of_interest
for spatial (min/max latitude and longitude
) and temporal (start/end time
) boundaries, ensuring consistency across all downloaded data.Should we delete the whole
scripts/
folder?Status
This is a draft PR. I have not tested running the
fetch
command with an exampleexpedition_dir
to ensure data is downloaded correctly based onarea_of_interest
.