Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Documenting data on Lorenz #5

Open
26 tasks done
VeckoTheGecko opened this issue Dec 3, 2024 · 11 comments
Open
26 tasks done

Documenting data on Lorenz #5

VeckoTheGecko opened this issue Dec 3, 2024 · 11 comments

Comments

@VeckoTheGecko
Copy link
Contributor

VeckoTheGecko commented Dec 3, 2024

Most of the datasets on Lorenz are undocumented. The ones that do have documentation either:

  • Have a README.txt or README.md file in the folder
  • Have a wiki page on the wiki

It would be good to have the metadata centralised on the Wiki so its all in one place, where anyone can view it and those in the Parcels team can edit it.

Currently these are the instructions I have for adding a new dataset:

Details

To add a new dataset, follow these steps:

  1. Create a new page with a suitable title, and fill out the template below. If a field isn't relevant or is unknown, you can put - or ? respectively. Feel free to add additional information below as needed.
<!-- 1 sentence summary -->

- **Location on gemini**: /path/on/gemini/ <!-- output from `pwd` -->
- **Location on lorenz**: /path/on/lorenz/ <!-- output from `pwd` -->
- **Simulation**: -
- **Region**: Global
- **Period**: YYYY - YYYY
- **Frequency**: x hours
- **Variables**: 'eastward_eulerian_current_velocity', 'northward_eulerian_current_velocity'
- **Vertical levels**: Surface
- **Grid**: x degree
- **Source**: 'http://www.example.com/data_source'
- **Documentation**: 'http://www.example.com/data_source', 'https://doi.org/example'
- **Used by**: John Doe, Jane Doe
- **Links**: 
    - https://example.com/additional-link

<!-- You can add free form notes here if you need. e.g., instructions on how to download (a script etc). -->
  1. Add the new dataset to the list above, linking to the new page.
  2. Add a README.md file in new dataset folders on Gemini and Lorenz. This README on the servers should only contain the link to the newly created wiki page. This is important to ensure that information stays up-to-date, and all information is centralised in one place.

@erikvansebille @michaeldenes Any thoughts on how the template can be improved? Is it versatile enough? The key-value metadata section of the template is based off what we already had in some wiki pages for some CMEMS and GlobCurrent.

Once these README's are in place on Lorenz, user's can run find . -maxdepth 4 -name README.md | xargs -I{} sh -c 'echo "{}: $(cat {})"' which will provide the output:

./CMEMS/GLOBAL_ANALYSIS_FORECAST_PHY_001_024_SMOC: https://github.com/.../page-on-wiki
./CMEMS/NORTHWESTSHELF_ANALYSIS_FORECAST_PHY_004_013: https://github.com/.../page-on-wiki
./CMEMS/NORTHWESTSHELF_ANALYSIS_FORECAST_WAV_004_014: https://github.com/.../page-on-wiki
./CESM/Hist_LR: https://github.com/.../page-on-wiki
./CESM/PI-control_LR: https://github.com/.../page-on-wiki
./CESM/iHESP_HR_CESM: https://github.com/.../page-on-wiki

which I think would be a convenient way to bridge Lorenz and GitHub while keeping all the important info on GitHub. Open to other ideas though.

The following is the list of datasets on Lorenz and whether they have a wiki documentation page or not:

  • CMEMS/GLOBAL_ANALYSIS_FORECAST_PHY_001_024_SMOC
  • CMEMS/NORTHWESTSHELF_ANALYSIS_FORECAST_PHY_004_013
  • CMEMS/NORTHWESTSHELF_ANALYSIS_FORECAST_WAV_004_014
  • [already done] CMEMS/NWSHELF_MULTIYEAR_PHY_004_009
  • [already done] CMEMS/NWSHELF_REANALYSIS_WAV_004_015
  • CESM/Hist_LR
  • CESM/PI-control_LR
  • CESM/iHESP_HR_CESM
  • ERA5 (download script is in dataset folder)
  • ESA_WorldOceanCirculation/NorthAtlantic
  • ESA_WorldOceanCirculation/TropicalAtlantic
  • FES2014Data (consolidate README.txt into the wiki entry and delete)
  • [already done] GlobCurrent/v2p0/
  • GlobalFishingWatch
  • InternalTidalMixing (consolidate README.txt into the wiki entry and delete)
  • LLC4320_Galapagos
  • MITgcm_Channel (consolidate README.md into the wiki entry and delete)
  • MOi
  • MatroosWaddenSea (consolidate README.txt into the wiki entry and delete)
  • Miron_etal_2020
  • NEMO-MEDUSA
  • NEMO16_CMCC
  • NEMO4p2_CMCC
  • NEMO_Ensemble
  • NorKyst
  • eNATL60
@VeckoTheGecko
Copy link
Contributor Author

Currently I'm just wanting to finalise the template

@michaeldenes
Copy link
Member

Hey @VeckoTheGecko, this looks good! 👍

I suppose the github page has a 'history' to it, so we can see the last time the details were updated?

One thing to add would be a location of the mesh file. Perhaps the variables and details of the mesh file can be separated into a different table too?

Will there be a from_modulefile() type script in each data folder?

@VeckoTheGecko
Copy link
Contributor Author

I suppose the github page has a 'history' to it, so we can see the last time the details were updated?

yes, so we get that for free :)

One thing to add would be a location of the mesh file. Perhaps the variables and details of the mesh file can be separated into a different table too?

Perhaps a **Mesh file location**: field? I think going further with values might be a bit overkill? (let me know, not sure having not worked with these datasets).

Also, I guess this would only apply to hydrodynamic datasets. Is the input_data folder used for other datasets as well?

Will there be a from_modulefile() type script in each data folder?

Yes that's the plan.

@erikvansebille
Copy link
Member

Thanks @VeckoTheGecko, I like the template! A few points:

  1. Also add "doi" (if available), so that users can easily know how to cite the dataset
  2. The "Used by" gets outdated quite quickly, in my experience. People who are listed leave (so can't be asked for help), and others who use the dataset don't add their name. Not sure how to resolve this
  3. Also add "used by Parcels in" or something like that, that would then be a list of GitHub repos? Or would that have the same issue that it wouldn't get updated?
  4. Since Lorenz is now our main compute platform, I would move that to the top of the template
  5. Also provide information on how to load the dataset (using from_modulefile())?

@VeckoTheGecko
Copy link
Contributor Author

One thing to add would be a location of the mesh file. Perhaps the variables and details of the mesh file can be separated into a different table too?

After discussing with @michaeldenes , I think let's leave this information to the modulefile so that it's in one place. Perhaps let's have a **Has modulefile**: attribute as well.

  1. Also add "doi" (if available), so that users can easily know how to cite the dataset

Can do!

  • The "Used by" gets outdated quite quickly, in my experience. People who are listed leave (so can't be asked for help), and others who use the dataset don't add their name. Not sure how to resolve this
  • Also add "used by Parcels in" or something like that, that would then be a list of GitHub repos? Or would that have the same issue that it wouldn't get updated?

I think having a "Used by" that also promotes linking to repos would be good. I don't know how to make sure that this is up to date though. I think the "Used in repo" option is more objective/clearer than the "Used by" as users might self disqualify for the latter.

4. Since Lorenz is now our main compute platform, I would move that to the top of the template

can do!

5. Also provide information on how to load the dataset (using from_modulefile())?

Can do. I think the from_modulefile() would be the default for all these datasets?

@VeckoTheGecko
Copy link
Contributor Author

This is the updated description:

<!-- 1 sentence summary -->

- **Location on lorenz**: /path/on/lorenz/ <!-- output from `pwd` -->
- **Location on gemini**: /path/on/gemini/ <!-- output from `pwd` -->
- **Simulation**: -
- **Region**: Global
- **Period**: YYYY - YYYY
- **Frequency**: x hours
- **Variables**: 'eastward_eulerian_current_velocity', 'northward_eulerian_current_velocity'
- **Vertical levels**: Surface
- **Grid**: x degree
- **Source**: 'http://www.example.com/data_source'
- **DOI**: 'https://doi.org/example'
- **Documentation**: 'http://www.example.com/data_source'
- **Used by (name and/or repo)**: John Doe (in https://github.com/.../repo), Jane Doe
- **Has Parcels modulefile**: Yes/No
- **Links**: 
    - https://example.com/additional-link

<!-- You can add free form notes here if you need. e.g., instructions on how to download (a script etc). -->
Diff

diff --git a/Available-data.md b/Available-data.md
index ee966eb..d6f276a 100644
--- a/Available-data.md
+++ b/Available-data.md
@@ -57,8 +57,8 @@ To add a new dataset, follow these steps:
 <!-- 1 sentence summary -->
 
-- **Location on gemini**: /path/on/gemini/ <!-- output from `pwd` -->
 - **Location on lorenz**: /path/on/lorenz/ <!-- output from `pwd` -->
+- **Location on gemini**: /path/on/gemini/ <!-- output from `pwd` -->
 - **Simulation**: -
 - **Region**: Global
 - **Period**: YYYY - YYYY
@@ -67,8 +67,10 @@ To add a new dataset, follow these steps:
 - **Vertical levels**: Surface
 - **Grid**: x degree
 - **Source**: 'http://www.example.com/data_source'
-- **Documentation**: 'http://www.example.com/data_source', 'https://doi.org/example'
-- **Used by**: John Doe, Jane Doe
+- **DOI**: 'https://doi.org/example'
+- **Documentation**: 'http://www.example.com/data_source'
+- **Used by (name and/or repo)**: John Doe (in https://github.com/.../repo), Jane Doe
+- **Has Parcels modulefile**: Yes/No
 - **Links**: 
     - https://example.com/additional-link
 

@erikvansebille
Copy link
Member

LGTM! Only change perhaps instead of a boolean "Has Parcels module file, the /path/name to that file?

@michaeldenes
Copy link
Member

  • The "Used by" gets outdated quite quickly, in my experience. People who are listed leave (so can't be asked for help), and others who use the dataset don't add their name. Not sure how to resolve this
  • Also add "used by Parcels in" or something like that, that would then be a list of GitHub repos? Or would that have the same issue that it wouldn't get updated?

I think having a "Used by" that also promotes linking to repos would be good. I don't know how to make sure that this is up to date though. I think the "Used in repo" option is more objective/clearer than the "Used by" as users might self disqualify for the latter.

I think this is a much nicer solution. A link to the repos that used said dataset (ideally in a published paper!) will show new users how to load and utilise such data. Good idea!

@VeckoTheGecko
Copy link
Contributor Author

LGTM! Only change perhaps instead of a boolean "Has Parcels module file, the /path/name to that file?

Wouldn't the path always be at the lorenz folder path? (i.e., **Has Parcels modulefile (in Lorenz folder)**: Yes/No) Also, there is the difficulty that the module file will only be on Lorenz so its not as useful to those not in the team. Not sure if there's anything we can do about that.

@erikvansebille
Copy link
Member

Wouldn't the path always be at the lorenz folder path? (i.e., **Has Parcels modulefile (in Lorenz folder)**: Yes/No)

Yep indeed; let's only focus on Lorenz

Also, there is the difficulty that the module file will only be on Lorenz so it's not as useful to those not in the team. Not sure if there's anything we can do about that.

Well, this is the UtrechtTeam wiki, so (while it's public) I'd say the focus in on our team anyways

@sruehs
Copy link

sruehs commented Dec 19, 2024

I also like the new template for the wiki page a lot and have a few additional suggestions:

  • The point "simulation" could be expanded. Normally, if we use model data, we need to know the model, the configuration, and the simulation (some of the configurations in the list actually have different simulations in sub-folders). Maybe adjust the template accordingly, so that this information is immediately available to the user?
  • To allow for specifications for a wider range of spatial and temporal output, maybe change (i) "frequency" to "temporal resolution" (and include info whether data contains snapshots or temporal averages), (ii) "grid" to "horizontal grid" (including information on grid type and resolution), and "vertical levels" to "vertical grid" (containing again information on the grid type and the resolution)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants