Documenting data on Lorenz #5

VeckoTheGecko · 2024-12-03T14:46:47Z

Most of the datasets on Lorenz are undocumented. The ones that do have documentation either:

Have a README.txt or README.md file in the folder
Have a wiki page on the wiki

It would be good to have the metadata centralised on the Wiki so its all in one place, where anyone can view it and those in the Parcels team can edit it.

Currently these are the instructions I have for adding a new dataset:

Details

To add a new dataset, follow these steps:

Create a new page with a suitable title, and fill out the template below. If a field isn't relevant or is unknown, you can put - or ? respectively. Feel free to add additional information below as needed.

<!-- 1 sentence summary -->

- **Location on gemini**: /path/on/gemini/ <!-- output from `pwd` -->
- **Location on lorenz**: /path/on/lorenz/ <!-- output from `pwd` -->
- **Simulation**: -
- **Region**: Global
- **Period**: YYYY - YYYY
- **Frequency**: x hours
- **Variables**: 'eastward_eulerian_current_velocity', 'northward_eulerian_current_velocity'
- **Vertical levels**: Surface
- **Grid**: x degree
- **Source**: 'http://www.example.com/data_source'
- **Documentation**: 'http://www.example.com/data_source', 'https://doi.org/example'
- **Used by**: John Doe, Jane Doe
- **Links**: 
    - https://example.com/additional-link

<!-- You can add free form notes here if you need. e.g., instructions on how to download (a script etc). -->

Add the new dataset to the list above, linking to the new page.
Add a README.md file in new dataset folders on Gemini and Lorenz. This README on the servers should only contain the link to the newly created wiki page. This is important to ensure that information stays up-to-date, and all information is centralised in one place.

@erikvansebille @michaeldenes Any thoughts on how the template can be improved? Is it versatile enough? The key-value metadata section of the template is based off what we already had in some wiki pages for some CMEMS and GlobCurrent.

Once these README's are in place on Lorenz, user's can run find . -maxdepth 4 -name README.md | xargs -I{} sh -c 'echo "{}: $(cat {})"' which will provide the output:

./CMEMS/GLOBAL_ANALYSIS_FORECAST_PHY_001_024_SMOC: https://github.com/.../page-on-wiki
./CMEMS/NORTHWESTSHELF_ANALYSIS_FORECAST_PHY_004_013: https://github.com/.../page-on-wiki
./CMEMS/NORTHWESTSHELF_ANALYSIS_FORECAST_WAV_004_014: https://github.com/.../page-on-wiki
./CESM/Hist_LR: https://github.com/.../page-on-wiki
./CESM/PI-control_LR: https://github.com/.../page-on-wiki
./CESM/iHESP_HR_CESM: https://github.com/.../page-on-wiki

which I think would be a convenient way to bridge Lorenz and GitHub while keeping all the important info on GitHub. Open to other ideas though.

The following is the list of datasets on Lorenz and whether they have a wiki documentation page or not:

The text was updated successfully, but these errors were encountered:

VeckoTheGecko · 2024-12-03T14:48:44Z

Currently I'm just wanting to finalise the template

michaeldenes · 2024-12-03T15:26:06Z

Hey @VeckoTheGecko, this looks good! 👍

I suppose the github page has a 'history' to it, so we can see the last time the details were updated?

One thing to add would be a location of the mesh file. Perhaps the variables and details of the mesh file can be separated into a different table too?

Will there be a from_modulefile() type script in each data folder?

VeckoTheGecko · 2024-12-03T16:15:47Z

I suppose the github page has a 'history' to it, so we can see the last time the details were updated?

yes, so we get that for free :)

One thing to add would be a location of the mesh file. Perhaps the variables and details of the mesh file can be separated into a different table too?

Perhaps a **Mesh file location**: field? I think going further with values might be a bit overkill? (let me know, not sure having not worked with these datasets).

Also, I guess this would only apply to hydrodynamic datasets. Is the input_data folder used for other datasets as well?

Will there be a from_modulefile() type script in each data folder?

Yes that's the plan.

erikvansebille · 2024-12-04T07:07:27Z

Thanks @VeckoTheGecko, I like the template! A few points:

Also add "doi" (if available), so that users can easily know how to cite the dataset
The "Used by" gets outdated quite quickly, in my experience. People who are listed leave (so can't be asked for help), and others who use the dataset don't add their name. Not sure how to resolve this
Also add "used by Parcels in" or something like that, that would then be a list of GitHub repos? Or would that have the same issue that it wouldn't get updated?
Since Lorenz is now our main compute platform, I would move that to the top of the template
Also provide information on how to load the dataset (using from_modulefile())?

VeckoTheGecko · 2024-12-04T11:28:12Z

One thing to add would be a location of the mesh file. Perhaps the variables and details of the mesh file can be separated into a different table too?

After discussing with @michaeldenes , I think let's leave this information to the modulefile so that it's in one place. Perhaps let's have a **Has modulefile**: attribute as well.

Also add "doi" (if available), so that users can easily know how to cite the dataset

Can do!

The "Used by" gets outdated quite quickly, in my experience. People who are listed leave (so can't be asked for help), and others who use the dataset don't add their name. Not sure how to resolve this

Also add "used by Parcels in" or something like that, that would then be a list of GitHub repos? Or would that have the same issue that it wouldn't get updated?

I think having a "Used by" that also promotes linking to repos would be good. I don't know how to make sure that this is up to date though. I think the "Used in repo" option is more objective/clearer than the "Used by" as users might self disqualify for the latter.

4. Since Lorenz is now our main compute platform, I would move that to the top of the template

can do!

5. Also provide information on how to load the dataset (using from_modulefile())?

Can do. I think the from_modulefile() would be the default for all these datasets?

VeckoTheGecko · 2024-12-04T12:23:31Z

This is the updated description:

<!-- 1 sentence summary -->

- **Location on lorenz**: /path/on/lorenz/ <!-- output from `pwd` -->
- **Location on gemini**: /path/on/gemini/ <!-- output from `pwd` -->
- **Simulation**: -
- **Region**: Global
- **Period**: YYYY - YYYY
- **Frequency**: x hours
- **Variables**: 'eastward_eulerian_current_velocity', 'northward_eulerian_current_velocity'
- **Vertical levels**: Surface
- **Grid**: x degree
- **Source**: 'http://www.example.com/data_source'
- **DOI**: 'https://doi.org/example'
- **Documentation**: 'http://www.example.com/data_source'
- **Used by (name and/or repo)**: John Doe (in https://github.com/.../repo), Jane Doe
- **Has Parcels modulefile**: Yes/No
- **Links**: 
    - https://example.com/additional-link

<!-- You can add free form notes here if you need. e.g., instructions on how to download (a script etc). -->

Diff

diff --git a/Available-data.md b/Available-data.md
index ee966eb..d6f276a 100644
--- a/Available-data.md
+++ b/Available-data.md
@@ -57,8 +57,8 @@ To add a new dataset, follow these steps:
 <!-- 1 sentence summary -->
 
-- **Location on gemini**: /path/on/gemini/ <!-- output from `pwd` -->
 - **Location on lorenz**: /path/on/lorenz/ <!-- output from `pwd` -->
+- **Location on gemini**: /path/on/gemini/ <!-- output from `pwd` -->
 - **Simulation**: -
 - **Region**: Global
 - **Period**: YYYY - YYYY
@@ -67,8 +67,10 @@ To add a new dataset, follow these steps:
 - **Vertical levels**: Surface
 - **Grid**: x degree
 - **Source**: 'http://www.example.com/data_source'
-- **Documentation**: 'http://www.example.com/data_source', 'https://doi.org/example'
-- **Used by**: John Doe, Jane Doe
+- **DOI**: 'https://doi.org/example'
+- **Documentation**: 'http://www.example.com/data_source'
+- **Used by (name and/or repo)**: John Doe (in https://github.com/.../repo), Jane Doe
+- **Has Parcels modulefile**: Yes/No
 - **Links**: 
     - https://example.com/additional-link

erikvansebille · 2024-12-04T13:11:56Z

LGTM! Only change perhaps instead of a boolean "Has Parcels module file, the /path/name to that file?

michaeldenes · 2024-12-04T13:46:14Z

The "Used by" gets outdated quite quickly, in my experience. People who are listed leave (so can't be asked for help), and others who use the dataset don't add their name. Not sure how to resolve this

Also add "used by Parcels in" or something like that, that would then be a list of GitHub repos? Or would that have the same issue that it wouldn't get updated?

I think having a "Used by" that also promotes linking to repos would be good. I don't know how to make sure that this is up to date though. I think the "Used in repo" option is more objective/clearer than the "Used by" as users might self disqualify for the latter.

I think this is a much nicer solution. A link to the repos that used said dataset (ideally in a published paper!) will show new users how to load and utilise such data. Good idea!

VeckoTheGecko · 2024-12-04T14:03:39Z

LGTM! Only change perhaps instead of a boolean "Has Parcels module file, the /path/name to that file?

Wouldn't the path always be at the lorenz folder path? (i.e., **Has Parcels modulefile (in Lorenz folder)**: Yes/No) Also, there is the difficulty that the module file will only be on Lorenz so its not as useful to those not in the team. Not sure if there's anything we can do about that.

erikvansebille · 2024-12-05T07:07:20Z

Wouldn't the path always be at the lorenz folder path? (i.e., **Has Parcels modulefile (in Lorenz folder)**: Yes/No)

Yep indeed; let's only focus on Lorenz

Also, there is the difficulty that the module file will only be on Lorenz so it's not as useful to those not in the team. Not sure if there's anything we can do about that.

Well, this is the UtrechtTeam wiki, so (while it's public) I'd say the focus in on our team anyways

sruehs · 2024-12-19T13:09:51Z

I also like the new template for the wiki page a lot and have a few additional suggestions:

The point "simulation" could be expanded. Normally, if we use model data, we need to know the model, the configuration, and the simulation (some of the configurations in the list actually have different simulations in sub-folders). Maybe adjust the template accordingly, so that this information is immediately available to the user?
To allow for specifications for a wider range of spatial and temporal output, maybe change (i) "frequency" to "temporal resolution" (and include info whether data contains snapshots or temporal averages), (ii) "grid" to "horizontal grid" (including information on grid type and resolution), and "vertical levels" to "vertical grid" (containing again information on the grid type and the resolution)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Documenting data on Lorenz #5

Documenting data on Lorenz #5

VeckoTheGecko commented Dec 3, 2024 •

edited

Loading

VeckoTheGecko commented Dec 3, 2024

michaeldenes commented Dec 3, 2024

VeckoTheGecko commented Dec 3, 2024

erikvansebille commented Dec 4, 2024

VeckoTheGecko commented Dec 4, 2024

VeckoTheGecko commented Dec 4, 2024

erikvansebille commented Dec 4, 2024

michaeldenes commented Dec 4, 2024

VeckoTheGecko commented Dec 4, 2024

erikvansebille commented Dec 5, 2024

sruehs commented Dec 19, 2024

Documenting data on Lorenz #5

Documenting data on Lorenz #5

Comments

VeckoTheGecko commented Dec 3, 2024 • edited Loading

VeckoTheGecko commented Dec 3, 2024

michaeldenes commented Dec 3, 2024

VeckoTheGecko commented Dec 3, 2024

erikvansebille commented Dec 4, 2024

VeckoTheGecko commented Dec 4, 2024

VeckoTheGecko commented Dec 4, 2024

erikvansebille commented Dec 4, 2024

michaeldenes commented Dec 4, 2024

VeckoTheGecko commented Dec 4, 2024

erikvansebille commented Dec 5, 2024

sruehs commented Dec 19, 2024

VeckoTheGecko commented Dec 3, 2024 •

edited

Loading