Add Mini Well #15

MilesCranmer · 2024-12-14T07:19:50Z

This is a to-do item to add the "mini well" dataset. It's already generated on my user account; we just need to put it somewhere it can be downloaded.

Q: @rubenohana @LTMeyer @mikemccabe210 do you have a preferred API design for this? Perhaps the easiest is:

the-well-download --mini --base-path path/to/base --dataset active_matter --split train

where the --mini flag would enable the mini versions instead.

I suppose it should also be available on HF datasets since it's only 5 GB total?

The text was updated successfully, but these errors were encountered:

mikemccabe210 · 2024-12-15T21:38:24Z

That seems reasonable - I think we'd want it to function like a secondary Well path so if the mini flag is the only difference that seems like a good option. Maybe we'd want to add a "are you sure?" prompt with an explanation in case people accidentally elect to overwrite the full path which would be pretty painful.

Although then maybe the better option would be to inherently write to a "mini" directory so that someone would really have to work to overwrite the full well.

MilesCranmer · 2024-12-15T22:22:27Z

Although then maybe the better option would be to inherently write to a "mini" directory so that someone would really have to work to overwrite the full well.

That sounds like a good idea! Like we just modify path/to/base to path/to/base/mini inside the script (and maybe print a message to the user)

MilesCranmer · 2024-12-15T22:23:08Z

Actually maybe something better is we just modify all the dataset names? Like active_matter becomes mini_active_matter and so on. Then you can use the same base path and just pass mini=True to the data loader and it will switch.

mikemccabe210 · 2024-12-15T22:40:17Z

Yeah, I like that actually. We'd just want to make sure anything that's parsing the registry would know the difference, but that might be the safest option and also help avoid mistakes where someone is benchmarking on mini and thinks they're on full.

LTMeyer · 2024-12-16T13:02:56Z

I'm in favor of adding mini prefix to the dataset rather than overriding writing location. I trust user to know where they want to write their data.

We can update the registry for it to list also the mini datasets;
We can update the downloading script for it to download the mini datasets if the --mini option is provided, otherwise by default it downloads the full datasets.

MilesCranmer self-assigned this Dec 14, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add Mini Well #15

Add Mini Well #15

MilesCranmer commented Dec 14, 2024

mikemccabe210 commented Dec 15, 2024 •

edited

Loading

MilesCranmer commented Dec 15, 2024

MilesCranmer commented Dec 15, 2024

mikemccabe210 commented Dec 15, 2024

LTMeyer commented Dec 16, 2024

Add Mini Well #15

Add Mini Well #15

Comments

MilesCranmer commented Dec 14, 2024

mikemccabe210 commented Dec 15, 2024 • edited Loading

MilesCranmer commented Dec 15, 2024

MilesCranmer commented Dec 15, 2024

mikemccabe210 commented Dec 15, 2024

LTMeyer commented Dec 16, 2024

mikemccabe210 commented Dec 15, 2024 •

edited

Loading