Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Mini Well #15

Open
MilesCranmer opened this issue Dec 14, 2024 · 5 comments
Open

Add Mini Well #15

MilesCranmer opened this issue Dec 14, 2024 · 5 comments
Assignees

Comments

@MilesCranmer
Copy link
Member

This is a to-do item to add the "mini well" dataset. It's already generated on my user account; we just need to put it somewhere it can be downloaded.

Q: @rubenohana @LTMeyer @mikemccabe210 do you have a preferred API design for this? Perhaps the easiest is:

the-well-download --mini --base-path path/to/base --dataset active_matter --split train

where the --mini flag would enable the mini versions instead.

I suppose it should also be available on HF datasets since it's only 5 GB total?

@MilesCranmer MilesCranmer self-assigned this Dec 14, 2024
@mikemccabe210
Copy link
Contributor

mikemccabe210 commented Dec 15, 2024

That seems reasonable - I think we'd want it to function like a secondary Well path so if the mini flag is the only difference that seems like a good option. Maybe we'd want to add a "are you sure?" prompt with an explanation in case people accidentally elect to overwrite the full path which would be pretty painful.

Although then maybe the better option would be to inherently write to a "mini" directory so that someone would really have to work to overwrite the full well.

@MilesCranmer
Copy link
Member Author

Although then maybe the better option would be to inherently write to a "mini" directory so that someone would really have to work to overwrite the full well.

That sounds like a good idea! Like we just modify path/to/base to path/to/base/mini inside the script (and maybe print a message to the user)

@MilesCranmer
Copy link
Member Author

Actually maybe something better is we just modify all the dataset names? Like active_matter becomes mini_active_matter and so on. Then you can use the same base path and just pass mini=True to the data loader and it will switch.

@mikemccabe210
Copy link
Contributor

Yeah, I like that actually. We'd just want to make sure anything that's parsing the registry would know the difference, but that might be the safest option and also help avoid mistakes where someone is benchmarking on mini and thinks they're on full.

@LTMeyer
Copy link
Collaborator

LTMeyer commented Dec 16, 2024

I'm in favor of adding mini prefix to the dataset rather than overriding writing location. I trust user to know where they want to write their data.

  • We can update the registry for it to list also the mini datasets;
  • We can update the downloading script for it to download the mini datasets if the --mini option is provided, otherwise by default it downloads the full datasets.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants