Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support data-directory for data-files #210

Open
mabruzzo opened this issue Jun 12, 2024 · 0 comments
Open

Support data-directory for data-files #210

mabruzzo opened this issue Jun 12, 2024 · 0 comments
Labels

Comments

@mabruzzo
Copy link
Collaborator

Motivations

During initialization, the caller needs to specify the path to a data file. This can be tedious and annoying (especially during software testing of pygrackle or any downstream simulation code)

Description

Proposal: Introduce the feature to let users tell grackle to search a data-directory for these data files. For concreteness,

  • maybe we default to ~/.grackle/tables
  • maybe we allow people to overwrite this choice with the environment variable GRACKLE_DATA_HOME

We have 2 options with this feature:

  1. Make this feature exclusive to pygrackle (or at least start out that way).
  • To support installing pygrackle from PyPI (or conda), we should realistically provide some kind of routine to download the data files to some directory.
  • While we could have the function download the data to an arbitrary, user-defined location, I think it would be more ergonomic to support the option of writing it to a data-directory that pygrackle knows how to check.
  1. Support this feature in both grackle and pygrackle
  • To support it in grackle, we might just add a parameter called data_file_in_data_dir with a default value of 0. In that case we just maintain existing behavior. But if the parameter has a value of 1, then we treat grackle_data_file as a relative path with respect to the data-directory.
  • In this case, we would either need to support "installation" of data-files to the data-directory at build-time AND/OR provide a vanilla python script that can be used for this purpose.1

Considerations

I think this feature could significantly improve grackle's ergonomics. But there are 2 key considerations:

  1. Policies of datafile-versioning and compatibility of grackle versions. Grackle's data files have very stable over the years, so I'm not terribly worried. But, it's worth considering.

    • For example, what do we do if we want to update an existing datafile? (Say we uncover a bug or we want to modify the format for a newer version of grackle)
    • Will we replace the file? Or will we replace the file and retain the old-version with a different name? Or will we introduce the new version with a new name?
    • I think we already adopt the last option -- which avoids most issues
  2. What would be our policy for user-defined datafiles? Namely, what happens if our python script for downloading/installing datafiles to the data-directory encounters a name-collision.

    • I know that such data files are currently rare. But they do exist. And we definitely don't want to destructively overwrite someone's work
    • Do we simply forbid placement of custom datafiles into this directory?
    • or do we maybe promise to never install a datafile with the prefix "user"?

Of course, there is also the question of whether the maintenance burden is worthwhile.

Feedback

I would be greatly appreciative of any amount of feedback! (Especially on whether to just support this in pygrackle or also in grackle)

Footnotes

  1. To avoid maintaining logic in 2 places (that needs to be consistent), this could be the same python file that is shipped as a part of pygrackle (but it probably needs to be vanilla python without any external dependencies)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

1 participant