Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Compress final output files #139

Open
wants to merge 2 commits into
base: main
Choose a base branch
from
Open

Compress final output files #139

wants to merge 2 commits into from

Conversation

klamike
Copy link
Collaborator

@klamike klamike commented Oct 24, 2024

  • Uses the deflate/zlib integration in HDF5.jl to compress the datasets in the final output files. The highest compression level is enabled by default.
  • Uses gzip compression via OPFGenerator.save_json by default for the reference case.json.gz file.
  • Run some benchmarks with various compression options

@klamike klamike requested a review from mtanneau October 24, 2024 20:02
@mtanneau
Copy link
Contributor

Do you have a sense of how much space we save with this?

Copy link

codecov bot commented Oct 24, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

@klamike
Copy link
Collaborator Author

klamike commented Oct 25, 2024

Unfortunately the current approach does not work for the Vector{String} datasets (termination/result status codes). Since these are really enums we can store them as integers instead, then compress works well. We can store the mapping from Integer(instance) to String(instance) in the HDF5 dataset's attributes for readers to be able to convert back even if MOI changes the mapping. We also currently store the formulation name for each sample, which can be dropped.

Storing Vector{Enum} feels like something HDF5 would support natively, so I looked into it... there does seem to be a relevant datatype in HDF5 itself but neither the Julia nor Python interfaces have nice support for it.

With the enum -> Integer change and level 9 compression, an 89_pegase dataset gets about 30% smaller.

@mtanneau
Copy link
Contributor

Unfortunately the current approach does not work for the Vector{String} datasets (termination/result status codes).

Is it (technically) possible to compress only the numerical datasets? e.g., doing a an eltype check and compressing only if it's a numerical value.

We can store the mapping from Integer(instance) to String(instance) in the HDF5 dataset's attributes for readers to be able to convert back even if MOI changes the mapping.

Not against storing the integer codes instead of the Strings. It will also save a (tiny) bit of space.
We can ask the JuMP devs whether changing the integer codes of an enum would be considered a breaking change.

With the enum -> Integer change and level 9 compression, an 89_pegase dataset gets about 30% smaller.

That's not bad! For the record: this 30% is to be compared (/combined) with merging some fields (e.g. merging duals of lower/upper bound constraints).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants