Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added class gzFileLoader to read models in gzipped files #1403

Open
wants to merge 8 commits into
base: master
Choose a base branch
from

Conversation

lluisp
Copy link

@lluisp lluisp commented Jun 21, 2018

I added to io.h/io.cc a class gzFileLoader that behaves as textFileLoader but on gzipped model files.
The class uses boost::iostreams

It is not possible to perform tellg and seekg on gziped files, so I had to create a new class (that just reads and ignores the non-relevant sections).
If using tellg and seekg is not a must, both classes could be merged in a single one, since the only difference would be how the stream is opened. (I understand that skipping all the non-needed sections is slower, but loading a model is slower anyway so maybe it is not a big deal).
A single class could just open the stream depending on the file type, and then read the model transparently, regardless of whether it is gzipped or not. (or the derived classes could just open the stream, and all the loading work could be done in the parent class, avoindin code duplication)
Also, since boost::iostreams has other filters apart from gzip, this should allow to easily support other compression formats for model files.

The same idea could be used to textModelSaver to save directly in compressed formats.

Let me know if you are interested in developing this further, and how.
I'll be glad to contribute.

@neubig
Copy link
Contributor

neubig commented Jun 22, 2018

Thanks for the contribution! For a while now DyNet has not relied on boost for any of its core functionality, as boost has been a source of compile problems, etc.

Because the DyNet model format is actually quite simple, I think it might just be better to create a binary file format if you want faster saving/loading.

@lluisp
Copy link
Author

lluisp commented Jun 22, 2018

I saw that it already depends on boost (mp.h uses boost algorithm and boost-interprocess), so I though one more library would not make a difference.

My concern was not on faster loading, but on saving disk space and bandwidth (which is relevant when you want to distribute the models inside some other software, and don't want to lose users for being too space-greedy ;)
In my experience, binary formats are only slightly faster, but can not be inspected, while gziped text files are a bit smaller, not much slower, and can be inspected with zless or the like.

In any case, I could make this feature conditional, so it is only built if "ENABLE_BOOST" is set (the same that happens now with mp.cc). Would that suit you?

@pmichel31415
Copy link
Collaborator

Hi @lluisp thanks for the pull request! Sorry for the late reply. I think your suggestion of only building this if ENABLE_BOOST is set is a good compromise.This might be related to the CI errors as well.

@lluisp
Copy link
Author

lluisp commented Sep 9, 2018

great! It is already fixed in my fork.
Yes, CI fails because boost_iostreams is not installed on the environment.

@pmichel31415
Copy link
Collaborator

OK I think I figured out how this can be fixed:

  • For travis: add libboost-iostreams1.55-dev to this line
  • For appveyor (this was harder to spot): move (or copy?) this line up to after this line

Would you mind adding this? I think we'll be good to merge after that once the tests pass.

@lluisp
Copy link
Author

lluisp commented Sep 15, 2018

Cool, I fixed that. It is compiling now, let's see how it goes.
Thanks!

@lluisp
Copy link
Author

lluisp commented Sep 17, 2018

It seems that appveyor can not find boost::iostreams yet.

@pmichel31415
Copy link
Collaborator

Damn this is annoying. My last bet is to put the - cmd: set PATH=%BOOST_LIBRARYDIR%;%PATH% line right after the build: at line 48. If you don't have time to play around I can probably do it if you give me push access to this branch: https://help.github.com/articles/committing-changes-to-a-pull-request-branch-created-from-a-fork/

@lluisp
Copy link
Author

lluisp commented Sep 17, 2018 via email

@pmichel31415
Copy link
Collaborator

OK thanks I will take a look

@kwalcock
Copy link
Contributor

You may or may not want to look at https://github.com/kwalcock/dynet/blob/kwalcock-fatdynet/contrib/clulab/src/main/cpp/clulab-zip.cc. This implementation is based on zlib. There is additional access to the functionality through https://github.com/clulab/fatdynet.

@lluisp
Copy link
Author

lluisp commented Jan 28, 2021

Cool!
is it going to be in master at some moment ?

@kwalcock
Copy link
Contributor

I don't know about master potential. That's probably up to clab and it would take some coordination to get it integrated completely. We've been using the code for about a year and a half in order to access the DyNet models through jar files and it seems to work.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants