Adding zstd compression and decompression support for workload corpora #410

beaioun · 2023-11-08T09:24:55Z

Description

This PR adds support for zstd compression and decompression of workload corpora

Issues Resolved

This PR is aiming to solve issue #385

Testing

New functionality includes testing

REQUEST FOR HELP:
I need help understanding compression in the io module. I have not found where the io.compress function is referenced thus not really sure how the zstd compression will work in this case. But I included compress_zstd as a separate function to be called in actual use cases.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

…orpora Signed-off-by: beaioun <[email protected]>

IanHoang · 2023-11-09T16:42:18Z

osbenchmark/utils/io.py

@@ -344,6 +370,18 @@ def _do_decompress_manually_with_lib(target_directory, filename, compressed_file
        compressed_file.close()


+def _do_decompress_zstd(target_directory, filename):


Instead of creating a separate function for this, is it possible to implement this through the _do_decompress_manually as is done with other decompressors like bz2? This would remove boilerplate code

Yes I'll be working on that, thanks

IanHoang · 2023-11-09T16:49:07Z

@beaioun You are correct, there is no io.compress(). As of now, the only time compressing is used is in the metrics.py and corpus.py, but these call the libraries zlib and bz2 directly. Adding a separate function to io.py to compress like you did doesn't hurt. The issue I wrote should've clarified this but I can create a separate issue on this: it'd be good to include different compression options in corpus.py. I'll explain this further in a separate issue since this is out of scope for this PR!

IanHoang

Just need to fix the pylint error in the CI Unittests and we should be good

beaioun · 2023-11-09T23:00:39Z

@beaioun You are correct, there is no io.compress(). As of now, the only time compressing is used is in the metrics.py and corpus.py, but these call the libraries zlib and bz2 directly. Adding a separate function to io.py to compress like you did doesn't hurt. The issue I wrote should've clarified this but I can create a separate issue on this: it'd be good to include different compression options in corpus.py. I'll explain this further in a separate issue since this is out of scope for this PR!

Thank you @IanHoang for the clarification. I'll take a look at corpus.py and see if we can implement that.

beaioun · 2023-11-09T23:02:15Z

Just need to fix the pylint error in the CI Unittests and we should be good

Thanks @IanHoang, will do

IanHoang · 2023-11-13T17:46:52Z

Just need to fix the pylint error in the CI Unittests and we should be good

Thanks @IanHoang, will do

@beaioun I think it's fine to go ahead and fix the CI unittests here and we can merge this in. We can open a separate issue for corpus.py and create a separate PR for it. This PR has a good amount of changes

… If not I will fix from this point Signed-off-by: beaioun <[email protected]>

beaioun · 2023-11-18T00:06:29Z

Just need to fix the pylint error in the CI Unittests and we should be good

Thanks @IanHoang, will do

@beaioun I think it's fine to go ahead and fix the CI unittests here and we can merge this in. We can open a separate issue for corpus.py and create a separate PR for it. This PR has a good amount of changes

Hey @IanHoang , it's been a few days. I'll go ahead and commit the recent changes. Even though I am still left with this throughput throttled error unfixed, I feel like it has something to do with my PC setup. Let's see how the tests go and I will fix it from there. 4f37b4b

IanHoang

Thanks for this!

Initial commit for adding zstd (de)compression support for workload c…

7c95c4b

…orpora Signed-off-by: beaioun <[email protected]>

beaioun requested review from IanHoang and gkamat as code owners November 8, 2023 09:24

IanHoang reviewed Nov 9, 2023

View reviewed changes

IanHoang requested changes Nov 9, 2023

View reviewed changes

Let me commit the changes first and see if the error will be cleared.…

4f37b4b

… If not I will fix from this point Signed-off-by: beaioun <[email protected]>

beaioun requested a review from IanHoang November 20, 2023 06:39

IanHoang approved these changes Nov 20, 2023

View reviewed changes

IanHoang merged commit 20d4928 into opensearch-project:main Nov 20, 2023
8 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Adding zstd compression and decompression support for workload corpora #410

Adding zstd compression and decompression support for workload corpora #410

beaioun commented Nov 8, 2023

IanHoang Nov 9, 2023

beaioun Nov 9, 2023

IanHoang commented Nov 9, 2023

IanHoang left a comment

beaioun commented Nov 9, 2023 •

edited

Loading

beaioun commented Nov 9, 2023

IanHoang commented Nov 13, 2023

beaioun commented Nov 18, 2023 •

edited

Loading

IanHoang left a comment

		@@ -344,6 +370,18 @@ def _do_decompress_manually_with_lib(target_directory, filename, compressed_file
		compressed_file.close()


		def _do_decompress_zstd(target_directory, filename):

Adding zstd compression and decompression support for workload corpora #410

Adding zstd compression and decompression support for workload corpora #410

Conversation

beaioun commented Nov 8, 2023

Description

Issues Resolved

Testing

IanHoang Nov 9, 2023

Choose a reason for hiding this comment

beaioun Nov 9, 2023

Choose a reason for hiding this comment

IanHoang commented Nov 9, 2023

IanHoang left a comment

Choose a reason for hiding this comment

beaioun commented Nov 9, 2023 • edited Loading

beaioun commented Nov 9, 2023

IanHoang commented Nov 13, 2023

beaioun commented Nov 18, 2023 • edited Loading

IanHoang left a comment

Choose a reason for hiding this comment

beaioun commented Nov 9, 2023 •

edited

Loading

beaioun commented Nov 18, 2023 •

edited

Loading