Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

error while decoding serialized histogram produced by rust version #29

Open
tdyas opened this issue Jan 8, 2021 · 12 comments
Open

error while decoding serialized histogram produced by rust version #29

tdyas opened this issue Jan 8, 2021 · 12 comments

Comments

@tdyas
Copy link

tdyas commented Jan 8, 2021

I am generating histograms in Rust and am deserializing in Python using the HDR Histogram libraries for Rust and Python. The Rust code produces a byte array with the encoded histogram which ends up as a bytes instance in Python. (The project is a Python program that integrates with a Rust library via the cpython crate.)

It appears that the Python library is only able to decode the encoded histogram if Rust encodes using hdrhistogram::serialization::V2DeflateSerializer and further encodes it using base64 (via Python's base64.b64encode).

Without the base64 encoding, decoding with histogram = HdrHistogram.decode(encoded_histogram, b64_wrap=False) results in this error:

Traceback (most recent call last):
  ...
  File "XXX/hdrhistogram-0.8.0-cp38-cp38-macosx_10_15_x86_64.whl-install/hdrh/histogram.py", line 580, in decode
    hdr_payload = HdrHistogramEncoder.decode(encoded_histogram, b64_wrap)
  File "XXX/hdrhistogram-0.8.0-cp38-cp38-macosx_10_15_x86_64.whl-install/hdrh/codec.py", line 356, in decode
    hdr_payload = HdrPayload(8, compressed_payload=cpayload)
  File "XXX/hdrhistogram-0.8.0-cp38-cp38-macosx_10_15_x86_64.whl-install/hdrh/codec.py", line 146, in __init__
    self._decompress(compressed_payload)
  File "XXX/hdrhistogram-0.8.0-cp38-cp38-macosx_10_15_x86_64.whl-install/hdrh/codec.py", line 197, in _decompress
    self._data = zlib.decompress(compressed_payload)
zlib.error: Error -3 while decompressing data: incorrect header check

Using uncompressed encoding (via hdrhistogram::serialization::V2Serializer in Rust) and base64 encoding in Python results in this error:

Traceback (most recent call last):
  ...
  File "XXX/hdrhistogram-0.8.0-cp38-cp38-macosx_10_15_x86_64.whl-install/hdrh/histogram.py", line 580, in decode
    hdr_payload = HdrHistogramEncoder.decode(encoded_histogram, b64_wrap)
  File "XXX/hdrhistogram-0.8.0-cp38-cp38-macosx_10_15_x86_64.whl-install/hdrh/codec.py", line 346, in decode
    raise HdrCookieException()
hdrh.codec.HdrCookieException

And using uncompressed encoding without base64 results in:

Traceback (most recent call last):
  ...
  File "XXX/hdrhistogram-0.8.0-cp38-cp38-macosx_10_15_x86_64.whl-install/hdrh/histogram.py", line 580, in decode
    hdr_payload = HdrHistogramEncoder.decode(encoded_histogram, b64_wrap)
  File "XXX/hdrhistogram-0.8.0-cp38-cp38-macosx_10_15_x86_64.whl-install/hdrh/codec.py", line 356, in decode
    hdr_payload = HdrPayload(8, compressed_payload=cpayload)
  File "XXX/hdrhistogram-0.8.0-cp38-cp38-macosx_10_15_x86_64.whl-install/hdrh/codec.py", line 146, in __init__
    self._decompress(compressed_payload)
  File "XXX/hdrhistogram-0.8.0-cp38-cp38-macosx_10_15_x86_64.whl-install/hdrh/codec.py", line 197, in _decompress
    self._data = zlib.decompress(compressed_payload)
zlib.error: Error -3 while decompressing data: incorrect header check
@ahothan
Copy link
Contributor

ahothan commented Jan 8, 2021

traceback 1: This looks like an issue with the compressed data (what is base 64 encoded).

traceback 2 and 3: uncompressed histogram is not a valid/supported format as far as I know

If you can provide an example of rust generated histoblob (base64 compressed) that fails decoding in python, I can have a closer look.
Have you tried decoding the same histoblob using other decoders (java, C, go...)?
Have you tried the reverse (decode in rust a histoblob generated by python library)?

@tdyas
Copy link
Author

tdyas commented Jan 8, 2021

traceback 1: This looks like an issue with the compressed data (what is base 64 encoded).

The data was the raw set of bytes for the histogram with no base64 encoding. The Rust encoder for compressed histograms does not appear to do base64 encoding. See https://github.com/HdrHistogram/HdrHistogram_rust/blob/89ea97afdfa543a6b7a0ebc8c7d03eddf66affb3/src/serialization/v2_deflate_serializer.rs#L75-L133

traceback 2 and 3: uncompressed histogram is not a valid/supported format as far as I know

The Rust code is able to produce uncompressed histograms though. See https://github.com/HdrHistogram/HdrHistogram_rust/blob/89ea97afdfa543a6b7a0ebc8c7d03eddf66affb3/src/serialization/v2_serializer.rs#L67-L115

If you can provide an example of rust generated histoblob (base64 compressed) that fails decoding in python, I can have a closer look.

The Rust side of the project is here: https://github.com/tdyas/pants/blob/9f4e51cb0bc0293e56c7fa6376f7530d008ceaf5/src/rust/engine/workunit_store/src/lib.rs#L730-L756

On the Python side, I need to encode base64.b64encode on the raw bytes to go from the raw bytes to base64-encoding. Then the Python decoder works.

Have you tried decoding the same histoblob using other decoders (java, C, go...)?

I have not.

Maybe this is a bug in the Rust encoder where it fails to base64 encode?

Have you tried the reverse (decode in rust a histoblob generated by python library)?

I have not. The Python code is the part of the project that uploads histograms out of the Pants build tool into a server for histograms collected in the Rust engine.

@marshallpierce
Copy link

marshallpierce commented Jan 8, 2021

Histogram serialization does not involve base64; it just produces bytes. See EncodableHistogram#encodeIntoCompressedByteBuffer's implementations in the Java implementation. It may be base64'd later for transport in plain-text environments like a text histogram log, but that's separate -- it would be inefficient to always base64. edit: I misread; I thought there was confusion over whether the raw serialized form itself should always be base64'd.

There are 4 kinds of encoding: V0, V1, V2, V2+Deflate. The Rust implementation currently supports the latter two.

@ahothan
Copy link
Contributor

ahothan commented Jan 8, 2021

maybe we can discuss this over https://gitter.im/HdrHistogram/HdrHistogram ?
python supports V2 which to my knowledge only supports compressed + base64 and optionally compressed without base64.

@ahothan
Copy link
Contributor

ahothan commented Jan 8, 2021

@tdyas it looks like the only path that would work is if you generate on Rust side using hdrhistogram::serialization::V2DeflateSerializer (and without base64)
and use the python decode with b64_wrap=False

(was not clear above which format you were using when you say b64_wrap=False did not work)

To move forward, can you send an example of Rust generated compressed histogram (base64 version works) and I can have a look on my side why the python decode fails.

@tdyas
Copy link
Author

tdyas commented Jan 10, 2021

Here is the failure with a compressed blob with a single observation (and the success once it has been base64 encoded). The value was produced by V2DeflateSerializer in the Rust library.

Python 3.8.6 (default, Nov  2 2020, 08:14:47)
[Clang 12.0.0 (clang-1200.0.32.21)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> encoded = b"\x1c\x84\x93\x14\x00\x00\x00\x1fx\x9c\x93i\x99,\xcc\xc0\xc0\xc0\xcc\x00\x010\x9a\x11J3\xd9\x7f\x800\xfe32\x01\x00E\x0c\x03\x81"
>>> from hdrh.histogram import HdrHistogram
>>> h = HdrHistogram.decode(encoded, b64_wrap=False)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "XXX/foo/lib/python3.8/site-packages/hdrh/histogram.py", line 580, in decode
    hdr_payload = HdrHistogramEncoder.decode(encoded_histogram, b64_wrap)
  File "XXX/foo/lib/python3.8/site-packages/hdrh/codec.py", line 356, in decode
    hdr_payload = HdrPayload(8, compressed_payload=cpayload)
  File "XXX/foo/lib/python3.8/site-packages/hdrh/codec.py", line 146, in __init__
    self._decompress(compressed_payload)
  File "XXX/foo/lib/python3.8/site-packages/hdrh/codec.py", line 197, in _decompress
    self._data = zlib.decompress(compressed_payload)
zlib.error: Error -3 while decompressing data: incorrect header check
>>> import base64
>>> h = HdrHistogram.decode(base64.b64encode(encoded))
>>> h.get_total_count()
1
>>>

Here is the failure with an uncompressed blob produced by V2Serializer in the Rust library:

>>> encoded_uncompressed = b'\x1c\x84\x93\x13\x00\x00\x00\x03\x00\x00\x00\x00\x00\x00\x00\x03\x00\x00\x00\x00\x00\x00\x00\x01\x00\x00\x00\x00\x00\x00\x00\x02?\xf0\x00\x00\x00\x00\x00\x00\xff\x01\x02'
>>> h = HdrHistogram.decode(encoded_uncompressed, b64_wrap=False)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "XXX/foo/lib/python3.8/site-packages/hdrh/histogram.py", line 580, in decode
    hdr_payload = HdrHistogramEncoder.decode(encoded_histogram, b64_wrap)
  File "XXX/foo/lib/python3.8/site-packages/hdrh/codec.py", line 356, in decode
    hdr_payload = HdrPayload(8, compressed_payload=cpayload)
  File "XXX/foo/lib/python3.8/site-packages/hdrh/codec.py", line 146, in __init__
    self._decompress(compressed_payload)
  File "XXX/foo/lib/python3.8/site-packages/hdrh/codec.py", line 197, in _decompress
    self._data = zlib.decompress(compressed_payload)
zlib.error: Error -3 while decompressing data: incorrect header check
>>> h = HdrHistogram.decode(base64.b64encode(encoded_uncompressed))
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "XXX/foo/lib/python3.8/site-packages/hdrh/histogram.py", line 580, in decode
    hdr_payload = HdrHistogramEncoder.decode(encoded_histogram, b64_wrap)
  File "XXX/foo/lib/python3.8/site-packages/hdrh/codec.py", line 346, in decode
    raise HdrCookieException()
hdrh.codec.HdrCookieException

@ahothan
Copy link
Contributor

ahothan commented Jan 12, 2021

Yes I got the backtraces but I really need to get a hold on the buffer you pass to decode() so I can try to reproduce on my computer and decode it manually.

h = HdrHistogram.decode(encoded, b64_wrap=False)

The "encoded" buffer,
can you copy it here in base64 format?

You can either print directly the result of hdrhistogram::serialization::V2DeflateSerializer with base64
or wrap in base64 the output of hdrhistogram::serialization::V2DeflateSerializer

@tdyas
Copy link
Author

tdyas commented Jan 12, 2021

Yes I got the backtraces but I really need to get a hold on the buffer you pass to decode() so I can try to reproduce on my computer and decode it manually.

The buffers are in there as Python bytes literals:

encoded = b"\x1c\x84\x93\x14\x00\x00\x00\x1fx\x9c\x93i\x99,\xcc\xc0\xc0\xc0\xcc\x00\x010\x9a\x11J3\xd9\x7f\x800\xfe32\x01\x00E\x0c\x03\x81"

and:

encoded_uncompressed = b'\x1c\x84\x93\x13\x00\x00\x00\x03\x00\x00\x00\x00\x00\x00\x00\x03\x00\x00\x00\x00\x00\x00\x00\x01\x00\x00\x00\x00\x00\x00\x00\x02?\xf0\x00\x00\x00\x00\x00\x00\xff\x01\x02'

@tdyas
Copy link
Author

tdyas commented Jan 12, 2021

And here they are converted to base64:

Python 3.8.6 (default, Nov  2 2020, 08:14:47)
[Clang 12.0.0 (clang-1200.0.32.21)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import base64
>>> encoded = b"\x1c\x84\x93\x14\x00\x00\x00\x1fx\x9c\x93i\x99,\xcc\xc0\xc0\xc0\xcc\x00\x010\x9a\x11J3\xd9\x7f\x800\xfe32\x01\x00E\x0c\x03\x81"
>>> base64.b64encode(encoded)
b'HISTFAAAAB94nJNpmSzMwMDAzAABMJoRSjPZf4Aw/jMyAQBFDAOB'

and:

>>> encoded_uncompressed = b'\x1c\x84\x93\x13\x00\x00\x00\x03\x00\x00\x00\x00\x00\x00\x00\x03\x00\x00\x00\x00\x00\x00\x00\x01\x00\x00\x00\x00\x00\x00\x00\x02?\xf0\x00\x00\x00\x00\x00\x00\xff\x01\x02'
>>> base64.b64encode(encoded_uncompressed)
b'HISTEwAAAAMAAAAAAAAAAwAAAAAAAAABAAAAAAAAAAI/8AAAAAAAAP8BAg=='

@ahothan
Copy link
Contributor

ahothan commented Jan 18, 2021

ok here's what I found on the decode of a rust V2 compressed histogram.
The base64 encoded string (rust_compressed_b64) works fine when decoding on python:

def test_rust():
    rust_compressed_b64 = "HISTFAAAAB94nJNpmSzMwMDAzAABMJoRSjPZf4Aw/jMyAQBFDAOB"
    histogram = HdrHistogram.decode( rust_compressed_b64)

    rust_compressed = b"\x1c\x84\x93\x14\x00\x00\x00\x1fx\x9c\x93i\x99,\xcc\xc0\xc0\xc0\xcc\x00\x010\x9a\x11J3\xd9\x7f\x800\xfe32\x01\x00E\x0c\x03\x81"
    histogram = HdrHistogram.decode(rust_compressed, b64_wrap=False)

However the non base 64 compressed (rust_compressed) fails.
I added some traces to dump the buffer that is being decompressing and they do not match:

########BUFFER len=31
b'789c9369992cccc0c0c0cc0001309a114a33d97f8030fe33320100450c0381'
########BUFFER len=39
b'1c8493140000001f789c9369992cccc0c0c0cc0001309a114a33d97f8030fe33320100450c0381'

As you can see the rust compressed buffer is 8 bytes too long (start of buffer), which explains why the deflate fails/.
These first 8 bytes are unexpected:
b'1c8493140000001f'

@marshallpierce
Copy link

That's the v2 compressed cookie and the length. 0x1f is 31, which is the length of the buffer.

@tdyas
Copy link
Author

tdyas commented Jan 18, 2021

That's the v2 compressed cookie and the length. 0x1f is 31, which is the length of the buffer.

The code path in the decode function for base64-encoding seems to remove the header off the buffer, but the non-base64 code path does not.

cpayload = b64decode[ext_header_size:]
else:
cpayload = encoded_histogram

caizixian added a commit to caizixian/dacapo-latency-dump-hdrh that referenced this issue Nov 29, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants