Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[DRIVERS-2926] [PYTHON-4577] BSON Binary Vector Subtype Support #1813

Merged
Merged
Show file tree
Hide file tree
Changes from 28 commits
Commits
Show all changes
48 commits
Select commit Hold shift + click to select a range
245c869
First commit on DRIVERS-2926-BSON-Binary-Vectors
caseyclements Aug 22, 2024
031cd8c
Turns dtype into enum. Adds handling of padding, __eq__. Removal of n…
caseyclements Aug 23, 2024
8d4e8a2
Added docstring and comments
caseyclements Aug 23, 2024
2df0d6b
Changed order of BinaryVector and Binary in bson._ENCODERS to get tes…
caseyclements Aug 23, 2024
315a115
Changed order of BinaryVector and Binary in bson._ENCODERS to get tes…
caseyclements Aug 23, 2024
d74314d
json_util dumps/loads of BinaryVector
caseyclements Aug 23, 2024
27f13c8
Added bson_corpus tests. Needs more, and review of json_util
caseyclements Aug 24, 2024
263f8c7
Removed BinaryVector as separate class. Instead, Binary includes as_v…
caseyclements Sep 12, 2024
f8bcdef
Stop setting _USD_C to False
caseyclements Sep 13, 2024
5435785
mypy fixes
caseyclements Sep 13, 2024
5c4d152
Removed stub vector.json for bson_corpus tests
caseyclements Sep 13, 2024
f86d040
More tests
caseyclements Sep 13, 2024
adcb945
Added description of subtype 9 to bson.Binary docstring
caseyclements Sep 14, 2024
7986cc5
Addressed comments in docstrings.
caseyclements Sep 16, 2024
26b8398
Eased string comparison of exception in xfail in test_bson
caseyclements Sep 16, 2024
28de28a
Updates to docstrings of BinaryVector and BinaryVectorDtype
caseyclements Sep 17, 2024
68235b8
Simplified expected exeption case. Will be refactored with yaml anyway..
caseyclements Sep 17, 2024
e2a1a3c
Added draft of test runner
caseyclements Sep 18, 2024
bf9758a
Added test cases: padding, and overflow
caseyclements Sep 19, 2024
e1590aa
Merge branch 'master' into DRIVERS-2926-BSON-Binary-Vectors
caseyclements Sep 19, 2024
c4c7af7
Cast Path to str
caseyclements Sep 19, 2024
de5a245
Simplified as_vector API
caseyclements Sep 20, 2024
43bcce4
Added test case: list of floats with dtype int8 raises exception
caseyclements Sep 20, 2024
41ee0bb
Set default padding to 0 in test runner
caseyclements Sep 20, 2024
9d52aeb
Updated test_bson for new as_vector API
caseyclements Sep 20, 2024
0d34464
Updated resync-specs.sh to include bson-binary-vector
caseyclements Sep 20, 2024
1d49656
Updated resync-specs.sh and test cases
caseyclements Sep 20, 2024
2af0ca4
Broke tests into 3 files by dtype
caseyclements Sep 20, 2024
c93bae1
Update bson/binary.py
caseyclements Sep 27, 2024
f374b5a
Removed json from test_bson_binary_vector and its jsons
caseyclements Sep 27, 2024
0db9866
Addition of Provision (BETA) specifiers change references to 4.10
caseyclements Sep 30, 2024
0532803
Add references to from_vector() and as_vector()
caseyclements Sep 30, 2024
3edeef6
Add subtype number in changelog
caseyclements Sep 30, 2024
d199597
Raise ValueErrors not AssertionErrors. Bumped from 4.9 to 4.10
caseyclements Sep 30, 2024
abc7cd3
Docstring for as_vector
caseyclements Sep 30, 2024
4550c20
Add slots for BinaryVector
caseyclements Sep 30, 2024
99d44e1
Check subtype before decoding
caseyclements Oct 1, 2024
001636d
Try slots with default padding
caseyclements Oct 1, 2024
637c474
Removed slots arg
caseyclements Oct 1, 2024
2d511f6
Update dataclass
caseyclements Oct 1, 2024
17e1d33
Remove unompressed kwarg from as_vector
caseyclements Oct 1, 2024
ce5f3e3
Changed TypeError to ValueError
caseyclements Oct 1, 2024
edfe972
Updates after removing uncompressed
caseyclements Oct 1, 2024
8aaa2f6
Fixed expected exceptions in invalid test cases
caseyclements Oct 1, 2024
dfb322c
Merge branch 'master' into DRIVERS-2926-BSON-Binary-Vectors
blink1073 Oct 1, 2024
8946daf
padding in now Optional[int] = None
caseyclements Oct 1, 2024
9397129
padding does need to be an integer
caseyclements Oct 1, 2024
913403b
Removed unneeded ugly TYPE_FROM_HEX = {key.value: key for key in Bina…
caseyclements Oct 1, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 3 additions & 0 deletions .evergreen/resync-specs.sh
Original file line number Diff line number Diff line change
Expand Up @@ -76,6 +76,9 @@ do
atlas-data-lake-testing|data_lake)
cpjson atlas-data-lake-testing/tests/ data_lake
;;
bson-binary-vector|bson_binary_vector)
cpjson bson-binary-vector/tests/ bson_binary_vector
;;
bson-corpus|bson_corpus)
cpjson bson-corpus/tests/ bson_corpus
;;
Expand Down
157 changes: 151 additions & 6 deletions bson/binary.py
Original file line number Diff line number Diff line change
Expand Up @@ -13,7 +13,10 @@
# limitations under the License.
from __future__ import annotations

from typing import TYPE_CHECKING, Any, Tuple, Type, Union
import struct
from dataclasses import dataclass
from enum import Enum
from typing import TYPE_CHECKING, Any, Optional, Sequence, Tuple, Type, Union
from uuid import UUID

"""Tools for representing BSON binary data.
Expand Down Expand Up @@ -191,21 +194,76 @@ class UuidRepresentation:
"""


VECTOR_SUBTYPE = 9
"""BSON binary subtype for densely packed vector data.

.. versionadded:: 4.9
"""


USER_DEFINED_SUBTYPE = 128
"""BSON binary subtype for any user defined structure.
"""


class BinaryVectorDtype(Enum):
"""Datatypes of vector subtype.

:param FLOAT32: (0x27) Pack list of :class:`float` as float32
:param INT8: (0x03) Pack list of :class:`int` in [-128, 127] as signed int8
:param PACKED_BIT: (0x10) Pack list of :class:`int` in [0, 255] as unsigned uint8

The `PACKED_BIT` value represents a special case where vector values themselves
can only be of two values (0 or 1) but these are packed together into groups of 8,
a byte. In Python, these are displayed as ints in range [0, 255]

Each value is of type bytes with a length of one.

.. versionadded:: 4.9
"""

INT8 = b"\x03"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

An enum of bytes is a bit unusual. Can we change it to use ints? eg INT8 = 3

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The naming convention that Geert used has meaning in both the first and last 4 bits, so I'd prefer it to stay as-is.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see how the interpretation of the values here matters. Are you saying this is a bit flag?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The user doesn't see the value locally either. Given that the bson spec itself always uses integers I think we should use the integer form.

FLOAT32 = b"\x27"
PACKED_BIT = b"\x10"


# Map from bytes to enum value, for decoding.
DTYPE_FROM_HEX = {key.value: key for key in BinaryVectorDtype}


@dataclass
class BinaryVector:
"""Vector of numbers along with metadata for binary interoperability.

:param data: Sequence of numbers representing the mathematical vector.
:param dtype: The data type stored in binary
:param padding: The number of bits in the final byte that are to be ignored
when a vector element's size is less than a byte
and the length of the vector is not a multiple of 8.

.. versionadded:: 4.9
"""

data: Sequence[float | int]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add __slots__ = ("data", "dtype", "padding") to reduce the memory overhead of using this class.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure. How do I do type annotation for that?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

None needed, e.g.

__slots__ = ("__time", "__inc")

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Strange. And you're cool with that?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have to add them manually instead of using slots=True.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

class BinaryVector:
    """**(BETA)** Vector of numbers along with metadata for binary interoperability.

    :param data (Sequence[float | int]): Sequence of numbers representing the mathematical vector.
    :param dtype (:class:`bson.Binary.BinaryVectorDtype`):  The data type stored in binary
    :param padding (Optional[int] = 0): The number of bits in the final byte that are to be ignored
      when a vector element's size is less than a byte
      and the length of the vector is not a multiple of 8. Default is 0.

    .. versionadded:: 4.10
    """

    __slots__ = ("data", "dtype", "padding")

    def __init__(self, data, dtype, padding=0):
        self.data = data
        self.dtype = dtype
        self.padding = padding

Is this right? @blink1073

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yep

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh I didn't realize __slots__ with dataclass was problematic.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It should be sorted now

dtype: BinaryVectorDtype
padding: Optional[int] = 0
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Optional[int] -> int, unless padding=None is considered valid.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

padding=None is expected when it is not a packed type

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it? I see the code using padding=0 in that case.



class Binary(bytes):
"""Representation of BSON binary data.

This is necessary because we want to represent Python strings as
the BSON string type. We need to wrap binary data so we can tell
We want to represent Python strings as the BSON string type.
We need to wrap binary data so that we can tell
the difference between what should be considered binary data and
what should be considered a string when we encode to BSON.

Raises TypeError if `data` is not an instance of :class:`bytes`
or `subtype` is not an instance of :class:`int`.
Subtype 9 provides a space-efficient representation of 1-dimensional vector data.
Its data is prepended with two bytes of metadata.
The first (dtype) describes its data type, such as float32 or int8.
The second (padding) prescribes the number of bits to ignore in the final byte.
This is relevant when the element size of the dtype is not a multiple of 8.

Raises TypeError if subtype` is not an instance of :class:`int`.
caseyclements marked this conversation as resolved.
Show resolved Hide resolved
Raises ValueError if `subtype` is not in [0, 256).

.. note::
Expand All @@ -218,7 +276,10 @@ class Binary(bytes):
to use

.. versionchanged:: 3.9
Support any bytes-like type that implements the buffer protocol.
Support any bytes-like type that implements the buffer protocol.

.. versionchanged:: 4.9
Addition of vector subtype.
"""

_type_marker = 5
Expand Down Expand Up @@ -337,6 +398,90 @@ def as_uuid(self, uuid_representation: int = UuidRepresentation.STANDARD) -> UUI
f"cannot decode subtype {self.subtype} to {UUID_REPRESENTATION_NAMES[uuid_representation]}"
)

@classmethod
def from_vector(
cls: Type[Binary],
vector: list[int, float],
dtype: BinaryVectorDtype,
padding: Optional[int] = 0,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Optional[int] -> int

) -> Binary:
"""Create a BSON :class:`~bson.binary.Binary` of Vector subtype from a list of Numbers.

To interpret the representation of the numbers, a data type must be included.
See :class:`~bson.binary.BinaryVectorDtype` for available types and descriptions.

The dtype and padding are prepended to the binary data's value.

:param vector: List of values
:param dtype: Data type of the values
:param padding: For fractional bytes, number of bits to ignore at end of vector.
:return: Binary packed data identified by dtype and padding.

.. versionadded:: 4.9
"""
if dtype == BinaryVectorDtype.INT8: # pack ints in [-128, 127] as signed int8
blink1073 marked this conversation as resolved.
Show resolved Hide resolved
format_str = "b"
assert not padding, f"padding does not apply to {dtype=}"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same comment about assert.

elif dtype == BinaryVectorDtype.PACKED_BIT: # pack ints in [0, 255] as unsigned uint8
format_str = "B"
elif dtype == BinaryVectorDtype.FLOAT32: # pack floats as float32
format_str = "f"
assert not padding, f"padding does not apply to {dtype=}"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same comment about assert.

else:
raise NotImplementedError("%s not yet supported" % dtype)

metadata = struct.pack("<sB", dtype.value, padding)
data = struct.pack(f"{len(vector)}{format_str}", *vector)
return cls(metadata + data, subtype=VECTOR_SUBTYPE)

def as_vector(self, uncompressed: Optional[bool] = False) -> BinaryVector:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Optional[bool] -> bool unless uncompressed=None is valid.

"""From the Binary, create a list of numbers, along with dtype and padding.


:param uncompressed: If true, return the true mathematical vector.
This is only necessary for datatypes where padding is applicable.
For example, setting this to True for a PACKED_BIT vector will result
in a List[int] of zeros and ones.
:return: List of numbers, along with dtype and padding.
blink1073 marked this conversation as resolved.
Show resolved Hide resolved

.. versionadded:: 4.9
"""

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to validate self.subtype == 9 here before attempting to decode.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ShaneHarvey How do I do that?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Something like:

if self.subtype != VECTOR_SUBTYPE:
   raise ValueError(...)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Look at as_uuid it should be the same as that validation:

        if self.subtype not in ALL_UUID_SUBTYPES:
            raise ValueError(f"cannot decode subtype {self.subtype} as a uuid")

position = 0
dtype, padding = struct.unpack_from("<sB", self, position)
position += 2
dtype = BinaryVectorDtype(dtype)
n_values = len(self) - position

if dtype == BinaryVectorDtype.INT8:
dtype_format = "b"
format_string = f"{n_values}{dtype_format}"
vector = list(struct.unpack_from(format_string, self, position))
return BinaryVector(vector, dtype, padding)

elif dtype == BinaryVectorDtype.FLOAT32:
n_bytes = len(self) - position
n_values = n_bytes // 4
assert n_bytes % 4 == 0
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should not use assert for validation data. If the argument type is incorrect we raise a TypeError, if the type is correct but the value is invalid we raise a ValueError.

vector = list(struct.unpack_from(f"{n_values}f", self, position))
return BinaryVector(vector, dtype, padding)

elif dtype == BinaryVectorDtype.PACKED_BIT:
# data packed as uint8
dtype_format = "B"
unpacked_uint8s = list(struct.unpack_from(f"{n_values}{dtype_format}", self, position))
if not uncompressed:
return BinaryVector(unpacked_uint8s, dtype, padding)
else:
bits = []
for uint8 in unpacked_uint8s:
bits.extend([int(bit) for bit in f"{uint8:08b}"])
Copy link
Member

@ShaneHarvey ShaneHarvey Oct 1, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why add an uncompressed option here at all? It looks like this option is irreversible because from_vector does not support the same option. Even if it did, is this option useful outside of test code?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I think we should leave out uncompressed, it was a quality of life addition, but is not likely to be standardized.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. It's irreversible. We don't yet provide an API to go from a full vector of zeros and ones to a packed bit vector.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Going in, you'd be using something like numpy.packbits

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's omit it since it adds unneeded complexity at this point.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

vector = bits[:-padding] if padding else bits
return BinaryVector(vector, dtype, padding)

else:
raise NotImplementedError("Binary Vector dtype %s not yet supported" % dtype.name)

@property
def subtype(self) -> int:
"""Subtype of this binary data."""
Expand Down
8 changes: 8 additions & 0 deletions doc/api/bson/binary.rst
Original file line number Diff line number Diff line change
Expand Up @@ -21,6 +21,14 @@
.. autoclass:: UuidRepresentation
:members:

.. autoclass:: BinaryVectorDtype
:members:
:show-inheritance:

.. autoclass:: BinaryVector
:members:


.. autoclass:: Binary(data, subtype=BINARY_SUBTYPE)
:members:
:show-inheritance:
45 changes: 45 additions & 0 deletions test/bson_binary_vector/float32.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,45 @@
{
"description": "Tests of Binary subtype 9, Vectors, with dtype FLOAT32",
"test_key": "vector",
"tests": [
{
"description": "Simple Vector FLOAT32",
"valid": true,
"vector": [127.0, 7.0],
"dtype_hex": "0x27",
"dtype_alias": "FLOAT32",
"padding": 0,
"canonical_bson": "1C00000005766563746F72000A0000000927000000FE420000E04000",
"canonical_extjson": "{\"vector\": {\"$binary\": {\"base64\": \"JwAAAP5CAADgQA==\", \"subType\": \"09\"}}}"
},
{
"description": "Empty Vector FLOAT32",
"valid": true,
"vector": [],
"dtype_hex": "0x27",
"dtype_alias": "FLOAT32",
"padding": 0,
"canonical_bson": "1400000005766563746F72000200000009270000",
"canonical_extjson": "{\"vector\": {\"$binary\": {\"base64\": \"JwA=\", \"subType\": \"09\"}}}"
},
{
"description": "Infinity Vector FLOAT32",
"valid": true,
"vector": ["-inf", 0.0, "inf"],
"dtype_hex": "0x27",
"dtype_alias": "FLOAT32",
"padding": 0,
"canonical_bson": "2000000005766563746F72000E000000092700000080FF000000000000807F00",
"canonical_extjson": "{\"vector\": {\"$binary\": {\"base64\": \"JwAAAID/AAAAAAAAgH8=\", \"subType\": \"09\"}}}"
},
{
"description": "FLOAT32 with padding",
"valid": false,
"vector": [127.0, 7.0],
"dtype_hex": "0x27",
"dtype_alias": "FLOAT32",
"padding": 3
}
]
}

59 changes: 59 additions & 0 deletions test/bson_binary_vector/int8.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,59 @@
{
"description": "Tests of Binary subtype 9, Vectors, with dtype INT8",
"test_key": "vector",
"tests": [
{
"description": "Simple Vector INT8",
"valid": true,
"vector": [127, 7],
"dtype_hex": "0x03",
"dtype_alias": "INT8",
"padding": 0,
"canonical_bson": "1600000005766563746F7200040000000903007F0700",
"canonical_extjson": "{\"vector\": {\"$binary\": {\"base64\": \"AwB/Bw==\", \"subType\": \"09\"}}}"
},
{
"description": "Empty Vector INT8",
"valid": true,
"vector": [],
"dtype_hex": "0x03",
"dtype_alias": "INT8",
"padding": 0,
"canonical_bson": "1400000005766563746F72000200000009030000",
"canonical_extjson": "{\"vector\": {\"$binary\": {\"base64\": \"AwA=\", \"subType\": \"09\"}}}"
},
{
"description": "Overflow Vector INT8",
"valid": false,
"vector": [128],
"dtype_hex": "0x03",
"dtype_alias": "INT8",
"padding": 0
},
{
"description": "Underflow Vector INT8",
"valid": false,
"vector": [-129],
"dtype_hex": "0x03",
"dtype_alias": "INT8",
"padding": 0
},
{
"description": "INT8 with padding",
"valid": false,
"vector": [127, 7],
"dtype_hex": "0x03",
"dtype_alias": "INT8",
"padding": 3
},
{
"description": "INT8 with float inputs",
"valid": false,
"vector": [127.77, 7.77],
"dtype_hex": "0x03",
"dtype_alias": "INT8",
"padding": 0
}
]
}

53 changes: 53 additions & 0 deletions test/bson_binary_vector/packed_bit.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,53 @@
{
"description": "Tests of Binary subtype 9, Vectors, with dtype PACKED_BIT",
"test_key": "vector",
"tests": [
{
"description": "Simple Vector PACKED_BIT",
"valid": true,
"vector": [127, 7],
"dtype_hex": "0x10",
"dtype_alias": "PACKED_BIT",
"padding": 0,
"canonical_bson": "1600000005766563746F7200040000000910007F0700",
"canonical_extjson": "{\"vector\": {\"$binary\": {\"base64\": \"EAB/Bw==\", \"subType\": \"09\"}}}"
},
{
"description": "Empty Vector PACKED_BIT",
"valid": true,
"vector": [],
"dtype_hex": "0x10",
"dtype_alias": "PACKED_BIT",
"padding": 0,
"canonical_bson": "1400000005766563746F72000200000009100000",
"canonical_extjson": "{\"vector\": {\"$binary\": {\"base64\": \"EAA=\", \"subType\": \"09\"}}}"
},
{
"description": "PACKED_BIT with padding",
"valid": true,
"vector": [127, 7],
"dtype_hex": "0x10",
"dtype_alias": "PACKED_BIT",
"padding": 3,
"canonical_bson": "1600000005766563746F7200040000000910037F0700",
"canonical_extjson": "{\"vector\": {\"$binary\": {\"base64\": \"EAN/Bw==\", \"subType\": \"09\"}}}"
},
{
"description": "Overflow Vector PACKED_BIT",
"valid": false,
"vector": [256],
"dtype_hex": "0x10",
"dtype_alias": "PACKED_BIT",
"padding": 0
},
{
"description": "Underflow Vector PACKED_BIT",
"valid": false,
"vector": [-1],
"dtype_hex": "0x10",
"dtype_alias": "PACKED_BIT",
"padding": 0
}
]
}

Loading
Loading