Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[DRIVERS-2926] [PYTHON-4577] BSON Binary Vector Subtype Support #1813

Merged

Conversation

caseyclements
Copy link
Contributor

@caseyclements caseyclements commented Aug 24, 2024

Adds support for new subtype of Binary BSON subtype to be used for densely-packed 1d arrays (vectors).
Corresponding changes are in specifications repo PR #1658

bson/vector.py Outdated Show resolved Hide resolved
bson/vector.py Outdated Show resolved Hide resolved
bson/__init__.py Outdated Show resolved Hide resolved
bson/vector.py Outdated Show resolved Hide resolved
bson/binary.py Outdated Show resolved Hide resolved
bson/binary.py Outdated Show resolved Hide resolved
@caseyclements caseyclements marked this pull request as ready for review September 19, 2024 17:25
bson/binary.py Outdated Show resolved Hide resolved
doc/changelog.rst Outdated Show resolved Hide resolved
bson/binary.py Outdated
elif dtype == BinaryVectorDtype.FLOAT32:
n_bytes = len(self) - position
n_values = n_bytes // 4
assert n_bytes % 4 == 0
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should not use assert for validation data. If the argument type is incorrect we raise a TypeError, if the type is correct but the value is invalid we raise a ValueError.

bson/binary.py Outdated
"""
if dtype == BinaryVectorDtype.INT8: # pack ints in [-128, 127] as signed int8
format_str = "b"
assert not padding, f"padding does not apply to {dtype=}"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same comment about assert.

bson/binary.py Outdated
format_str = "B"
elif dtype == BinaryVectorDtype.FLOAT32: # pack floats as float32
format_str = "f"
assert not padding, f"padding does not apply to {dtype=}"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same comment about assert.

.. versionadded:: 4.9
"""

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to validate self.subtype == 9 here before attempting to decode.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ShaneHarvey How do I do that?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Something like:

if self.subtype != VECTOR_SUBTYPE:
   raise ValueError(...)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Look at as_uuid it should be the same as that validation:

        if self.subtype not in ALL_UUID_SUBTYPES:
            raise ValueError(f"cannot decode subtype {self.subtype} as a uuid")

.. versionadded:: 4.9
"""

INT8 = b"\x03"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

An enum of bytes is a bit unusual. Can we change it to use ints? eg INT8 = 3

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The naming convention that Geert used has meaning in both the first and last 4 bits, so I'd prefer it to stay as-is.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see how the interpretation of the values here matters. Are you saying this is a bit flag?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The user doesn't see the value locally either. Given that the bson spec itself always uses integers I think we should use the integer form.

bson/binary.py Outdated
.. versionadded:: 4.9
"""

data: Sequence[float | int]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add __slots__ = ("data", "dtype", "padding") to reduce the memory overhead of using this class.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure. How do I do type annotation for that?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

None needed, e.g.

__slots__ = ("__time", "__inc")

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Strange. And you're cool with that?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have to add them manually instead of using slots=True.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

class BinaryVector:
    """**(BETA)** Vector of numbers along with metadata for binary interoperability.

    :param data (Sequence[float | int]): Sequence of numbers representing the mathematical vector.
    :param dtype (:class:`bson.Binary.BinaryVectorDtype`):  The data type stored in binary
    :param padding (Optional[int] = 0): The number of bits in the final byte that are to be ignored
      when a vector element's size is less than a byte
      and the length of the vector is not a multiple of 8. Default is 0.

    .. versionadded:: 4.10
    """

    __slots__ = ("data", "dtype", "padding")

    def __init__(self, data, dtype, padding=0):
        self.data = data
        self.dtype = dtype
        self.padding = padding

Is this right? @blink1073

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yep

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh I didn't realize __slots__ with dataclass was problematic.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It should be sorted now

.. versionadded:: 4.9
"""

INT8 = b"\x03"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see how the interpretation of the values here matters. Are you saying this is a bit flag?

bson/binary.py Outdated

data: Sequence[float | int]
dtype: BinaryVectorDtype
padding: Optional[int] = 0
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Optional[int] -> int, unless padding=None is considered valid.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

padding=None is expected when it is not a packed type

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it? I see the code using padding=0 in that case.

bson/binary.py Outdated
cls: Type[Binary],
vector: list[int, float],
dtype: BinaryVectorDtype,
padding: Optional[int] = 0,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Optional[int] -> int

bson/binary.py Outdated
data = struct.pack(f"{len(vector)}{format_str}", *vector)
return cls(metadata + data, subtype=VECTOR_SUBTYPE)

def as_vector(self, uncompressed: Optional[bool] = False) -> BinaryVector:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Optional[bool] -> bool unless uncompressed=None is valid.

bson/binary.py Outdated
else:
bits = []
for uint8 in unpacked_uint8s:
bits.extend([int(bit) for bit in f"{uint8:08b}"])
Copy link
Member

@ShaneHarvey ShaneHarvey Oct 1, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why add an uncompressed option here at all? It looks like this option is irreversible because from_vector does not support the same option. Even if it did, is this option useful outside of test code?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I think we should leave out uncompressed, it was a quality of life addition, but is not likely to be standardized.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. It's irreversible. We don't yet provide an API to go from a full vector of zeros and ones to a packed bit vector.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Going in, you'd be using something like numpy.packbits

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's omit it since it adds unneeded complexity at this point.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@ShaneHarvey
Copy link
Member

Fixing the test suite and the Enum issue are my last remaining comments.

@blink1073
Copy link
Member

We're pressing ahead with the hex values in the Enum because they simplify the implementation of encode and decode.

@blink1073
Copy link
Member

Again, the user will only ever see BinaryVectorDtype.PACKED_BIT in their code.

Copy link
Member

@blink1073 blink1073 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@blink1073 blink1073 merged commit ae6cfd6 into mongodb:master Oct 1, 2024
26 of 27 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants