Skip to content

Commit

Permalink
Merge pull request #56 from vmizg/pickleable
Browse files Browse the repository at this point in the history
Awesome, this looks great. I'll see it gets published soon.
  • Loading branch information
prashnts authored May 6, 2024
2 parents cf80f09 + 9deea41 commit 623ddbf
Show file tree
Hide file tree
Showing 8 changed files with 120 additions and 25 deletions.
2 changes: 1 addition & 1 deletion MANIFEST.in
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
include CHANGELOG
include LICENSE
include AUTHORS
include README.markdown
include README.md
recursive-include src *
recursive-include tests *
70 changes: 55 additions & 15 deletions README.markdown → README.md
Original file line number Diff line number Diff line change
Expand Up @@ -10,20 +10,19 @@ The goal of `pybloomfiltermmap3` is simple: to provide a fast, simple, scalable,
[![PyPI](https://img.shields.io/pypi/dw/pybloomfiltermmap3.svg)](https://pypi.python.org/pypi/pybloomfiltermmap3)
[![PyPI](https://img.shields.io/pypi/pyversions/pybloomfiltermmap3.svg)](https://pypi.python.org/pypi/pybloomfiltermmap3)


## Why pybloomfiltermmap3?

There are a couple reasons to use this module:

* It natively uses [mmaped files](http://en.wikipedia.org/wiki/Mmap).
* It is fast (see [benchmarks](http://axiak.github.io/pybloomfiltermmap/#benchmarks)).
* It natively does the set things you want a Bloom filter to do.

- It natively uses [mmapped files](http://en.wikipedia.org/wiki/Mmap).
- It is fast (see [benchmarks](http://axiak.github.io/pybloomfiltermmap/#benchmarks)).
- It natively does the set things you want a Bloom filter to do.

## Quickstart

After you install, the interface to use is a cross between a file
interface and an ste interface. As an example:

```python
>>> import pybloomfilter
>>> fruit = pybloomfilter.BloomFilter(100000, 0.1, '/tmp/words.bloom')
Expand All @@ -37,23 +36,36 @@ interface and an ste interface. As an example:
```

To create an in-memory filter, simply omit the file location:

```python
>>> fruit = pybloomfilter.BloomFilter(10000, 0.1)
>>> fruit.add('apple')
>>> 'apple' in fruit
True
```

These in-memory filters can be pickled and reloaded:

```python
>>> cakes = pybloomfilter.BloomFilter(10000, 0.1)
>>> import pickle
>>> pickled_fruit = pickle.dumps(fruit)
>>> unpickled_fruit = pickle.loads(pickled_fruit)
>>> 'apple' in unpickled_fruit
True
```
*Caveat*: it is currently not possible to persist this filter later.

_Caveat_: it is currently not possible to persist this filter later as an mmap file.

## Docs

Current docs are available at [pybloomfiltermmap3.rtfd.io](https://pybloomfiltermmap3.readthedocs.io/en/latest).


## Install

To install:

```shell
$ pip install pybloomfiltermmap3
```bash
pip install pybloomfiltermmap3
```

and you should be set.
Expand All @@ -62,7 +74,6 @@ and you should be set.

This library is specifically meant for Python 3.5 and above. [As of 2020](https://www.python.org/doc/sunset-python-2/), we strongly advise you to switch to an actively maintained distribution of Python 3. If for any reason your current environment is restricted to Python 2, please see [pybloomfiltermmap](https://github.com/axiak/pybloomfiltermmap). Please note that the latter is not actively maintained and will lack bug fixes and new features.


## History and Future

[pybloomfiltermmap](https://github.com/axiak/pybloomfiltermmap) is an excellent Bloom filter implementation for Python 2 by [@axiak](https://github.com/axiak) and contributors. I, [@prashnts](https://github.com/prashnts), made initial changes to add support for Python 3 sometime in 2016 as the current [pybloomfiltermmap3](https://pypi.org/project/pybloomfiltermmap3/) on `PyPI`. Since then, with the help of contributors, there have been incremental improvements and bug fixes while maintaining the API from versions `0.4.x` and below.
Expand All @@ -71,18 +82,47 @@ Some new features and changes were first introduced in version `0.5.0`. From thi

Suggestions, bug reports, and / or patches are welcome!


## Contributions and development

When contributing, you should set up an appropriate Python 3 environment and install the dependencies listed in `requirements-dev.txt`.
Package installation depends on a generated `pybloomfilter.c` file, which requires Cython module to be in your current environment.

### Environment setup

## Maintainers
```bash
# Installs the venv and python3-dev packages
sudo apt install python3.10-venv python3-dev

# Creates a virtual env called "env"
python -m venv env

# Activates the created virtual env
source ./env/bin/activate
```

### Dependencies

```bash
python -m pip install --upgrade pip
pip install cython
```

* [Prashant Sinha](https://github.com/prashnts)
* [Vytautas Mizgiris](https://github.com/vmizg)
### Build

```bash
python setup.py develop
```

### Test

```bash
python setup.py test
```

## Maintainers

- [Prashant Sinha](https://github.com/prashnts)
- [Vytautas Mizgiris](https://github.com/vmizg)

## License

Expand Down
2 changes: 1 addition & 1 deletion setup.py
Original file line number Diff line number Diff line change
Expand Up @@ -9,7 +9,7 @@
here = os.path.dirname(__file__)

# Get the long description from the README file
with open(os.path.join(here, "README.markdown"), encoding="utf-8") as fp:
with open(os.path.join(here, "README.md"), encoding="utf-8") as fp:
long_description = fp.read()

setup_kwargs = {}
Expand Down
5 changes: 4 additions & 1 deletion src/bloomfilter.c
Original file line number Diff line number Diff line change
Expand Up @@ -9,7 +9,7 @@
#include "bloomfilter.h"

BloomFilter *bloomfilter_Create_Malloc(size_t max_num_elem, double error_rate,
BTYPE num_bits, int *hash_seeds, int num_hashes)
BTYPE num_bits, int *hash_seeds, int num_hashes, const char *data)
{
BloomFilter * bf = (BloomFilter *)malloc(sizeof(BloomFilter));
MBArray * array;
Expand All @@ -35,6 +35,9 @@ BloomFilter *bloomfilter_Create_Malloc(size_t max_num_elem, double error_rate,
bloomfilter_Destroy(bf);
return NULL;
}
if (data) {
memcpy(array->vector, data, num_bits / 8);
}

bf->array = array;

Expand Down
2 changes: 1 addition & 1 deletion src/bloomfilter.h
Original file line number Diff line number Diff line change
Expand Up @@ -28,7 +28,7 @@ typedef struct _BloomFilter BloomFilter;

/* Create a bloom filter without a memory-mapped file backing it */
BloomFilter *bloomfilter_Create_Malloc(size_t max_num_elem, double error_rate,
BTYPE num_bits, int *hash_seeds, int num_hashes);
BTYPE num_bits, int *hash_seeds, int num_hashes, const char * data);

/* Create a bloom filter with a memory-mapped file backing it */
BloomFilter *bloomfilter_Create_Mmap(size_t max_num_elem, double error_rate,
Expand Down
2 changes: 1 addition & 1 deletion src/cbloomfilter.pxd
Original file line number Diff line number Diff line change
Expand Up @@ -48,7 +48,7 @@ cdef extern from "bloomfilter.h":
BloomFilter * bloomfilter_Create_Malloc(long max_num_elem,
double error_rate,
long num_bits,
int * hash_seeds, int num_hashes)
int * hash_seeds, int num_hashes, char * data)
void bloomfilter_Destroy(BloomFilter * bf)
int bloomfilter_Add(BloomFilter * bf, Key * key)
int bloomfilter_Test(BloomFilter * bf, Key * key)
Expand Down
48 changes: 43 additions & 5 deletions src/pybloomfilter.pyx
Original file line number Diff line number Diff line change
Expand Up @@ -5,7 +5,6 @@ AUTHOR = "Michael Axiak"

__VERSION__ = VERSION


cimport cbloomfilter
cimport cpython

Expand Down Expand Up @@ -62,6 +61,9 @@ cdef class BloomFilter:
:param list hash_seeds: optionally specify hash seeds to use for the
hashing algorithm. Each hash seed must not exceed 32 bits. The number
of hash seeds will determine the number of hashes performed.
:param bytes data_array: optionally specify the filter data array, same as
given by BloomFilter.data_array. Only valid for in-memory bloomfilters.
If provided, hash_seeds must be given too.
**Note that we do not check capacity.** This is important, because
we want to be able to support logical OR and AND (see :meth:`BloomFilter.union`
Expand All @@ -80,20 +82,35 @@ cdef class BloomFilter:
cdef int _in_memory
cdef int _oflags

def __cinit__(self, capacity, error_rate, filename=None, perm=0755, hash_seeds=None):
def __reduce__(self):
"""Makes an in-memory BloomFilter pickleable."""
callable = BloomFilter
args = (self.capacity, self.error_rate, None, None, self.hash_seeds, self.data_array)
return (callable, args)


def __cinit__(self, capacity, error_rate, filename=None, perm=0755, hash_seeds=None, data_array=None):
self._closed = 0
self._in_memory = 0
self._oflags = os.O_RDWR

if capacity is NoConstruct:
return

self._create(capacity, error_rate, filename, perm, hash_seeds)
self._create(capacity, error_rate, filename, perm, hash_seeds, data_array)


def _create(self, capacity, error_rate, filename=None, perm=0755, hash_seeds=None):
def _create(self, capacity, error_rate, filename=None, perm=0755, hash_seeds=None, data_array=None):
cdef char * seeds
cdef char * data = NULL
cdef long long num_bits

if data_array is not None:
if filename:
raise ValueError("data_array cannot be used for an mmapped filter.")
if hash_seeds is None:
raise ValueError("hash_seeds must be specified if a data_array is provided.")

# Make sure that if the filename is defined, that the
# file exists
if filename and os.path.exists(filename):
Expand Down Expand Up @@ -135,6 +152,10 @@ cdef class BloomFilter:
# Minimum bit vector of 128 bits
num_bits = max(num_hashes * bits_per_hash,128)

# Override calculated capacity if we are provided a data array
if data_array is not None:
num_bits = 8 * len(data_array)

# print("k = %d m = %d n = %d p ~= %.8f" % (
# num_hashes, num_bits, capacity,
# (1.0 - math.exp(- float(num_hashes) * float(capacity) / num_bits))
Expand All @@ -153,18 +174,21 @@ cdef class BloomFilter:
num_hashes)
else:
self._in_memory = 1
if data_array is not None:
data = data_array
self._bf = cbloomfilter.bloomfilter_Create_Malloc(capacity,
error_rate,
num_bits,
<int *>seeds,
num_hashes)
num_hashes, <const char *>data)
if self._bf is NULL:
if filename:
raise OSError(errno, '%s: %s' % (os.strerror(errno),
filename))
else:
cpython.PyErr_NoMemory()


def _open(self, filename, mode="rw"):
# Should not overwrite
mode = mode.replace("+", "")
Expand Down Expand Up @@ -202,6 +226,19 @@ cdef class BloomFilter:
arr = (<char *>cbloomfilter.mbarray_CharData(self._bf.array))[start_pos:end_pos]
return int.from_bytes(arr, byteorder="big", signed=False)

@property
def data_array(self):
"""Bytes array of the Bloom filter contents.
"""
self._assert_open()
start_pos = self._bf.array.preamblebytes
end_pos = start_pos + self._bf.array.bytes
arr = array.array('B')
arr.frombytes(
(<char *>cbloomfilter.mbarray_CharData(self._bf.array))[start_pos:end_pos]
)
return bytes(arr)

@property
def hash_seeds(self):
"""Integer seeds used for the random hashing. Returns a list of integers."""
Expand Down Expand Up @@ -345,6 +382,7 @@ cdef class BloomFilter:
key.shash = item
key.nhash = len(item)
else:
# Warning! Only works reliably for objects whose hash is based on value not memory address.
item = item_
key.shash = NULL
key.nhash = hash(item)
Expand Down
14 changes: 14 additions & 0 deletions tests/simpletest.py
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,7 @@
import string
import unittest
import tempfile
import pickle
from random import randint, choice, getrandbits

import pybloomfilter
Expand Down Expand Up @@ -400,6 +401,19 @@ def test_approximate_size_after_union_called(self):
intersection = len(bf1) + len(bf2) - len(union_bf)
assert intersection == 11 # approximate size

def test_pickle(self):
bf = pybloomfilter.BloomFilter(100, 0.1)
bf.add('apple')
assert 'apple' in bf
assert 'hello' not in bf

pickled = pickle.dumps(bf)
unpickled = pickle.loads(pickled)
assert 'apple' in unpickled
assert 'hello' not in unpickled

# Expecting same hashing sequence
self.assertEqual(bf.bit_array, unpickled.bit_array)

def suite():
suite = unittest.TestSuite()
Expand Down

0 comments on commit 623ddbf

Please sign in to comment.