Skip to content

Commit

Permalink
Merge branch 'feature/twidi/range-indexes' into develop
Browse files Browse the repository at this point in the history
  • Loading branch information
twidi committed Jan 26, 2018
2 parents a9e0eb6 + f78da10 commit 08dade3
Show file tree
Hide file tree
Showing 13 changed files with 2,201 additions and 208 deletions.
162 changes: 131 additions & 31 deletions doc/collections.rst
Original file line number Diff line number Diff line change
Expand Up @@ -11,51 +11,44 @@ If fields are indexed, it's possible to make query to retrieve many of them, usi
The filtering has some limitations:

- you can only filter on fields with `indexable` and/or `unique` attributes set to True
- the filtering capabilities are limited and must be thought at the beginning
- you can only filter on full values (`limpyd` doesn't provide filters like "startswith", "contains"...)
- all filters are "and"ed
- no "not" (only able to find mathing fields, not to exlude some)
- no "not" (only able to find matching fields, not to exclude some)
- no "join" (filter on one model only)

The result of a call to the `collection` is lazy. The query is only sent to Redis when data is really needed, to display or do computation with them.

By default, a collection returns a list of primary keys for all the matching objects, but you can sort them, retrieve only a part, and/or directly get full instances instead of primary keys.

We will explain Filtering_, Sorting_, Slicing_, Instanciating_, and Lazyness_ below, based on this example:
We will explain Filtering_, Sorting_, Slicing_, Instantiating_, Indexing_, and Laziness_ below, based on this example:

.. code:: python
class Person(model.RedisModel):
database = main_database
firstname = fields.InstanceHashField(indexable=True)
lastname = fields.InstanceHashField(indexable=True)
birth_year = fields.InstanceHashField(indexable=True)
nickname = fields.InstanceHashField(indexable=True, indexes=[TextRangeIndex])
birth_year = fields.InstanceHashField(indexable=True, indexes=[NumberRangeIndex])
def __repr__(self):
return "<[%s] %s %s (%s)>" % tuple([self.pk.get()] + self.hmget('firstname', 'lastname', 'birth_year'))
>>> Person(firstname='John', lastname='Smith', birth_year=1960)
<[1] John Smith (1960)>
>>> Person(firstname='John', lastname='Doe', birth_year=1965)
<[2] John Doe (1965)>
>>> Person(firstname='Emily', lastname='Smith', birth_year=1950)
<[3] Emily Smith (1950)>
>>> Person(firstname='Susan', lastname='Doe', birth_year=1960)
<[4] Susan Doe (1960)>
return '<[%s] %s "%s" %s (%s)>' % tuple([self.pk.get()] + self.hmget('firstname', 'nickname', 'lastname', 'birth_year'))
Note that for each primary key got from redis, a real instance is created, with a check for pk existence. As it can lead to a lot of redis calls (one for each instance), if you are sure that all primary keys really exists (it must be the case if nothing special was done), you can skip these tests by passing the `skip_exist_test` named argument to True when calling `instances`::

>>> Person.collection().instances(skip_exist_test=True)

Note that when you'll update an instance got with `skip_exist_test` set to True, the existence of the primary key will be done before the update, raising an exception if not found.

To cancel retrieving instances and get the default return format, call the `primary_keys` method:
>>> Person(firstname='John', lastname='Smith', nickname='Joe', birth_year=1960)
<[1] John "Joe" Smith (1960)>
>>> Person(firstname='John', lastname='Doe', nickname='Jon', birth_year=1965)
<[2] John "Jon" Doe (1965)>
>>> Person(firstname='Emily', lastname='Smith', nickname='Emma', birth_year=1950)
<[3] Emily "Emma" Smith (1950)>
>>> Person(firstname='Susan', lastname='Doe', nickname='Sue', birth_year=1960)
<[4] Susan "Sue" Doe (1960)>
.. code:: python
>>> Person.collection(firstname='John').instances().primary_keys()
>>> ['1', '2']
Filtering
=========

Expand Down Expand Up @@ -108,33 +101,140 @@ Example:
>>> Person.collection(firstname='John').sort(by='lastname', alpha=True)
['2', '1']
>>> Person.collection(firstname='John').sort(by='lastname', alpha=True)[1:2]
[1']
['1']
>>> Person.collection().sort(by='birth_year')
['3', '1', '4', '2']
Instanciating
Instantiating
=============

If you want to retrieve already instanciated objects, instead of only primary keys and having to do instanciation yourself, you simply have to call `instances()` on the result of the collection. The result of the collection and its methods (`sort` and `instances`) return a collection, so you can do chaining:
If you want to retrieve already instantiated objects, instead of only primary keys and having to do instantiation yourself, you simply have to call `instances()` on the result of the collection. The result of the collection and its methods (`sort` and `instances`) return a collection, so you can do chaining:

.. code:: python
>>> Person.collection(firstname='John')
['1', '2']
>>> Person.collection(firstname='John').instances()
[<[1] John Smith (1960)>, <[2] John Doe (1965)>]
[<[1] John "Joe" Smith (1960)>, <[2] John "Jon" Doe (1965)>]
>>> Person.collection(firstname='John').instances().sort(by='lastname', alpha=True)
[<[2] John Doe (1965)>, <[1] John Smith (1960)>]
[<[2] John "Jon" Doe (1965)>, <[1] John "Joe" Smith (1960)>]
>>> Person.collection(firstname='John').sort(by='lastname', alpha=True).instances()
[<[2] John Doe (1965)>, <[1] John Smith (1960)>]
[<[2] John "Jon" Doe (1965)>, <[1] John "Joe" Smith (1960)>]
>>> Person.collection(firstname='John').sort(by='lastname', alpha=True).instances()[0]
[<[2] John Doe (1965)>
[<[2] John "Jon" Doe (1965)>
Note that for each primary key got from redis, a real instance is created, with a check for pk existence. As it can lead to a lot of redis calls (one for each instance), if you are sure that all primary keys really exists (it must be the case if nothing special was done), you can skip these tests by passing the `skip_exist_test` named argument to True when calling `instances`:
.. code:: python
>>> Person.collection().instances(skip_exist_test=True)
Note that when you'll update an instance got with `skip_exist_test` set to True, the existence of the primary key will be done before the update, raising an exception if not found.
To cancel retrieving instances and get the default return format, call the `primary_keys` method:
.. code:: python
>>> Person.collection().instances(skip_exist_test=True).primary_keys()
Indexing
========
By default, all fields with `indexable=True` use the default index, `EqualIndex`.
It only allows equality filtering (the only legacy index type supported by limpyd), but it is fast.
To filter using this index, you simply pass the field and a value in the collection call:
.. code:: python
>>> Person.collection(firstname='John').instances()
[<[1] John "Joe" Smith (1960)>, <[2] John "Jon" Doe (1965)>]
But you can also be more specific about the fact that you want an equality by using the `__eq` suffix. All other indexes use different suffixes.
This design is inspired by Django.
.. code:: python
>>> Person.collection(firstname__eq='John').instances()
[<[1] John "Joe" Smith (1960)>, <[2] John "Jon" Doe (1965)>]
If you want to do more advanced lookup on a field that contains text, you can use the `TextRangeIndex` (to import from `limpyd.indexes`), as we did for the `nickname` field.
It allows the same filtering as the default index, ie equality without suffix or with the `__eq` suffix, but it is not as efficient.
So if your only use is equality filtering, do not use it.
But if not, you can take advantage of its capabilities, depending on the suffix you'll use:
- `__gt`: text "Greater Than" the given value
- `__gte`: "Greater Than or Equal"
- `__lt`: "Less Than"
- `__lte`: "Less Than or Equal"
- `__startswith`: text that starts with the given value
Texts are compared in a lexicographical way, as viewed by redis and explained this way:
The elements are considered to be ordered from lower to higher strings as compared byte-by-byte using the memcmp() C function. Longer strings are considered greater than shorter strings if the common part is identical.
Some examples:
.. code:: python
>>> Person.collection(nickname__startswith='Jo').instances()
[<[1] John "Joe" Smith (1960)>, <[2] John "Jon" Doe (1965)>]
>>> Person.collection(nickname__gte='Jo').instances()
[<[1] John "Joe" Smith (1960)>, <[2] John "Jon" Doe (1965)>, <[4] Susan "Sue" Doe (1960)>]
>>> Person.collection(nickname__gt='Jo').instances()
[<[4] Susan "Sue" Doe (1960)>]
As for normal index, you can filter many times on the same field (more than two times doesn't really make sense):
.. code:: python
>>> Person.collection(nickname__gte='E', nickname__lte='J').instances()
[<[3] Emily "Emma" Smith (1950)>, <[1] John "Joe" Smith (1960)>, <[2] John "Jon" Doe (1965)>]
This index works well for text but not for numbers, because lexicographically, 1000 < 11.
For numbers, you can use the `NumberRangeIndex` (to import from `limpyd.indexes`).
It supports the same suffixes than `TextRangeIndex` excepted for `startswith`.
Some things to know about this index:
- values of a field that cannot be casted to a float are converted to 0 for indexing (the stored value doesn't change).
- negative numbers are, of course, supported
- numbers are saved as the score of a redis sorted set, so a number is, in the index:
represented as an IEEE 754 floating point number, that is able to represent precisely integer numbers between -(2^53) and +(2^53) included.
In more practical terms, all the integers between -9007199254740992 and 9007199254740992 are perfectly representable.
Larger integers, or fractions, are internally represented in exponential form, so it is possible that you get only an approximation of the decimal number, or of the very big integer.
Some examples:
.. code:: python
>>> Person.collection(birth_year__eq=1960).instances()
[<[1] John "Joe" Smith (1960)>, <[4] Susan "Sue" Doe (1960)>]
>>> Person.collection(birth_year__gt=1960).instances()
[<[2] John "Jon" Doe (1965)>]
>>> Person.collection(birth_year__gte=1960).instances()
[<[1] John "Joe" Smith (1960)>, <[2] John "Jon" Doe (1965)>, <[4] Susan "Sue" Doe (1960)>]
>>> Person.collection(birth_year__gt=1940, birth_year__lte=1950).instances()
[<[3] Emily "Emma" Smith (1950)>]
And, of course, you can use fields with different indexes in the same query:
.. code:: python
>>> Person.collection(birth_year__gte=1960, lastname='Doe', nickname__startswith='S').instances()
[<[4] Susan "Sue" Doe (1960)>]
Lazyness
Laziness
========
The result of a collection is lazy. In fact it's the collection itself, it's why we can chain calls to `sort` and `instances`.
Expand Down
115 changes: 91 additions & 24 deletions limpyd/collection.py
Original file line number Diff line number Diff line change
@@ -1,12 +1,17 @@
# -*- coding:utf-8 -*-
from __future__ import unicode_literals
from future.builtins import object
from collections import namedtuple


from limpyd.utils import unique_key
from limpyd.exceptions import *
from limpyd.fields import MultiValuesField


ParsedFilter = namedtuple('ParsedFilter', ['index', 'suffix', 'extra_field_parts', 'value'])


class CollectionManager(object):
"""
Retrieve objects collection, optionnaly slice and order it.
Expand All @@ -23,10 +28,12 @@ class CollectionManager(object):
Slicing a collection will force a sort.
"""

_accepted_key_types = {'set'} # Type of keys indexes are allowed to return

def __init__(self, cls):
self.cls = cls
self._lazy_collection = { # Store infos to make the requested collection
'sets': set(), # store sets to use (we'll intersect them)
'sets': [], # store sets to use (we'll intersect them)
'pks': set(), # store special filter on pk
}
self._instances = False # True when instances are asked
Expand Down Expand Up @@ -229,7 +236,31 @@ def _prepare_sets(self, sets):
Must return a tuple with a set of redis set keys, and another with
new temporary keys to drop at the end of _get_final_set
"""
return (sets, set())

final_sets = set()
tmp_keys = set()

for set_ in sets:
if isinstance(set_, str):
final_sets.add(set_)
elif isinstance(set_, ParsedFilter):

index_key, key_type, is_tmp = set_.index.get_filtered_key(
set_.suffix,
accepted_key_types=self._accepted_key_types,
*(set_.extra_field_parts + [set_.value])
)
if key_type not in self._accepted_key_types:
raise ValueError('The index key returned by the index %s is not valid' % (
set_.index.__class__.__name__
))
final_sets.add(index_key)
if is_tmp:
tmp_keys.add(index_key)
else:
raise ValueError('Invalid filter type')

return final_sets, tmp_keys

def _get_final_set(self, sets, pk, sort_options):
"""
Expand Down Expand Up @@ -295,34 +326,70 @@ def _combine_sets(self, sets, final_set):
def __call__(self, **filters):
return self._add_filters(**filters)

def _field_is_pk(self, field_name):
"""Check if the given name is the pk field, suffixed or not with "__eq" """
if self.cls._field_is_pk(field_name):
return True
if field_name.endswith('__eq') and self.cls._field_is_pk(field_name[:-4]):
return True
return False

def _parse_filter_key(self, key):
# Each key can have optional subpath
# We pass it as args to the field, which is responsable
# from handling them
# We only manage here the suffix handled by a filter

key_path = key.split('__')
field_name = key_path.pop(0)
field = self.cls.get_field(field_name)

if not field.indexable:
raise ImplementationError(
'Field %s.%s is not indexable' % (
field._model.__name__, field.name
)
)

other_field_parts = key_path[:field._field_parts - 1]

if len(other_field_parts) + 1 != field._field_parts:
raise ImplementationError(
'Unexpected number of parts in filter %s for field %s.%s' % (
key, field._model.__name__, field.name
)
)

rest = key_path[field._field_parts - 1:]
index_suffix = None if not rest else '__'.join(rest)
index_to_use = None
for index in field._indexes:
if index.can_handle_suffix(index_suffix):
index_to_use = index
break

if not index_to_use:
raise ImplementationError(
'No index found to manage filter %s for field %s.%s' % (
key, field._model.__name__, field.name
)
)

return index_to_use, index_suffix, other_field_parts

def _add_filters(self, **filters):
"""Define self._lazy_collection according to filters."""
for key, value in filters.items():
if self.cls._field_is_pk(key):
if self._field_is_pk(key):
pk = self.cls.get_field('pk').normalize(value)
self._lazy_collection['pks'].add(pk)
else:
# each key can have optional subpath
# we pass it as args to the field, which is responsable
# from handling them
key_path = key.split('__')
field_name = key_path.pop(0)
field = self.cls.get_field(field_name)

if not field.indexable:
raise ImplementationError(
'Field %s.%s is not indexable' % (
field._model._name, field.name
)
)

if len(key_path) != field._field_parts - 1:
raise ImplementationError(
'Unexpected number of parts in filter %s for field %s.%s' % (
key, field._model._name, field.name
)
)
self._lazy_collection['sets'].add(field.index_key(value, *key_path))
# store the info to call the index later, in ``_prepare_sets``
# (to avoid doing extra work if the collection is never called)
index, suffix, extra_field_parts = self._parse_filter_key(key)
parsed_filter = ParsedFilter(index, suffix, extra_field_parts, value)
self._lazy_collection['sets'].append(parsed_filter)

return self

def __len__(self):
Expand Down
Loading

0 comments on commit 08dade3

Please sign in to comment.