Merge branch 'feature/twidi/range-indexes' into develop

limpyd · Jan 26, 2018 · 08dade3 · 08dade3
2 parents a9e0eb6 + f78da10
commit 08dade3
Show file tree

Hide file tree

Showing 13 changed files with 2,201 additions and 208 deletions.
diff --git a/doc/collections.rst b/doc/collections.rst
@@ -11,51 +11,44 @@ If fields are indexed, it's possible to make query to retrieve many of them, usi
 The filtering has some limitations:
 
 - you can only filter on fields with `indexable` and/or `unique` attributes set to True
+- the filtering capabilities are limited and must be thought at the beginning
 - you can only filter on full values (`limpyd` doesn't provide filters like "startswith", "contains"...)
 - all filters are "and"ed
-- no "not" (only able to find mathing fields, not to exlude some)
+- no "not" (only able to find matching fields, not to exclude some)
 - no "join" (filter on one model only)
 
 The result of a call to the `collection` is lazy. The query is only sent to Redis when data is really needed, to display or do computation with them.
 
 By default, a collection returns a list of primary keys for all the matching objects, but you can sort them, retrieve only a part, and/or directly get full instances instead of primary keys.
 
-We will explain Filtering_, Sorting_, Slicing_, Instanciating_, and Lazyness_ below, based on this example:
+We will explain Filtering_, Sorting_, Slicing_, Instantiating_, Indexing_, and Laziness_ below, based on this example:
 
 .. code:: python
 
     class Person(model.RedisModel):
         database = main_database
         firstname = fields.InstanceHashField(indexable=True)
         lastname = fields.InstanceHashField(indexable=True)
-        birth_year = fields.InstanceHashField(indexable=True)
+        nickname = fields.InstanceHashField(indexable=True, indexes=[TextRangeIndex])
+        birth_year = fields.InstanceHashField(indexable=True, indexes=[NumberRangeIndex])
 
         def __repr__(self):
-            return "<[%s] %s %s (%s)>" % tuple([self.pk.get()] + self.hmget('firstname', 'lastname', 'birth_year'))
-
-    >>> Person(firstname='John', lastname='Smith', birth_year=1960)
-    <[1] John Smith (1960)>
-    >>> Person(firstname='John', lastname='Doe', birth_year=1965)
-    <[2] John Doe (1965)>
-    >>> Person(firstname='Emily', lastname='Smith', birth_year=1950)
-    <[3] Emily Smith (1950)>
-    >>> Person(firstname='Susan', lastname='Doe', birth_year=1960)
-    <[4] Susan Doe (1960)>
+            return '<[%s] %s "%s" %s (%s)>' % tuple([self.pk.get()] + self.hmget('firstname', 'nickname', 'lastname', 'birth_year'))
 
-Note that for each primary key got from redis, a real instance is created, with a check for pk existence. As it can lead to a lot of redis calls (one for each instance), if you are sure that all primary keys really exists (it must be the case if nothing special was done), you can skip these tests by passing the `skip_exist_test` named argument to True when calling `instances`::
-
-    >>> Person.collection().instances(skip_exist_test=True)
-
-Note that when you'll update an instance got with `skip_exist_test` set to True, the existence of the primary key will be done before the update, raising an exception if not found.
-
-To cancel retrieving instances and get the default return format, call the `primary_keys` method:
+    >>> Person(firstname='John', lastname='Smith', nickname='Joe', birth_year=1960)
+    <[1] John "Joe" Smith (1960)>
+    >>> Person(firstname='John', lastname='Doe', nickname='Jon', birth_year=1965)
+    <[2] John "Jon" Doe (1965)>
+    >>> Person(firstname='Emily', lastname='Smith', nickname='Emma', birth_year=1950)
+    <[3] Emily "Emma" Smith (1950)>
+    >>> Person(firstname='Susan', lastname='Doe', nickname='Sue', birth_year=1960)
+    <[4] Susan "Sue" Doe (1960)>
 
 .. code:: python
 
     >>> Person.collection(firstname='John').instances().primary_keys()
     >>> ['1', '2']
 
-
 Filtering
 =========
 
@@ -108,33 +101,140 @@ Example:
     >>> Person.collection(firstname='John').sort(by='lastname', alpha=True)
     ['2', '1']
     >>> Person.collection(firstname='John').sort(by='lastname', alpha=True)[1:2]
-    [1']
+    ['1']
     >>> Person.collection().sort(by='birth_year')
     ['3', '1', '4', '2']
 
 
-
-
-Instanciating
+Instantiating
 =============
 
-If you want to retrieve already instanciated objects, instead of only primary keys and having to do instanciation yourself, you simply have to call `instances()` on the result of the collection. The result of the collection and its methods (`sort` and `instances`) return a collection, so you can do chaining:
+If you want to retrieve already instantiated objects, instead of only primary keys and having to do instantiation yourself, you simply have to call `instances()` on the result of the collection. The result of the collection and its methods (`sort` and `instances`) return a collection, so you can do chaining:
 
 .. code:: python
 
     >>> Person.collection(firstname='John')
     ['1', '2']
     >>> Person.collection(firstname='John').instances()
-    [<[1] John Smith (1960)>, <[2] John Doe (1965)>]
+    [<[1] John "Joe" Smith (1960)>, <[2] John "Jon" Doe (1965)>]
     >>> Person.collection(firstname='John').instances().sort(by='lastname', alpha=True)
-    [<[2] John Doe (1965)>, <[1] John Smith (1960)>]
+    [<[2] John "Jon" Doe (1965)>, <[1] John "Joe" Smith (1960)>]
     >>> Person.collection(firstname='John').sort(by='lastname', alpha=True).instances()
-    [<[2] John Doe (1965)>, <[1] John Smith (1960)>]
+    [<[2] John "Jon" Doe (1965)>, <[1] John "Joe" Smith (1960)>]
     >>> Person.collection(firstname='John').sort(by='lastname', alpha=True).instances()[0]
-    [<[2] John Doe (1965)>
+    [<[2] John "Jon" Doe (1965)>
+
+Note that for each primary key got from redis, a real instance is created, with a check for pk existence. As it can lead to a lot of redis calls (one for each instance), if you are sure that all primary keys really exists (it must be the case if nothing special was done), you can skip these tests by passing the `skip_exist_test` named argument to True when calling `instances`:
+
+.. code:: python
+
+    >>> Person.collection().instances(skip_exist_test=True)
+
+Note that when you'll update an instance got with `skip_exist_test` set to True, the existence of the primary key will be done before the update, raising an exception if not found.
+
+To cancel retrieving instances and get the default return format, call the `primary_keys` method:
+
+.. code:: python
+
+    >>> Person.collection().instances(skip_exist_test=True).primary_keys()
+
+Indexing
+========
+
+By default, all fields with `indexable=True` use the default index, `EqualIndex`.
+It only allows equality filtering (the only legacy index type supported by limpyd), but it is fast.
+
+To filter using this index, you simply pass the field and a value in the collection call:
+
+.. code:: python
+
+    >>> Person.collection(firstname='John').instances()
+    [<[1] John "Joe" Smith (1960)>, <[2] John "Jon" Doe (1965)>]
+
+But you can also be more specific about the fact that you want an equality by using the `__eq` suffix. All other indexes use different suffixes.
+
+This design is inspired by Django.
+
+.. code:: python
+
+    >>> Person.collection(firstname__eq='John').instances()
+    [<[1] John "Joe" Smith (1960)>, <[2] John "Jon" Doe (1965)>]
+
+If you want to do more advanced lookup on a field that contains text, you can use the `TextRangeIndex` (to import from `limpyd.indexes`), as we did for the `nickname` field.
+
+It allows the same filtering as the default index, ie equality without suffix or with the `__eq` suffix, but it is not as efficient.
+
+So if your only use is equality filtering, do not use it.
+
+But if not, you can take advantage of its capabilities, depending on the suffix you'll use:
+
+- `__gt`: text "Greater Than" the given value
+- `__gte`: "Greater Than or Equal"
+- `__lt`: "Less Than"
+- `__lte`: "Less Than or Equal"
+- `__startswith`: text that starts with the given value
+
+Texts are compared in a lexicographical way, as viewed by redis and explained this way:
+
+    The elements are considered to be ordered from lower to higher strings as compared byte-by-byte using the memcmp() C function. Longer strings are considered greater than shorter strings if the common part is identical.
+
+Some examples:
+
+.. code:: python
+
+    >>> Person.collection(nickname__startswith='Jo').instances()
+    [<[1] John "Joe" Smith (1960)>, <[2] John "Jon" Doe (1965)>]
+    >>> Person.collection(nickname__gte='Jo').instances()
+    [<[1] John "Joe" Smith (1960)>, <[2] John "Jon" Doe (1965)>, <[4] Susan "Sue" Doe (1960)>]
+    >>> Person.collection(nickname__gt='Jo').instances()
+    [<[4] Susan "Sue" Doe (1960)>]
+
+As for normal index, you can filter many times on the same field (more than two times doesn't really make sense):
+
+.. code:: python
+    >>> Person.collection(nickname__gte='E', nickname__lte='J').instances()
+    [<[3] Emily "Emma" Smith (1950)>, <[1] John "Joe" Smith (1960)>, <[2] John "Jon" Doe (1965)>]
+
+This index works well for text but not for numbers, because lexicographically, 1000 < 11.
+
+For numbers, you can use the `NumberRangeIndex` (to import from `limpyd.indexes`).
+
+It supports the same suffixes than `TextRangeIndex` excepted for `startswith`.
+
+Some things to know about this index:
+
+- values of a field that cannot be casted to a float are converted to 0 for indexing (the stored value doesn't change).
+- negative numbers are, of course, supported
+- numbers are saved as the score of a redis sorted set, so a number is, in the index:
+
+    represented as an IEEE 754 floating point number, that is able to represent precisely integer numbers between -(2^53) and +(2^53) included.
+
+    In more practical terms, all the integers between -9007199254740992 and 9007199254740992 are perfectly representable.
+
+    Larger integers, or fractions, are internally represented in exponential form, so it is possible that you get only an approximation of the decimal number, or of the very big integer.
+
+Some examples:
+
+.. code:: python
+
+    >>> Person.collection(birth_year__eq=1960).instances()
+    [<[1] John "Joe" Smith (1960)>, <[4] Susan "Sue" Doe (1960)>]
+    >>> Person.collection(birth_year__gt=1960).instances()
+    [<[2] John "Jon" Doe (1965)>]
+    >>> Person.collection(birth_year__gte=1960).instances()
+    [<[1] John "Joe" Smith (1960)>, <[2] John "Jon" Doe (1965)>, <[4] Susan "Sue" Doe (1960)>]
+    >>> Person.collection(birth_year__gt=1940, birth_year__lte=1950).instances()
+    [<[3] Emily "Emma" Smith (1950)>]
+
+And, of course, you can use fields with different indexes in the same query:
+
+.. code:: python
+
+    >>> Person.collection(birth_year__gte=1960, lastname='Doe', nickname__startswith='S').instances()
+    [<[4] Susan "Sue" Doe (1960)>]
 
 
-Lazyness
+Laziness
 ========
 
 The result of a collection is lazy. In fact it's the collection itself, it's why we can chain calls to `sort` and `instances`.

diff --git a/limpyd/collection.py b/limpyd/collection.py
@@ -1,12 +1,17 @@
 # -*- coding:utf-8 -*-
 from __future__ import unicode_literals
 from future.builtins import object
+from collections import namedtuple
+
 
 from limpyd.utils import unique_key
 from limpyd.exceptions import *
 from limpyd.fields import MultiValuesField
 
 
+ParsedFilter = namedtuple('ParsedFilter', ['index', 'suffix', 'extra_field_parts', 'value'])
+
+
 class CollectionManager(object):
     """
     Retrieve objects collection, optionnaly slice and order it.
@@ -23,10 +28,12 @@ class CollectionManager(object):
     Slicing a collection will force a sort.
     """
 
+    _accepted_key_types = {'set'}  # Type of keys indexes are allowed to return
+
     def __init__(self, cls):
         self.cls = cls
         self._lazy_collection = {  # Store infos to make the requested collection
-            'sets': set(),  # store sets to use (we'll intersect them)
+            'sets': [],  # store sets to use (we'll intersect them)
             'pks': set(),  # store special filter on pk
         }
         self._instances = False  # True when instances are asked
@@ -229,7 +236,31 @@ def _prepare_sets(self, sets):
         Must return a tuple with a set of redis set keys, and another with
         new temporary keys to drop at the end of _get_final_set
         """
-        return (sets, set())
+
+        final_sets = set()
+        tmp_keys = set()
+
+        for set_ in sets:
+            if isinstance(set_, str):
+                final_sets.add(set_)
+            elif isinstance(set_, ParsedFilter):
+
+                index_key, key_type, is_tmp = set_.index.get_filtered_key(
+                    set_.suffix,
+                    accepted_key_types=self._accepted_key_types,
+                    *(set_.extra_field_parts + [set_.value])
+                )
+                if key_type not in self._accepted_key_types:
+                    raise ValueError('The index key returned by the index %s is not valid' % (
+                        set_.index.__class__.__name__
+                    ))
+                final_sets.add(index_key)
+                if is_tmp:
+                    tmp_keys.add(index_key)
+            else:
+                raise ValueError('Invalid filter type')
+
+        return final_sets, tmp_keys
 
     def _get_final_set(self, sets, pk, sort_options):
         """
@@ -295,34 +326,70 @@ def _combine_sets(self, sets, final_set):
     def __call__(self, **filters):
         return self._add_filters(**filters)
 
+    def _field_is_pk(self, field_name):
+        """Check if the given name is the pk field, suffixed or not with "__eq" """
+        if self.cls._field_is_pk(field_name):
+            return True
+        if field_name.endswith('__eq') and self.cls._field_is_pk(field_name[:-4]):
+            return True
+        return False
+
+    def _parse_filter_key(self, key):
+        # Each key can have optional subpath
+        # We pass it as args to the field, which is responsable
+        # from handling them
+        # We only manage here the suffix handled by a filter
+
+        key_path = key.split('__')
+        field_name = key_path.pop(0)
+        field = self.cls.get_field(field_name)
+
+        if not field.indexable:
+            raise ImplementationError(
+                'Field %s.%s is not indexable' % (
+                    field._model.__name__, field.name
+                )
+            )
+
+        other_field_parts = key_path[:field._field_parts - 1]
+
+        if len(other_field_parts) + 1 != field._field_parts:
+            raise ImplementationError(
+                'Unexpected number of parts in filter %s for field %s.%s' % (
+                    key, field._model.__name__, field.name
+                )
+            )
+
+        rest = key_path[field._field_parts - 1:]
+        index_suffix = None if not rest else '__'.join(rest)
+        index_to_use = None
+        for index in field._indexes:
+            if index.can_handle_suffix(index_suffix):
+                index_to_use = index
+                break
+
+        if not index_to_use:
+            raise ImplementationError(
+                'No index found to manage filter %s for field %s.%s' % (
+                    key, field._model.__name__, field.name
+                )
+            )
+
+        return index_to_use, index_suffix, other_field_parts
+
     def _add_filters(self, **filters):
         """Define self._lazy_collection according to filters."""
         for key, value in filters.items():
-            if self.cls._field_is_pk(key):
+            if self._field_is_pk(key):
                 pk = self.cls.get_field('pk').normalize(value)
                 self._lazy_collection['pks'].add(pk)
             else:
-                # each key can have optional subpath
-                # we pass it as args to the field, which is responsable
-                # from handling them
-                key_path = key.split('__')
-                field_name = key_path.pop(0)
-                field = self.cls.get_field(field_name)
-
-                if not field.indexable:
-                    raise ImplementationError(
-                        'Field %s.%s is not indexable' % (
-                            field._model._name, field.name
-                        )
-                    )
-
-                if len(key_path) != field._field_parts - 1:
-                    raise ImplementationError(
-                        'Unexpected number of parts in filter %s for field %s.%s' % (
-                            key, field._model._name, field.name
-                        )
-                    )
-                self._lazy_collection['sets'].add(field.index_key(value, *key_path))
+                # store the info to call the index later, in ``_prepare_sets``
+                # (to avoid doing extra work if the collection is never called)
+                index, suffix, extra_field_parts = self._parse_filter_key(key)
+                parsed_filter = ParsedFilter(index, suffix, extra_field_parts, value)
+                self._lazy_collection['sets'].append(parsed_filter)
+
         return self
 
     def __len__(self):