Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Counterintuitive Behaviors of contains Filter Operator over Arrays in PostgreSQL Backend #6618

Open
2 tasks done
rabbull opened this issue Nov 19, 2024 · 2 comments
Open
2 tasks done

Comments

@rabbull
Copy link
Contributor

rabbull commented Nov 19, 2024

Describe the bug

The contains operator on the PostgreSQL backend lacks comprehensive documentation. To better understand its exact behavior, I conducted additional tests in #6617. However, some of the observ
/ed behaviors are counterintuitive, and these are detailed below:

1. Non-Existent attr_key

import pytest
from aiida.orm import Dict, QueryBuilder

@pytest.mark.usefixtures('aiida_profile_clean')
@pytest.mark.requires_psql
def test():
    Dict({
        'arr': [114, 514]
    }).store()

    qb = QueryBuilder().append(Dict, filters={
        'attributes.oops': {'contains': []},
    })
    print(len(qb.all()))  # prints 0

    qb = QueryBuilder().append(Dict, filters={
        'attributes.oops': {'!contains': []},
    })
    print(len(qb.all()))  # also prints 0

In a test where the attributes column does not contain a key named oops, the query executes successfully. However, the results are confusing: neither the affirmation nor the negation of the contains operation matches the entry. This behavior is unexpected and counterintuitive. What best fits my expectation is to fail loudly, e.g. to raise an Exception.

2. Nested Array

This is a known issue discussed in a previous thread on discourse.

@pytest.mark.usefixtures('aiida_profile_clean')
@pytest.mark.requires_psql
def test():
    Dict({
        'arr': [[1, 2], [3]]
    }).store()

    qb = QueryBuilder().append(Dict, filters={
        'attributes.arr': {'contains': [[4]]},
    })
    assert len(qb.all()) == 0  # OK

    qb = QueryBuilder().append(Dict, filters={
        'attributes.arr': {'contains': [[2]]},
    })
    assert len(qb.all()) == 0  # AssertionError: assert 1 == 0

When testing with nested arrays, the contains operation unexpectedly matches entries even when the contained elements do not strictly align with the expected structure. For example, a query attempting to match [2] against an array [[1, 2], [3]] may return a match, even though [2] is not directly an element of the array.

Note that this actually complies with PostgreSQL's native JSONB containment semantics:

postgres=# select '[[1, 2], [3]]'::jsonb @> '[[2]]'::jsonb;
 ?column?
----------
 t
(1 row)

postgres=# select '[[1, 2], [3]]'::jsonb @> '[[4]]'::jsonb;
 ?column?
----------
 f
(1 row)

However, this behavior contracts with the intution quite a lot, and consequently makes the abstraction hard to be understood well.

In addition, it should also be nice to mention in the documentation that contains doesn't care the order of arrays. See example below:

@pytest.mark.usefixtures('aiida_profile_clean')
@pytest.mark.requires_psql
def test():
    Dict({
        'arr': [[1, 2], [3]]
    }).store()

    qb = QueryBuilder().append(Dict, filters={
        'attributes.arr': {'contains': [[2, 1]]},
    })
    assert len(qb.all()) == 1  # OK

Steps to reproduce

See above.

Expected behavior

See above.

Your environment

  • Operating system [e.g. Linux]: Linux 6.11.6-arch1-1
  • Python version [e.g. 3.7.1]: 3.11.10
  • aiida-core version [e.g. 1.2.1]: main (779cc29)

Other relevant software versions, e.g. Postres & RabbitMQ

  • PostgreSQL version: PostgreSQL 16.3 on x86_64-pc-linux-gnu, compiled by gcc (GCC) 14.2.1 20240805, 64-bit

Additional context

This might be issues of SQLAlchemy and need further investigation.

@rabbull
Copy link
Contributor Author

rabbull commented Nov 21, 2024

When implementing this operator on the SQLite backend, I gained a clearer understanding of its semantics, especially in the context of nested arrays. It's overly simplistic to treat this operation as a purely mathematical "is-subset" or "contains" comparator. Instead, if we conceptualize JSON objects as trees, this contains operator effectively checks whether the operand is a connected subgraph of the tree.

For instance, consider the JSON arrays [[1, 2], 3] and [[2]]. If we visualize them as trees (as shown below):

  • [[1, 2], 3]:

lhs

  • [[2]]:

rhs

It becomes evident that [[2]] is a sub-tree of [[1, 2], 3]. Consequently, the operation [[1, 2], 3] @> [[2]] returns true.

In contrast, if you examine [2]:

  • [2]:

image

You’ll notice it lacks an intermediate "ARRAY" node, so [[1, 2], 3] do not contains [2].

This becomes even more apparent in cases like [[3]] and other examples, so I would prefer not to elaborate too far.

This tree-based perspective also applies to JSON objects (dictionaries in Python). For instance:

  • {"data": {"name": "Alice", "gender": "female"}, "time_modified": "1732231289"}:
    image

  • {"data": {"gender": "female"}}:
    image

Here, {"data": {"gender": "female"}} is a sub-graph of the larger JSON object. Accordingly, PostgreSQL evaluates {"data": {"name": "Alice", "gender": "female"}, "time_modified": "1732231289"} @> {"data": {"gender": "female"}} as true.

Additionally, this tree-based interpretation clarifies why the order of elements within an array doesn’t matter: the structure is what’s being compared, not the sequence.

Another noteworthy property of this containment operator is that duplicate elements are ignored. By combining these two characteristics, the behavior of the operator becomes more intuitive and easier to understand.

cc @GeigerJ2

@rabbull rabbull removed the type/bug label Nov 22, 2024
@GeigerJ2
Copy link
Contributor

Thanks a lot for this deep-dive and taking the time to write down the explanation here, @rabbull! That actually makes a lot of sense. Feel free to add a pointer to this issue to the discourse thread you linked.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants