Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reimplement has_key for SQLite #6552

Open
Tracked by #6435
giovannipizzi opened this issue Jul 25, 2024 · 5 comments
Open
Tracked by #6435

Reimplement has_key for SQLite #6552

giovannipizzi opened this issue Jul 25, 2024 · 5 comments

Comments

@giovannipizzi
Copy link
Member

In #6256, a bug was identified in the previous implementation of the has_key for SQLite. The bug is described there.
To avoid returning wrong results, the feature has been disabled. We should re-enable it.

@giovannipizzi giovannipizzi added the type/feature request status undecided label Jul 25, 2024
@giovannipizzi
Copy link
Member Author

giovannipizzi commented Jul 25, 2024

Here I explain the issue - I describe a bit more as I go through my thought process, that might be relevant for the implementation of the contains operator (issue to be opened yet) - see #6494.

Old problem

Let's consider this DB (I'm using SQLite Browser)
Screenshot 2024-07-25 at 10 12 41

Let's first understand how json_each works in SQLite:

SELECT test_1.id, test_1.extras, anon_1.key, anon_1.value
FROM test AS test_1, json_each(JSON_QUOTE(JSON_EXTRACT(test_1.extras, '$'))) AS anon_1
WHERE json_type(test_1.extras, '$') = 'object'
Screenshot 2024-07-25 at 10 15 20

(here I'm just filtering on the first three rows that contain an object (i.e. a dict).

json_each has more columns, but for now I just focus on the key and value. Essentially, this is implicitly joining the two tables, and returning now not a row per entry in the test table, but a row per key in the top-level dictionary (in the extras column).

Therefore, the old query was simply filtering on these if there was a key with the correct value, something like

SELECT test_1.id, test_1.extras, anon_1.key, anon_1.value
FROM test AS test_1, json_each(JSON_QUOTE(JSON_EXTRACT(test_1.extras, '$'))) AS anon_1
WHERE anon_1.key = 'b'
Screenshot 2024-07-25 at 10 18 23

The problem was the negation: now, the negation is applied to the whole filter! So this would be

SELECT test_1.id, test_1.extras, anon_1.key, anon_1.value
FROM test AS test_1, json_each(JSON_QUOTE(JSON_EXTRACT(test_1.extras, '$'))) AS anon_1
WHERE anon_1.key != 'b'
Screenshot 2024-07-25 at 10 18 58

Indeed, this would return a result for every key-value pair where the key is not b, explaining #6256. This is clearly not what we want.

@giovannipizzi
Copy link
Member Author

First attempt to a solution

The first attempt to a solution I did is with an aggregation sum and a group_by on the ID, essentially counting all rows per entry (corresponding to a DbNode in our case). And filtering only on those that have exactly 1 match:

SELECT 
SUM(anon_1.key = 'b') as count_keys, test_1.id, test_1.extras
FROM test AS test_1, json_each(test_1.extras, '$') AS anon_1
GROUP BY test_1.id
HAVING  count_keys = 1
Screenshot 2024-07-25 at 10 22 05

This is correct, and now also the negation works!

SELECT 
SUM(anon_1.key = 'b') as count_keys, test_1.id, test_1.extras
FROM test AS test_1, json_each(test_1.extras, '$') AS anon_1
GROUP BY test_1.id
HAVING  count_keys != 1
Screenshot 2024-07-25 at 10 22 30

Note that it correctly shows lists as the has_key should only be applied to dictionaries. This is not yet the correct query as it does not show integers, strings, etc.. However, there is a more technical subtle detail: the way QueryBuilder is modularized and implemented in SQLAlchemy, we just want to add a filter, and this is passed up through 3-4 function calls, and the actual query is constructed in a very different function. Replacing the whole query with SUM, GROUP_BY, ... is very difficult (and I'm not even sure how this works, even in pure SQL, if we have many filters on various nodes).

@giovannipizzi
Copy link
Member Author

giovannipizzi commented Jul 25, 2024

The solution

The final solution I found is to use simply this query, using the exists statement of SQL - this does not require a GROUP BY and not even explicit joining!

select test_1.* from test as test_1
WHERE EXISTS (
  SELECT * from json_each(test_1.extras, '$') AS anon_1 
  WHERE anon_1.key = 'b'
)
Screenshot 2024-07-25 at 10 27 52

The negation becomes not exists, and it actually works pretty well, also on strings, ints, ... without special CASE statements!

select test_1.* from test as test_1
WHERE NOT EXISTS (
  SELECT * from json_each(test_1.extras, '$') AS anon_1 
  WHERE anon_1.key = 'b'
)
Screenshot 2024-07-25 at 10 28 48

@giovannipizzi
Copy link
Member Author

giovannipizzi commented Jul 25, 2024

Implementing in AiiDA (SQLAlchemy)

The implementation is now pretty straightforward - this seems to be doing the job:

        if operator == 'has_key':
            from sqlalchemy import select
            return select(database_entity).where(func.json_each(database_entity).table_valued('key', joins_implicitly=True).c.key == value).exists()               

How to test it

Here is an extended test from the one of @mbercx in #6256:

from aiida import orm, load_profile
from aiida.storage.sqlite_temp import SqliteTempBackend
temp_profile = SqliteTempBackend.create_profile('temp-profile')
load_profile(temp_profile, allow_switch=True)

node = orm.Int(1).store()
node.base.extras.set_many({'sub': {'a': 1, 'b': 2, 'c': 3}})
node = orm.Int(2).store()
node.base.extras.set_many({'sub': {'b': 2, 'c': 3}})
node = orm.Int(3).store()
node.base.extras.set_many({'sub': {'a': 1, 'b': 2}})
node = orm.Int(4).store()
node.base.extras.set_many({'sub': "a"})
node = orm.Int(5).store()
node.base.extras.set_many({'sub': 1})
node = orm.Int(6).store()
node.base.extras.set_many({'sub': ["a", "b"]})
node = orm.Int(7).store()
node.base.extras.set_many({'sub': ["c", "d"]})


qbuild = orm.QueryBuilder()
filters = {"extras.sub": {"!has_key": "a"}}

qbuild.append(
    orm.Int,
    project=['id', 'attributes.value'],
    filters=filters,
)
print(qbuild.all())
print(qbuild.as_sql(True))

That prints

[[2, 2], [4, 4], [5, 5], [6, 6], [7, 7]]
SELECT db_dbnode_1.id, JSON_QUOTE(JSON_EXTRACT(db_dbnode_1.attributes, '$."value"')) AS anon_1 
FROM db_dbnode AS db_dbnode_1 
WHERE CAST(db_dbnode_1.node_type AS VARCHAR) LIKE 'data.core.int.%' ESCAPE '\' AND NOT (EXISTS (SELECT JSON_QUOTE(JSON_EXTRACT(db_dbnode_1.extras, '$."sub"')) AS anon_2 
FROM json_each(JSON_QUOTE(JSON_EXTRACT(db_dbnode_1.extras, '$."sub"'))) AS anon_3 
WHERE anon_3."key" = 'a'))

while replacing without negation: filters = {"extras.sub": {"has_key": "a"}}
returns

[[1, 1], [3, 3]]
SELECT db_dbnode_1.id, JSON_QUOTE(JSON_EXTRACT(db_dbnode_1.attributes, '$."value"')) AS anon_1 
FROM db_dbnode AS db_dbnode_1 
WHERE CAST(db_dbnode_1.node_type AS VARCHAR) LIKE 'data.core.int.%' ESCAPE '\' AND (EXISTS (SELECT JSON_QUOTE(JSON_EXTRACT(db_dbnode_1.extras, '$."sub"')) AS anon_2 
FROM json_each(JSON_QUOTE(JSON_EXTRACT(db_dbnode_1.extras, '$."sub"'))) AS anon_3 
WHERE anon_3."key" = 'a'))

This seems the correct result

@giovannipizzi
Copy link
Member Author

Conclusion

This provides the implementation that needs to be put in AiiDA. I didn't test it much more, and unfortunately I won't have time until the second half of August - so I'm putting this here in case someone who has time wants to implement it and add some tests (pinging @sphuber and @danielhollas as they were involved earlier, in addition to @mbercx)

Also pinging @khsrali @eimrek @GeigerJ2 and @agoscinski as this is relevant for the support of Materials Cloud REST APIs directly via SQLite backends, and also for later benchmarking of performance.

Note on tests

I would add many tests, querying directly extras, some sub key of extras, in various forms, both dicts and lists, ints, floats, strings, bools, none - to make sure the queries work, and also that the result is identical between SQLite and PSQL

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants