Performance #216

skyrod-vactai · 2023-12-21T17:29:01Z

skyrod-vactai
Dec 21, 2023

Hi All,
What an awesome project! I'm considering using it in my application so I'm testing it out.

What I'm considering is a graph database of attributes where there are no real "entities", just shared attributes. So, to that end the attributes relation is just [a_attr, a_val, b_attr, b_val] and that forms the edge between two attribute/value pairs.

I downloaded a freely available db of movies to test with, it contains about 70k records, each with about 10 attributes, so the relation has about 700k rows. I modeled this particular data with all the other attributes (cast, director, release_year, etc) linked to the title of the movie. I cleaned the data a bit, imported into cozo and started querying. It works great, but it's quite a bit slower than I expected. Here's an example query:

Movies directed by Steven Spielberg

linked[a, v, oa, ov] := *attributes[a, v, oa, ov] or *attributes[oa, ov, a, v]
  ?[title] := linked['title', title, 'director', 'Steven Spielberg'],

| Catch Me If You Can                                |
| Indiana Jones and the Kingdom of the Crystal Skull |
| Indiana Jones and the Last Crusade                 |
| Indiana Jones and the Raiders of the Lost Ark      |
| Indiana Jones and the Temple of Doom               |
| Jaws                                               |
| Lincoln                                            |
| Schindler's List                                   |
| The Adventures of Tintin                           |
| The BFG                                            |
| War Horse                                          |

The query takes about 300ms, which is disappointing because the database I'm planning to use with my app is around 100x larger and presumably would take way too long to query.

This is on a new laptop with a good CPU and SSD, using the Sqlite backend.

I was expecting that there's an index of all the rows by attribute, and another by value, which should make lookups like this quite fast. (look up v=Spielberg and ov=Spielberg, there's only 11 of these and that's the final result. Of course it needs to be filtered to only return those rows where Spielberg is the director, (and not some other attribute) but that should be very fast.

What can I do to optimize the performance? Is there a way to find out what the bottleneck is here?

skyrod-vactai · 2023-12-21T20:49:53Z

skyrod-vactai
Dec 21, 2023
Author

I notice the performance is what I expect, if I don't use the or clause. However, I'm not sure how to better represent "undirected edges" - users don't necessarily know whether an attribute will be the 'a' or 'b' attribute in the relation, which is why that clause is in place - to check both possible positions. I think I could add both directions to the dataset, which would probably still be fairly fast to query, but that would double the storage requirements. I don't quite understand why checking both directions in the query is such a big drain on performance:

Without or (and getting the position correct in the query): "took":0.00141488

With the or clause, (such that I can query without caring which position the attribute is in): "took":0.225032194}

That's 158x longer.

Edit: Experimenting further, I think the issue is really more about the order of the columns in the relation. Changing the order to put the attributes first (which tend to be fixed), seems to help some of the queries, even with the or clause.

Edit2: I see the bit about indices here: https://docs.cozodb.org/en/latest/stored.html#indices - it does still double the storage requirements, which means this isn't really an "index" as much as it is a re-arranged copy of the entire dataset (a book's index isn't as long as the whole book, it's just a list of pointers to the original data). Still, a handy feature.

I also think I was mistaken extrapolating that increasing my db size by 100x might make query times too long. The attribute names in my test db is far less diverse than the prod db will be, so specifying an attribute in the prod db will narrow the candidates a lot more. So I will proceed with more involved tests!

1 reply

aramallo Jun 11, 2024

Edit2: I see the bit about indices here: https://docs.cozodb.org/en/latest/stored.html#indices - it does still double the storage requirements, which means this isn't really an "index" as much as it is a re-arranged copy of the entire dataset (a book's index isn't as long as the whole book, it's just a list of pointers to the original data). Still, a handy feature

Actually, it is a real index 😊It's a type of index called covering index because they contain the data (usually and in this case in different collation orders that the original relation) and not just pointers to the data.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Performance #216

{{title}}

Replies: 1 comment 1 reply

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

Select a reply

Performance #216

skyrod-vactai Dec 21, 2023

Replies: 1 comment · 1 reply

skyrod-vactai Dec 21, 2023 Author

aramallo Jun 11, 2024

skyrod-vactai
Dec 21, 2023

Replies: 1 comment 1 reply

skyrod-vactai
Dec 21, 2023
Author