Is there a configuration option to prevent the database from stumbling over entities with large lists? #1690
-
While inserting entities that contain a relatively large list of maps I'm seeing XTDB allocating a lot of memory and taking ages to finish the indexing (100% CPU usage too). Is there a way I can tell XTDB to not bother indexing the list? Or do you see any other ways to mitigate such an issue? The simplest case to reproduce the issue is: (def doc {:name "Max"
:hours-slept (map (fn [i]
{:date "2022-01-14"
:hours 6.3})
(range 5500))})
(defn- uuid [] (java.util.UUID/randomUUID))
(doall
(doseq [i (range (* 50 1000))]
(crux/submit-tx (db-node) [[:crux.tx/put (merge doc
{:crux.db/id (uuid)})]]))) Versions used: [pro.juxt.crux/crux-core "1.18.1"]
[pro.juxt.crux/crux-lucene "1.18.1"]
[pro.juxt.crux/crux-rocksdb "1.18.1"] And the configuration I'm using is {:rocksdb {:crux/module 'crux.rocksdb/->kv-store
:db-dir (u/join-path store-data-dir "db-dir-1")}
:crux.lucene/lucene-store {:db-dir (u/join-path store-data-dir "lucene-dir")}
:crux/tx-log {:kv-store :rocksdb}
:crux/document-store {:kv-store :rocksdb}
:crux/index-store {:kv-store :rocksdb}} Working on updating the app to use the latest release to see if this issue persists. |
Beta Was this translation helpful? Give feedback.
Replies: 2 comments
-
I've created a new sample application using the latest version of If anyone got any ideas or pointers to what one can do to improve this, that'd be amazing. |
Beta Was this translation helpful? Give feedback.
-
Hi @philippkueng - thanks for the question and for providing a very clear repro!
For the index-store in general, 'no', although Lucene indexing can be configured (via a custom indexer) to ignore certain attributes:
You can often avoid some encoding overheads (during indexing, at least) by pushing the list into a nested structure (since top-level values (including elements of vectors and lists) get encoded individually), e.g. (def doc {:name "Max"
:hours-slept {:nested-val (map (fn [i]
{:date "2022-01-14"
:hours 6.3})
(range 5500))}}) ...however this doesn't seem to help much in this scenario. Another mitigation strategy is to serialize the data into userspace byte buffers, which shifts the burden of encoding/decoding away from XT entirely, e.g. using Nippy (which is what is used internally) (def doc {:name "Max"
:hours-slept (juxt.clojars-mirrors.nippy.v3v1v1.taoensso.nippy/fast-freeze (map (fn [i]
{:date "2022-01-14"
:hours 6.3})
(range 5500)))}) This looks to have a very large positive impact. You could perhaps consider storing these blobs across many smaller, separate entities and joining them when needed. Note that there are still hashing costs (during submit+indexing) for byte buffers, but this is much less expensive that encoding. You can also perhaps help the situation by disabling/adjusting Lucene's Longer term, I suspect there are a few possibilities for internal changes that we could make, for instance, using multiple threads to perform encoding work in parallel - currently, both submission and indexing pipelines are single-threaded. I have just opened this issue to track proposals (and planned work) in this area #1692 |
Beta Was this translation helpful? Give feedback.
Hi @philippkueng - thanks for the question and for providing a very clear repro!
For the index-store in general, 'no', although Lucene indexing can be configured (via a custom indexer) to ignore certain attributes:
xtdb/modules/lucene/test/xtdb/lucene/extension_test.clj
Line 210 in 087b5c7
You can often avoid some encoding overheads (during indexing, at least) by pushing the list into a nested structure (since top-level values (including elements of ve…