Is there a configuration option to prevent the database from stumbling over entities with large lists? #1690

philippkueng · 2022-01-14T13:53:18Z

philippkueng
Jan 14, 2022

While inserting entities that contain a relatively large list of maps I'm seeing XTDB allocating a lot of memory and taking ages to finish the indexing (100% CPU usage too). Is there a way I can tell XTDB to not bother indexing the list? Or do you see any other ways to mitigate such an issue?

The simplest case to reproduce the issue is:

(def doc {:name        "Max"
          :hours-slept (map (fn [i]
                              {:date  "2022-01-14"
                               :hours 6.3})
                            (range 5500))})

(defn- uuid [] (java.util.UUID/randomUUID))                        
(doall
    (doseq [i (range (* 50 1000))]
      (crux/submit-tx (db-node) [[:crux.tx/put (merge doc
                                                      {:crux.db/id (uuid)})]])))

Versions used:

[pro.juxt.crux/crux-core "1.18.1"]
[pro.juxt.crux/crux-lucene "1.18.1"]
[pro.juxt.crux/crux-rocksdb "1.18.1"]

And the configuration I'm using is

{:rocksdb {:crux/module 'crux.rocksdb/->kv-store
           :db-dir (u/join-path store-data-dir "db-dir-1")}
 :crux.lucene/lucene-store {:db-dir (u/join-path store-data-dir "lucene-dir")}
 :crux/tx-log {:kv-store :rocksdb}
 :crux/document-store {:kv-store :rocksdb}
 :crux/index-store {:kv-store :rocksdb}}

Working on updating the app to use the latest release to see if this issue persists.

Answered by refset

Jan 17, 2022

Hi @philippkueng - thanks for the question and for providing a very clear repro!

Is there a way I can tell XTDB to not bother indexing the list?

For the index-store in general, 'no', although Lucene indexing can be configured (via a custom indexer) to ignore certain attributes:

xtdb/modules/lucene/test/xtdb/lucene/extension_test.clj

Line 210 in 087b5c7

     :when (contains? #{:product/title :product/description} a) ;; example - don't index all attributes  

 

Or do you see any other ways to mitigate such an issue?

You can often avoid some encoding overheads (during indexing, at least) by pushing the list into a nested structure (since top-level values (including elements of ve…

View full answer

philippkueng · 2022-01-14T16:01:52Z

philippkueng
Jan 14, 2022
Author

I've created a new sample application using the latest version of xtdb (1.20.0) and the database is still behaving the same way as with the older crux release: https://github.com/philippkueng/xtdb-1690

If anyone got any ideas or pointers to what one can do to improve this, that'd be amazing.

0 replies

refset · 2022-01-17T13:01:18Z

refset
Jan 17, 2022
Maintainer

Hi @philippkueng - thanks for the question and for providing a very clear repro!

Is there a way I can tell XTDB to not bother indexing the list?

For the index-store in general, 'no', although Lucene indexing can be configured (via a custom indexer) to ignore certain attributes:

xtdb/modules/lucene/test/xtdb/lucene/extension_test.clj

Line 210 in 087b5c7

    
           :when (contains? #{:product/title :product/description} a) ;; example - don't index all attributes

Or do you see any other ways to mitigate such an issue?

You can often avoid some encoding overheads (during indexing, at least) by pushing the list into a nested structure (since top-level values (including elements of vectors and lists) get encoded individually), e.g.

(def doc {:name        "Max"
          :hours-slept {:nested-val (map (fn [i]
                                           {:date  "2022-01-14"
                                            :hours 6.3})
                                         (range 5500))}})

...however this doesn't seem to help much in this scenario.

Another mitigation strategy is to serialize the data into userspace byte buffers, which shifts the burden of encoding/decoding away from XT entirely, e.g. using Nippy (which is what is used internally)

(def doc {:name        "Max"
          :hours-slept (juxt.clojars-mirrors.nippy.v3v1v1.taoensso.nippy/fast-freeze (map (fn [i]
                                                                                            {:date  "2022-01-14"
                                                                                             :hours 6.3})
                                                                                          (range 5500)))})

This looks to have a very large positive impact. You could perhaps consider storing these blobs across many smaller, separate entities and joining them when needed. Note that there are still hashing costs (during submit+indexing) for byte buffers, but this is much less expensive that encoding.

You can also perhaps help the situation by disabling/adjusting Lucene's :refresh-frequency during bulk ingestion (when strongly-consistent, online queries aren't needed), see: https://docs.xtdb.com/extensions/full-text-search/#_parameters

Longer term, I suspect there are a few possibilities for internal changes that we could make, for instance, using multiple threads to perform encoding work in parallel - currently, both submission and indexing pipelines are single-threaded. I have just opened this issue to track proposals (and planned work) in this area #1692

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

XTDB

Is there a configuration option to prevent the database from stumbling over entities with large lists? #1690

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 2 comments

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

XTDB

Is there a configuration option to prevent the database from stumbling over entities with large lists? #1690

philippkueng Jan 14, 2022

Replies: 2 comments

philippkueng Jan 14, 2022 Author

refset Jan 17, 2022 Maintainer

philippkueng
Jan 14, 2022

philippkueng
Jan 14, 2022
Author

refset
Jan 17, 2022
Maintainer