Versioning and garbage collector for persistent-sorted-set backends. (#…

…232) * Add pss garbage collector implementation and tests. * Add gc files. * Track all written DB snapshots and walk along them for GC. * Improve and simplify gc. Add versioning fns. * Add gc test coverage and complete first implementation. * Add documentation. * Improve versioning API, always store commits directly. * Ensure that transactor thread is not out of sync. * Layout tests better. Factorize update-and-flush-db better. * Misnaming. Co-authored-by: Judith <[email protected]>
replikativ · Nov 18, 2022 · 7db474f · 7db474f
1 parent 6c1468f
commit 7db474f
Show file tree

Hide file tree

Showing 20 changed files with 710 additions and 112 deletions.
diff --git a/README.md b/README.md
@@ -115,13 +115,15 @@ Refer to the docs for more information:
 
 - [backend development](./doc/backend-development.md)
 - [benchmarking](./doc/benchmarking.md)
+- [garbage collection](./doc/gc.md)
 - [contributing to Datahike](./doc/contributing.md)
 - [configuration](./doc/config.md)
 - [differences to Datomic](./doc/datomic_differences.md)
 - [entity spec](./doc/entity_spec.md)
 - [logging and error handling](./doc/logging_and_error_handling.md)
 - [schema flexibility](./doc/schema.md)
 - [time variance](./doc/time_variance.md)
+- [versioning](./doc/versioning.md)
 
 
 For simple examples have a look at the projects in the `examples` folder.

diff --git a/doc/gc.md b/doc/gc.md
@@ -0,0 +1,53 @@
+# Garbage collection
+
+**This is an experimental feature. Please try it out in a test environment and provide feedback.**
+
+Datahike uses persistent data structures to update its memory. In a persistent
+memory model a copy is efficiently created on each update to the database. A
+good explanation of how shared persistent data structures share structure and
+are updated can be found
+[here](https://hypirion.com/musings/understanding-persistent-vector-pt-1). That
+means that you can always access all old versions of Datahike databases and
+Datahike provides a distinct set of historical query abilities, both in form of
+its [historical indices](./time_variances.md) that support queries over the full
+history and in form of its [git-like versioning functionality](./versioning.md)
+for individual snapshots. A potential downside of the latter functionality is
+that you need to keep all old versions of your database around and therefore
+storage requirements grow with usage. To remove old versions of a Datahike
+database you can apply garbage collection.
+
+Provided no process reads anymore from a database it can be considered garbage.
+To remove these versions you can use the garbage collector. You can run it on a
+database `db` as follows
+
+~~~clojure
+(require '[datahike.experiemntal.gc :refer [gc!]]
+         '[superv.async :refer [<?? S]])
+
+(<?? S (gc! db))
+~~~
+
+This will garbage collect any branches you might have created and deleted by
+now, but otherwise will not delete any old db values (snapshots) that are still
+in the store. You will retrieve a set of all deleted storage blobs. You can just
+run the collector concurrently in the background by removing the blocking
+operator `<??`. It requires no coordination, operates on metadata only and
+should not slow down the transactor or other readers.
+
+Next let's assume that you do not want to keep any old data around much longer
+and want to invalidate all readers that still access trees older than the
+current `db`.
+
+~~~clojure
+(let [now (java.util.Date.)]
+  (gc! db now))
+~~~
+
+Datahike provides open, uncoordinated and scalable read access to the indices
+and therefore you need to be aware that there might be long running processes
+that still need access to old versions and pick conservative grace periods in
+such cases. The garbage collector will make sure that any value that was
+accessible in the time window provided will stay accessible.
+
+The garbage collector is tested, but there might still be problems, so please
+reach out if you experience any!
diff --git a/doc/time_variance.md b/doc/time_variance.md
@@ -42,7 +42,7 @@ system. Use `db` for this view. The following example shows a simple interaction
              {:db/ident :age
               :db/valueType :db.type/long
               :db/cardinality :db.cardinality/one}])
-              
+
 (def cfg {:store {:backend :mem :id "current-db"} :initial-tx schema})
 
 ;; create our temporal database

diff --git a/doc/versioning.md b/doc/versioning.md
@@ -0,0 +1,71 @@
+# Versioning
+
+**This is an experimental feature. Please try it out in a test environment and provide feedback.**
+
+Since Datahike has a persistent memory model it can be used similarly to
+[git](https://git-scm.com/). While using databases with different underlying
+stores is the general way to combine data in Datahike and should be preferred
+for separate data sets, in cases where you want to evolve a single database the
+structural sharing of its indices has unique advantages. Git is efficient and
+fast because it does not need to copy shared data on each operation. So in cases
+where you want to evolve a database with new data, but don't want to write it
+directly into the main database, you can `branch!` and evolve a copy of the
+database that behaves like the main branch under `:db`. After you have evolved
+the database you can decide what data to retain and then `merge!` it back. You
+can also take any in-memory DB value and dump it into a durable branch with
+`force-branch!`. To inspect the write history use `branch-history`.
+
+You can see the following example as an example,
+
+~~~clojure
+(require '[superv.async :refer [<?? S]]
+         '[datahike.api :as d]
+         '[datahike.experimental.versioning :refer [branch! branch-history delete-branch! force-branch! merge!
+                                                    branch-as-db commit-as-db parent-commit-ids]])
+
+(let [cfg    {:store              {:backend :file
+                                   :path    "/tmp/dh-versioning-test"}
+              :keep-history?      true
+              :schema-flexibility :write
+              :index              :datahike.index/persistent-set}
+      conn   (do
+              (d/delete-database cfg)
+              (d/create-database cfg)
+              (d/connect cfg))
+      schema [{:db/ident       :age
+               :db/cardinality :db.cardinality/one
+               :db/valueType   :db.type/long}]
+      _      (d/transact conn schema)
+      store  (:store @conn)]
+  (branch! conn :db :foo) ;; new branch :foo, does not create new commit, just copies
+  (let [foo-conn (d/connect (assoc cfg :branch :foo))] ;; connect to it
+    (d/transact foo-conn [{:age 42}]) ;; transact some data
+    ;; extracted data from foo by query
+    ;; ...
+    ;; and decide to merge it into :db
+    (merge! conn #{:foo} [{:age 42}]))
+  (count (parent-commit-ids @conn)) ;; => 2, as :db got merged from :foo and :db
+  ;; check that the commit stored is the same db as conn
+  (= (commit-as-db store (commit-id @conn)) (branch-as-db store :db) @conn) ;; => true
+  (count (<?? S (branch-history conn))) ;; => 4 commits now on both branches
+  (force-branch! @conn :foo2 #{:foo}) ;; put whatever DB value you have created in memory
+  (delete-branch! conn :foo))
+~~~
+
+Here we create a database as usual, but then we create a branch `:foo`, write to
+it and then merge it back. A simple query to extract all data in transactable
+form that is in a `branch1` db but not in `branch2` is
+
+~~~clojure
+(d/q [:find ?db-add ?e ?a ?v ?t
+      :in $ $2 ?db-add
+      :where
+      [$ ?e ?a ?v ?t]
+      [(not= :db/txInstant ?a)]
+      (not [$2 ?e ?a ?v ?t])]
+      branch1 branch2 :db/add)
+~~~
+
+but you might want to be more selective when creating the data for `merge!`. We
+are very interested in what you are planning to do with this functionality, so
+please reach out if you have ideas or experience problems!
diff --git a/src/datahike/config.cljc b/src/datahike/config.cljc
@@ -19,6 +19,7 @@
 (s/def ::search-cache-size nat-int?)
 (s/def ::store-cache-size pos-int?)
 (s/def ::crypto-hash? boolean?)
+(s/def ::branch keyword?)
 (s/def ::entity (s/or :map associative? :vec vector?))
 (s/def ::initial-tx (s/nilable (s/or :data (s/coll-of ::entity) :path string?)))
 (s/def ::name string?)
@@ -44,6 +45,7 @@
                                          ::crypto-hash?
                                          ::initial-tx
                                          ::name
+                                         ::branch
                                          ::middleware]))
 
 (s/def :deprecated/schema-on-read boolean?)
@@ -76,6 +78,7 @@
    :initial-tx initial-tx
    :schema-flexibility (if (true? schema-on-read) :read :write)
    :crypto-hash? false
+   :branch :db
    :search-cache-size default-search-cache-size
    :store-cache-size default-store-cache-size})
 
@@ -118,6 +121,7 @@
    :search-cache-size default-search-cache-size
    :store-cache-size default-store-cache-size
    :crypto-hash? false
+   :branch :db
    :index-config (di/default-index-config default-index)})
 
 (defn remove-nils
@@ -154,6 +158,7 @@
                  :schema-flexibility (keyword (:datahike-schema-flexibility env :write))
                  :index index
                  :crypto-hash? false
+                 :branch :db
                  :search-cache-size (int-from-env :datahike-search-cache-size default-search-cache-size)
                  :store-cache-size (int-from-env :datahike-store-cache-size default-store-cache-size)
                  :index-config (if-let [index-config (map-from-env :datahike-index-config nil)]