Skip to content

Commit

Permalink
Versioning and garbage collector for persistent-sorted-set backends. (#…
Browse files Browse the repository at this point in the history
…232)

* Add pss garbage collector implementation and tests.

* Add gc files.

* Track all written DB snapshots and walk along them for GC.

* Improve and simplify gc. Add versioning fns.

* Add gc test coverage and complete first implementation.

* Add documentation.

* Improve versioning API, always store commits directly.

* Ensure that transactor thread is not out of sync.

* Layout tests better. Factorize update-and-flush-db better.

* Misnaming.

Co-authored-by: Judith <[email protected]>
  • Loading branch information
whilo and jsmassa authored Nov 18, 2022
1 parent 6c1468f commit 7db474f
Show file tree
Hide file tree
Showing 20 changed files with 710 additions and 112 deletions.
2 changes: 2 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -115,13 +115,15 @@ Refer to the docs for more information:

- [backend development](./doc/backend-development.md)
- [benchmarking](./doc/benchmarking.md)
- [garbage collection](./doc/gc.md)
- [contributing to Datahike](./doc/contributing.md)
- [configuration](./doc/config.md)
- [differences to Datomic](./doc/datomic_differences.md)
- [entity spec](./doc/entity_spec.md)
- [logging and error handling](./doc/logging_and_error_handling.md)
- [schema flexibility](./doc/schema.md)
- [time variance](./doc/time_variance.md)
- [versioning](./doc/versioning.md)


For simple examples have a look at the projects in the `examples` folder.
Expand Down
53 changes: 53 additions & 0 deletions doc/gc.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,53 @@
# Garbage collection

**This is an experimental feature. Please try it out in a test environment and provide feedback.**

Datahike uses persistent data structures to update its memory. In a persistent
memory model a copy is efficiently created on each update to the database. A
good explanation of how shared persistent data structures share structure and
are updated can be found
[here](https://hypirion.com/musings/understanding-persistent-vector-pt-1). That
means that you can always access all old versions of Datahike databases and
Datahike provides a distinct set of historical query abilities, both in form of
its [historical indices](./time_variances.md) that support queries over the full
history and in form of its [git-like versioning functionality](./versioning.md)
for individual snapshots. A potential downside of the latter functionality is
that you need to keep all old versions of your database around and therefore
storage requirements grow with usage. To remove old versions of a Datahike
database you can apply garbage collection.

Provided no process reads anymore from a database it can be considered garbage.
To remove these versions you can use the garbage collector. You can run it on a
database `db` as follows

~~~clojure
(require '[datahike.experiemntal.gc :refer [gc!]]
'[superv.async :refer [<?? S]])

(<?? S (gc! db))
~~~

This will garbage collect any branches you might have created and deleted by
now, but otherwise will not delete any old db values (snapshots) that are still
in the store. You will retrieve a set of all deleted storage blobs. You can just
run the collector concurrently in the background by removing the blocking
operator `<??`. It requires no coordination, operates on metadata only and
should not slow down the transactor or other readers.

Next let's assume that you do not want to keep any old data around much longer
and want to invalidate all readers that still access trees older than the
current `db`.

~~~clojure
(let [now (java.util.Date.)]
(gc! db now))
~~~

Datahike provides open, uncoordinated and scalable read access to the indices
and therefore you need to be aware that there might be long running processes
that still need access to old versions and pick conservative grace periods in
such cases. The garbage collector will make sure that any value that was
accessible in the time window provided will stay accessible.

The garbage collector is tested, but there might still be problems, so please
reach out if you experience any!
2 changes: 1 addition & 1 deletion doc/time_variance.md
Original file line number Diff line number Diff line change
Expand Up @@ -42,7 +42,7 @@ system. Use `db` for this view. The following example shows a simple interaction
{:db/ident :age
:db/valueType :db.type/long
:db/cardinality :db.cardinality/one}])

(def cfg {:store {:backend :mem :id "current-db"} :initial-tx schema})

;; create our temporal database
Expand Down
71 changes: 71 additions & 0 deletions doc/versioning.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,71 @@
# Versioning

**This is an experimental feature. Please try it out in a test environment and provide feedback.**

Since Datahike has a persistent memory model it can be used similarly to
[git](https://git-scm.com/). While using databases with different underlying
stores is the general way to combine data in Datahike and should be preferred
for separate data sets, in cases where you want to evolve a single database the
structural sharing of its indices has unique advantages. Git is efficient and
fast because it does not need to copy shared data on each operation. So in cases
where you want to evolve a database with new data, but don't want to write it
directly into the main database, you can `branch!` and evolve a copy of the
database that behaves like the main branch under `:db`. After you have evolved
the database you can decide what data to retain and then `merge!` it back. You
can also take any in-memory DB value and dump it into a durable branch with
`force-branch!`. To inspect the write history use `branch-history`.

You can see the following example as an example,

~~~clojure
(require '[superv.async :refer [<?? S]]
'[datahike.api :as d]
'[datahike.experimental.versioning :refer [branch! branch-history delete-branch! force-branch! merge!
branch-as-db commit-as-db parent-commit-ids]])

(let [cfg {:store {:backend :file
:path "/tmp/dh-versioning-test"}
:keep-history? true
:schema-flexibility :write
:index :datahike.index/persistent-set}
conn (do
(d/delete-database cfg)
(d/create-database cfg)
(d/connect cfg))
schema [{:db/ident :age
:db/cardinality :db.cardinality/one
:db/valueType :db.type/long}]
_ (d/transact conn schema)
store (:store @conn)]
(branch! conn :db :foo) ;; new branch :foo, does not create new commit, just copies
(let [foo-conn (d/connect (assoc cfg :branch :foo))] ;; connect to it
(d/transact foo-conn [{:age 42}]) ;; transact some data
;; extracted data from foo by query
;; ...
;; and decide to merge it into :db
(merge! conn #{:foo} [{:age 42}]))
(count (parent-commit-ids @conn)) ;; => 2, as :db got merged from :foo and :db
;; check that the commit stored is the same db as conn
(= (commit-as-db store (commit-id @conn)) (branch-as-db store :db) @conn) ;; => true
(count (<?? S (branch-history conn))) ;; => 4 commits now on both branches
(force-branch! @conn :foo2 #{:foo}) ;; put whatever DB value you have created in memory
(delete-branch! conn :foo))
~~~

Here we create a database as usual, but then we create a branch `:foo`, write to
it and then merge it back. A simple query to extract all data in transactable
form that is in a `branch1` db but not in `branch2` is

~~~clojure
(d/q [:find ?db-add ?e ?a ?v ?t
:in $ $2 ?db-add
:where
[$ ?e ?a ?v ?t]
[(not= :db/txInstant ?a)]
(not [$2 ?e ?a ?v ?t])]
branch1 branch2 :db/add)
~~~

but you might want to be more selective when creating the data for `merge!`. We
are very interested in what you are planning to do with this functionality, so
please reach out if you have ideas or experience problems!
5 changes: 5 additions & 0 deletions src/datahike/config.cljc
Original file line number Diff line number Diff line change
Expand Up @@ -19,6 +19,7 @@
(s/def ::search-cache-size nat-int?)
(s/def ::store-cache-size pos-int?)
(s/def ::crypto-hash? boolean?)
(s/def ::branch keyword?)
(s/def ::entity (s/or :map associative? :vec vector?))
(s/def ::initial-tx (s/nilable (s/or :data (s/coll-of ::entity) :path string?)))
(s/def ::name string?)
Expand All @@ -44,6 +45,7 @@
::crypto-hash?
::initial-tx
::name
::branch
::middleware]))

(s/def :deprecated/schema-on-read boolean?)
Expand Down Expand Up @@ -76,6 +78,7 @@
:initial-tx initial-tx
:schema-flexibility (if (true? schema-on-read) :read :write)
:crypto-hash? false
:branch :db
:search-cache-size default-search-cache-size
:store-cache-size default-store-cache-size})

Expand Down Expand Up @@ -118,6 +121,7 @@
:search-cache-size default-search-cache-size
:store-cache-size default-store-cache-size
:crypto-hash? false
:branch :db
:index-config (di/default-index-config default-index)})

(defn remove-nils
Expand Down Expand Up @@ -154,6 +158,7 @@
:schema-flexibility (keyword (:datahike-schema-flexibility env :write))
:index index
:crypto-hash? false
:branch :db
:search-cache-size (int-from-env :datahike-search-cache-size default-search-cache-size)
:store-cache-size (int-from-env :datahike-store-cache-size default-store-cache-size)
:index-config (if-let [index-config (map-from-env :datahike-index-config nil)]
Expand Down
Loading

0 comments on commit 7db474f

Please sign in to comment.