-
-
Notifications
You must be signed in to change notification settings - Fork 97
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Versioning and garbage collector for persistent-sorted-set backends. (#…
…232) * Add pss garbage collector implementation and tests. * Add gc files. * Track all written DB snapshots and walk along them for GC. * Improve and simplify gc. Add versioning fns. * Add gc test coverage and complete first implementation. * Add documentation. * Improve versioning API, always store commits directly. * Ensure that transactor thread is not out of sync. * Layout tests better. Factorize update-and-flush-db better. * Misnaming. Co-authored-by: Judith <[email protected]>
- Loading branch information
Showing
20 changed files
with
710 additions
and
112 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,53 @@ | ||
# Garbage collection | ||
|
||
**This is an experimental feature. Please try it out in a test environment and provide feedback.** | ||
|
||
Datahike uses persistent data structures to update its memory. In a persistent | ||
memory model a copy is efficiently created on each update to the database. A | ||
good explanation of how shared persistent data structures share structure and | ||
are updated can be found | ||
[here](https://hypirion.com/musings/understanding-persistent-vector-pt-1). That | ||
means that you can always access all old versions of Datahike databases and | ||
Datahike provides a distinct set of historical query abilities, both in form of | ||
its [historical indices](./time_variances.md) that support queries over the full | ||
history and in form of its [git-like versioning functionality](./versioning.md) | ||
for individual snapshots. A potential downside of the latter functionality is | ||
that you need to keep all old versions of your database around and therefore | ||
storage requirements grow with usage. To remove old versions of a Datahike | ||
database you can apply garbage collection. | ||
|
||
Provided no process reads anymore from a database it can be considered garbage. | ||
To remove these versions you can use the garbage collector. You can run it on a | ||
database `db` as follows | ||
|
||
~~~clojure | ||
(require '[datahike.experiemntal.gc :refer [gc!]] | ||
'[superv.async :refer [<?? S]]) | ||
|
||
(<?? S (gc! db)) | ||
~~~ | ||
|
||
This will garbage collect any branches you might have created and deleted by | ||
now, but otherwise will not delete any old db values (snapshots) that are still | ||
in the store. You will retrieve a set of all deleted storage blobs. You can just | ||
run the collector concurrently in the background by removing the blocking | ||
operator `<??`. It requires no coordination, operates on metadata only and | ||
should not slow down the transactor or other readers. | ||
|
||
Next let's assume that you do not want to keep any old data around much longer | ||
and want to invalidate all readers that still access trees older than the | ||
current `db`. | ||
|
||
~~~clojure | ||
(let [now (java.util.Date.)] | ||
(gc! db now)) | ||
~~~ | ||
|
||
Datahike provides open, uncoordinated and scalable read access to the indices | ||
and therefore you need to be aware that there might be long running processes | ||
that still need access to old versions and pick conservative grace periods in | ||
such cases. The garbage collector will make sure that any value that was | ||
accessible in the time window provided will stay accessible. | ||
|
||
The garbage collector is tested, but there might still be problems, so please | ||
reach out if you experience any! |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,71 @@ | ||
# Versioning | ||
|
||
**This is an experimental feature. Please try it out in a test environment and provide feedback.** | ||
|
||
Since Datahike has a persistent memory model it can be used similarly to | ||
[git](https://git-scm.com/). While using databases with different underlying | ||
stores is the general way to combine data in Datahike and should be preferred | ||
for separate data sets, in cases where you want to evolve a single database the | ||
structural sharing of its indices has unique advantages. Git is efficient and | ||
fast because it does not need to copy shared data on each operation. So in cases | ||
where you want to evolve a database with new data, but don't want to write it | ||
directly into the main database, you can `branch!` and evolve a copy of the | ||
database that behaves like the main branch under `:db`. After you have evolved | ||
the database you can decide what data to retain and then `merge!` it back. You | ||
can also take any in-memory DB value and dump it into a durable branch with | ||
`force-branch!`. To inspect the write history use `branch-history`. | ||
|
||
You can see the following example as an example, | ||
|
||
~~~clojure | ||
(require '[superv.async :refer [<?? S]] | ||
'[datahike.api :as d] | ||
'[datahike.experimental.versioning :refer [branch! branch-history delete-branch! force-branch! merge! | ||
branch-as-db commit-as-db parent-commit-ids]]) | ||
|
||
(let [cfg {:store {:backend :file | ||
:path "/tmp/dh-versioning-test"} | ||
:keep-history? true | ||
:schema-flexibility :write | ||
:index :datahike.index/persistent-set} | ||
conn (do | ||
(d/delete-database cfg) | ||
(d/create-database cfg) | ||
(d/connect cfg)) | ||
schema [{:db/ident :age | ||
:db/cardinality :db.cardinality/one | ||
:db/valueType :db.type/long}] | ||
_ (d/transact conn schema) | ||
store (:store @conn)] | ||
(branch! conn :db :foo) ;; new branch :foo, does not create new commit, just copies | ||
(let [foo-conn (d/connect (assoc cfg :branch :foo))] ;; connect to it | ||
(d/transact foo-conn [{:age 42}]) ;; transact some data | ||
;; extracted data from foo by query | ||
;; ... | ||
;; and decide to merge it into :db | ||
(merge! conn #{:foo} [{:age 42}])) | ||
(count (parent-commit-ids @conn)) ;; => 2, as :db got merged from :foo and :db | ||
;; check that the commit stored is the same db as conn | ||
(= (commit-as-db store (commit-id @conn)) (branch-as-db store :db) @conn) ;; => true | ||
(count (<?? S (branch-history conn))) ;; => 4 commits now on both branches | ||
(force-branch! @conn :foo2 #{:foo}) ;; put whatever DB value you have created in memory | ||
(delete-branch! conn :foo)) | ||
~~~ | ||
|
||
Here we create a database as usual, but then we create a branch `:foo`, write to | ||
it and then merge it back. A simple query to extract all data in transactable | ||
form that is in a `branch1` db but not in `branch2` is | ||
|
||
~~~clojure | ||
(d/q [:find ?db-add ?e ?a ?v ?t | ||
:in $ $2 ?db-add | ||
:where | ||
[$ ?e ?a ?v ?t] | ||
[(not= :db/txInstant ?a)] | ||
(not [$2 ?e ?a ?v ?t])] | ||
branch1 branch2 :db/add) | ||
~~~ | ||
|
||
but you might want to be more selective when creating the data for `merge!`. We | ||
are very interested in what you are planning to do with this functionality, so | ||
please reach out if you have ideas or experience problems! |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.