FDBLucene is a new project to store Lucene indexes into FoundationDB while providing high performance for both indexing and searching.
Requires Apache Maven to build:
Run mvn clean install -DskipTests
Requires a local FoundationDB cluster
to be installed and running.
Run mvn test
to run the unit tests included in src/test
.
This repository contains two different approaches to storing Lucene indexes in FoundationDB. At this time the FDBDirectory approach is active candidate.
These classes implement a subset of Lucene's features.
The principal advantage of this approach over FDBDirectory is that it removes the need for an exclusive writer. Multiple instances of FDBIndexWriter can safely add, update or delete documents from the same index concurrently.
DATA.md describes the format of the keys and values used to build the index and serve requests.
We are no longer using;
- the IndexWriter class
- the notion of a Directory
- the notion of a Codec
- Field numbers
FDBIndex{Reader,Writer} only implements a subset of Lucene's features though more may be added over time. DocValues and Points are not completely supported but numeric lookup and range querying is possible with FDBNumericPoint and sorting by number with the standard NumericDocValuesField class.
This class is a full implementation of Lucene's Directory abstraction.
Lucene expects to write to disk (via a file system) and uses an inverted index for this reason. To balance the optimal on-disk format with the need to efficiently update an index, Lucene creates multiple "segments". Each of these segments is an index in its own right, though Lucene makes it easy to search across all segments.
Because Lucene assumes a file system, it defines its own transactional semantics. Firstly, a lock file is used to ensure there is only a single writer to the index at a time. Secondly, data that is written to a file is not required to be visible until the file is closed. Finally, there is a central file (called the segments file) which names the other files in the directory which constitute the index. This allows Lucene to build files in the index without making them immediately visible. The segments file is itself updated atomically.
These design decisions within Lucene guide us to where, and whether, to
apply FDB transactional semantics. When writing to a new file, for
example, we have no need to put a transaction around the data we're
writing. FoundationDB, of course, requires one, but it has no semantic
meaning to Lucene. We can therefore buffer as much data as we like to
form an optimal transaction size. In contrast, the rename
method
is atomic.
FDBDirectory stores all its data in FoundationDB using a user-specified key prefix, represented as a Subspace. Each file within the index is given a unique number, generated by a per-index counter entry. Binary data within the file are stored as pages. This is essentially the https://apple.github.io/foundationdb/largeval.html pattern.
Lucene creates empty files, fills them with data by appending, and
then closes them. The files are never updated again. They are
therefore highly cacheable. FDBLucene exploits this property by
caching every page
that it reads from any file. The behaviour, and
capacity, of that cache is configurable by the user as FDBLucene uses
Apache JCS (http://commons.apache.org/jcs/). The cache for an
individual file is only valid until the enclosing Directory is closed
in order to avoid any cache coherency issues if an index is deleted
and recreated.