Index management happens through normal Postgres SQL commands.
To create a ZomboDB index, use the form:
CREATE INDEX idxname
ON table
USING zombodb (zdb('table', table.ctid), zdb(table))
WITH (url='...',
shards=N,
replicas=N,
preference='...',
options='...');
The (zdb('table', ctid), zdb(table))
construct is required and causes the index to be a multi-"column" functional index.
The output of the zdb(table)
function is what is actually indexed -- which is a JSON-formatted version of each row.
The WITH
settings are:
url
required: The base url of the primary entry point into your Elasticsearch cluster. For example:http://192.168.0.75:9200/
. The value must end with a forward slash (/
). If the value isdefault
, then the Postgres GUC settingzombodb.default_elasticsearch_url
(frompostgresql.conf
) will be used. See below for that GUC setting.shards
optional: The number of Elasticsearch shards to use. The default is5
. Changing this value requires aREINDEX
before the new value becomes live.replicas
optional: The number of Elasticsearch replicas to use. The default is1
and allowed values are[0..64]
. Changing this property requires that you callzdb_update_mapping(tablename_containg_index)
to push the setting to Elasticsearch.store
optional: The default isfalse
, but if set totrue
, then ZomboDB will store the raw JSON for each row in Elasticsearch (as the "_source" field). Changing the value on an existing index requiresREINDEX
. If set totrue
, this setting will allow you to use external search and visulation tools like Kibana with ZomboDB-managed indexes. Note that ZomboDB-managed indexes contain dead or not-yet-visible rows, and querying the indexes via external tools will make these rows visible to those tools.
-
options
optional:options
is a ZomboDB-specific string that allows you to define how this index relates to other indexes. This is an advanced-use feature and is documented here. -
shadow
optional (mutually exclusive withurl
): The name of an existing ZomboDB index that this index should use, but likely with a different set of options. This too is an advanced-use feature. -
alias
optional: Thealias
option can be used to assign an Elasticsearch index alias to the index ZomboDB creates. This is typically only important when using Citus or when using Postgres table inheritence to do partitioning. You can set this value to whatever you want, but it should be unique across your Elasticsearch cluster. -
field_lists
optional: Allows to define lists fields that, when queried, are dynamically expanded to search their defined list of other fields. The syntax for this setting is:field_lists='fake_field1=[a, b, c], fake_field2=[d,e,f], ...'
. This can be useful, for example, for searching all "date" fields at once, or defining a set of fields that represent "names" or "locations". Note that each field in a list must be of the same underlying Postgres data type. -
block_routing_field
optional: When set to a column name whose type isint2
,int4
, orint8
, ZomboDB will use that column's value to route documents to shards in blocks of 100,000. Essentially, this calculated value becomes the_routing
for each document, and when used with a column that is frequently used to do cross-index joining (via theoptions
index links above), can greatly improve cross-index joins. ZomboDB will group rows together in blocks of 100,000 using_routing = (long) (field_value / 100_000L)
and then Elasticsearch's normal routing hashing kicks in to determine which shard gets that block of documents. Read more about this, and how it improves cross-index join performance in BLOCK-ROUTING.md and generally about cross-index joins in CROSS-INDEX-JOINS.md. Changing the value of this field requires aREINDEX
. -
always_resolve_joins
optional: The default isfalse
which allows ZomboDB to directly query a linked index for aggregates andzdb_estimate_count()
when the query only queries fields from one linked index (seeoptions
under Advanced Settings below along with the INDEX-OPTIONS documentation). If your links/joins are always 1-to-1, this is perfectly safe and quite a performance improvement. If your links/joins are 1-to-many (or many-to-1) this is not safe and you should set the value totrue
.
preference
optional: The Elasticsearch search preference to use. The default isnull
, meaning no search preference is used.compression_level
optional: ZomboDB can use "deflate" compression when communicating with Elasticsearch via HTTP. The default compression level is zero (ie, disabled), but a value between1
and9
will enable compression. If you enable compression, you also need to sethttp.compression: true
inelasticsearch.yml
for each node in your cluster.bulk_concurrency
optional: Specifies the maximum number of concurrent HTTP requests, per Postgres backend, to use when making "batch" changes, which include CREATE INDEX/REINDEX, INSERT, UPDATE, COPY statements. The default is12
and allowed values are in the set[1..1024]
batch_size
optional: Specifies the size, in bytes, for batch POST data. Affects CREATE INDEX/REINDEX, INSERT, UPDATE, COPY, and VACUUM statements. The default is8388608
bytes (8MB) and allowed values are[1k..1Gb]
. Note that ZomboDB will potentially consumebatch_size * bulk_concurrency
bytes of memory during any given statement, per backend connection.refresh_interval
optional: This setting directly relates to Elasticsearch'sindex.refresh_interval
. The default value is-1
, meaning ZomboDB will control index refreshes such that modified rows are always immediately visible. If a time delay is okay, change this setting to something like5s
for five seconds. This can help improve single-record INSERT/UPDATE performance at the expense of when the rows are actually searchable. Changing this property requires that you callzdb_update_mapping(tablename_containg_index)
to push the setting to Elasticsearch.ignore_visibility
optional: Controls, on a per-index level, if queries that require row visibility information to be MVCC-correct should honor it. The default forignore_visibility
isfalse
, meaning all queries are MVCC-correct. Set this totrue
if you don't need exact values for aggregates andzdb_estimate_count()
.optimize_after
optional: An integer value that represents the number of deleted documents in the Elasticsearch index after which ZomboDB shoud expunge deleted documents. The default is zero, meaning never do this (i.e., let Elasticsearch manage it on its own segment merge schedule), but specifying a value can help keep performance in check for tables that experience high UPDATE/DELETE loads.
zombodb.batch_mode
: This is a boolean "GUC" that controls how ZomboDB sends index changes to Elasticsearch. The default forzombodb.batch_mode
isfalse
which means that ZomboDB sends index changes to ES on a per-statement level and flushes the remote Elasticsearch index at the end of each statement. Setting this totrue
will cause ZomboDB to batch index changes through the life of the transaction, and the final set of changes won't be available for search untilCOMMIT
. When set totrue
, ZomboDB delays flushing the index until transactionCOMMIT
. This can be changed interactively by issuingSET zombodb.batch_mode = true;
zombodb.ignore_visibility
: This is a boolean "GUC" that controls if ZomboDB will honor MVCC visibility rules. The default isfalse
meaning it will, but you canSET zombodb.ignore_visibility = true;
if you don't mind having dead/invisible rows counted in aggregates andzdb_estimate_count()
. This is similiar to the index-level setting of the same name, but can be controlled per session.zombodb.log_level
: This is an enumerated "GUC" that controls ZomboDB's Postgres logging level. It supports the same set of values that Postgres' other log levels support, and its default value isDEBUG1
. Possible values are:
values in order of decreasing detail:
debug5
debug4
debug3
debug2
debug1* -- default value
info
notice
warning
log
This setting can be changed per-session or in `postgresql.conf` for all sessions.
To drop a ZomboDB index, use Postgres' standard DROP INDEX
command:
DROP INDEX idxname;
This removes the index from Postgres' system catalogs but does not also delete the index from Elasticsearch. You must do that manually, for example:
$ curl -XDELETE http://cluster.ip:9200/db.schema.table.index
NOTE: This may change in the future such that the Elasticsearch search index is deleted. It's currently unclear which direction is most helpful.
To reindex an exising ZomboDB index, simply use Postgres' standard REINDEX command. All the various forms of INDEX
, TABLE
, and DATABASE
are fully supported.
You can use Postgres' ALTER INDEX
command to change any of the WITH
settings defined above. For example:
ALTER INDEX idxname SET (replicas=3);
ALTER INDEX idxname RESET (preference);
Chagning non-structure settings such as replicas
and refresh_interval
do not require a reindex, but do require you call zdb_update_mapping(tablename)
to push these changes to Elasticsearch.
However, a REINDEX
is required after changing an index setting that would affect the physical structure of the underlying Elasticsearch index, specifically shards
. This is a limitation of Elasticsearch in that shard count is fixed at index creation time.
The various forms of ALTER TABLE that add/drop columns or change column types are supported. Note that if added columns have a default value or if the type change isn't directly cast-able, Postgres will automatically rebuild the index, so these operations could take a significant amount of time.
####WARNING:
Renaming columns is not supported. What will happen is that newly INSERTed/UPDATEd rows will use the new column name but existing rows will use the old name, making searching difficult.
If you need rename a column, you'll need to manually issue a REINDEX
.
Perhaps in the future, ZomboDB can transparently alias fields to avoid this situation, but at present, it's not the case.
Running a standard VACUUM
on a table with a ZomboDB index does the minimum amount of work to remove dead rows from the backing Elastisearch index and shoud happen in a reasonable amount of time (depending, of course, on the update frequency).
Running a VACUUM FULL
on a table with a ZomboDB index, on the otherhand, is functionally equilivant to running a REINDEX
on the table, which means a VACUUM FULL
could take a long time to complete.
For tables with ZomboDB indexes that are frequently UPDATEd, it's important that autovacuum be configured to be as aggressive as your I/O subsystem can afford. This is because certain types of ZomboDB queries need to build an "invisibility map", which necessitates looking at every heap page that is not known to be all-visibile.