Smarter Datastore Indexing #240

jqnatividad · 2020-02-27T21:38:54Z

With a lot of CKAN installations now also using the Datastore, and with the robust XLoader that can now load large datasets reliably, one common complaint I still see is how Datastore Search is done.

Right now, an FTS index is created for each column in the datastore after insert For large datasets, I've seen indices that are 2 to 3x larger than the datasets themselves, and even XLoader spends a lot of time creating the indices asynchronusly after the postgres copy.

Has the team considered engaging the portal admin/manager the option to manage datastore indices better?

One workflow I can envision is:

large dataset is loaded using xloader/postgres copy
the admin populates the data dictionary. While doing so, specifying which columns should be indexed for search. This also makes sense since far more efficient timestamp/numeric indices are created compared to an FTS index when specifying those datatypes.
CKAN uses the new data dictionary annotations to cast the data types and create the indices asynchronously.
Further, specifying datatypes and index creation for a column, gives the admin the opportunity to specify if a Filter UI can be exposed for that column.

One benefit of this approach is that it also incentivizes data dictionary maintenance, making it not just a documentation artifact, but an easy way to "manage" the datastore without exposing/requiring the dataset maintainer to know SQL.

And with Filtering UI like ckan/ckan#4406, and the SearchPanes plugin for the datatables viewer, this could be the first installment to also improving the filtering experience for CKAN.

An added benefit of this approach is that it can even help with the "select count(*)" performance issue. I can imagine exposing a way to declaring the primary key in the data dictionary. Doing so not only creates the index for the resource, but also allows CKAN to use the reltuples estimate in pg_class to get the rowcount in a consistent, performant manner - as each index also has an entry in pg_class, and doing an ANALYZE TABLE COLUMN is faster than doing an ANALYZE TABLE.

cc @davidread @wardi @amercader

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Smarter Datastore Indexing #240

Smarter Datastore Indexing #240

jqnatividad commented Feb 27, 2020 •

edited

Loading

Smarter Datastore Indexing #240

Smarter Datastore Indexing #240

Comments

jqnatividad commented Feb 27, 2020 • edited Loading

jqnatividad commented Feb 27, 2020 •

edited

Loading