You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
With a lot of CKAN installations now also using the Datastore, and with the robust XLoader that can now load large datasets reliably, one common complaint I still see is how Datastore Search is done.
Right now, an FTS index is created for each column in the datastore after insert For large datasets, I've seen indices that are 2 to 3x larger than the datasets themselves, and even XLoader spends a lot of time creating the indices asynchronusly after the postgres copy.
Has the team considered engaging the portal admin/manager the option to manage datastore indices better?
One workflow I can envision is:
large dataset is loaded using xloader/postgres copy
the admin populates the data dictionary. While doing so, specifying which columns should be indexed for search. This also makes sense since far more efficient timestamp/numeric indices are created compared to an FTS index when specifying those datatypes.
CKAN uses the new data dictionary annotations to cast the data types and create the indices asynchronously.
Further, specifying datatypes and index creation for a column, gives the admin the opportunity to specify if a Filter UI can be exposed for that column.
One benefit of this approach is that it also incentivizes data dictionary maintenance, making it not just a documentation artifact, but an easy way to "manage" the datastore without exposing/requiring the dataset maintainer to know SQL.
And with Filtering UI like ckan/ckan#4406, and the SearchPanes plugin for the datatables viewer, this could be the first installment to also improving the filtering experience for CKAN.
An added benefit of this approach is that it can even help with the "select count(*)" performance issue. I can imagine exposing a way to declaring the primary key in the data dictionary. Doing so not only creates the index for the resource, but also allows CKAN to use the reltuples estimate in pg_class to get the rowcount in a consistent, performant manner - as each index also has an entry in pg_class, and doing an ANALYZE TABLE COLUMN is faster than doing an ANALYZE TABLE.
With a lot of CKAN installations now also using the Datastore, and with the robust XLoader that can now load large datasets reliably, one common complaint I still see is how Datastore Search is done.
Right now, an FTS index is created for each column in the datastore after insert For large datasets, I've seen indices that are 2 to 3x larger than the datasets themselves, and even XLoader spends a lot of time creating the indices asynchronusly after the postgres copy.
Has the team considered engaging the portal admin/manager the option to manage datastore indices better?
One workflow I can envision is:
One benefit of this approach is that it also incentivizes data dictionary maintenance, making it not just a documentation artifact, but an easy way to "manage" the datastore without exposing/requiring the dataset maintainer to know SQL.
And with Filtering UI like ckan/ckan#4406, and the SearchPanes plugin for the datatables viewer, this could be the first installment to also improving the filtering experience for CKAN.
An added benefit of this approach is that it can even help with the "select count(*)" performance issue. I can imagine exposing a way to declaring the primary key in the data dictionary. Doing so not only creates the index for the resource, but also allows CKAN to use the
reltuples
estimate inpg_class
to get the rowcount in a consistent, performant manner - as each index also has an entry in pg_class, and doing an ANALYZE TABLE COLUMN is faster than doing an ANALYZE TABLE.cc @davidread @wardi @amercader
The text was updated successfully, but these errors were encountered: