Skip to content

CartoDB Surrogate Keys

Alejandro Martínez edited this page Apr 20, 2015 · 1 revision

Surrogate Keys

Intro

The surrogate key concept comes form the databases world. From the Wikipedia: A surrogate key in a database is a unique identifier for either an entity in the modeled world or an object in the database.

Having in mind that concept we can tag and control resources with a surrogate key, so any request that has some associated resources will be tag with their surrogate keys, those surrogate keys will allow us to invalidate the cached requests in our cache layers.

We manually invalidate from Windshaft and SQL API using a regular expression for support on the built-in Varnish cache layer.

But Surrogate-Keys are thought in mind for other platforms which support individual hash-keys such as Fastly or Varnish Plus with the Hash Ninja plugin.

Why a new header

Currently we are using another header for cache invalidation, the X-Cache-Channel header.

It has the format: ${DB_NAME}:${TABLE_NAME}[,${TABLE_NAME}...]. That enables to tag and invalidate resources based on a user database name and a list of user tables associated to the resources. However not all resources are associated to those entities (database names + database tables) so we are restricted to cache and invalidate resources that we can associate with them.

We could integrate the surrogate key concept inside that header, but for the sake of simplicity (and being compatible with other solutions rather than Varnish) we've decided to add a Surrogate-Key header.

Header size limit

Fastly's Surrogate-Key header has a fixed length limitation of 1024 bytes (including spaces and hex-encoded values). So we are going to set our limit to the same value for compatibility.

Surrogate-Key header is limited to 1024 bytes, including spaces and hex-encoded values.

Format

In order to not waste space, surrogate keys should be as short as possible. Having really short keys means collisions can happen, so keys have to guarantee uniqueness but be short enough to avoid collisions as much as possible. Check git's abbreviated hash changes.

Another important thing to cover is visibility, if we use only a hash substring it won't be possible to determine what type has the associated resource so we are gonna use namespaces for different resources in combination with a big enough base64 substring from a sha256 (or another cryptographic hash function) hash built from the representative object attributes. The namespace will be delimited by colon (:). For instance for a named map resource we will use n (n for named) as namespace in combination with substring(base64(hash(named map owner and named map name)), 0, 6).

So for a named map with owner=foo and name=bar the surrogate key will be n:p2Wovq. With the n namespace is easier to know that it has a named map resource associated and collisions will be reduced to other named maps and not to other resource types.

For an implementation example, check Windshaft-cartodb's cache/model/named_maps_entry.js.

About collisions

It's very important to understand the consequences of surrogate key collisions. If the surrogate key you choose is prone to have collisions and you plan to invalidate cached resources based on that key it's crucial that the objects to get invalidated won't result in a permanent loss of data and it's possible to regenerate them again.

For instance in the named maps example, although far from ideal, if we invalidate a layergroup instance from user foo because a named map from user bar results in the same surrogate key for the layergroup request, we can live with it because the tiles can be generated again.

Transition

Windshaft-cartodb and CartoDB-SQL-API

In order to transition from the current X-Cache-Channel header we would have to start tagging with several surrogate keys for each table associated to the request. Choosing a key with low collisions is very important here because table names will be very similar in a lot of users. So probably it will require to use more than one namespace. We currently use Surrogate-Keys only for named map template instantiations on Windshaft-cartodb:

  • n namespace: named map

CartoDB Rails application

We use the following specific namespaces for objects:

  • rv: namespace for referring to a visualization. It has to be amended the hash of the visualization ID.

Generic namespaces:

  • rj: namespace for viz.json pages.
  • rp: namespace por cacheable public pages (embeddable maps/public map pages)

So, for example, a viz.json for a visualization with ID "foo" will have the Surrogate-Key: rj rv:LCa0a2

This way we could:

  • invalidate everything related to a visualization by knowing its ID.
  • or invalidate all public pages or all viz.json outputs.
Clone this wiki locally