-
Notifications
You must be signed in to change notification settings - Fork 23
CartoDB Surrogate Keys
The surrogate key concept comes form the databases world. From the Wikipedia: A surrogate key in a database is a unique identifier for either an entity in the modeled world or an object in the database.
Having in mind that concept we can tag and control resources with a surrogate key, so any request that has some associated resources will be tag with their surrogate keys, those surrogate keys will allow us to invalidate the cached requests in our cache layers.
We manually invalidate from Windshaft and SQL API using a regular expression for support on the built-in Varnish cache layer.
But Surrogate-Keys are thought in mind for other platforms which support individual hash-keys such as Fastly or Varnish Plus with the Hash Ninja plugin.
Currently we are using another header for cache invalidation, the X-Cache-Channel
header.
It has the format: ${DB_NAME}:${TABLE_NAME}[,${TABLE_NAME}...]
. That enables to tag and invalidate resources based on
a user database name and a list of user tables associated to the resources. However not all resources are associated to
those entities (database names + database tables) so we are restricted to cache and invalidate resources that we can
associate with them.
We could integrate the surrogate key concept inside that header, but for the sake of simplicity (and being compatible with other solutions rather than Varnish) we've decided to add a Surrogate-Key
header.
Fastly's Surrogate-Key header has a fixed length limitation of 1024 bytes (including spaces and hex-encoded values). So we are going to set our limit to the same value for compatibility.
Surrogate-Key header is limited to 1024 bytes, including spaces and hex-encoded values.
In order to not waste space, surrogate keys should be as short as possible. Having really short keys means collisions can happen, so keys have to guarantee uniqueness but be short enough to avoid collisions as much as possible. Check git's abbreviated hash changes.
Another important thing to cover is visibility, if we use only a hash substring it won't be possible to determine what
type has the associated resource so we are gonna use namespaces for different resources in combination with a big enough
base64 substring from a sha256 (or another cryptographic hash function) hash built from the representative object
attributes. The namespace will be delimited by colon (:
). For instance for a named map resource we will use n
(n for
named) as namespace in combination with substring(base64(hash(named map owner and named map name)), 0, 6)
.
So for a named map with owner=foo
and name=bar
the surrogate key will be n:p2Wovq
. With the n
namespace is
easier to know that it has a named map resource associated and collisions will be reduced to other named maps and not to
other resource types.
For an implementation example, check Windshaft-cartodb's cache/model/named_maps_entry.js.
It's very important to understand the consequences of surrogate key collisions. If the surrogate key you choose is prone to have collisions and you plan to invalidate cached resources based on that key it's crucial that the objects to get invalidated won't result in a permanent loss of data and it's possible to regenerate them again.
For instance in the named maps example, although far from ideal, if we invalidate a layergroup instance from user foo
because a named map from user bar
results in the same surrogate key for the layergroup request, we can live with it
because the tiles can be generated again.
In order to transition from the current X-Cache-Channel
header we would have to start tagging with several surrogate keys for each table associated to the request. Choosing a key with low collisions is very important here because table names will be very similar in a lot of users. So probably it will require to use more than one namespace.
We currently use Surrogate-Keys only for named map template instantiations on Windshaft-cartodb:
-
n
namespace: named map
We use the following specific namespaces for objects:
-
rv
: namespace for referring to a visualization. It has to be amended the hash of the visualization ID.
Generic namespaces:
-
rj
: namespace forviz.json
pages. -
rp
: namespace por cacheable public pages (embeddable maps/public map pages)
So, for example, a viz.json for a visualization with ID "foo" will have the Surrogate-Key: rj rv:LCa0a2
This way we could:
- invalidate everything related to a visualization by knowing its ID.
- or invalidate all public pages or all
viz.json
outputs.