-
Notifications
You must be signed in to change notification settings - Fork 280
Metastore removal
The Metastore is an optional GeoWebCache module responsible for handling meta-information attached to each generated tile, in particular:
- identify groups of request parameters that lead to the generation of different tile contents, associating them with a unique numerical identifier
- perform tile locking during tile creation, thus making sure two GeoWebCache instances won't generate the same tile at the same time
- track tile creation time, which is then used to perform tile expiration
The metastore is coded on top of H2, with some parts being H2 specific and others being generic SQL. While the H2 database can be clustered there are issues with its usage:
- the reports of database corruptions over time and the varying database format from release to release makes it not very palatable to enterprise customers
- the DBMS in large setups is normally a given, imposing a specific database is not an option
However talking with the GWC maintainer the desire to remove the MetaStore fully emerged. The MetaStore is currently working as part of the StorageBroker, which uses the MetaStore and the BlobStore to perform its actions. Most of the GWC code actually talks with the StorageBroker, so it's easy to replace it with a different implementation after it has been made into an interface.
The plan is then as follows:
make the StorageBroker into an interface the old StorageBroker, MetaStore and BlobStore would make the "LegacyStorageBroker", which is going to be kept around and used to access and convert legacy tile sets the new StorageBroker would simply used a improved BlobStore that handles directly on the file system all the functionality previously managed by the legacy MetaStore and BlobStore
The parameter handling in the current MetaStore takes all the extra parameters and associates them to a single numeric value, a long generated by a database sequence, which is known as the parameterId.
The parameterId is used in two places in GWC:
- to build the full path to a tile
- to identify the tile set in the disk quota module The current full path to a tile looks as follows:
layerName/gridSet_zoom[_hexParamId]/hx_hy/x_y.format
where:
half = 2 << (z / 2)
hx = x / half
hy = y / half
The parameter id thus avoid the collisions between tile sets using different parameters. The new system will use a fully on disk system to identify the parameters:
- the group of parameter is hashed with the SHA1 algorithm generating a unique hash that is very resilient to collisions (unlikely, but not impossible)
- the new on disk layout looks as follows:
layerName/gridSet_zoom[_paramsSHA1]/hx_hy/x_y.format
layerName/gridSet_zoom[_paramsSHA1]/params.txt
The params.txt
is the file containing the clear text version of the parameters. In the unlikely case of a collision, with two parameter sets ending up getting the same SHA1 value, the code will create a lock file at layerName/gridSet_zoom[_paramsSHA1]/lock.txt which will prevent two GWC instances from trying to create new directories at the same time, and will try to allocate a structure like:
layerName/gridSet_zoom_paramsSHA1_cnt/hx_hy/x_y.format
layerName/gridSet_zoom_paramsSHA1_cnt/params.txt
where cnt
is a progressive counter. The lock will be released once the first free number is found on the file system.
When searching for tiles with a certain parameter set GWC will first look for the straight SHA1, check the parameters actually match the contents of params.txt, if not it will fall back on a linear search of the similarly named directories. To facilitate and speed up those checks the StorageBroker will keep an in memory cache of the available paramsSHA1_cnt combinations falling back on a disk check only in case of a miss.
It is assumed that this will perform nicely because:
- the SHA1 computation is fast
- the cache will save significant amounts of disk access
- the SHA1 algorithm offsers in any case a very good collision prevention (1 / 2^51)
The tile locking performed by the metastore serves two purposes:
- avoids two instances of GWC to compute the same missing tile
- avoids issues with two GWC instances writing on the same target file
The first is considered to be a non issue, the percentage of tiles that actually get computed in parallel by different GWC is minimal since:
- users normally access different parts of the map, and when they don't conflicts arise only when the missing tiles are being computed, so there is little effective duplication of work
- seeds have the potential to duplicate a lot, but two instances of GWC never seed the same layer at the same time, at the time of writing they would not be able to share the workload, in case we evolve GWC in that direction we'll also make sure to orchestrate the various insances