Thread deadlock in LayerMetadataStore when writing parameters to a temp properties file #1276

brettniven · 2024-05-31T01:22:12Z

LayerMetadataStore appears to sometimes encounter a thread deadlock when writing parameters to a temp properties file. The problem appears to be quite rare (~1 in every few million tile requests, which is once or twice a week for us in Production) but when it does occur, it will effectively bring the server to a halt.

Our setup:

GWC embedded in GeoServer, 1.24.x, using Kartoza docker image
We're hosting in Azure, using an Azure File Share for GWC tile persistence.
We’re using FileBlobStore, which uses LayerMetadataStore for layers that have CQL filters (we have many such layers)
This issue only occurs when we turn the MemoryBlobStore off. The reason we don’t experience the issue when MemoryBlobStore is on, is due to MemoryBlobStore limiting delegate stores to a single thread, meaning deadlocks are impossible. Unfortunately, this then leads to poor performance. See: MemoryBlobStore limits the delegate store to a single thread #1275
We’re in the process of trialling the Azure BlobStore plugin, to avoid this issue

Symptoms:

GeoServer/GWC becomes unresponsive to http requests, requiring a re-start to resolve
Some time earlier (~30 mins for us), our Azure metrics detect possible thread deadlocks
The available threads gradually decrease until all threads in the pool are waiting on the same lock

The issue is difficult to reproduce, so I’ve not included a test.

I do however have a Java Flight Recorder output which detected the deadlock. I’ll attempt to attach this here:
1e16e3f4984b4c37b3690c7369987cfb.jfr.zip

I can attempt to assist in a fix - but I can’t see precisely where the problem is. Below, I’ve done my best to highlight whereabouts the problem is. I’m seeking any thoughts/ideas. My best guess is that after obtaining a hash, it appears to use a potentially unsafe array of locks.

The Deadlock, per Java Flight Recorder

From the JFR output, the proof of the deadlock is (which by itself, Is not overly helpful):

Found one Java-level deadlock:
=============================
"http-nio-8080-exec-1":
  waiting for ownable synchronizer 0x00000005c2e59ca0, (a java.util.concurrent.locks.ReentrantReadWriteLock$NonfairSync),
  which is held by "http-nio-8080-exec-44"
"http-nio-8080-exec-44":
  waiting for ownable synchronizer 0x00000005c2e59f70, (a java.util.concurrent.locks.ReentrantReadWriteLock$NonfairSync),
  which is held by "http-nio-8080-exec-13"
"http-nio-8080-exec-13":
  waiting for ownable synchronizer 0x00000005c2e59ca0, (a java.util.concurrent.locks.ReentrantReadWriteLock$NonfairSync),
  which is held by "http-nio-8080-exec-44"

Of which the relevant stack traces of both threads 13 and 44 are (i've not included the full traces, for brevity. They are visible in the JFR zip though):

at java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt([email protected]/AbstractQueuedSynchronizer.java:885)
at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireQueued([email protected]/AbstractQueuedSynchronizer.java:917)
at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire([email protected]/AbstractQueuedSynchronizer.java:1240)
at java.util.concurrent.locks.ReentrantReadWriteLock$WriteLock.lock([email protected]/ReentrantReadWriteLock.java:959)
at org.geowebcache.storage.blobstore.file.LayerMetadataStore.writeMetadataFile(LayerMetadataStore.java:250)
at org.geowebcache.storage.blobstore.file.LayerMetadataStore.writeTempMetadataFile(LayerMetadataStore.java:234)
at org.geowebcache.storage.blobstore.file.LayerMetadataStore.writeMetadataOptimisticLock(LayerMetadataStore.java:175)
at org.geowebcache.storage.blobstore.file.LayerMetadataStore.putEntry(LayerMetadataStore.java:118)
at org.geowebcache.storage.blobstore.file.FileBlobStore.putLayerMetadata(FileBlobStore.java:677)
...
org.geowebcache.storage.blobstore.file.FileBlobStore.put(FileBlobStore.java:491)

Relevant sections in the code

FileBlobStore.putLayerMetadata(FileBlobStore.java:677):
- Here, the FileBlobStore is calling putEntry, to put a new entry into the layer metadata file (i.e. into the metadata.properties.gz file, when a tile request has a CQL filter):
- geowebcache/geowebcache/core/src/main/java/org/geowebcache/storage/blobstore/file/FileBlobStore.java
  
  Line 677 in e0244e1
  
  layerMetadata.putEntry(layerName, key, value);
LayerMetadataStore.putEntry(LayerMetadataStore.java:118)
- In putEntry, it calls 'writeMetadataOptimisticLock'
- geowebcache/geowebcache/core/src/main/java/org/geowebcache/storage/blobstore/file/LayerMetadataStore.java
  
  Line 118 in e0244e1
  
  writeMetadataOptimisticLock(key, encodedValue, metadataFile);
writeOptimisticLock
- First locks on the metadata.properties.gz file, here (see a few lines earlier where it obtains a lock for that file):
  
  geowebcache/geowebcache/core/src/main/java/org/geowebcache/storage/blobstore/file/LayerMetadataStore.java
  
  Line 167 in e0244e1
  
  rwLock.writeLock().lock();
- It then goes to alter the file in a temp file location, here:
  
  geowebcache/geowebcache/core/src/main/java/org/geowebcache/storage/blobstore/file/LayerMetadataStore.java
  
  Line 175 in e0244e1
  
  File tempFile = writeTempMetadataFile(metadata);
writeTempMetadataFile
- This creates a temp file, then calls writeMetadataFile. See these few lines:
- geowebcache/geowebcache/core/src/main/java/org/geowebcache/storage/blobstore/file/LayerMetadataStore.java
  
  Line 232 in e0244e1
  
  final File metadataFile =
writeMetadataFile:
- This creates a lock on the temp file - i.e. not the original file, this is a subsequent lock
- See here:
  
  geowebcache/geowebcache/core/src/main/java/org/geowebcache/storage/blobstore/file/LayerMetadataStore.java
  
  Line 249 in e0244e1
  
  final ReadWriteLock lock = getLock(metadataFile);
- and line 250 where it issues the lock
- This is where it deadlocks. I can't explain why though...
- All code here unlocks on a 'finally' etc.
- I do see so potential non-thread-safe code - but I still can't see how this would lead to 2 threads waiting on each other:
  - In getLock, it's using array indices:
    
    geowebcache/geowebcache/core/src/main/java/org/geowebcache/storage/blobstore/file/LayerMetadataStore.java
    
    Line 146 in e0244e1
    
    private ReadWriteLock getLock(File file) {
  - , and even though this func uses a hash:
    
    geowebcache/geowebcache/core/src/main/java/org/geowebcache/storage/blobstore/file/LayerMetadataStore.java
    
    Line 137 in e0244e1
    
    private int resolveLockBucket(File file) {
  - , it is seemingly limiting the array to a size 32, here:
    
    geowebcache/geowebcache/core/src/main/java/org/geowebcache/storage/blobstore/file/LayerMetadataStore.java
    
    Line 72 in e0244e1
    
    private static final int lockShardSize = 32;
  - And storing it in an array. I can't quite work out why the hash is limited to 32. Would make more sense (to me) to have a ConcurrentHashMap or the like, instead of a fixed size array:
    
    geowebcache/geowebcache/core/src/main/java/org/geowebcache/storage/blobstore/file/LayerMetadataStore.java
    
    Line 75 in e0244e1
    
    private ReadWriteLock[] locks =

Any thoughts welcome. In the meantime, we'll be trialling the Azure BlobStore plugin, to get around this issue.

The text was updated successfully, but these errors were encountered:

aaime · 2024-05-31T07:40:15Z

I'm not sure it's the same, but this reminds me a lot of the issue fixed in this pull request.

The fix has been released for the first time on 1.25.1 a few days ago, and should be part of the 1.24.x series next month.

Also, I recommend using a nightly build if you're testing the Azure blob store, or the test could become expensive: #1149 (issue fixed in the meantime, but also released only on 1.25.1 so far)

brettniven · 2024-06-07T04:45:55Z

Thanks Andrea for your assistance.

Concerning the 2 issues mentioned above (parameter storage for FileBlobStore, ListBlobs issue for AzureBlobStore), is there an intended 1.24.4 release date (so GeoServer 2.24.4 I guess), that these may be included in?
For reference:

We've had some decent success when trialling the Azure BlobStore plugin (nightly/custom build of 1.24.x), where we can attain a much higher throughput. We are however hitting issues with the Truncate performance of that plugin though, which for the most part we can likely alter config and our delegate logic (I won't digress here - but we can see the plugin will attempt to delete all possible tiles as opposed to prefetching - which has both Pros and Cons)
I'm hopeful the FileBlobStore enhancement you've made, may improve our FileBlobStore throughout and also resolve the locking issue
Unfortunately we can't upgrade to 1.25.x easily as we're encountering ClassLoader issues with plugins

aaime · 2024-06-07T06:40:41Z

1.24.4 should be released around the 18th of the month.

About the truncation being inefficient, we're aware, have some ideas, but waiting for funding to show up. If you're up to make pull requests on your own, I'm happy to explain some of the most immediate changes that would improve truncate performance.

Classloader issues with plugins... are you using GWC along with GeoServer and with some community modules in the mix? On 1.25.x we just merged a rather large PR that overhauls how community modules are packaged, that should help in that respect: geoserver/geoserver#7714

brettniven · 2024-06-07T07:12:02Z

Nice! Yes, I can potentially contribute. I'm waiting to see where we land in the next couple of weeks. We have to scale for an expected load increase imminently. We may end up with an interim solution and then a future plan.

Yes, our setup is GWC embedded in GeoServer, with some community plugins. I believe my colleague may have asked about the ClassLoader issues, maybe on the mailing list. I need to catch up on that aspect.

brettniven · 2024-06-19T06:01:28Z

Some positive feedback on this one. With the previously mentioned fix for #880, in PR #1230, we now get substantial performance improvements with FileBlobStore. This is anywhere between 2x to 12x the performance, in terms of throughput (there are so many variables to our setup that I can't put a precise figure on it - but this, with other config tweaks to allow us to scale, brings us close to the 12x mark). I've tested this with both the 1.24.x nightly builds, and now 1.24.4 now that it's been released.

It also seems that this thread deadlock should now not be possible, as I can see that the code path in question should no longer be traversed. I can't verify that 100% though, until we give this a solid hit-out for a decent time period in our prod env.

aaime closed this as completed Jun 29, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Thread deadlock in LayerMetadataStore when writing parameters to a temp properties file #1276

Thread deadlock in LayerMetadataStore when writing parameters to a temp properties file #1276

brettniven commented May 31, 2024

aaime commented May 31, 2024

brettniven commented Jun 7, 2024

aaime commented Jun 7, 2024

brettniven commented Jun 7, 2024

brettniven commented Jun 19, 2024

Thread deadlock in LayerMetadataStore when writing parameters to a temp properties file #1276

Thread deadlock in LayerMetadataStore when writing parameters to a temp properties file #1276

Comments

brettniven commented May 31, 2024

The Deadlock, per Java Flight Recorder

Relevant sections in the code

aaime commented May 31, 2024

brettniven commented Jun 7, 2024

aaime commented Jun 7, 2024

brettniven commented Jun 7, 2024

brettniven commented Jun 19, 2024