Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Thread deadlock in LayerMetadataStore when writing parameters to a temp properties file #1276

Closed
brettniven opened this issue May 31, 2024 · 5 comments

Comments

@brettniven
Copy link

LayerMetadataStore appears to sometimes encounter a thread deadlock when writing parameters to a temp properties file. The problem appears to be quite rare (~1 in every few million tile requests, which is once or twice a week for us in Production) but when it does occur, it will effectively bring the server to a halt.

Our setup:

  • GWC embedded in GeoServer, 1.24.x, using Kartoza docker image
  • We're hosting in Azure, using an Azure File Share for GWC tile persistence.
  • We’re using FileBlobStore, which uses LayerMetadataStore for layers that have CQL filters (we have many such layers)
  • This issue only occurs when we turn the MemoryBlobStore off. The reason we don’t experience the issue when MemoryBlobStore is on, is due to MemoryBlobStore limiting delegate stores to a single thread, meaning deadlocks are impossible. Unfortunately, this then leads to poor performance. See: MemoryBlobStore limits the delegate store to a single thread #1275
  • We’re in the process of trialling the Azure BlobStore plugin, to avoid this issue

Symptoms:

  • GeoServer/GWC becomes unresponsive to http requests, requiring a re-start to resolve
  • Some time earlier (~30 mins for us), our Azure metrics detect possible thread deadlocks
  • The available threads gradually decrease until all threads in the pool are waiting on the same lock

The issue is difficult to reproduce, so I’ve not included a test.

I do however have a Java Flight Recorder output which detected the deadlock. I’ll attempt to attach this here:
1e16e3f4984b4c37b3690c7369987cfb.jfr.zip

I can attempt to assist in a fix - but I can’t see precisely where the problem is. Below, I’ve done my best to highlight whereabouts the problem is. I’m seeking any thoughts/ideas. My best guess is that after obtaining a hash, it appears to use a potentially unsafe array of locks.

The Deadlock, per Java Flight Recorder

From the JFR output, the proof of the deadlock is (which by itself, Is not overly helpful):

Found one Java-level deadlock:
=============================
"http-nio-8080-exec-1":
  waiting for ownable synchronizer 0x00000005c2e59ca0, (a java.util.concurrent.locks.ReentrantReadWriteLock$NonfairSync),
  which is held by "http-nio-8080-exec-44"
"http-nio-8080-exec-44":
  waiting for ownable synchronizer 0x00000005c2e59f70, (a java.util.concurrent.locks.ReentrantReadWriteLock$NonfairSync),
  which is held by "http-nio-8080-exec-13"
"http-nio-8080-exec-13":
  waiting for ownable synchronizer 0x00000005c2e59ca0, (a java.util.concurrent.locks.ReentrantReadWriteLock$NonfairSync),
  which is held by "http-nio-8080-exec-44"

Of which the relevant stack traces of both threads 13 and 44 are (i've not included the full traces, for brevity. They are visible in the JFR zip though):

at java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt([email protected]/AbstractQueuedSynchronizer.java:885)
at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireQueued([email protected]/AbstractQueuedSynchronizer.java:917)
at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire([email protected]/AbstractQueuedSynchronizer.java:1240)
at java.util.concurrent.locks.ReentrantReadWriteLock$WriteLock.lock([email protected]/ReentrantReadWriteLock.java:959)
at org.geowebcache.storage.blobstore.file.LayerMetadataStore.writeMetadataFile(LayerMetadataStore.java:250)
at org.geowebcache.storage.blobstore.file.LayerMetadataStore.writeTempMetadataFile(LayerMetadataStore.java:234)
at org.geowebcache.storage.blobstore.file.LayerMetadataStore.writeMetadataOptimisticLock(LayerMetadataStore.java:175)
at org.geowebcache.storage.blobstore.file.LayerMetadataStore.putEntry(LayerMetadataStore.java:118)
at org.geowebcache.storage.blobstore.file.FileBlobStore.putLayerMetadata(FileBlobStore.java:677)
...
org.geowebcache.storage.blobstore.file.FileBlobStore.put(FileBlobStore.java:491)

Relevant sections in the code

Any thoughts welcome. In the meantime, we'll be trialling the Azure BlobStore plugin, to get around this issue.

@aaime
Copy link
Member

aaime commented May 31, 2024

I'm not sure it's the same, but this reminds me a lot of the issue fixed in this pull request.

The fix has been released for the first time on 1.25.1 a few days ago, and should be part of the 1.24.x series next month.

Also, I recommend using a nightly build if you're testing the Azure blob store, or the test could become expensive: #1149 (issue fixed in the meantime, but also released only on 1.25.1 so far)

@brettniven
Copy link
Author

Thanks Andrea for your assistance.

Concerning the 2 issues mentioned above (parameter storage for FileBlobStore, ListBlobs issue for AzureBlobStore), is there an intended 1.24.4 release date (so GeoServer 2.24.4 I guess), that these may be included in?
For reference:

  • We've had some decent success when trialling the Azure BlobStore plugin (nightly/custom build of 1.24.x), where we can attain a much higher throughput. We are however hitting issues with the Truncate performance of that plugin though, which for the most part we can likely alter config and our delegate logic (I won't digress here - but we can see the plugin will attempt to delete all possible tiles as opposed to prefetching - which has both Pros and Cons)
  • I'm hopeful the FileBlobStore enhancement you've made, may improve our FileBlobStore throughout and also resolve the locking issue
  • Unfortunately we can't upgrade to 1.25.x easily as we're encountering ClassLoader issues with plugins

@aaime
Copy link
Member

aaime commented Jun 7, 2024

1.24.4 should be released around the 18th of the month.

About the truncation being inefficient, we're aware, have some ideas, but waiting for funding to show up. If you're up to make pull requests on your own, I'm happy to explain some of the most immediate changes that would improve truncate performance.

Classloader issues with plugins... are you using GWC along with GeoServer and with some community modules in the mix? On 1.25.x we just merged a rather large PR that overhauls how community modules are packaged, that should help in that respect: geoserver/geoserver#7714

@brettniven
Copy link
Author

Nice! Yes, I can potentially contribute. I'm waiting to see where we land in the next couple of weeks. We have to scale for an expected load increase imminently. We may end up with an interim solution and then a future plan.

Yes, our setup is GWC embedded in GeoServer, with some community plugins. I believe my colleague may have asked about the ClassLoader issues, maybe on the mailing list. I need to catch up on that aspect.

@brettniven
Copy link
Author

Some positive feedback on this one. With the previously mentioned fix for #880, in PR #1230, we now get substantial performance improvements with FileBlobStore. This is anywhere between 2x to 12x the performance, in terms of throughput (there are so many variables to our setup that I can't put a precise figure on it - but this, with other config tweaks to allow us to scale, brings us close to the 12x mark). I've tested this with both the 1.24.x nightly builds, and now 1.24.4 now that it's been released.

It also seems that this thread deadlock should now not be possible, as I can see that the code path in question should no longer be traversed. I can't verify that 100% though, until we give this a solid hit-out for a decent time period in our prod env.

@aaime aaime closed this as completed Jun 29, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants