Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bug#180_Stuck_threads_due_to_WeakHashMap #181

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

lananda
Copy link

@lananda lananda commented Jul 19, 2022

No description provided.

Copy link
Contributor

@gsmet gsmet left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. I added a few comments.

private final WeakHashMap<ClassLoader, SoftReference<List<ValidationProvider<?>>>> providersPerClassloader =
new WeakHashMap<>();
private final Map<ClassLoader, SoftReference<List<ValidationProvider<?>>>> providersPerClassloader =
Collections.synchronizedMap(new WeakHashMap<ClassLoader, SoftReference<List<ValidationProvider<?>>>>());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you sure you're not doing something wrong in your code? Because write operations to this map should be extremely rare and if you have issues, I wonder if you are not doing something wrong.
That being said, I think the change makes sense.

@yrodiere could you have a look at this change too?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@gsmet thank you for the comments
there are few issues related to WeakHashMap not being synchronized , I have checked with JDK team too

eclipse-ee4j/metro-jax-ws#61
spring-projects/spring-loaded#194
https://www.adam-bien.com/roller/abien/entry/endless_loops_in_unsychronized_weakhashmap

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not saying this is not an issue. I'm saying that you shouldn't have it as you shouldn't call the method accessing this often: my wild guess is that, if you're hitting this, you are creating ValidatorFactory instances far too often. Your ValidatorFactory should be created once and for all and shared. Creating a new ValidatorFactory each time will cost you as it won't have the bean metadata cached.

Comment on lines +173 to +194
@Test
public void testWeakHashMapSynchronization() {
try {
Validation validation = new Validation();
GenericBootstrap f1 = Validation.byDefaultProvider();
System.out.println(f1.getClass().getDeclaredField("defaultResolver"));
Class<?>[] classes = validation.getClass().getDeclaredClasses();
for (int i = 0; i < classes.length; i++) {
if (classes[i].getSimpleName().equals("GetValidationProviderListAction")) {
Field field = classes[i].getDeclaredField("providersPerClassloader");
field.setAccessible(true);
Constructor<?> constructor = classes[i].getDeclaredConstructor();
constructor.setAccessible(true);
Object obj = constructor.newInstance();
assertTrue(field.get(obj).getClass().getName().equals("java.util.Collections$SynchronizedMap"));
}
}
} catch (Exception e) {
e.printStackTrace();
fail();
}
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks but you can drop the test, it really has no value. If we were doing that for everything, it would just be a big nightmare to maintain.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

okay i will remove the test

Copy link
Contributor

@yrodiere yrodiere left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure how this is supposed to solve the problem exactly.

As far as I can see, the code that accesses this map is already synchronized:

		private synchronized List<ValidationProvider<?>> getCachedValidationProviders(ClassLoader classLoader) {
			SoftReference<List<ValidationProvider<?>>> ref = providersPerClassloader.get( classLoader );
			return ref != null ? ref.get() : null;
		}
		private synchronized void cacheValidationProviders(ClassLoader classLoader, List<ValidationProvider<?>> providers) {
			providersPerClassloader.put( classLoader, new SoftReference<>( providers ) );
		}

So, in theory, adding one more level of synchronization wouldn't hurt (except performance-wise), but I don't see how it could help?

Also... I'd really like more info about #180. It mentions "stuck threads", but if you're talking about a deadlock, there should be multiple threads involved, and I only see one in the issue description... ?

@lananda
Copy link
Author

lananda commented Jul 20, 2022

I'm not sure how this is supposed to solve the problem exactly.

As far as I can see, the code that accesses this map is already synchronized:

		private synchronized List<ValidationProvider<?>> getCachedValidationProviders(ClassLoader classLoader) {
			SoftReference<List<ValidationProvider<?>>> ref = providersPerClassloader.get( classLoader );
			return ref != null ? ref.get() : null;
		}
		private synchronized void cacheValidationProviders(ClassLoader classLoader, List<ValidationProvider<?>> providers) {
			providersPerClassloader.put( classLoader, new SoftReference<>( providers ) );
		}

So, in theory, adding one more level of synchronization wouldn't hurt (except performance-wise), but I don't see how it could help?

Also... I'd really like more info about #180. It mentions "stuck threads", but if you're talking about a deadlock, there should be multiple threads involved, and I only see one in the issue description... ?

Here (getCachedValidationProviders) the synchronization is on class monitor - this is not going to help this situation, the HashMap itself is not synchronized & that is leading to the issue.
this has been reviewed by JDK team too. they suggest the Validation.java WeakHashMap not being synchronized is the root cause.

@lananda
Copy link
Author

lananda commented Jul 20, 2022

I have updated the bug with stack of other deadlocked thread

@yrodiere
Copy link
Contributor

Here (getCachedValidationProviders) the synchronization is on class monitor - this is not going to help this situation, the HashMap itself is not synchronized & that is leading to the issue.

Those two methods are not synchronizing on the class, but on the GetValidationProviderListAction instance (they're instance methods). What you synchronize on should not matter, as long as all threads accessing the same resource use the same lock, and that's the case for getCachedValidationProviders and cacheValidationProviders.

That being said, I see there's a static clearCache method which bypasses the instance synchronization and indeed only synchronizes on the class. Is it possible that your problem is with that method?

this has been reviewed by JDK team too. they suggest the Validation.java WeakHashMap not being synchronized is the root cause.

I'm not disputing the fact that WeakHashMap is not thread-safe, which is what your leaks seem to point at. WeakHashMap is not thread-safe, that's a fact.

What I'm arguing is that we already synchronize access to that map, so adding Collections.synchronizedMap should not be necessary.

But as I said above, the clearCache method seems problematic, so if you can confirm that method caused your problem, fixing synchronization in that method should fix your problem as well.

@gsmet
Copy link
Contributor

gsmet commented Jul 20, 2022

Apart from @yrodiere 's comment, I would repeat what I initially said: these particular methods should generally be called once in the lifetime of an application.
I wonder if you are not creating EntityManagerFactory and ValidatorFactory far more often than you should be. I already said that several times but you're ignoring this point and I think it's important.

Not saying that we shouldn't fix a synchronization issue if we end up finding one - for now, it's not clear if there is one. But my point is that even if we had one, you shouldn't experience it, except if you're doing very weird things that I wouldn't recommend if you want your application to perform well.

@lananda
Copy link
Author

lananda commented Jul 20, 2022

Apart from @yrodiere 's comment, I would repeat what I initially said: these particular methods should generally be called once in the lifetime of an application. I wonder if you are not creating EntityManagerFactory and ValidatorFactory far more often than you should be. I already said that several times but you're ignoring this point and I think it's important.

Not saying that we shouldn't fix a synchronization issue if we end up finding one - for now, it's not clear if there is one. But my point is that even if we had one, you shouldn't experience it, except if you're doing very weird things that I wouldn't recommend if you want your application to perform well.

I have requested client to confirm on this API usage. I will check that.

@yrodiere
Copy link
Contributor

That being said, I see there's a static clearCache method which bypasses the instance synchronization and indeed only synchronizes on the class.

I opened #182 to fix that. But again, I'm not sure that's your problem, since your thread dump in #180 was incomplete :)

@lananda
Copy link
Author

lananda commented Jul 20, 2022

Here (getCachedValidationProviders) the synchronization is on class monitor - this is not going to help this situation, the HashMap itself is not synchronized & that is leading to the issue.

Those two methods are not synchronizing on the class, but on the GetValidationProviderListAction instance (they're instance methods). What you synchronize on should not matter, as long as all threads accessing the same resource use the same lock, and that's the case for getCachedValidationProviders and cacheValidationProviders.

That being said, I see there's a static clearCache method which bypasses the instance synchronization and indeed only synchronizes on the class. Is it possible that your problem is with that method?

this has been reviewed by JDK team too. they suggest the Validation.java WeakHashMap not being synchronized is the root cause.

I'm not disputing the fact that WeakHashMap is not thread-safe, which is what your leaks seem to point at. WeakHashMap is not thread-safe, that's a fact.

What I'm arguing is that we already synchronize access to that map, so adding Collections.synchronizedMap should not be necessary.

But as I said above, the clearCache method seems problematic, so if you can confirm that method caused your problem, fixing synchronization in that method should fix your problem as well.

Thanks, I have updated the stack, the issue is not happening during clearCache.

@lananda
Copy link
Author

lananda commented Oct 11, 2022

FullThreadDump_Validtion_Hashmap.txt

gsmet yrodiere
I have now attached the full thread dump for this issue. could you please review once.
I think the fix in clearCache is unrelated to the current issue, as this clearcache invocation is nowhere in the call stack & this issue is happening during initialization. i have checked the code EntityManagerFactory and ValidatorFactory are called only once in application lifecycle

@yrodiere
Copy link
Contributor

Thanks.

To clarify: the only reason I'm reluctant to merge this patch is that it doesn't seem to fix anything. I'd like to understand how the patch can fix anything, and ideally also the problem itself.

As you can clearly see from your own thread dump, access to the hash map is already protected by a lock:

	at java.util.WeakHashMap.get(WeakHashMap.java:403)
	at javax.validation.Validation$GetValidationProviderListAction.getCachedValidationProviders(Validation.java:368)
	- locked <0x00000005f44e3e10> (a javax.validation.Validation$GetValidationProviderListAction)

And after #182, that lock is being used by every method that could possibly access that map. So wrapping the map with Collections.synchronizedMap is unlikely to change anything to the situation; you'll just have two locks instead of one.

I think the fix in clearCache is unrelated to the current issue, as this clearcache invocation is nowhere in the call stack

As far as I understand, clearCache doesn't need to be visible in your thread dump; it's enough that someone called clearCache concurrently to your stuck thread at some point. I.e. the thread that called clearCache might have moved on, it's not the one being stuck.

i have checked the code EntityManagerFactory and ValidatorFactory are called only once in application lifecycle

That's good.

Did you also check that clearCache is never called until shutdown?

Be aware that this code will be invoked reflectively, see

/**
* Not a public API; it can be used reflectively by code that integrates with Jakarta Bean Validation, e.g. application
* servers, to clear the provider cache maintained by the default provider resolver.
* <p>
* This is a strictly unsupported API, its definition may be changed or removed at any time. Its purpose is to
* explore the addition of standardized methods for dealing with provider caching in a future revision of the spec.
*/
@SuppressWarnings("unused")
private static void clearDefaultValidationProviderResolverCache() {
GetValidationProviderListAction.INSTANCE.clearCache();
}

Also, considering the history behind clearCache, it's entirely possible that WebLogic clears the cache by directly retrieving the map through reflection, thereby bypassing the locks. If that's the case, I would suggest reporting a bug to WebLogic. If they go that far, they might as well call clearDefaultValidationProviderResolverCache reflectively, like other application servers such as WildFly do.

Finally... did you try to run your application with an updated bean-validation JAR that includes my patch (#182), to see if the problem still occurs?

@lananda
Copy link
Author

lananda commented Oct 11, 2022

this is a known issue about WeakHashMap, Here are few issues where WeakHashMap is not synchronized is reported
eclipse-ee4j/metro-jax-ws#61
spring-projects/spring-loaded#194
https://www.adam-bien.com/roller/abien/entry/endless_loops_in_unsychronized_weakhashmap

Here is the problem as it showed up in Nashorn.
https://bugs.openjdk.java.net/browse/JDK-8075006
fixed Nashorn.
http://hg.openjdk.java.net/jdk9/jdk9/nashorn/rev/78f82d897305

The Java SE API publicizes that WeakHashMap is not synchronized.
In this case, it is being called by
javax.validation.Validation$GetValidationProviderListAction.

Hence the fix is suggested at Validation. This been suggested after review from JDK developers.

Weblogic is not calling clearCache using reflection, We are unable to reproduce the issue at will, hence unable confirm the fix of #182

@yrodiere
Copy link
Contributor

this is a known issue about WeakHashMap

Yes, I've already followed your links and I understand the problem. The question is whether your patch would solve it, considering we already have an equivalent solution (synchronized keywords on methods accessing the map) in place.

fixed Nashorn.
http://hg.openjdk.java.net/jdk9/jdk9/nashorn/rev/78f82d897305

Great minds think alike, then, because they just synchronized all methods accessing the WeakHashMap. Which is what Bean Validation does, too, in javax.validation.Validation$GetValidationProviderListAction.

The Java SE API publicizes that WeakHashMap is not synchronized.

Indeed. We do synchronize our own calls to WeakHashMap, though.

In this case, it is being called by javax.validation.Validation$GetValidationProviderListAction.

... whose methods accessing the WeakHashMap are all synchronized.

Weblogic is not calling clearCache using reflection

Calling clearCache would be fine, at least after my patch. I was saying that if WebLogic clears the cache by directly retrieving the map through reflection, then that is a problem, and WebLogic shouldn't do that. I can't check the source code of WebLogic, though, so I don't know what's going on.

We are unable to reproduce the issue at will, hence unable confirm the fix of #182

Ok, so as I understand it, you can't confirm my patch solves your problem. I infer you cannot confirm that your patch solves the problem either.

That would be fine if I could understand how on earth your patch could solve the problem, but I don't. Let's hope someone else does and will be able to merge your patch.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants