Bug#180_Stuck_threads_due_to_WeakHashMap #181

lananda · 2022-07-19T10:04:03Z

No description provided.

gsmet

Thanks. I added a few comments.

gsmet · 2022-07-19T10:44:02Z

src/main/java/jakarta/validation/Validation.java

-		private final WeakHashMap<ClassLoader, SoftReference<List<ValidationProvider<?>>>> providersPerClassloader =
-				new WeakHashMap<>();
+		private final Map<ClassLoader, SoftReference<List<ValidationProvider<?>>>> providersPerClassloader =
+				Collections.synchronizedMap(new WeakHashMap<ClassLoader, SoftReference<List<ValidationProvider<?>>>>());


Are you sure you're not doing something wrong in your code? Because write operations to this map should be extremely rare and if you have issues, I wonder if you are not doing something wrong.
That being said, I think the change makes sense.

@yrodiere could you have a look at this change too?

@gsmet thank you for the comments
there are few issues related to WeakHashMap not being synchronized , I have checked with JDK team too

eclipse-ee4j/metro-jax-ws#61
spring-projects/spring-loaded#194
https://www.adam-bien.com/roller/abien/entry/endless_loops_in_unsychronized_weakhashmap

I'm not saying this is not an issue. I'm saying that you shouldn't have it as you shouldn't call the method accessing this often: my wild guess is that, if you're hitting this, you are creating ValidatorFactory instances far too often. Your ValidatorFactory should be created once and for all and shared. Creating a new ValidatorFactory each time will cost you as it won't have the bean metadata cached.

gsmet · 2022-07-19T10:44:48Z

src/test/java/jakarta/validation/ValidationTest.java

+	@Test
+	public void testWeakHashMapSynchronization() {
+		try {
+			Validation validation = new Validation();
+			GenericBootstrap f1 = Validation.byDefaultProvider();
+			System.out.println(f1.getClass().getDeclaredField("defaultResolver"));
+			Class<?>[] classes = validation.getClass().getDeclaredClasses();
+			for (int i = 0; i < classes.length; i++) {
+				if (classes[i].getSimpleName().equals("GetValidationProviderListAction")) {
+					Field field = classes[i].getDeclaredField("providersPerClassloader");
+					field.setAccessible(true);
+					Constructor<?> constructor = classes[i].getDeclaredConstructor();
+					constructor.setAccessible(true);
+					Object obj = constructor.newInstance();
+					assertTrue(field.get(obj).getClass().getName().equals("java.util.Collections$SynchronizedMap"));
+				}
+			}
+		} catch (Exception e) {
+			e.printStackTrace();
+			fail();
+		} 
+	}


Thanks but you can drop the test, it really has no value. If we were doing that for everything, it would just be a big nightmare to maintain.

okay i will remove the test

yrodiere

I'm not sure how this is supposed to solve the problem exactly.

As far as I can see, the code that accesses this map is already synchronized:

		private synchronized List<ValidationProvider<?>> getCachedValidationProviders(ClassLoader classLoader) {
			SoftReference<List<ValidationProvider<?>>> ref = providersPerClassloader.get( classLoader );
			return ref != null ? ref.get() : null;
		}
		private synchronized void cacheValidationProviders(ClassLoader classLoader, List<ValidationProvider<?>> providers) {
			providersPerClassloader.put( classLoader, new SoftReference<>( providers ) );
		}

So, in theory, adding one more level of synchronization wouldn't hurt (except performance-wise), but I don't see how it could help?

Also... I'd really like more info about #180. It mentions "stuck threads", but if you're talking about a deadlock, there should be multiple threads involved, and I only see one in the issue description... ?

lananda · 2022-07-20T09:00:50Z

I'm not sure how this is supposed to solve the problem exactly.

As far as I can see, the code that accesses this map is already synchronized:
		private synchronized List<ValidationProvider<?>> getCachedValidationProviders(ClassLoader classLoader) {
			SoftReference<List<ValidationProvider<?>>> ref = providersPerClassloader.get( classLoader );
			return ref != null ? ref.get() : null;
		}
		private synchronized void cacheValidationProviders(ClassLoader classLoader, List<ValidationProvider<?>> providers) {
			providersPerClassloader.put( classLoader, new SoftReference<>( providers ) );
		}
So, in theory, adding one more level of synchronization wouldn't hurt (except performance-wise), but I don't see how it could help?

Also... I'd really like more info about #180. It mentions "stuck threads", but if you're talking about a deadlock, there should be multiple threads involved, and I only see one in the issue description... ?

Here (getCachedValidationProviders) the synchronization is on class monitor - this is not going to help this situation, the HashMap itself is not synchronized & that is leading to the issue.
this has been reviewed by JDK team too. they suggest the Validation.java WeakHashMap not being synchronized is the root cause.

lananda · 2022-07-20T09:06:52Z

I have updated the bug with stack of other deadlocked thread

yrodiere · 2022-07-20T09:23:33Z

Here (getCachedValidationProviders) the synchronization is on class monitor - this is not going to help this situation, the HashMap itself is not synchronized & that is leading to the issue.

Those two methods are not synchronizing on the class, but on the GetValidationProviderListAction instance (they're instance methods). What you synchronize on should not matter, as long as all threads accessing the same resource use the same lock, and that's the case for getCachedValidationProviders and cacheValidationProviders.

That being said, I see there's a static clearCache method which bypasses the instance synchronization and indeed only synchronizes on the class. Is it possible that your problem is with that method?

this has been reviewed by JDK team too. they suggest the Validation.java WeakHashMap not being synchronized is the root cause.

I'm not disputing the fact that WeakHashMap is not thread-safe, which is what your leaks seem to point at. WeakHashMap is not thread-safe, that's a fact.

What I'm arguing is that we already synchronize access to that map, so adding Collections.synchronizedMap should not be necessary.

But as I said above, the clearCache method seems problematic, so if you can confirm that method caused your problem, fixing synchronization in that method should fix your problem as well.

gsmet · 2022-07-20T09:31:15Z

Apart from @yrodiere 's comment, I would repeat what I initially said: these particular methods should generally be called once in the lifetime of an application.
I wonder if you are not creating EntityManagerFactory and ValidatorFactory far more often than you should be. I already said that several times but you're ignoring this point and I think it's important.

Not saying that we shouldn't fix a synchronization issue if we end up finding one - for now, it's not clear if there is one. But my point is that even if we had one, you shouldn't experience it, except if you're doing very weird things that I wouldn't recommend if you want your application to perform well.

lananda · 2022-07-20T09:33:42Z

Apart from @yrodiere 's comment, I would repeat what I initially said: these particular methods should generally be called once in the lifetime of an application. I wonder if you are not creating EntityManagerFactory and ValidatorFactory far more often than you should be. I already said that several times but you're ignoring this point and I think it's important.

Not saying that we shouldn't fix a synchronization issue if we end up finding one - for now, it's not clear if there is one. But my point is that even if we had one, you shouldn't experience it, except if you're doing very weird things that I wouldn't recommend if you want your application to perform well.

I have requested client to confirm on this API usage. I will check that.

yrodiere · 2022-07-20T09:39:47Z

That being said, I see there's a static clearCache method which bypasses the instance synchronization and indeed only synchronizes on the class.

I opened #182 to fix that. But again, I'm not sure that's your problem, since your thread dump in #180 was incomplete :)

lananda · 2022-07-20T09:42:45Z

Here (getCachedValidationProviders) the synchronization is on class monitor - this is not going to help this situation, the HashMap itself is not synchronized & that is leading to the issue.

Those two methods are not synchronizing on the class, but on the GetValidationProviderListAction instance (they're instance methods). What you synchronize on should not matter, as long as all threads accessing the same resource use the same lock, and that's the case for getCachedValidationProviders and cacheValidationProviders.

That being said, I see there's a static clearCache method which bypasses the instance synchronization and indeed only synchronizes on the class. Is it possible that your problem is with that method?

this has been reviewed by JDK team too. they suggest the Validation.java WeakHashMap not being synchronized is the root cause.

I'm not disputing the fact that WeakHashMap is not thread-safe, which is what your leaks seem to point at. WeakHashMap is not thread-safe, that's a fact.

What I'm arguing is that we already synchronize access to that map, so adding Collections.synchronizedMap should not be necessary.

But as I said above, the clearCache method seems problematic, so if you can confirm that method caused your problem, fixing synchronization in that method should fix your problem as well.

Thanks, I have updated the stack, the issue is not happening during clearCache.

lananda · 2022-10-11T06:13:36Z

FullThreadDump_Validtion_Hashmap.txt

gsmet yrodiere
I have now attached the full thread dump for this issue. could you please review once.
I think the fix in clearCache is unrelated to the current issue, as this clearcache invocation is nowhere in the call stack & this issue is happening during initialization. i have checked the code EntityManagerFactory and ValidatorFactory are called only once in application lifecycle

yrodiere · 2022-10-11T08:49:56Z

Thanks.

To clarify: the only reason I'm reluctant to merge this patch is that it doesn't seem to fix anything. I'd like to understand how the patch can fix anything, and ideally also the problem itself.

As you can clearly see from your own thread dump, access to the hash map is already protected by a lock:

	at java.util.WeakHashMap.get(WeakHashMap.java:403)
	at javax.validation.Validation$GetValidationProviderListAction.getCachedValidationProviders(Validation.java:368)
	- locked <0x00000005f44e3e10> (a javax.validation.Validation$GetValidationProviderListAction)

And after #182, that lock is being used by every method that could possibly access that map. So wrapping the map with Collections.synchronizedMap is unlikely to change anything to the situation; you'll just have two locks instead of one.

I think the fix in clearCache is unrelated to the current issue, as this clearcache invocation is nowhere in the call stack

As far as I understand, clearCache doesn't need to be visible in your thread dump; it's enough that someone called clearCache concurrently to your stuck thread at some point. I.e. the thread that called clearCache might have moved on, it's not the one being stuck.

i have checked the code EntityManagerFactory and ValidatorFactory are called only once in application lifecycle

That's good.

Did you also check that clearCache is never called until shutdown?

Be aware that this code will be invoked reflectively, see

validation/src/main/java/jakarta/validation/Validation.java

Lines 157 to 167 in 1aac339

    
           	/** 
        
           	 * Not a public API; it can be used reflectively by code that integrates with Jakarta Bean Validation, e.g. application 
        
           	 * servers, to clear the provider cache maintained by the default provider resolver. 
        
           	 * <p> 
        
           	 * This is a strictly unsupported API, its definition may be changed or removed at any time. Its purpose is to 
        
           	 * explore the addition of standardized methods for dealing with provider caching in a future revision of the spec. 
        
           	 */ 
        
           	@SuppressWarnings("unused") 
        
           	private static void clearDefaultValidationProviderResolverCache() { 
        
           		GetValidationProviderListAction.INSTANCE.clearCache(); 
        
           	}

Also, considering the history behind clearCache, it's entirely possible that WebLogic clears the cache by directly retrieving the map through reflection, thereby bypassing the locks. If that's the case, I would suggest reporting a bug to WebLogic. If they go that far, they might as well call clearDefaultValidationProviderResolverCache reflectively, like other application servers such as WildFly do.

Finally... did you try to run your application with an updated bean-validation JAR that includes my patch (#182), to see if the problem still occurs?

lananda · 2022-10-11T11:59:55Z

this is a known issue about WeakHashMap, Here are few issues where WeakHashMap is not synchronized is reported
eclipse-ee4j/metro-jax-ws#61
spring-projects/spring-loaded#194
https://www.adam-bien.com/roller/abien/entry/endless_loops_in_unsychronized_weakhashmap

Here is the problem as it showed up in Nashorn.
https://bugs.openjdk.java.net/browse/JDK-8075006
fixed Nashorn.
http://hg.openjdk.java.net/jdk9/jdk9/nashorn/rev/78f82d897305

The Java SE API publicizes that WeakHashMap is not synchronized.
In this case, it is being called by
javax.validation.Validation$GetValidationProviderListAction.

Hence the fix is suggested at Validation. This been suggested after review from JDK developers.

Weblogic is not calling clearCache using reflection, We are unable to reproduce the issue at will, hence unable confirm the fix of #182

yrodiere · 2022-10-18T10:06:54Z

this is a known issue about WeakHashMap

Yes, I've already followed your links and I understand the problem. The question is whether your patch would solve it, considering we already have an equivalent solution (synchronized keywords on methods accessing the map) in place.

fixed Nashorn.
http://hg.openjdk.java.net/jdk9/jdk9/nashorn/rev/78f82d897305

Great minds think alike, then, because they just synchronized all methods accessing the WeakHashMap. Which is what Bean Validation does, too, in javax.validation.Validation$GetValidationProviderListAction.

The Java SE API publicizes that WeakHashMap is not synchronized.

Indeed. We do synchronize our own calls to WeakHashMap, though.

In this case, it is being called by javax.validation.Validation$GetValidationProviderListAction.

... whose methods accessing the WeakHashMap are all synchronized.

Weblogic is not calling clearCache using reflection

Calling clearCache would be fine, at least after my patch. I was saying that if WebLogic clears the cache by directly retrieving the map through reflection, then that is a problem, and WebLogic shouldn't do that. I can't check the source code of WebLogic, though, so I don't know what's going on.

We are unable to reproduce the issue at will, hence unable confirm the fix of #182

Ok, so as I understand it, you can't confirm my patch solves your problem. I infer you cannot confirm that your patch solves the problem either.

That would be fine if I could understand how on earth your patch could solve the problem, but I don't. Let's hope someone else does and will be able to merge your patch.

Bug#180_Stuck_threads_due_to_WeakHashMap

76cce18

gsmet requested changes Jul 19, 2022

View reviewed changes

yrodiere suggested changes Jul 19, 2022

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bug#180_Stuck_threads_due_to_WeakHashMap #181

Bug#180_Stuck_threads_due_to_WeakHashMap #181

lananda commented Jul 19, 2022

gsmet left a comment

gsmet Jul 19, 2022

lananda Jul 19, 2022

gsmet Jul 19, 2022

gsmet Jul 19, 2022

lananda Jul 19, 2022

yrodiere left a comment

lananda commented Jul 20, 2022

lananda commented Jul 20, 2022

yrodiere commented Jul 20, 2022

gsmet commented Jul 20, 2022

lananda commented Jul 20, 2022

yrodiere commented Jul 20, 2022

lananda commented Jul 20, 2022

lananda commented Oct 11, 2022

yrodiere commented Oct 11, 2022

lananda commented Oct 11, 2022

yrodiere commented Oct 18, 2022

Bug#180_Stuck_threads_due_to_WeakHashMap #181

Are you sure you want to change the base?

Bug#180_Stuck_threads_due_to_WeakHashMap #181

Conversation

lananda commented Jul 19, 2022

gsmet left a comment

Choose a reason for hiding this comment

gsmet Jul 19, 2022

Choose a reason for hiding this comment

lananda Jul 19, 2022

Choose a reason for hiding this comment

gsmet Jul 19, 2022

Choose a reason for hiding this comment

gsmet Jul 19, 2022

Choose a reason for hiding this comment

lananda Jul 19, 2022

Choose a reason for hiding this comment

yrodiere left a comment

Choose a reason for hiding this comment

lananda commented Jul 20, 2022

lananda commented Jul 20, 2022

yrodiere commented Jul 20, 2022

gsmet commented Jul 20, 2022

lananda commented Jul 20, 2022

yrodiere commented Jul 20, 2022

lananda commented Jul 20, 2022

lananda commented Oct 11, 2022

yrodiere commented Oct 11, 2022

lananda commented Oct 11, 2022

yrodiere commented Oct 18, 2022