None recovering Redis command timed out #3010

mshahmoradi87 · 2024-10-11T12:07:49Z

Bug Report

One container starts to have "Redis command timed out" and it does not recover.

Current Behavior

we have started to use aws redis serveless, since the switch from redis cluster-mode enabled to redis serverless, out of the blue one client from many tasks running starts to have command timed out continuously. Depending on the service and size of objects and if put or get commands are being timed out, we see different side effects, sometimes it results in #705, sometimes only increased latency.

Note that in the services this happens, usually there are more than 12 tasks running, and only one container has this issue.

I could correlate any CPU spike with the event causing the timeouts.

org.springframework.dao.QueryTimeoutException: Redis command timed out; nested exception is io.lettuce.core.RedisCommandTimeoutException: Command timed out after 5 second(s)
	at org.springframework.data.redis.connection.lettuce.LettuceExceptionConverter.convert(LettuceExceptionConverter.java:70)
	at org.springframework.data.redis.connection.lettuce.LettuceExceptionConverter.convert(LettuceExceptionConverter.java:41)
	at org.springframework.data.redis.PassThroughExceptionTranslationStrategy.translate(PassThroughExceptionTranslationStrategy.java:44)
	at org.springframework.data.redis.FallbackExceptionTranslationStrategy.translate(FallbackExceptionTranslationStrategy.java:42)
	at org.springframework.data.redis.connection.lettuce.LettuceConnection.convertLettuceAccessException(LettuceConnection.java:277)
	at org.springframework.data.redis.connection.lettuce.LettuceConnection.await(LettuceConnection.java:1085)
	at org.springframework.data.redis.connection.lettuce.LettuceConnection.lambda$doInvoke$4(LettuceConnection.java:938)
	at org.springframework.data.redis.connection.lettuce.LettuceInvoker$Synchronizer.invoke(LettuceInvoker.java:665)
	at org.springframework.data.redis.connection.lettuce.LettuceInvoker.just(LettuceInvoker.java:125)
	at org.springframework.data.redis.connection.lettuce.LettuceHashCommands.hSet(LettuceHashCommands.java:61)
	at org.springframework.data.redis.connection.DefaultedRedisConnection.hSet(DefaultedRedisConnection.java:1332)
	at org.springframework.data.redis.connection.DefaultStringRedisConnection.hSet(DefaultStringRedisConnection.java:631)
	at org.springframework.data.redis.core.DefaultHashOperations.lambda$put$14(DefaultHashOperations.java:254)
	at org.springframework.data.redis.core.RedisTemplate.execute(RedisTemplate.java:224)
	at org.springframework.data.redis.core.RedisTemplate.execute(RedisTemplate.java:191)
	at org.springframework.data.redis.core.AbstractOperations.execute(AbstractOperations.java:97)
	at org.springframework.data.redis.core.DefaultHashOperations.put(DefaultHashOperations.java:253)



	at java.base/java.util.concurrent.FutureTask.run(Unknown Source)
	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
	at java.base/java.lang.Thread.run(Unknown Source)
Caused by: io.lettuce.core.RedisCommandTimeoutException: Command timed out after 5 second(s)
	at io.lettuce.core.internal.ExceptionFactory.createTimeoutException(ExceptionFactory.java:59)
	at io.lettuce.core.internal.Futures.awaitOrCancel(Futures.java:246)
	at io.lettuce.core.LettuceFutures.awaitOrCancel(LettuceFutures.java:74)
	at org.springframework.data.redis.connection.lettuce.LettuceConnection.await(LettuceConnection.java:1083)
	... 22 common frames omitted`

Expected behavior/code

the expected behaviour is that with periodic_refresh, the issue does not persist.

Environment

aws redis serverless 7.1
<lettuce.version>6.3.2.RELEASE</lettuce.version>
<netty.version>4.1.109.Final</netty.version>

related configurations:
ssl: true
timeout: 5s
connect-timeout: 500ms
additional:
dns-ttl: 5s
periodic-refresh: 10s
reconnect-delay-min: 100ms
reconnect-delay-max: 5s
read-from: replicaPreferred

Possible Solution

RedisCommandTimeoutException intermittently:
in some cases this could be related, as we sometimes see timeouts for a minute, and they are recovered, however in this case it persists.

Lettuce cannot recover from connection problems:
maybe setting TCP_USER_TIMEOUT would help, default is 1 minutes, so for the recovering case, maybe this helps, but for the cases that this continues and gets stacked probably not.

The text was updated successfully, but these errors were encountered:

tishun added for: team-attention An issue we need to discuss as a team to make progress status: waiting-for-triage labels Oct 11, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

None recovering Redis command timed out #3010

None recovering Redis command timed out #3010

mshahmoradi87 commented Oct 11, 2024

None recovering Redis command timed out #3010

None recovering Redis command timed out #3010

Comments

mshahmoradi87 commented Oct 11, 2024

Bug Report

Current Behavior

Expected behavior/code

Environment

Possible Solution