Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Handle Elasticache upgrade with low downtime #1907

Open
JulesClaussen opened this issue Aug 20, 2024 · 0 comments
Open

Handle Elasticache upgrade with low downtime #1907

JulesClaussen opened this issue Aug 20, 2024 · 0 comments

Comments

@JulesClaussen
Copy link

JulesClaussen commented Aug 20, 2024

Hey everyone,

We have been using ioredis for a while now, and it works fine except this specific point.
We have an elasticache Redis OSS on AWS, configured in Cluster mode Disabled, but with MultiAZ enabled, and Failover enabled.
We have one primary node, and one replica node.
We are using ioredis 5.2.4.

Client configuration is quite basic, (Typescript, Nest application) the following:

Redis({
    host: env.get('REDIS_HOSTNAME'),
    port: env.get('REDIS_PORT'),
    password: env.get('REDIS_PASSWORD'),
    tls: env.get('REDIS_TLS') === 'true' ? {} : undefined,
    ...(!!dbEnv && isNumber(dbEnv) && { db: parseInt(dbEnv) }),
});

Where REDIS_HOSTNAME is the primary endpoint from AWS.

Whenever we upgrade the Redis (even for minor and release version), we have a 10 minutes unavailability of the Redis. Upgrade takes around 30 minutes all in all, but the Redis is unavailable for around 10 minutes, throwing error such as:

-READONLY You can't write against a read only replica.
    at parseError (/app/node_modules/redis-parser/lib/parser.js:179:12)
    at parseType (/app/node_modules/redis-parser/lib/parser.js:302:14)

We have tried using the reconnectOnError, but without success:

reconnectOnError(err) {
    const targetError = 'READONLY';
    if (err.message.includes(targetError)) {
        return true;
    }
    return false;
},

According to documentation, retryStrategy is supposed to reconnect after a minute, so we haven't tried setting it.

Is there a way to handle this, or is this currently not possible?
Also, is there a way to easily test that? Running a failover manually on AWS console does not reproduce the issue for some reasons. Failover in this specific case happens quickly and application is just failing for about a minute or so.

Cheers,
Jules

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant