fix(migrations): split redis migrations into up and teardown #12983

nowNick · 2024-05-06T11:03:23Z

Summary

"UP" phase should only contain non-destructive operations. "TEARDOWN" (or "FINISH") phase should be used to change/delete data.

Checklist

The Pull Request has tests
A changelog file has been created under changelog/unreleased/kong or skip-changelog label added on PR if changelog is unnecessary. README.md
N/A ~~There is a user-facing docs PR against https://github.com/Kong/docs.konghq.com - PUT DOCS PR HERE~~

Issue reference

KAG-4419
#12978

"UP" phase should only contain non-destructive operations. "TEARDOWN" (or "FINISH") phase should be used to change/delete data. KAG-4419

team-gateway-bot · 2024-05-06T18:14:35Z

Successfully created cherry-pick PR for master:

https://github.com/kong/kong-ee/pull/9028

team-gateway-bot · 2024-05-06T18:14:35Z

Successfully created backport PR for release/3.6.x:

[backport -> release/3.6.x] fix(migrations): split redis migrations into up and teardown #12988

team-gateway-bot · 2024-05-06T18:14:41Z

Successfully created backport PR for release/3.7.x:

[backport -> release/3.7.x] fix(migrations): split redis migrations into up and teardown #12989

bungle · 2024-05-06T18:45:30Z

@nowNick @locao @hanshuebner @samugi, I do want to double check if this is re-entrant or not. I feel it is not. E.g. upgrade to latest and add couple of plugins (acme, rate-limiting) and then run kong migrations finish -f (or run it multiple times) OR run kong migrations up -f multiple times before running kong migrations finish. What does happen (it might work, but it is not obvious)?

My concern is that what happens if this is executed multiple times:

          UPDATE plugins
          SET config =
              jsonb_set(
                config,
                '{storage_config,redis}',
                config #> '{storage_config, redis}'
                || jsonb_build_object(
                  'password', config #> '{storage_config, redis, auth}',
                  'server_name', config #> '{storage_config, redis, ssl_server_name}',
                  'extra_options', jsonb_build_object(
                    'scan_count', config #> '{storage_config, redis, scan_count}',
                    'namespace', config #> '{storage_config, redis, namespace}'
                  )
                )
              )
            WHERE name = 'acme';

Especially when it is executed after:

            UPDATE plugins
            SET config =
              config
                #- '{storage_config,redis,auth}'
                #- '{storage_config,redis,ssl_server_name}'
                #- '{storage_config,redis,scan_count}'
                #- '{storage_config,redis,namespace}'
            WHERE name = 'acme';

(to be honest, I do not fully understand why we even migrate this, as shorthands can do it without migrations, and I guess this is good example of things starting to go wrong when trying to add migrations and why we avoid running migrations with these JSON blobs in general)

StarlightIbuki · 2024-05-07T06:53:11Z

+1 with @bungle . The up part seems ok to be executed multiple times but not after executing tearup. It will lose all the Redis setup. The commands need careful checks before it does the change. Please fix this and then backport it @nowNick.

StarlightIbuki · 2024-05-07T06:56:40Z

BTW is it ok for new nodes to see the old fields in the config object? It will happen when the teardown is not done.

outsinre · 2024-05-07T07:10:29Z

Just a quick through the code, this PR follows the official guideline to split update and add operations into up and teardown respectively.

However, it does not resolve the timeout issues reported here #12978. It is quite common, nowadays, customers may have hundreds of rate-limiting instances.

Maybe, we should adopt @bungle 's suggestion on shorthand_fields like this

kong/kong/plugins/session/schema.lua

Line 216 in 49aa233

shorthand_fields = {

.

hanshuebner · 2024-05-07T07:31:29Z

However, it does not resolve the timeout issues reported here #12978. It is quite common, nowadays, customers may have hundreds of rate-limiting instances.

#12761 addresses the timeout issue in the sense that it makes the environment variable KONG_PG_TIMEOUT work in addition to the --db-timeout option. It may be worthwhile to change the default database timeout for command line operations to a higher value.

Maybe, we should adopt @bungle 's suggestion on shorthand_fields like this

kong/kong/plugins/session/schema.lua

Line 216 in 49aa233

shorthand_fields = {

Can you elaborate how shorthand_fields can help with the timeout issue?

hanshuebner · 2024-05-07T09:15:40Z

Can you elaborate how shorthand_fields can help with the timeout issue?

@bungle explained: We should generally avoid data migrations if possible as they are scary and failure prone. In this case, the new version can work with the old configration values and shorthand_fields, so a data migration might not even be needed.

bungle · 2024-05-07T10:44:28Z

BTW is it ok for new nodes to see the old fields in the config object? It will happen when the teardown is not done.

It is perfectly fine. New nodes will remove unknown fields on read (which doesn't matter much as new nodes have no code to use old fields -> but on read we sanitize the data according to schemas). And they should, with shorthands, be able to migrate old to new too (which has benefit of new (and old) nodes working correctly even when finish migrations have not ran yet).

bungle · 2024-05-07T10:52:00Z

We should generally avoid data migrations if possible as they are scary and failure prone. In this case, the new version can work with the old configuration values and shorthand_fields, so a data migration might not even be needed.

Yes. This is a bit debatable. As some people feel that leaving old data in db is somekind of cruft that we need to get rid of, or that will cause some other problems. The old values will stay there only as long as entity is not saved, When entity issaved the old fields (and now unknown) are removed from (db). Meanwhile we can migrate them over to new fields (without even checking if new fields are there already, as I explained, they will be gone on save).

I also do agree that we need to get rid of this "cruft", but perhaps it is best done with major version. When we get rid of cruft like here in finish:

            UPDATE plugins
            SET config =
              config
                #- '{storage_config,redis,auth}'
                #- '{storage_config,redis,ssl_server_name}'
                #- '{storage_config,redis,scan_count}'
                #- '{storage_config,redis,namespace}'
            WHERE name = 'acme';

It is a destructive operation. Now only way to get back to older Kong version is to restore db. If we didn't do this at all, it would be just matter of switching version back. And it would work even if new fields already had data copied over, because again, old nodes remove unknown fields on read.

We could also add a new command like (to remove the old fields, essentially we can just loop through entities and call update):

kong database vacuum

(I agree, this fights against "let's keep everything clean" but it is quite practical - aka enables more ways to recover from things going wrong, and frankly I have hardly ever heard "the cruft" in data causing us issues - so just thinking about benefits of destructive operations in minor versions, and even in majors, we used to even do these destructive things separated in two versions, aka in steps)

bungle · 2024-05-07T11:13:33Z

Migrations can go wrong in many unexpected ways. Especially data migrations. We have seen timeouts, locks harming production machines, processed get killed in a mid, network issues, postgres version issues (migration works on my machine, but not on customer version of pg), we have seen that we needed to rewrite migrations because of bugs and then we needed to somehow support also people who ran the buggy migration if possible etc. The biggest issue for us is that we don't know the data. It can be anything from couple of entities to perhaps millions of entities.

pull-request-size bot added the size/L label May 6, 2024

github-actions bot added plugins/rate-limiting plugins/acme plugins/response-ratelimiting labels May 6, 2024

github-actions bot assigned nowNick May 6, 2024

nowNick force-pushed the fix/split-redis-migrations-to-up-teardown branch 2 times, most recently from dee141d to dbfc85a Compare May 6, 2024 11:58

nowNick marked this pull request as draft May 6, 2024 14:22

kikito added this to the 3.7.0 milestone May 6, 2024

nowNick force-pushed the fix/split-redis-migrations-to-up-teardown branch from dbfc85a to b9cddbd Compare May 6, 2024 14:51

nowNick removed plugins/ai-prompt-template changelog labels May 6, 2024

nowNick marked this pull request as ready for review May 6, 2024 14:54

nowNick marked this pull request as draft May 6, 2024 14:54

nowNick force-pushed the fix/split-redis-migrations-to-up-teardown branch from b9cddbd to b2489ca Compare May 6, 2024 14:55

github-actions bot added the cherry-pick kong-ee schedule this PR for cherry-picking to kong/kong-ee label May 6, 2024

fix(migrations): split redis migrations into up and teardown

77d80b7

"UP" phase should only contain non-destructive operations. "TEARDOWN" (or "FINISH") phase should be used to change/delete data. KAG-4419

nowNick force-pushed the fix/split-redis-migrations-to-up-teardown branch from b2489ca to 77d80b7 Compare May 6, 2024 15:00

nowNick marked this pull request as ready for review May 6, 2024 15:16

nowNick requested a review from hanshuebner May 6, 2024 15:26

locao added the backport release/3.7.x label May 6, 2024

hanshuebner approved these changes May 6, 2024

View reviewed changes

nowNick added the backport release/3.6.x label May 6, 2024

samugi approved these changes May 6, 2024

View reviewed changes

locao merged commit 49aa233 into master May 6, 2024
37 checks passed

locao deleted the fix/split-redis-migrations-to-up-teardown branch May 6, 2024 18:14

team-gateway-bot mentioned this pull request May 6, 2024

[backport -> release/3.6.x] fix(migrations): split redis migrations into up and teardown #12988

Merged

3 tasks

team-gateway-bot mentioned this pull request May 6, 2024

[backport -> release/3.7.x] fix(migrations): split redis migrations into up and teardown #12989

Merged

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(migrations): split redis migrations into up and teardown #12983

fix(migrations): split redis migrations into up and teardown #12983

nowNick commented May 6, 2024 •

edited by hanshuebner

Loading

team-gateway-bot commented May 6, 2024

team-gateway-bot commented May 6, 2024

team-gateway-bot commented May 6, 2024

bungle commented May 6, 2024 •

edited

Loading

StarlightIbuki commented May 7, 2024

StarlightIbuki commented May 7, 2024

outsinre commented May 7, 2024

hanshuebner commented May 7, 2024

hanshuebner commented May 7, 2024

bungle commented May 7, 2024 •

edited

Loading

bungle commented May 7, 2024 •

edited

Loading

bungle commented May 7, 2024 •

edited

Loading

fix(migrations): split redis migrations into up and teardown #12983

fix(migrations): split redis migrations into up and teardown #12983

Conversation

nowNick commented May 6, 2024 • edited by hanshuebner Loading

Summary

Checklist

Issue reference

team-gateway-bot commented May 6, 2024

team-gateway-bot commented May 6, 2024

team-gateway-bot commented May 6, 2024

bungle commented May 6, 2024 • edited Loading

StarlightIbuki commented May 7, 2024

StarlightIbuki commented May 7, 2024

outsinre commented May 7, 2024

hanshuebner commented May 7, 2024

hanshuebner commented May 7, 2024

bungle commented May 7, 2024 • edited Loading

bungle commented May 7, 2024 • edited Loading

bungle commented May 7, 2024 • edited Loading

nowNick commented May 6, 2024 •

edited by hanshuebner

Loading

bungle commented May 6, 2024 •

edited

Loading

bungle commented May 7, 2024 •

edited

Loading

bungle commented May 7, 2024 •

edited

Loading

bungle commented May 7, 2024 •

edited

Loading