Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(migrations): split redis migrations into up and teardown #12983

Merged
merged 1 commit into from
May 6, 2024

Conversation

nowNick
Copy link
Contributor

@nowNick nowNick commented May 6, 2024

Summary

"UP" phase should only contain non-destructive operations. "TEARDOWN" (or "FINISH") phase should be used to change/delete data.

Checklist

  • The Pull Request has tests
  • A changelog file has been created under changelog/unreleased/kong or skip-changelog label added on PR if changelog is unnecessary. README.md
  • N/A There is a user-facing docs PR against https://github.com/Kong/docs.konghq.com - PUT DOCS PR HERE

Issue reference

KAG-4419
#12978

@nowNick nowNick marked this pull request as ready for review May 6, 2024 14:54
@nowNick nowNick marked this pull request as draft May 6, 2024 14:54
@nowNick nowNick force-pushed the fix/split-redis-migrations-to-up-teardown branch from b9cddbd to b2489ca Compare May 6, 2024 14:55
@github-actions github-actions bot added the cherry-pick kong-ee schedule this PR for cherry-picking to kong/kong-ee label May 6, 2024
"UP" phase should only contain non-destructive operations.
"TEARDOWN" (or "FINISH") phase should be used to change/delete data.

KAG-4419
@nowNick nowNick force-pushed the fix/split-redis-migrations-to-up-teardown branch from b2489ca to 77d80b7 Compare May 6, 2024 15:00
@nowNick nowNick marked this pull request as ready for review May 6, 2024 15:16
@nowNick nowNick requested a review from hanshuebner May 6, 2024 15:26
@locao locao merged commit 49aa233 into master May 6, 2024
37 checks passed
@locao locao deleted the fix/split-redis-migrations-to-up-teardown branch May 6, 2024 18:14
@team-gateway-bot
Copy link
Collaborator

Successfully created cherry-pick PR for master:

@team-gateway-bot
Copy link
Collaborator

Successfully created backport PR for release/3.6.x:

@team-gateway-bot
Copy link
Collaborator

Successfully created backport PR for release/3.7.x:

@bungle
Copy link
Member

bungle commented May 6, 2024

@nowNick @locao @hanshuebner @samugi, I do want to double check if this is re-entrant or not. I feel it is not. E.g. upgrade to latest and add couple of plugins (acme, rate-limiting) and then run kong migrations finish -f (or run it multiple times) OR run kong migrations up -f multiple times before running kong migrations finish. What does happen (it might work, but it is not obvious)?

My concern is that what happens if this is executed multiple times:

          UPDATE plugins
          SET config =
              jsonb_set(
                config,
                '{storage_config,redis}',
                config #> '{storage_config, redis}'
                || jsonb_build_object(
                  'password', config #> '{storage_config, redis, auth}',
                  'server_name', config #> '{storage_config, redis, ssl_server_name}',
                  'extra_options', jsonb_build_object(
                    'scan_count', config #> '{storage_config, redis, scan_count}',
                    'namespace', config #> '{storage_config, redis, namespace}'
                  )
                )
              )
            WHERE name = 'acme';

Especially when it is executed after:

            UPDATE plugins
            SET config =
              config
                #- '{storage_config,redis,auth}'
                #- '{storage_config,redis,ssl_server_name}'
                #- '{storage_config,redis,scan_count}'
                #- '{storage_config,redis,namespace}'
            WHERE name = 'acme';

(to be honest, I do not fully understand why we even migrate this, as shorthands can do it without migrations, and I guess this is good example of things starting to go wrong when trying to add migrations and why we avoid running migrations with these JSON blobs in general)

@StarlightIbuki
Copy link
Contributor

+1 with @bungle . The up part seems ok to be executed multiple times but not after executing tearup. It will lose all the Redis setup. The commands need careful checks before it does the change. Please fix this and then backport it @nowNick.

@StarlightIbuki
Copy link
Contributor

BTW is it ok for new nodes to see the old fields in the config object? It will happen when the teardown is not done.

@outsinre
Copy link
Contributor

outsinre commented May 7, 2024

Just a quick through the code, this PR follows the official guideline to split update and add operations into up and teardown respectively.

However, it does not resolve the timeout issues reported here #12978. It is quite common, nowadays, customers may have hundreds of rate-limiting instances.

Maybe, we should adopt @bungle 's suggestion on shorthand_fields like this

shorthand_fields = {
.

@hanshuebner
Copy link
Contributor

However, it does not resolve the timeout issues reported here #12978. It is quite common, nowadays, customers may have hundreds of rate-limiting instances.

#12761 addresses the timeout issue in the sense that it makes the environment variable KONG_PG_TIMEOUT work in addition to the --db-timeout option. It may be worthwhile to change the default database timeout for command line operations to a higher value.

Maybe, we should adopt @bungle 's suggestion on shorthand_fields like this

shorthand_fields = {

Can you elaborate how shorthand_fields can help with the timeout issue?

@hanshuebner
Copy link
Contributor

Can you elaborate how shorthand_fields can help with the timeout issue?

@bungle explained: We should generally avoid data migrations if possible as they are scary and failure prone. In this case, the new version can work with the old configration values and shorthand_fields, so a data migration might not even be needed.

@bungle
Copy link
Member

bungle commented May 7, 2024

BTW is it ok for new nodes to see the old fields in the config object? It will happen when the teardown is not done.

It is perfectly fine. New nodes will remove unknown fields on read (which doesn't matter much as new nodes have no code to use old fields -> but on read we sanitize the data according to schemas). And they should, with shorthands, be able to migrate old to new too (which has benefit of new (and old) nodes working correctly even when finish migrations have not ran yet).

@bungle
Copy link
Member

bungle commented May 7, 2024

We should generally avoid data migrations if possible as they are scary and failure prone. In this case, the new version can work with the old configuration values and shorthand_fields, so a data migration might not even be needed.

Yes. This is a bit debatable. As some people feel that leaving old data in db is somekind of cruft that we need to get rid of, or that will cause some other problems. The old values will stay there only as long as entity is not saved, When entity issaved the old fields (and now unknown) are removed from (db). Meanwhile we can migrate them over to new fields (without even checking if new fields are there already, as I explained, they will be gone on save).

I also do agree that we need to get rid of this "cruft", but perhaps it is best done with major version. When we get rid of cruft like here in finish:

            UPDATE plugins
            SET config =
              config
                #- '{storage_config,redis,auth}'
                #- '{storage_config,redis,ssl_server_name}'
                #- '{storage_config,redis,scan_count}'
                #- '{storage_config,redis,namespace}'
            WHERE name = 'acme';

It is a destructive operation. Now only way to get back to older Kong version is to restore db. If we didn't do this at all, it would be just matter of switching version back. And it would work even if new fields already had data copied over, because again, old nodes remove unknown fields on read.

We could also add a new command like (to remove the old fields, essentially we can just loop through entities and call update):

kong database vacuum

(I agree, this fights against "let's keep everything clean" but it is quite practical - aka enables more ways to recover from things going wrong, and frankly I have hardly ever heard "the cruft" in data causing us issues - so just thinking about benefits of destructive operations in minor versions, and even in majors, we used to even do these destructive things separated in two versions, aka in steps)

@bungle
Copy link
Member

bungle commented May 7, 2024

Migrations can go wrong in many unexpected ways. Especially data migrations. We have seen timeouts, locks harming production machines, processed get killed in a mid, network issues, postgres version issues (migration works on my machine, but not on customer version of pg), we have seen that we needed to rewrite migrations because of bugs and then we needed to somehow support also people who ran the buggy migration if possible etc. The biggest issue for us is that we don't know the data. It can be anything from couple of entities to perhaps millions of entities.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants