Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve availability during pd rolling restart #8748

Open
lhy1024 opened this issue Oct 28, 2024 · 1 comment · May be fixed by #8749
Open

Improve availability during pd rolling restart #8748

lhy1024 opened this issue Oct 28, 2024 · 1 comment · May be fixed by #8749
Labels
type/enhancement The issue or PR belongs to an enhancement.

Comments

@lhy1024
Copy link
Contributor

lhy1024 commented Oct 28, 2024

Enhancement Task

After a PD instance is restarted, it may take some time to load region information. If a PD leader doesn't have all up-to-date region information, it may cause some problems.

When updating/upgrading PD, the TiDB Operator restarts PD instances one by one, e.g. if the current PD instance is ready, the next one will be restarted. But, this ready doesn't include region information sync.

A case with problems:

  1. All PD instances run for a while and with all region information synced
  2. PD-2 is the leader, and an updating operation is triggered
  3. TiDB Operator calls PD API to transfer leader from PD-2 to PD-1
  4. TiDB Operator restarts PD-2 and waits for PD-2 to be ready (but without the additional wait for region information sync)
  5. TiDB Operator calls PD API to transfer leader from PD-1 to PD-2
  6. As the region information in PD-2 is not synced, problems happen

At this point, the problem arises that PD-2 is elected as the leader, but it can't provide services related to region query until regions are loaded, it can only provide tso services.

The current workaround is to wait a while after the pd rolls reboot until the load region is complete and then let it be the leader
Typically, for a 10 million cluster, five to ten minutes is enough.

But in the longer term, it is more flexible for pd to provide an interface to query if the load region is complete.

@lhy1024 lhy1024 added the type/enhancement The issue or PR belongs to an enhancement. label Oct 28, 2024
@lhy1024
Copy link
Contributor Author

lhy1024 commented Oct 28, 2024

Currently there are three options.

  1. put it under a new api
  2. put it under /status, which is appropriate from the name, but currently status returns compiled information.
  3. put it under /health, /health is to query the status of each node through the /ping interface to return information about all the nodes, if we don't add a new api, we won't be able to get the result of whether the load region is complete or not.

@lhy1024 lhy1024 linked a pull request Oct 28, 2024 that will close this issue
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type/enhancement The issue or PR belongs to an enhancement.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant