You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
This issue is to explain how I wish to implement fast revert for ADD|DROP PARTITION operations. Some initial context: #10315
The problem
Consider the following range partitioned table:
CREATETABLEtp (
id INTNOT NULL,
ts TIMESTAMPNOT NULL DEFAULT CURRENT_TIMESTAMP,
primary key (id)
)
PARTITION BY RANGE (id) (
PARTITION p1 VALUES LESS THAN (10),
PARTITION p2 VALUES LESS THAN (20),
PARTITION p3 VALUES LESS THAN (30),
PARTITION p4 VALUES LESS THAN (40),
PARTITION p5 VALUES LESS THAN (50),
PARTITION p6 VALUES LESS THAN (60)
);
Assuming the table is very large, and assuming the user wants to routinely rotate it, we can expect periodic ALTER TABLE ... DROP PARTITION and ALTER TABLE ... ADD PARTITION DDLs, which the user expect to operate very quickly. Moreover, in the spirit of Online DDL, we want to be able to revert such statements.
Nothing is trivial here and the problems begin with almost the simplest query.
We cannot use good-old online schema change to add/drop partitions because it's slow
The canonical way for OSC is to create a new empty shadow table, alter that table, then copy over the data from the original table. If the table is large, then the operation is slow. But range partitioning's greatest incentive is that we can operate on smaller tables (partitions), and where operations are proportional to the size of affected partitions, not of the entire table. In particular, ALTER TABLE ... ADD PARTITION is a near instant operation. But OSC can take hours to complete it.
We cannot use good-old online schema change to drop partitions because it's incorrect
This is true for all OSC tools. Consider that we issue a ALTER TABLE tp DROP PARTITION p1 on our sample table above. Alkso, consider the table is populated, and has an id value of 2, which belongs in p1.
The user assumes that dropping p1 also drops the row where id = 2. But in an OSC we build a new table:
CREATETABLEtp (
id INTNOT NULL,
ts TIMESTAMPNOT NULL DEFAULT CURRENT_TIMESTAMP,
primary key (id)
)
PARTITION BY RANGE (id) (
PARTITION p2 VALUES LESS THAN (20),
PARTITION p3 VALUES LESS THAN (30),
PARTITION p4 VALUES LESS THAN (40),
PARTITION p5 VALUES LESS THAN (50),
PARTITION p6 VALUES LESS THAN (60)
);
(p1 is missing) and then copy data over from the original table. The partition p1 now comfortably hosts our row (id = 2 is LESS THAN (20)). We end up with no rows removed!
We therefore need to have a special support for DROP PARTITION statements
But as we'll soon show, we also need a special support for ADD PARTITION. But, before that:
DROP PARTITION is not fast
In MySQL, DROP PARTITION maps to a DROP TABLE. This operation is proportional to the size of the table. Even with recent MySQL 8.0.23 fix to DROP TABLE performance, there is still the issue of removing the table from the filesyste, In MySQL, a DROP PARTITION locks the entire table for writes for the duration of the operation. This can be seconds or minutes. This is too much. We want Online DDL to be non-blocking.
DROP PARTITION is not revertible
Vitess's REVERT not only brings back the previous schema, but also the data. But a DROP PARTITION loses the data. We want to be able to keep hold of it.
Proposal: fast and safe DROP PARTITION implementation
We turn DROP PARTITION into the following multi-step operation. Illustrated by our sample table and query ALTER TABLE tp DROP PARTITION p1:
CREATE TABLE _shadow_p1 LIKE tp
ALTER TABLE _shadow_p1 REMOVE PARTITIONING
ALTER TABLE tp EXCHANGE PARTITION p1 WITH TABLE _shadow_p1
ALTER TABLE tp DROP PARTITION p1
The above sequence of events is fast and safe. We swap the existing partition with an empty table (fast). We then drop the now-empty partition (fast) and are left with a copy of the data (safe).
_shadow_p1 is a pseudocode name for the table. In reality, this will be a _vt_HOLD_* table and will be garbage collected at a later stage.
DROP PARTITION has no counter-query
We'd like to think that the counter-query to a ALTER TABLE ... DROP PARTITION is a ALTER TABLE ... ADD PARTITION, but that is incorrect in range partitioned tables. You may only ADD PARTITION to the end of the partition range, and so this can only (kind of) work if you dropped the last partition_.
Proposal: fast and safe REVERT for DROP PARTITION (last partition) implementation
Assuming we just dropped the last partition, the REVERT is not just about creating a new partition in its place, but also about populating it. Suggested flow for reverting ALTER TABLE tp DROP PARTITION p6 (last partition inour table):
ALTER TABLE tp ADD PARTITION p6 (PARTITION p6 VALUES LESS THAN (60))
ALTER TABLE tp EXCHANGE PARTITION p6 WITH TABLE _shadow_p6
DROP TABLE _shadow_p6
(this assumes _shadow_p6 was created by a previous DROP PARTITION)
This is a fast and safe operation. It is not atomic: there can be a moment where the user reads empty data from p6 just as it is created and before swapped with _shadow_p6. But this behavior is consistent with not having a partition in the first place.
Important notes:
The table has no memory of p6 and has lost its definition. We must be able to retain the definition PARTITION p6 VALUES LESS THAN (60) somewhere.
Because the operation is not atomic, we need to audit our steps so that if the tablet crashes between ADD PARTITION and EXCHANGE PARTITION, the successor tablet can still complete the opration from that point.
Proposal: fast and safe REVERT for DROP PARTITION (non-last partition) implementation
If we drop p1, we can't ADD it back. Same for p2 though p5. To recreate the partitions we'd have to use a REORGANIZE PARTITION statement. But that is proportional to the partition size and can take a long time.
We now illustrate the steps to REVERT a ALTER TABLE tp DROP PARTITION p1 migration:
CREATE TABLE _shadow_p1 LIKE tp
ALTER TABLE _shadow_p1 REMOVE PARTITIONING
CREATE TABLE _shadow_p2 LIKE tp
ALTER TABLE _shadow_p2 REMOVE PARTITIONING
ALTER TABLE tp EXCHANGE PARTITION p2 WITH TABLE _shadow_p2
ALTER TABLE tp REORGANIZE PARTITION p2 INTO (PARTITION p1 VALUES LESS THAN (10), PARTITION p2 VALUES LESS THAN (20))
ALTER TABLE tp EXCHANGE PARTITION p2 WITH TABLE _shadow_p2
ALTER TABLE tp EXCHANGE PARTITION p1 WITH TABLE _shadow_p1
The above is fast, but not atomic/safe, and we will iterate on it. But first, some analysis:
Because REORGANIZE PARTITION is proportional to table size, we don't want to run it on a populated partition.
This is why we exchange p2 with an empty table. REORGANIZE PARTITION is fast on the now empty partition.
We then repopulate p2, and then repopulate p1.
Important notes:
We need to somehow aquire the information that p2 was the next partition after p1. If we want to drop p4 for example, then we need to know that p5 was its subsequent partition.
We need to be able to restore the p1 definition, which was lost from the table.
There's a point in time where p2 is suddenly empty. This is an unexpected and undesired behavior.
The operation is non-atomic. If the tablet fails after EXCHANGE PARTITION p2 WITH TABLE _shadow_p2 then its successor needs to resume operation or revert it, to gain back the data lost from p2.
Making the operation atomic to the user
To prevent the user from suddenly reading empty data from p2, or inserting/updating data in that partition, we have to prevent reads and writes to the table, momentarily. This wil be achieved by using query buffering similarly to how we cut-over a VReplication migration.
Reverting a ADD PARTITION
While a ADD PARTITION migration creates an empty new partition, the user may populate the partition at any time. The REVERT of a ADD PARTITION is not to DROP it. Rather, we want to invoke our Online DDL equivalent of DROP PARTITION, as illustrated above.
Summary
There is a lot of state management to handle: a dropped partition's definition, where the data is stashed, journaling of non-atomic steps, state of query buffering. The solution is complex but achievable.
The text was updated successfully, but these errors were encountered:
Status: we have implemented fast DROP PARTITION and ADD PARTITION, supported by schemadiff and used by Vitess with a special, yet-undocumented ddl_strategy flag. We will continue to experiment with that until we either document the flag or just activate the new behavior by default without flags.
REVERT for those operations is ON HOLD because of the complexity it takes; it works, but auditing and recovering from a failed/interrupted REVERT introduces some difficult scenarios.
This issue is to explain how I wish to implement fast revert for
ADD|DROP PARTITION
operations. Some initial context: #10315The problem
Consider the following range partitioned table:
Assuming the table is very large, and assuming the user wants to routinely rotate it, we can expect periodic
ALTER TABLE ... DROP PARTITION
andALTER TABLE ... ADD PARTITION
DDLs, which the user expect to operate very quickly. Moreover, in the spirit of Online DDL, we want to be able to revert such statements.Nothing is trivial here and the problems begin with almost the simplest query.
We cannot use good-old online schema change to add/drop partitions because it's slow
The canonical way for OSC is to create a new empty shadow table, alter that table, then copy over the data from the original table. If the table is large, then the operation is slow. But range partitioning's greatest incentive is that we can operate on smaller tables (partitions), and where operations are proportional to the size of affected partitions, not of the entire table. In particular,
ALTER TABLE ... ADD PARTITION
is a near instant operation. But OSC can take hours to complete it.We cannot use good-old online schema change to drop partitions because it's incorrect
This is true for all OSC tools. Consider that we issue a
ALTER TABLE tp DROP PARTITION p1
on our sample table above. Alkso, consider the table is populated, and has an id value of2
, which belongs inp1
.The user assumes that dropping
p1
also drops the row whereid = 2
. But in an OSC we build a new table:(
p1
is missing) and then copy data over from the original table. The partitionp1
now comfortably hosts our row (id = 2
isLESS THAN (20)
). We end up with no rows removed!We therefore need to have a special support for DROP PARTITION statements
But as we'll soon show, we also need a special support for
ADD PARTITION
. But, before that:DROP PARTITION is not fast
In MySQL,
DROP PARTITION
maps to aDROP TABLE
. This operation is proportional to the size of the table. Even with recent MySQL8.0.23
fix toDROP TABLE
performance, there is still the issue of removing the table from the filesyste, In MySQL, aDROP PARTITION
locks the entire table for writes for the duration of the operation. This can be seconds or minutes. This is too much. We want Online DDL to be non-blocking.DROP PARTITION is not revertible
Vitess's
REVERT
not only brings back the previous schema, but also the data. But aDROP PARTITION
loses the data. We want to be able to keep hold of it.Proposal: fast and safe DROP PARTITION implementation
We turn
DROP PARTITION
into the following multi-step operation. Illustrated by our sample table and queryALTER TABLE tp DROP PARTITION p1
:CREATE TABLE _shadow_p1 LIKE tp
ALTER TABLE _shadow_p1 REMOVE PARTITIONING
ALTER TABLE tp EXCHANGE PARTITION p1 WITH TABLE _shadow_p1
ALTER TABLE tp DROP PARTITION p1
The above sequence of events is fast and safe. We swap the existing partition with an empty table (fast). We then drop the now-empty partition (fast) and are left with a copy of the data (safe).
_shadow_p1
is a pseudocode name for the table. In reality, this will be a_vt_HOLD_*
table and will be garbage collected at a later stage.DROP PARTITION has no counter-query
We'd like to think that the counter-query to a
ALTER TABLE ... DROP PARTITION
is aALTER TABLE ... ADD PARTITION
, but that is incorrect in range partitioned tables. You may onlyADD PARTITION
to the end of the partition range, and so this can only (kind of) work if you dropped the last partition_.Proposal: fast and safe REVERT for DROP PARTITION (last partition) implementation
Assuming we just dropped the last partition, the
REVERT
is not just about creating a new partition in its place, but also about populating it. Suggested flow for revertingALTER TABLE tp DROP PARTITION p6
(last partition inour table):ALTER TABLE tp ADD PARTITION p6 (PARTITION p6 VALUES LESS THAN (60))
ALTER TABLE tp EXCHANGE PARTITION p6 WITH TABLE _shadow_p6
DROP TABLE _shadow_p6
(this assumes
_shadow_p6
was created by a previousDROP PARTITION
)This is a fast and safe operation. It is not atomic: there can be a moment where the user reads empty data from
p6
just as it is created and before swapped with_shadow_p6
. But this behavior is consistent with not having a partition in the first place.Important notes:
p6
and has lost its definition. We must be able to retain the definitionPARTITION p6 VALUES LESS THAN (60)
somewhere.ADD PARTITION
andEXCHANGE PARTITION
, the successor tablet can still complete the opration from that point.Proposal: fast and safe REVERT for DROP PARTITION (non-last partition) implementation
If we drop
p1
, we can'tADD
it back. Same forp2
thoughp5
. To recreate the partitions we'd have to use aREORGANIZE PARTITION
statement. But that is proportional to the partition size and can take a long time.We now illustrate the steps to
REVERT
aALTER TABLE tp DROP PARTITION p1
migration:CREATE TABLE _shadow_p1 LIKE tp
ALTER TABLE _shadow_p1 REMOVE PARTITIONING
CREATE TABLE _shadow_p2 LIKE tp
ALTER TABLE _shadow_p2 REMOVE PARTITIONING
ALTER TABLE tp EXCHANGE PARTITION p2 WITH TABLE _shadow_p2
ALTER TABLE tp REORGANIZE PARTITION p2 INTO (PARTITION p1 VALUES LESS THAN (10), PARTITION p2 VALUES LESS THAN (20))
ALTER TABLE tp EXCHANGE PARTITION p2 WITH TABLE _shadow_p2
ALTER TABLE tp EXCHANGE PARTITION p1 WITH TABLE _shadow_p1
The above is fast, but not atomic/safe, and we will iterate on it. But first, some analysis:
REORGANIZE PARTITION
is proportional to table size, we don't want to run it on a populated partition.p2
with an empty table.REORGANIZE PARTITION
is fast on the now empty partition.p2
, and then repopulatep1
.Important notes:
p2
was the next partition afterp1
. If we want to dropp4
for example, then we need to know thatp5
was its subsequent partition.p1
definition, which was lost from the table.p2
is suddenly empty. This is an unexpected and undesired behavior.EXCHANGE PARTITION p2 WITH TABLE _shadow_p2
then its successor needs to resume operation or revert it, to gain back the data lost fromp2
.Making the operation atomic to the user
To prevent the user from suddenly reading empty data from
p2
, or inserting/updating data in that partition, we have to prevent reads and writes to the table, momentarily. This wil be achieved by using query buffering similarly to how we cut-over aVReplication
migration.Reverting a ADD PARTITION
While a
ADD PARTITION
migration creates an empty new partition, the user may populate the partition at any time. TheREVERT
of aADD PARTITION
is not toDROP
it. Rather, we want to invoke our Online DDL equivalent ofDROP PARTITION
, as illustrated above.Summary
There is a lot of state management to handle: a dropped partition's definition, where the data is stashed, journaling of non-atomic steps, state of query buffering. The solution is complex but achievable.
The text was updated successfully, but these errors were encountered: