You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
#16997 introduced a preference to not promote hosts taking backups during reparents. However, we have noticed some issues with it, namely:
Incorrect passing of the flag indicatign whether a tablet is or not taking a backup
The information about whether a tablet is taking a backup or not is being stored in ReplicationStatusResponse and StopReplicationAndGetStatusResponse, even though a BackupRunning field has also been added to replicationdata.Status. The reparenting code, however does not have access to the ReplicationStatusResponse.BackupRunning because the gRPC TabletManagerClient only returnsReplicationStatusResponse.Status. This leads to incorrect decisions being made regarding the right host to promote.
vtctld crashes when certain calls from vtcltd to tablets fail during ERS.
During ERS, when running stopReplicationAndBuildStatusMaps, calls to TabletManagerClient.StopReplicationAndGetStatus can fail, leading to attempting to access a method on a null struct, and hence, segfaulting the vtctld process. See here.
Reproduction Steps
The issue can be verified by running the local installation as described here, triggering backups and calling PlannedReparentShard and EmergencyReparentShard
Binary Version
This has been seen in the latest dev code for v22.
Overview of the Issue
#16997 introduced a preference to not promote hosts taking backups during reparents. However, we have noticed some issues with it, namely:
Incorrect passing of the flag indicatign whether a tablet is or not taking a backup
The information about whether a tablet is taking a backup or not is being stored in
ReplicationStatusResponse
andStopReplicationAndGetStatusResponse
, even though aBackupRunning
field has also been added toreplicationdata.Status
. The reparenting code, however does not have access to theReplicationStatusResponse.BackupRunning
because the gRPC TabletManagerClient only returnsReplicationStatusResponse.Status
. This leads to incorrect decisions being made regarding the right host to promote.vtctld
crashes when certain calls fromvtcltd
to tablets fail during ERS.During ERS, when running
stopReplicationAndBuildStatusMaps
, calls toTabletManagerClient.StopReplicationAndGetStatus
can fail, leading to attempting to access a method on a null struct, and hence, segfaulting thevtctld
process. See here.Reproduction Steps
The issue can be verified by running the local installation as described here, triggering backups and calling
PlannedReparentShard
andEmergencyReparentShard
Binary Version
Operating System and Environment details
Log Fragments
No response
The text was updated successfully, but these errors were encountered: