Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add the ability to automate and schedule backups #553

Merged
merged 45 commits into from
Jun 3, 2024
Merged

Conversation

frouioui
Copy link
Member

@frouioui frouioui commented May 13, 2024

Description

This Pull Request adds a new CRD called VitessBackupSchedule. Its main goal is to automate and schedule backups of Vitess, taking backups of the Vitess cluster at regular intervals based on a given cron schedule and Strategy. This new CRD is managed by the VitessCluster, like most other components of the vitess-operator, the VitessCluster controller is responsible for the whole lifecycle (creation, update, deletion) of the VitessBackupSchedule object in the cluster. Inside the VitessCluster it is possible to define several VitessBackupSchedules as a list, allowing for multiple concurrent backup schedules.

Among other things, the VitessBackupSchedule object is responsible for creating Kubernetes's Job at the desired time, based on the user-defined schedule. It also keeps track of older jobs and delete them if they are too old, according to user-defined parameters (successfulJobsHistoryLimit & failedJobsHistoryLimit). The jobs created by the VitessBackupSchedule object will use the vtctld Docker Image and will execute a shell command that is generated based on the user-defined strategies. The end user can define as many backup strategy per schedule, each of them mocks what vtctldclient is able to do, the Backup and BackupShard commands are available, a map of extra flags enable the user to give as many flag as they want to vtctldclient.

A new end-to-end test is added to our BuildKite pipeline as part of this Pull Request to test the proper behavior of this new CRD.

Related PRs

Demonstration

For this demonstration I have setup a Vitess cluster by following the steps in the getting started guide, until the very last step where we must apply the 306_down_shard_0.yaml file. My cluster is then composed of 2 keyspaces: customer with 2 shards, and commerce unsharded. I then modify the 306... yaml file to contain the new backup schedule, as seen in the snippet right below. We want to create two schedules, one for each keyspace. The keyspace customer will have two backup strategies: one for each shard.

apiVersion: planetscale.com/v2
kind: VitessCluster
metadata:
  name: example
spec:
  backup:
    engine: xtrabackup
    locations:
      - volume:
          hostPath:
            path: /backup
            type: Directory
    schedules:
      - name: "every-minute-customer"
        schedule: "* * * * *"
        resources:
          requests:
            cpu: 100m
            memory: 1024Mi
          limits:
            memory: 1024Mi
        successfulJobsHistoryLimit: 2
        failedJobsHistoryLimit: 3
        strategies:
          - name: BackupShard
            keyspaceShard: "customer/-80"
          - name: BackupShard
            keyspaceShard: "customer/80-"
      - name: "every-minute-commerce"
        schedule: "* * * * *"
        resources:
          requests:
            cpu: 100m
            memory: 1024Mi
          limits:
            memory: 1024Mi
        successfulJobsHistoryLimit: 2
        failedJobsHistoryLimit: 3
        strategies:
          - name: BackupShard
            keyspaceShard: "commerce/-"
  images:

Once the cluster is stable, all tablets are serving and ready, I re-apply my yaml file with the backup configuration:

$ kubectl apply -f test/endtoend/operator/306_down_shard_0.yaml 
vitesscluster.planetscale.com/example configured

Immidiately I can check that the new VitessBackupSchedule objects have been created.

$ kubectl get VitessBackupSchedule 
NAME                                          AGE
example-vbsc-every-minute-commerce-ac6ff735   7s
example-vbsc-every-minute-customer-8aaaa771   7s

Now I want to check the pods where the jobs created by VitessBackupSchedule are running. After about 2 minutes, we can see four pods, two for each schedule. The pods are marked as Completed as they finished their job.

$ kubectl get pods
NAME                                                           READY   STATUS             RESTARTS        AGE
...
example-vbsc-every-minute-commerce-ac6ff735-1715897700-nkfzx   0/1     Completed          0              79s
example-vbsc-every-minute-commerce-ac6ff735-1715897760-qr4hp   0/1     Completed          0              19s
example-vbsc-every-minute-customer-8aaaa771-1715897700-rbsmd   0/1     Completed          0              79s
example-vbsc-every-minute-customer-8aaaa771-1715897760-kzn8t   0/1     Completed          0              19s
...

Now let's check our backup:

$ ls -l vtdataroot/backup/example/commerce/- vtdataroot/backup/example/customer/80- vtdataroot/backup/example/customer/-80 

vtdataroot/backup/example/commerce/-:
total 0
drwxr-xr-x  11 florentpoinsard  staff  352 May 16 16:15 2024-05-16.221502.zone1-0790125915
drwxr-xr-x  11 florentpoinsard  staff  352 May 16 16:16 2024-05-16.221602.zone1-0790125915

vtdataroot/backup/example/customer/-80:
total 0
drwxr-xr-x  11 florentpoinsard  staff  352 May 16 16:15 2024-05-16.221502.zone1-2289928654
drwxr-xr-x  11 florentpoinsard  staff  352 May 16 16:16 2024-05-16.221601.zone1-2289928654

vtdataroot/backup/example/customer/80-:
total 0
drwxr-xr-x  11 florentpoinsard  staff  352 May 16 16:15 2024-05-16.221511.zone1-4277914223
drwxr-xr-x  10 florentpoinsard  staff  320 May 16 16:16 2024-05-16.221609.zone1-2298643297

$ kubectl get vtb --no-headers
example-commerce-x-x-20240516-221502-2f185d5b-1854be28    2m7s
example-commerce-x-x-20240516-221602-2f185d5b-0a248174    67s
example-customer-80-x-20240516-221511-fefbca6f-8ede9c7d   2m7s
example-customer-80-x-20240516-221609-89028361-d9d1c1e4   67s
example-customer-x-80-20240516-221502-887d89ce-2fc618f4   2m7s
example-customer-x-80-20240516-221601-887d89ce-5b5b0acb   66s

frouioui added 20 commits May 8, 2024 10:30
Signed-off-by: Florent Poinsard <[email protected]>
Signed-off-by: Florent Poinsard <[email protected]>
Signed-off-by: Florent Poinsard <[email protected]>
Signed-off-by: Florent Poinsard <[email protected]>
Signed-off-by: Florent Poinsard <[email protected]>
Signed-off-by: Florent Poinsard <[email protected]>
Signed-off-by: Florent Poinsard <[email protected]>
Signed-off-by: Florent Poinsard <[email protected]>
@frouioui frouioui changed the title Add VitessBackupSchedule VitessBackupSchedule add the ability to automate backups May 16, 2024
@frouioui frouioui changed the title VitessBackupSchedule add the ability to automate backups Add the ability to automate and schedule backups May 16, 2024
Signed-off-by: Florent Poinsard <[email protected]>
@frouioui frouioui marked this pull request as ready for review May 17, 2024 17:01
@frouioui frouioui requested a review from mattlord May 17, 2024 17:03
@frouioui
Copy link
Member Author

In commit bc74ab4, I have applied one of the most important suggestion discussed above which is to remove the BackupTablet strategy in favor of BackupKeyspace and BackupCluster. The strategies can be used as follows:

# BackupKeyspace
        strategies:
          - name: BackupKeyspace
            cluster: "example"
            keyspace: "customer"
# BackupCluster
        strategies:
          - name: BackupCluster
            cluster: "example"

Meanwhile, the BackupShard strategy does not change. When ran we can see the following command line argument in the job's pod, which gets executed upon creation of the container:

# BackupKeyspace
Args:
      /bin/sh
      -c
      /vt/bin/vtctldclient --server=example-vtctld-625ee430:15999 BackupShard customer/-80 && /vt/bin/vtctldclient --server=example-vtctld-625ee430:15999 BackupShard customer/80-
# BackupCluster
Args:
      /bin/sh
      -c
      /vt/bin/vtctldclient --server=example-vtctld-625ee430:15999 BackupShard commerce/- && /vt/bin/vtctldclient --server=example-vtctld-625ee430:15999 BackupShard customer/-80 && /vt/bin/vtctldclient --server=example-vtctld-625ee430:15999 BackupShard customer/80-

cc @maxenglander @mattlord


// Cluster defines on which cluster you want to take the backup.
// This field is mandatory regardless of the chosen strategy.
Cluster string `json:"cluster"`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i'm not sure i follow why this is necessary. my mental model is that a user defines []VitessBackupScheduleTemplate on the ClusterBackupSpec, and so implicitly each VitessBackupScheduleStrategy will be associated with the cluster where ClusterBackupSpec is defined.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a good point @maxenglander, it is pretty useless. I ended up removing that field from VitessBackupScheduleStrategy and adding it to VitessBackupScheduleSpec. The VitessCluster controller will come and fill that new field when it create a new VitessBackupSchedule object, that way VitessBackupSchedule is still be able to select existing components given their cluster names to avoid fetching wrong data in the event where we have multiple VitessCluster running in our K8S cluster.

See b30aa09 for the change.

frouioui added 4 commits May 28, 2024 15:06
Signed-off-by: Florent Poinsard <[email protected]>
Signed-off-by: Florent Poinsard <[email protected]>
Signed-off-by: Florent Poinsard <[email protected]>
@frouioui
Copy link
Member Author

another thought, might be nice to give users a way to assign annotations, and one or more affinity selection options to the backup runner pods. that way they can influence things scheduling and eviction.

for example, users might not want backup runner pods running on the same nodes as vttablet pods. and they might not want the backup runner pods to get evicted by an unrelated pod after they've been running for a long time.

In e6946fb I have added affinity and annotations in the VitessBackupScheduleTemplate, allowing the user to configure the affinity and annotations they want for their pods that take backups.

@frouioui frouioui requested review from mattlord and maxenglander May 30, 2024 18:29
pkg/apis/planetscale/v2/labels.go Show resolved Hide resolved
return err
}
if jobStartTime.Add(time.Minute * time.Duration(timeout)).Before(time.Now()) {
if err := r.client.Delete(ctx, job, client.PropagationPolicy(metav1.DeletePropagationBackground)); (err) != nil {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

seems like a good thing to have a metric for

Copy link
Member Author

@frouioui frouioui May 30, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed via 46b6967 + 5809cbd

return job, nil
}

func (r *ReconcileVitessBackupsSchedule) createJobPod(ctx context.Context, vbsc *planetscalev2.VitessBackupSchedule, name string) (pod corev1.PodSpec, err error) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

might be worth adding a note about that in release notes. i expect it will be a common issue people run in to.

if shardIndex > 0 || ksIndex > 0 {
cmd.WriteString(" && ")
}
createVtctldClientCommand(&cmd, vtctldclientServerArg, strategy.ExtraFlags, ks.name, shard)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

am i reading this right that it will be taking a backup of each keyspace and shard in sequence? that doesn't seem ideal to me because if each shard takes an hour to backup, and there are 32 shards, then the backup of the first shard and last shard will be more than a day apart.

i think it would be better if there were at least the option of BackupCluster and BackupKeyspace to backup all keyspaces and shards in parallel.

might be better to limit this PR to only support BackupShard for now, and add support for the other options after more consideration into how to implement BackupKeyspace and BackupCluster.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's do that, remove those two strategies as part of this PR and I will work on a subsequent PR to add them back with a better approach. This PR is getting lengthy already.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed via 70ba063

Copy link
Collaborator

@mattlord mattlord May 31, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO BackupAllShardsInKeyspace and BackupAllShardsInCluster are better names. It may seem nitty, but I think it's important as it reflects what it actually is: independent backups of the shards. i.e. it is NOT a single consistent backup of the keyspace or cluster at any physical or logical point in time.

Copy link
Member Author

@frouioui frouioui May 31, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I ended up removing Keyspace and Cluster strategies in this PR as it will require a bigger refactoring. I am keeping that in mind for when we add them though.

@frouioui frouioui requested a review from maxenglander May 30, 2024 21:24
Copy link
Contributor

@maxenglander maxenglander left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

one last thought, lgtm overall

Signed-off-by: Florent Poinsard <[email protected]>
Copy link
Collaborator

@mattlord mattlord left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work on this, @frouioui ! ❤️ I only had a few nits/comments that you can address as you feel is best.

.buildkite/pipeline.yml Show resolved Hide resolved
take into account when using this feature:

- If you are using the `xtrabackup` engine, your vttablet pods will need more memory, think about provisioning more memory for it.
- If you are using the `builtin` engine, you will lose a replica during the backup, think about adding a new tablet.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there's a minimum healthy tablet setting? If so, worth mentioning that here IMO.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is not

test/endtoend/backup_schedule_test.sh Outdated Show resolved Hide resolved
Comment on lines +613 to +621
ks := keyspace{
name: item.Spec.Name,
}
for shardName := range item.Status.Shards {
ks.shards = append(ks.shards, shardName)
}
if len(ks.shards) > 0 {
result = append(result, ks)
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Curious why we don't do this instead:

		for shardName := range item.Status.Shards {
			ks.shards = append(result, &keyspace{
			name: item.Spec.Name,
			shards: shardName,
		})
		}

The other allocations/copying seems unnecessary at first glance. When combined with the single shot precise allocation it should be more efficient.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure I understand what you are suggesting. We still want to create one keyspace object per item in ksList.Items and for all the shards in this item we want to append to keyspace.shards

frouioui added 3 commits June 3, 2024 12:04
Signed-off-by: Florent Poinsard <[email protected]>
Signed-off-by: Florent Poinsard <[email protected]>
Signed-off-by: Florent Poinsard <[email protected]>
@frouioui frouioui merged commit f754509 into main Jun 3, 2024
10 checks passed
@frouioui frouioui deleted the scheduled-backups branch June 3, 2024 21:06
@frouioui frouioui mentioned this pull request Dec 2, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants