Skip to content

Disaster Recovery

Azimuth uses Velero as a disaster recovery solution. Velero provides the ability to back up Kubernetes API resources to an object store and has a plugin-based system to enable snapshotting of a cluster's persistent volumes.

Warning

Backup and restore is only available for production-grade HA installations of Azimuth.

The Azimuth playbooks install Velero on the HA management cluster and the Velero CLI tool on the seed node. Once configured with the appropriate credentials, the installation process will create a Schedule on the HA cluster, which triggers a daily backup at midnight and cleans up backups older which are more than 1 week old.

The AWS Velero plugin is used for S3 support and the CSI plugin for volume snapshots. The CSI plugin uses Kubernetes generic support for Volume Snapshots, which is implemented for OpenStack by the Cinder CSI plugin.

Configuration

To enable backup and restore functionality, the following variables should be set in your environment:

environments/my-site/inventory/group_vars/all/variables.yml
velero_enabled: true
velero_s3_url: <object-store-endpoint-url>
velero_bucket_name: <name-of-an-existing-bucket>
environments/my-site/inventory/group_vars/all/secrets.yml
velero_aws_access_key_id: <S3-access-key-id>
velero_aws_secret_access_key: <S3-secret-value>

Danger

The S3 credentials should be kept secret. If you want to keep them in Git - which is recommended - then it must be encrypted.

Velero CLI

The Velero installation process also installs the Velero CLI on the Azimuth seed node, which can be used to inspect the state of the backups:

On the seed node, with the kubeconfig for the HA cluster exported
# List the configured backup locations
velero backup-location get

# List the backups and their statuses
velero backup get

See velero -h for other useful commands.

Restoring from a backup

To restore from a backup, you must first know the name of the target backup. This can be inferred from the object names in S3 if the Velero CLI is no longer available.

Once you have the name of the backup to restore, run the following command with your environment activated (similar to a provision):

ansible-playbook stackhpc.azimuth_ops.restore \
  -e velero_restore_backup_name=<backup name>

This will provision a new HA cluster, restore the backup onto it and then bring the installation up-to-date with your configuration.

Performing ad-hoc backups

In order to perform ad-hoc backups using the same config parameters as the installed backup schedule, run the following Velero CLI command from the seed node:

On the seed node, with the kubeconfig for the HA cluster exported
velero backup create --from-schedule default

This will begin the backup process in the background. The status of this backup (and others) can be viewed with the velero backup get command shown above.

Tip

Ad-hoc backups will have the same time-to-live as the configured schedule backups (default = 7 days). To change this, pass the --ttl <hours> option to the velero backup create command.

Modifying the backup schedule

The following config options are available for modifying the regular backup schedule:

environments/my-site/inventory/group_vars/all/variables.yml
# Whether or not to perform scheduled backups
velero_backup_schedule_enabled: true
# Name for backup schedule kubernetes resource
velero_backup_schedule_name: default
# Schedule to use for backups (defaults to every day at midnight)
# See https://en.wikipedia.org/wiki/Cron for format options
velero_backup_schedule_timings: "0 0 * * *"
# Time-to-live for existing backups (defaults to 1 week)
# See https://pkg.go.dev/time#ParseDuration for duration format options
velero_backup_schedule_ttl: "168h"

Note

Setting velero_backup_schedule_enabled: false does not prevent the backup schedule from being installed - instead it sets the schedule state to paused.

This allows for ad-hoc backups to still be run on demand using the configured backup parameters.


Last update: April 9, 2024
Created: April 9, 2024