Backup and Restore High Availability PostgreSQL Clusters

Sysdig provides a tool to back up and restore your high availability (HA) PostgreSQL in your Sysdig on-prem deployment.

The tool can be leveraged in the following scenarios:

  • PostgreSQL cluster cannot start, for example, due to corrupted data. You need to restore databases in a new Postgres HA instance.

  • Kubernetes cluster is not fully functional, and you need to recreate all Postgres databases in a new cluster for fast recovery.

Back Up a PostgreSQL HA Cluster

Use the Installer to deploy the backup tool in the cluster so it creates periodical backups to Amazon S3 or S3-compatible object storage. Later on, you can use the most recent copy to restore the databases. Install the backup tool as a cronjob called pg-backup-ha-cronjob in the sysdigcloud namespace. By default, it creates backups every 6 hours.

Prerequisites

  • Provision an S3 or S3-compatible bucket.
  • AWS Access Key ID and AWS Secret Access Key with appropriate privileges to use the bucket.

Backup Configuration

Use the configuration parameters in the values.yaml associated with the Installer to configure the backup operation.

Manually Trigger a Backup

You can trigger a backup on-demand using the following command:

kubectl create job pg-backup --from=cronjob/pg-backup-ha-cronjob -n sysdigcloud

Verify Backup Operation

In the sysdigcloud namespace, run the following command, replacing <backup-pod-name> with the pod in which the latest backup job is executed:

kubectl get pods -n sysdigcloud | grep "pg-backup-ha-cronjob"
kubectl logs <backup-pod-name> -n sysdigcloud

The log of the pod provides the details about the backup operations. A successful backup job should generate logs similar to this:


2023-11-07T23:06:21+00:00 - INFO - Checking envs
2023-11-07T23:06:21+00:00 - INFO - Validating S3 Bucket
2023-11-07T23:06:21+00:00 - INFO - Aws: S3 region is: us-east-1

2023-11-07T23:06:21+00:00 - INFO - Starting
2023-11-07T23:06:21+00:00 - INFO - Checking envs
2023-11-07T23:06:21+00:00 - INFO - Connecting to S3 and backing up
2023-11-07T23:06:21+00:00 - INFO - Done

Restore a PostgreSQL HA Backup

The restore tool relies on the Kubernetes Job. As datastore restoration is carried out only upon request. It is not bundled in the Installer binary. You can trigger the database restore operation by applying the given YAML file. The duration of the restoration process will vary depending on the size of the databases.

Executing the restore necessitates scaling down all deployments in the sysdig namespace and a StatefulSet. This ensures a seamless and error-free database restoration.

This topic assumes that the most recent backup can be found in the S3 bucket in the path as indicated in the Backup section.

Scale Down the Workloads

  1. Count the amount of replicas of the StatefulSet sysdigcloud-netsec-ingest:

    kubectl get sts sysdigcloud-netsec-ingest -n sysdigcloud
    

    An example output:

    NAME                        READY   AGE
    sysdigcloud-netsec-ingest   1/1     4h11m
    

    Note it down for future use.

  2. Count the number of ready replicas for the all the sysdig deployments:

    kubectl get deploy -n sysdigcloud
    

    An example output:

     NAME                                               READY   UP-TO-DATE   AVAILABLE   AGE
     ingress-default-backend                            1/1     1            1           4h7m
     registry-scanner-api                               2/2     2            2           3h55m
     sysdig-alert-manager                               1/1     1            1           4h4m                      
    

    Note it down for future use.

  3. Scale down the workloads:

    kubectl scale deployment --replicas 0 --all -n sysdigcloud
    kubectl scale sts sysdigcloud-netsec-ingest --replicas=0 -n sysdigcloud
    

Apply the Kubernetes Job

Apply the example Kubernetes job file to the cluster in the sysdigcloud namespace:

An example Job for Restore:

apiVersion: batch/v1
kind: Job
metadata:
  name: pg-restore-ha-job
  namespace: sysdigcloud  
  generateName: pg-restore-ha  
spec:
  ttlSecondsAfterFinished: 200
  template:
    spec:
      restartPolicy: Never
      containers:
        - image: quay.io/sysdig/postgres-backup-onprem:0.1.3
          name: pg-backup-ha
          command: ["/usr/local/bin/pg-restore.sh"]
          env:
            - name: TZ
              value: Etc/UTC
            - name: LOGICAL_BACKUP_PROVIDER
              value: s3
            - name: LOGICAL_BACKUP_S3_BUCKET
              value: example-bucket
            - name: LOGICAL_BACKUP_S3_REGION
              value: us-east-1
            - name: LOGICAL_BACKUP_PATH
              value: "demo-path"
            - name: PGPORT
              value: "5432"
            - name: PGHOST
              value: sysdigcloud-postgres-cluster
            - name: PGUSER
              valueFrom:
                secretKeyRef:
                  name: root.sysdigcloud-postgres-cluster.credentials.postgresql.acid.zalan.do
                  key: username
            - name: PGDATABASE
              value: postgres
            - name: PGSSLMODE
              value: require
            - name: PGPASSWORD
              valueFrom:
                secretKeyRef:
                  name: root.sysdigcloud-postgres-cluster.credentials.postgresql.acid.zalan.do
                  key: password
            - name: AWS_ACCESS_KEY_ID
              value: XXXXXXXXXX
            - name: AWS_SECRET_ACCESS_KEY
              value: YYYYYYYYYY
      imagePullSecrets:
        - name: sysdigcloud-pull-secret              

Verify Restore Operation

You can run the following command in the sysdigcloud namespace to get the name of the pod which runs the restore job.

kubectl get pods -n sysdogcloud | grep "pg-restore-ha-job"
kubectl logs <restore-pod-name> -n sysdigcloud

The pod logs provides the indication whether the job is completed successfully or not.

2024-01-07T09:00:00+00:00 - INFO - Starting
2024-01-07T09:00:00+00:00 - INFO - Checking envs
2024-01-07T09:00:00+00:00 - INFO - Connecting to S3 and restoring
2024-01-07T09:20:00+00:00 - INFO - Done

Scaling Up the Workloads

When the job is complete, scale up each deployment and StatefulSet by using the number of replicas noted down earlier.

Deployments

For example:

kubectl scale deployment registry-scanner-api --replicas 2 -n sysdigcloud
kubectl scale deployment ingress-default-backend  --replicas 1 -n sysdigcloud
kubectl scale deployment sysdig-alert-manager --replicas 1 -n sysdigcloud

StatefulSet

For example:

kubectl scale sts sysdigcloud-netsec-ingest --replicas=1 -n sysdigcloud

Configuration Parameters

ParameterValueExample
logicalBackupS3BucketThe AWS S3 bucket name.example-bucket
logicalBackupS3RegionThe AWS Region where S3 bucket resides.us-east-1
logicalBackupPathThe path to the backup files.
logicalBackupProviderAWS S3S3
awsAccessKeyIDAWS Access Key ID
awsSecretAccessKeyAWS Secret Access Key
deploymentEnvironmentThe variable that determines whether the backups includes all databases, or Sysdig only databases.
* If the value is left blank, then all databases will be backed up.
* If the value is set to sysdig_databases, all the databases are backed up except template0, template1 ,postgres .
sysdig_databases
enabledThe variable name that determines whether the backup is enabled or disabled . The value can be enabled or disabled. The default is enabled.enabled
scheduleThe frequency to perform the database backup operation."* */6 * * *"