BackHub audit log data loss incident on March 14, 2024

Summary

On March 14th 2024 during a live Disaster Recovery (DR) test for the BackHub system, the data store for the live audit log was deleted in error. While investigating what had happened, it was discovered that no backup existed of the audit log data store and all audit log records for the BackHub infrastructure was lost.

What Happened

We perform bi-annual DR testing of all our application services and infrastructure. During this testing, we discovered that for the BackHub infrastructure (which handles backups of Github), we had not previously done a full AWS regional failure DR test and proceeded to do so. While the DR testing was successful and resulted in good updates to our DR procedures, the teardown of the DR testing region revealed a problem with one of the data stores used by BackHub.

BackHub uses three main data stores for permanent storage:

For backup metadata and customer information, an AWS RDS database is utilized
For the actual Github backups themselves, AWS EBS storage is utilized
For the audit logging (backup complete, backup failed, restore complete, etc.), AWS SimpleDB is utilized

All of the infrastructure for the environments are managed using Infrastructure-as-code, specifically with Terraform.

SimpleDB

SimpleDB is an AWS service that launched in 2007 as a ‘simple' NoSQL data store. It has since been superseded by DynamoDB as the main NoSQL offering from AWS and indeed is no longer accessible in the AWS console. Within Rewind, the BackHub infrastructure is the only component using SimpleDB as a data store. It is this service which is used to hold the audit log records for BackHub

Within the BackHub infrastructure, the backup services are deployed in multiple regions for data residency purposes. However, the core ‘administration’ database is hosted in a single region with snapshots replicated for DR purposes. Within Terraform, the configuration for SimpleDB looks like this:

module "simpledb" {
    source = "../modules/simpledb"
    providers = {
      aws = aws.eu-west-1
    }
}

Meaning, whatever provider is used for the backup services, SimpleDB will always be referenced in the eu-west-1 region.

Usually when performing a terraform apply for a pre-existing resource, the apply will fail and the resource must be imported into the terraform state. However, that is not the case with SimpleDB - the resource is added to the existing state with no conflicts. For someone not familiar with the fine details of the Terraform template, it appears as if a new SimpleDB instance has been created in the newly configured DR region. The issue then comes when running a terraform destroy operation to remove the DR testing infrastructure - the SimpleDB instance in eu-west-1 is destroyed rather than what the operator expects which is the DR copy of the SimpleDB instance.

SimpleDB Backups

Rewind has extensive policies and procedures around backup and restore of critical infrastructure and data stores and this is regularly tested bi-annually in a process known as the “restore-a-thon”. During this process, we verify we can restore everything we have backed up - both in the same region and a replica region within AWS. However, despite all of this we found we had no backup of the SimpleDB database being used for audit logging and hence no restoration of the now destroyed database was possible. We also looked at the possibility of re-creating the audit log database from regular log messages emitted by BackHub but found that the log messages do not contain enough information to reconstruct the audit log records.

Lessons Learned and Actions

We are applying the following learnings and actions from this incident:

SimpleDB has no built-in backup and restore process
- Action: we will create our own tooling to facilitate this
While we have guard rails in place to prevent deletion of data stores, SimpleDB was not in this policy
- Action: We have added SimpleDB to the guardrails around deletion
SimpleDB backup and restore testing should be performed at the same interval as other data stores
- Action: SimpleDB is being added to the regular backup and restore testing process
All data stores should be re-audited for backup and restore capabilities and procedures
- Action: All data stores are being re-audited to ensure full backup and restore processes and procedures exist

Posted Mar 19, 2024 - 09:42 EDT

Resolved

We have identified and remediated the problem with the audit log for the Backhub system. A postmortem will be published for this incident.

Posted Mar 19, 2024 - 09:33 EDT

Identified

We have identified the issue with the Backhub audit log and are considering mitigations

Posted Mar 18, 2024 - 10:33 EDT

Investigating

We are currently investigating an issue where some customers are missing audit log data in the Backhub product.

Posted Mar 15, 2024 - 16:16 EDT