Notes on AWS, Big Data, Machine Learning and Leadership: AWS Disaster Recovery

Concepts

RPO - recovery point objective (how much data loss is acceptable, e.g. updates from last 5 minutes can be lost)

RTO - recovery time objective (how long should it maximally take to restore the data from backup, e.g. 5 hours)

Scenarios

Backup&Restore ($)
- Glacier
- S3
- Storage Gateway
- - Cached volumes (S3 is authority)
  - VTL - backup target for tapes
- Transfer
- - DX
  - Import/Export
  - Internet
- Long RTO, High RPO
Pilot Light ($$)
- Minimal version of critical core components always running
- - Database server (e.g. asynchronous replication)
- Preconfigured server AMIs - can be started quickly
Warm standby ($$$)
- Scaled down version of environment running all the time
- Can be used for non-production work
- - Test, QA, internal use
- Patch on same schedule as primary
- App tier & standby-by site write to both primary and stand-by
- Upon recovery
- - add more servers
  - resize instances
  - Switch DNS to warm standby
Multi-Site ($$$$)
- active-active
- Two regions at the same time
- Route53 for weighted routing
- Complext state management (database reconciliation)

RDS

Multi A-Z synchronously replicates data between AZ
- Eliminates I/O freezes during backup
- Eliminates outage during maintenance windows
Automatic failover
- AZ outage
- Primary DB fails
- DB instance type is changed
- OS of DB is undergoing patching
- Manual failover initiated (reboot with failover)
- Failover ~60s
Daily backup (within 30 minutes "backup window")
- Retention period configurable - number of days
- When deleting instance all automatic backups are deleted (only manual snapshots remain)
- Automated DB snapshot can be copied (retain indefinitely)
RDS archives logs (point in time recovery)
- Ability to recover to arbitrary point in time in the retention period (RPO 5 minutes)
Use InnoDB (MyISAM will not work)
On restore possible to change many options
- engine (SQL Server Standard -> SQL Server Enterprise)
Snapshots can be copied cross-region

Glacier

DynamoDB

AMI

EBS

AFR - 0.5-1.0% (if < 20GB modified since last snapshot)
Snapshots reduce AFR
RAID does not offer that much help as all disks in array are likely to fail together
Replicated within AZ (mirrored using proprietry technology)
Snapshots can be copied cross-region

Business Continuity

Notes on AWS, Big Data, Machine Learning and Leadership