Concepts
RPO - recovery point objective (how much data loss is acceptable, e.g. updates from last 5 minutes can be lost)
RTO - recovery time objective (how long should it maximally take to restore the data from backup, e.g. 5 hours)
Scenarios
- Backup&Restore ($)
- Glacier
- S3
- Storage Gateway
- Cached volumes (S3 is authority)
- VTL - backup target for tapes
- Transfer
- DX
- Import/Export
- Internet
- Long RTO, High RPO
- Pilot Light ($$)
- Minimal version of critical core components always running
- Database server (e.g. asynchronous replication)
- Preconfigured server AMIs - can be started quickly
- Warm standby ($$$)
- Scaled down version of environment running all the time
- Can be used for non-production work
-
- Patch on same schedule as primary
- App tier & standby-by site write to both primary and stand-by
- Upon recovery
- add more servers
- resize instances
- Switch DNS to warm standby
- Multi-Site ($$$$)
- active-active
- Two regions at the same time
- Route53 for weighted routing
- Complext state management (database reconciliation)
RDS
- Multi A-Z synchronously replicates data between AZ
- Eliminates I/O freezes during backup
- Eliminates outage during maintenance windows
- Automatic failover
- AZ outage
- Primary DB fails
- DB instance type is changed
- OS of DB is undergoing patching
- Manual failover initiated (reboot with failover)
- Failover ~60s
- Daily backup (within 30 minutes "backup window")
- Retention period configurable - number of days
- When deleting instance all automatic backups are deleted (only manual snapshots remain)
- Automated DB snapshot can be copied (retain indefinitely)
- RDS archives logs (point in time recovery)
- Ability to recover to arbitrary point in time in the retention period (RPO 5 minutes)
- Use InnoDB (MyISAM will not work)
- On restore possible to change many options
- engine (SQL Server Standard -> SQL Server Enterprise)
- Snapshots can be copied cross-region
S3
- Very high durability (11 9's)
- Reducecd Storage cheaper but lower durability
- Replicated within region
- Possible to replicate cross-region
-
- Access time: instant
Glacier
- Very high durability (like S3)
- Access time: a few hours
DynamoDB
- Backed by SSD
- Replicated in 3 AZs in a region
- DataPipeline supports
- cross region DynamoDb copy (with filtering)
- copy to S3
AMI
- Can be copied cross-region
EBS
- AFR - 0.5-1.0% (if < 20GB modified since last snapshot)
- Snapshots reduce AFR
- RAID does not offer that much help as all disks in array are likely to fail together
- Replicated within AZ (mirrored using proprietry technology)
- Snapshots can be copied cross-region
Business Continuity
- How business continous to operate in the event of disaster
- Fire, theft, flood