Tuesday, 26 April 2016

AWS Disaster Recovery


Concepts
RPO - recovery point objective (how much data loss is acceptable, e.g. updates from last 5 minutes can be lost)
RTO - recovery time objective (how long should it maximally take to restore the data from backup, e.g. 5 hours)

Scenarios
  • Backup&Restore ($)
    • Glacier
    • S3
    • Storage Gateway
      • Cached volumes (S3 is authority)
      • VTL - backup target for tapes
    • Transfer
      • DX
      • Import/Export
      • Internet
    • Long RTO, High RPO
  • Pilot Light ($$)
    • Minimal version of critical core components always running
      • Database server (e.g. asynchronous replication)
    • Preconfigured server AMIs - can be started quickly
  • Warm standby ($$$)
    • Scaled down version of environment running all the time
    • Can be used for non-production work
      • Test, QA, internal use
    • Patch on same schedule as primary
    • App tier & standby-by site write to both primary and stand-by
    • Upon recovery
      • add more servers
      • resize instances 
      • Switch DNS to warm standby
  • Multi-Site ($$$$)
    • active-active
    • Two regions at the same time
    • Route53 for weighted routing
    • Complext state management (database reconciliation)

RDS
  • Multi A-Z synchronously replicates data between AZ
    • Eliminates I/O freezes during backup
    • Eliminates outage during maintenance windows
  • Automatic failover
    • AZ outage
    • Primary DB fails
    • DB instance type is changed
    • OS of DB is undergoing patching
    • Manual failover initiated (reboot with failover)
    • Failover ~60s
  • Daily backup (within 30 minutes "backup window")
    • Retention period configurable - number of days
    • When deleting instance all automatic backups are deleted (only manual snapshots remain)
    • Automated DB snapshot can be copied (retain indefinitely)
  • RDS archives logs (point in time recovery)
    • Ability to recover to arbitrary point in time in the retention period (RPO 5 minutes)
  • Use InnoDB (MyISAM will not work)
  • On restore possible to change many options
    • engine (SQL Server Standard -> SQL Server Enterprise)
  • Snapshots can be copied cross-region

S3
  • Very high durability (11 9's)
    • Reducecd Storage cheaper but lower durability
  • Replicated within region
  • Possible to replicate cross-region
    • Asynchronous
  • Access time: instant

Glacier
  • Very high durability (like S3)
  • Access time: a few hours

DynamoDB
  • Backed by SSD
  • Replicated in 3 AZs in a region
  • DataPipeline supports
    • cross region DynamoDb copy (with filtering)
    • copy to S3

AMI
  • Can be copied cross-region

EBS
  • AFR - 0.5-1.0% (if < 20GB modified since last snapshot)
  • Snapshots reduce AFR
  • RAID does not offer that much help as all disks in array are likely to fail together
  • Replicated within AZ (mirrored using proprietry technology)
  • Snapshots can be copied cross-region

Business Continuity

  • How business continous to operate in the event of disaster
  • Fire, theft, flood

No comments:

Post a Comment