Saturday, 24 February 2018

AWS EC2 (Auto Scaling)


Use cases
  • Scaling activity
  • Instance replacement

Auto Scaling Group    
  • Group of resources
  • Associated with either
    • Launch Configuration
    • Launch Template
  • Group size
    • min
    • max
    • desired
      • used for manual scaling and scheduled scaling
      • must be beteween min <= desired <= max
  • Networking
    • VPC and subnets
    • AZ
    • (Optional) Placement Group
      • Cannot specify multiple AZ if used
  • (Optional) ELB
  • Healthcheck
    • Type
    • Grace Period
  • Cooldown period
  • Termination Policies
  • Suspended Processes
  • Instance Protection
    • Protects from termination on scale-in

Notification
  • Sends SNS notification on following instance events
    • launch
    • terminate
    • failed to launch
    • failed to terminate

Launch Configuration
  • Template (blueprint) for instance to be launched by AS
    • AMI
    • Instance type
    • Spot Instances (Yes/No)
    • Detailed monitoring
    • Role
    • Public IP assignment
    • Additional Storage (EBS)
    • Security Groups
    • Keypair
  • Each ASG has exactly 1 (current) launch configuration
    • Cannot be edited - must be cloned
    • Upating Launch Configuration does impact existing instances
  • As of 2018 recommended to use Launch Templates

Healthcheck
  • Instance starts as healthy
  • Type 
    • EC2 (default)
      • System Status
      • Instance Status
    • ELB (optional)
      • AS Reports instance as unhealthy if ELB reports OutOfService
      • Combined with EC2 healthcheck (logical "AND")
    • Custom healthcheck 
      • Manually notify AS that an instance is healthy/unhealthy (set-instance-health)
      • Overrides the health status
      • e.g. Can be used to mark the instance as healthy when it is rebooted
  • HealthcheckGracePeriod
    • Amount of time to wait before AS starts relying on Healthcheck
      • By default assumes Healthy
    • Used when new instance is started to give it time to prepare itself
    • If lifecycle hook is attached the time starts AFTER it is completed

Healthcheck Replacement
  • When instance is marked unhealthy it is immediately marked for termination
    • e.g Stop instance
    • You can attempt to call "SetInstanceHealth=Healthy" but there is race condition
  • Subsequently new activity to launch replacement is initiated
  • EIP and EBS volumes are NOT attached to replacement instance
    • You can handle this via User data script

Auto Scaling Instance Lifecycle 
  • Pending (instance launched/attached)
    • Pending:Wait (lifecycle hook: 1h)
    • Pending:Proceed
  • InService  (passed healthcheck)
  • EnteringStandby
  • Standby
    • Call "ExitStandby" to return to Pending
  • Terminating
    • Terminating:Wait (lifecycle hook: 1h)
    • Terminating:Proceed
  • Terminated
  • Detaching

Lifecycle hooks
  • Custom code executed when instance enters "Wait" state
    • launch - Pending:Wait, e.g.
      • Install additional software with CodeDeploy
      • Fill-up cache
    • terminate - Terminating:Wait, e.g.
      • Analyze crashed instance
      • Retrieve Logs
      • Copy data out of instance
  • AS sends notification to Notification Target:
    • Targets
      • SQS
      • SNS
      • CloudWatch Events (e.g. Lambda)
  • Code must return result CompleteLifecycleAction
    • CONTINUE
    • ABANDON   
      • On Terminate:Wait it still terminates but no other lifecycle hooks are executed
  • Timeout: default 60 minutes,
    • Can be extended with RecordLifecycleActionHeartbeat
      • Max 48 hours 
  • Cooldown starts AFTER hook is completed
    • AS is frozen for a longer period of time
  • Max 50 hooks per Auto Scaling group

Instance actions
  • Attach existing EC2 instance
    • Must not be part of other ASG
    • AMI must exist
    • Increases Desired Count
    • Lifecycle
      • Pending
        • Call "AttachInstances"
  • Detach EC2 instance from ASG
    • Can be used to move to different ASG
    • Lifecycle
      • Detaching
        • Call "DetachInstances":
      • Detached
    • Specify if you want to decrement "Desired"
  • Standby
    • manually take instance out of ASG
    • Instance is deregistered from ELB (if applicable)
    • No healthcheck performed 
    • Use cases
      • Update or modify instance
      • Troubleshoot
    • By default "Desired" decremented (i.e. no replacement launched)

ELB integration
  • ASG can use multiple ELBs
    • Max 50 per ASG
    • If any healthcheck fails the instance is marked unhealthy by ASG
      • even if all other ELBs consider it OK
    • Use case
      • Each ELB has a different SSL certificate associated
  • ELB points to ASG rather than specific instances inside it
  • ASG can re-use
    • ELB healthcheck     
    • Connection Draining (waits before termination)


Scaling
  • Methods
    • Manual
    • Schedule
    • Dynamic
      • Simple
      • Step
      • Target Tracking

Scaling Manually
  • Manually change the size of the ASG
    • desired count

Scaling by Schedule
  • When you can predict exact dates
  • Maximum 125 scheduled actions (4*31) per month
  • Similar to programming a room thermostat
  • Group size properties change (min, desired, max)
  • Types
    • One time - start time
    • Recurring  - cron syntax
      • There is no YEAR field

Scaling by Policy (dynamic)
  • Scaling Adjustment Types
    • ChangeInCapacity(+/- number_of_instances)
    • ExactCapacity(number_of_instances)
    • PercentChangeInCapacity(+/- percent_change_in_capacity)
      • MinAdjustmentMagnitue (minimum number of instances)
      • Rounding
        • (-1,1) => 1
        • (-inf,-1)u(1,+inf) => drop fraction part (cast)
  • Policy Type
    • Simple Scaling - single adjustment 
      • Supports any ALARM
      • When breach defined adjustment occurs (e.g. 3->8)
      • Cooldown supported
    • Step Scaling
      • Recommended
      • One or more steps
      • Responds to the magnitude of the Alarm (not just binary: ALARM/OK)
      • Warm-up supported
    • Target Tracking
      • Supported metrics
        • ALB Request Count per Target
        • Average CPU Utilization
        • Average Network In
        • Average Network Out
      • When breached scale-out occurs 
        • Works like thermostat ("I want average CPU Utilization to be < 50%")
      • Warm-up supported
      • Scale-in can be disabled
  • Based on CloudWatch Alarms
    • e.g. CPU Utilization, ELB Latency, ELB RequestCount, SQSNumberOfMessagesVisible

Oscillations
  • Adding/removing instances changes the state of the system
    • This may cause oscillation behavior
  • In order to damp oscillations two mechanisms are provided
    • Cooldown
      • Period of time to wait before another scaling action
      • How long to wait before previous action gives result
      • Damps oscillations
      • Supported for Simple Scaling Policy
      • Locks the entire ASG
    • Warm-up
      • Supported by Step Scaling Policy and Target Tracking Policy 
      • Period of time after adding new instance when it is not counted towards aggregated metrics
      • Prevents adding or terminating too many instances

Termination Policy
  • How AWS decides which instance to terminate on scale-in
  • Firstly - always try to balance AZ (choose random AZ if all have the same instance count)
  • Secondly 
    •  Default (OCR)
      • OldestLaunchConfiguration
      • ClosestToNextInstanceHour
      • Random
    • Custom
      • OldestInstance
      • NewestInstance
      • OldestLaunchConfiguration
      • ClosestToNextInstanceHour
  • Multpile policies can be associated with ASG
    • e.g. "OldestLaunchConfiguration","NewestInstance", "Default")

Auto Scaling Processes
  • Independent processes (workers) that perform state transitions 
  • Can be individually suspended/resumed (e.g. for debugging)
    • Administrative suspension
      • All processes in the group are suspended
      • When fail to launch instance for 24h
      • Can be resumed
  • Types
    • Launch - add new instances to the group
    • Terminate - removes instances from the group
    • Healthcheck - checks the health status
    • ReplaceUnhealthy
      • Uses: Healthcheck, Terminate, Launch
    • AZRebalance
      • Balance instance count between AZ
        • When AZ is removed from a group
        • AZ is failing or has recovered
        • Instance is explicitly terminated
      • Uses: Launch (before termination)
        • Unlike Healthcheck Replacement that kills the instance first
    • AlarmNotification
      • Accepts and reacts on CW Alarms associated with a group
      • Required for executing policies based on ALARM triggers
    • ScheduledActions
      • Performs scheduled actions
    • AddToLoadBalancer
      • adds launched instances to ELB

CloudWatch metrics
  • AutoScaling maintains aggregated instance metrics for all instances in the group (e.g. CpuUtilization)
    • Identical to EC2 but dimension is ASG (not instanceId)
  • Auto Scaling Metrics
    • GroupMinSize
    • GroupMaxSize
    • GroupDesiredSize
    • GroupInService 
    • ...

Spot Instances


  • Can be used with ASG
  • Require separate Launch Configuration
    • Specify bid price
      • Cannot be modified as Launch Configuration is immutable
    • Cannot mix on-demand and spot
  • When spot instance is interrupted AS tries to launch replacement

No comments:

Post a Comment