Notes on AWS, Big Data, Machine Learning and Leadership: AWS EC2 (Auto Scaling)

Use cases

Scaling activity
Instance replacement

Auto Scaling Group

Group of resources
Associated with either
- Launch Configuration
- Launch Template
Group size
- min
- max
- desired
  - used for manual scaling and scheduled scaling
  - must be beteween min <= desired <= max
Networking
- VPC and subnets
- AZ
- (Optional) Placement Group
  - Cannot specify multiple AZ if used
(Optional) ELB
Healthcheck
- Type
- Grace Period
Cooldown period
Termination Policies
Suspended Processes
Instance Protection
- Protects from termination on scale-in

Notification

Sends SNS notification on following instance events
- launch
- terminate
- failed to launch
- failed to terminate

Launch Configuration

Template (blueprint) for instance to be launched by AS
- AMI
- Instance type
- Spot Instances (Yes/No)
- Detailed monitoring
- Role
- Public IP assignment
- Additional Storage (EBS)
- Security Groups
- Keypair
Each ASG has exactly 1 (current) launch configuration
- Cannot be edited - must be cloned
- Upating Launch Configuration does impact existing instances
As of 2018 recommended to use Launch Templates

Healthcheck

Instance starts as healthy
Type
- EC2 (default)
  - System Status
  - Instance Status
- ELB (optional)
  - AS Reports instance as unhealthy if ELB reports OutOfService
  - Combined with EC2 healthcheck (logical "AND")
- Custom healthcheck
  - Manually notify AS that an instance is healthy/unhealthy (set-instance-health)
  - Overrides the health status
  - e.g. Can be used to mark the instance as healthy when it is rebooted
HealthcheckGracePeriod
- Amount of time to wait before AS starts relying on Healthcheck
  - By default assumes Healthy
- Used when new instance is started to give it time to prepare itself
- If lifecycle hook is attached the time starts AFTER it is completed

Healthcheck Replacement

When instance is marked unhealthy it is immediately marked for termination
- e.g Stop instance
- You can attempt to call "SetInstanceHealth=Healthy" but there is race condition
Subsequently new activity to launch replacement is initiated
EIP and EBS volumes are NOT attached to replacement instance
- You can handle this via User data script

Auto Scaling Instance Lifecycle

Pending (instance launched/attached)
- Pending:Wait (lifecycle hook: 1h)
- Pending:Proceed
InService (passed healthcheck)
EnteringStandby
Standby
- Call "ExitStandby" to return to Pending
Terminating
- Terminating:Wait (lifecycle hook: 1h)
- Terminating:Proceed
Terminated
Detaching

Lifecycle hooks

Custom code executed when instance enters "Wait" state
- launch - Pending:Wait, e.g.
  - Install additional software with CodeDeploy
  - Fill-up cache
- terminate - Terminating:Wait, e.g.
  - Analyze crashed instance
  - Retrieve Logs
  - Copy data out of instance
AS sends notification to Notification Target:
- Targets
  - SQS
  - SNS
  - CloudWatch Events (e.g. Lambda)
Code must return result CompleteLifecycleAction
- CONTINUE
- ABANDON
  - On Terminate:Wait it still terminates but no other lifecycle hooks are executed
Timeout: default 60 minutes,
- Can be extended with RecordLifecycleActionHeartbeat
  - Max 48 hours
Cooldown starts AFTER hook is completed
- AS is frozen for a longer period of time
Max 50 hooks per Auto Scaling group

Instance actions

Attach existing EC2 instance
- Must not be part of other ASG
- AMI must exist
- Increases Desired Count
- Lifecycle
  - Pending
    - Call "AttachInstances"
Detach EC2 instance from ASG
- Can be used to move to different ASG
- Lifecycle
  - Detaching
    - Call "DetachInstances":
  - Detached
- Specify if you want to decrement "Desired"
Standby
- manually take instance out of ASG
- Instance is deregistered from ELB (if applicable)
- No healthcheck performed
- Use cases
  - Update or modify instance
  - Troubleshoot
- By default "Desired" decremented (i.e. no replacement launched)

ELB integration

ASG can use multiple ELBs
- Max 50 per ASG
- If any healthcheck fails the instance is marked unhealthy by ASG
  - even if all other ELBs consider it OK
- Use case
  - Each ELB has a different SSL certificate associated
ELB points to ASG rather than specific instances inside it
ASG can re-use
- ELB healthcheck
- Connection Draining (waits before termination)

Scaling

Methods
- Manual
- Schedule
- Dynamic
  - Simple
  - Step
  - Target Tracking

Scaling Manually

Manually change the size of the ASG
- desired count

Scaling by Schedule

When you can predict exact dates
Maximum 125 scheduled actions (4*31) per month
Similar to programming a room thermostat
Group size properties change (min, desired, max)
Types
- One time - start time
- Recurring - cron syntax
  - There is no YEAR field

Scaling by Policy (dynamic)

Scaling Adjustment Types
- ChangeInCapacity(+/- number_of_instances)
- ExactCapacity(number_of_instances)
- PercentChangeInCapacity(+/- percent_change_in_capacity)
  - MinAdjustmentMagnitue (minimum number of instances)
  - Rounding
    - (-1,1) => 1
    - (-inf,-1)u(1,+inf) => drop fraction part (cast)
Policy Type
- Simple Scaling - single adjustment
  - Supports any ALARM
  - When breach defined adjustment occurs (e.g. 3->8)
  - Cooldown supported
- Step Scaling
  - Recommended
  - One or more steps
  - Responds to the magnitude of the Alarm (not just binary: ALARM/OK)
  - Warm-up supported
- Target Tracking
  - Supported metrics
    - ALB Request Count per Target
    - Average CPU Utilization
    - Average Network In
    - Average Network Out
  - When breached scale-out occurs
    - Works like thermostat ("I want average CPU Utilization to be < 50%")
  - Warm-up supported
  - Scale-in can be disabled
Based on CloudWatch Alarms
- e.g. CPU Utilization, ELB Latency, ELB RequestCount, SQSNumberOfMessagesVisible

Oscillations

Adding/removing instances changes the state of the system
- This may cause oscillation behavior
In order to damp oscillations two mechanisms are provided
- Cooldown
  - Period of time to wait before another scaling action
  - How long to wait before previous action gives result
  - Damps oscillations
  - Supported for Simple Scaling Policy
  - Locks the entire ASG
- Warm-up
  - Supported by Step Scaling Policy and Target Tracking Policy
  - Period of time after adding new instance when it is not counted towards aggregated metrics
  - Prevents adding or terminating too many instances

Termination Policy

How AWS decides which instance to terminate on scale-in
Firstly - always try to balance AZ (choose random AZ if all have the same instance count)
Secondly
- Default (OCR)
  - OldestLaunchConfiguration
  - ClosestToNextInstanceHour
  - Random
- Custom
  - OldestInstance
  - NewestInstance
  - OldestLaunchConfiguration
  - ClosestToNextInstanceHour
Multpile policies can be associated with ASG
- e.g. "OldestLaunchConfiguration","NewestInstance", "Default")

Auto Scaling Processes

Independent processes (workers) that perform state transitions
Can be individually suspended/resumed (e.g. for debugging)
- Administrative suspension
  - All processes in the group are suspended
  - When fail to launch instance for 24h
  - Can be resumed
Types
- Launch - add new instances to the group
- Terminate - removes instances from the group
- Healthcheck - checks the health status
- ReplaceUnhealthy
  - Uses: Healthcheck, Terminate, Launch
- AZRebalance
  - Balance instance count between AZ
    - When AZ is removed from a group
    - AZ is failing or has recovered
    - Instance is explicitly terminated
  - Uses: Launch (before termination)
    - Unlike Healthcheck Replacement that kills the instance first
- AlarmNotification
  - Accepts and reacts on CW Alarms associated with a group
  - Required for executing policies based on ALARM triggers
- ScheduledActions
  - Performs scheduled actions
- AddToLoadBalancer
  - adds launched instances to ELB

CloudWatch metrics

AutoScaling maintains aggregated instance metrics for all instances in the group (e.g. CpuUtilization)
- Identical to EC2 but dimension is ASG (not instanceId)
Auto Scaling Metrics
- GroupMinSize
- GroupMaxSize
- GroupDesiredSize
- GroupInService
- ...

Spot Instances

Can be used with ASG
Require separate Launch Configuration
- Specify bid price
  - Cannot be modified as Launch Configuration is immutable
- Cannot mix on-demand and spot
When spot instance is interrupted AS tries to launch replacement

Notes on AWS, Big Data, Machine Learning and Leadership

Saturday, 24 February 2018

AWS EC2 (Auto Scaling)

No comments:

Post a Comment