Notes on AWS, Big Data, Machine Learning and Leadership: February 2018

Monday, 26 February 2018

AWS IoT Device Management

Overview

Tool to manage millions of IoT devices
Onboarding
- templates for bulk-registration (device provisioning)
- hetergenous devices
  - Amazon FreeRTOS based
  - Greengrass based
Organization
- Extends IoT Device Registry
- Hierarchical model
  - Policies can be set on sub-hierarchies
  - Groups
- Queries (fleet search)
  - e.g. device type, firmware version
Monitoring
- Gathers telemetry (real-time connection, status, authentication)
Remote Management

Job

Can target specific device groups
Examples
- Selective fleet update (OTA)
  - software/firmware update
- Collect diagnostics (e.g. engine started >= 20K times)
- Device reboot

Amazon FreeRTOS

Overview

Based on FreeRTOS kernel
Runs on edge device (IoT)
Allows to connect device to
- AWS Service
- More powerful device (via Greengrass)
Extends via packages
- Security library
- Connectivity
- Secure Updateability (can update itself)

FreeRTOS

Popular Real-Time Operating System
Open Source
De-facto standard for micro-controllers
- Texas Instruments, Microchip, NXP Semiconductors,....

AWS Greengrass

Greengrass Core

Runtime allowing code execution on IoT
Allows to run local Lambda functions on device
Local MQTT messaging across network
Secure communication with cloud
Device with Greengrass may act as a gateway for less powerful devices
Over-The-Air (OTA) Updates
Supported devices
- >= 128 MB memory
- x86/ARM (>=1GHz)
- Example
  - RaspberryPi
Use cases
- Filter/aggregate device data to only transmit necessary
  - Reduces cost
- Offline processing (connectivity may be slow or intermittent)
- Can be deployed on Snowball Edge

Greengrass Group

Collection of
- Greengardd Cores
- Devices (Amazon FreeRTOS, iOT Device)
Things can communicate with each other

References

https://aws.amazon.com/blogs/aws/aws-greengrass-run-aws-lambda-functions-on-connected-devices/

AWS Step Functions

Overview

Orchestration layer
AWS manages the state between invocation
Suitable for Serverless Lambda functions
Simplified version of Simple Workflow Service
- Uses SWF behind the scene
Uses Amazon States Language (JSON) to express state

State

Element of state machine (workflow)
- Task (do some work)
- Choice
- Stop (Fail, Succeed)
- Pass input->output
- Wait (delay timer)
- Parallel

Task

Lambda function
Activity
- Performed by worker (EC2, ECS, mobile device)
- Uses long polling

Error handling

Retry
- Task and Parallel states
- ErrorEquals (what exception to handle)
- Interval (delay after first attempt)
- MaxAttempts
- BackoffRate (how quickly interval increases)
Catch
- Fallback state

References

https://www.youtube.com/watch?v=75MRve4nv8s

Saturday, 24 February 2018

AWS Storage Gateway

Overview

Enables hybrid storage architectures
- Move data to AWS for Big data / cloud bursting migration
- Backup, archive, DR
- Tiered storage (on-premise, cloud)
Uses native AWS storage
- S3
- EBS Snapshots
Efficient data transfer
- Reduces bandwidth usage
Local caching

AWS Storage Gateway VM

Virtual appliance downloaded from AWS
Acts as a facade so that client applications on-premise need no change
- Standard storage protocols
Installed on a host on-premise
- Needs VMWare/Windows Hyper-V hypervisor
- Possible to install on EC2
  - e.g. for PoC purpose
Activation
- specify IP address, name, timezone
- AWS region to store snapshots
- associates gateway with AWS account
Must have access to disk subsystem: SAN, NAS or DAS

File Gateway

Exposes NFS mount target
- NFS v3/4.1
- Mounts S3 as a file system
1-1 mapping between S3 object and file
- Including metadata
Data stored in S3 bucket and can be accessed directly
- Fine-grained control, lifecycle, CRR, etc.
- EFS does not provide this
- NFS client can access any data in the bucket
  - Including created outside of File Gateway, e.g.
    - Replicated from other bucket
    - Imported via Snowball

Volume Gateway

Exposes iSCSI mount point (block strorage)
- Initiator: on-premise Application Servier
- Target: Storage Gateway
Data stored on S3 in opaque format
- unlike File Gateway
  - no direct access to objects
  - stored in AWS buckets
- compression at-rest and in-transit
On-premise volume can be backed up to EBS snapshot
- Restored as EBS volume
Max 1 PB of volume
Mode
- Stored
- Cached

Volume Gateway (Stored)

Local disk: "source of truth"
- S3: continuous synchronous backup of on-premise volume
EBS snapshots can be created
- ad-hoc
- scheduled
Size
- Max 16TB per volume (EBS volume limit)
- Max 32 voumes (=Max 512TB)
Use case
- Offsite backup
  - On restore: everything downloaded

Voume Gateway (Cached)

S3: "source of truth"
- Local disk: cache of data
Minimize need for on-premise scaling
Size
- Max 32TB per volume
- Max 32 volumes (=Max 1PB)
Data stored on S3 as "Volume Storage"
- Supports Point In Time Snapshot of data in S3 -> EBS Snapshot
  - Can be restored to "Volume Storage"
  - If < 16 TB can also be restored to EBS volume
Allocated on-premise storage best practices
- Cache Storage - local cache of frequently accessed data
  - optimize performance for iSCSI
  - durable data storage on-premise
  - Allow at least 20% of entire storage volume
  - Use RAID5 or RAID6
  - When fills-up and full of dirty data - iSCSI writes are blocked
- Upload Buffer - queue to get data up to S3 (asynchronous).
  - At least 150GBs recommended
  - Optimize performance for S3
  - When it fills-up data is uploaded from Cache Storage directly but we cannot take PITs Snapshots during that time
- Separate Cache/Buffer to different spindles
Dirty data - data put in Cached Volume that is not yet uploaded to S3
Avoid
- Windows full format as it initializes blocks and you start paying for used storage (use quickformat instead)
- full antivirus sweep as it ruins the cache
Use cases
- Large data set but small working set
- Moving from Volume (Stored) to Volume (Shared)
- Backup
  - On restore: nothing downloaded (empty cache)

Tape Gateway

Drop-in replacement for physical tape infrastructure
Exposes iSCSI interface
- Media Changer
- Tape Drive
Virtual Tape (VT)
- Analogous to physical cartridge
- Size 100-2.5TB
- States
  - AVAILABLE - application may write to it
  - IN TRANSIT TO VTS - uploading data to AWS
  - ARCHIVING - upload to AWS complete. Archiving
  - ARCHIVED - in Glacier
Virtual Tape Library (VTL)
- Analogous to Physical Library (with robotic arms and tape drives)
- Many existing backup tools supported (e.g. Dell, Veritas, etc.)
- Max 1500 Virtual Tapes (total 150TB) in library
  - Unlimited number in AWS
- Drives
  - Tape Drive
    - I/O and Seek
    - Max 10
    - Responds to SCSI commands
  - Media Changer (robotic arm)
    - Max 1
    - "Inserts" Virtual Tape into Tape Drive
Virtual Tape Shelf (VTS)
- Analogous to off-site tape holding facility
- When backup software ejects the tape it is moved to VTS
- Backed by Glacier
Use case
- Replacing physical tape infrastructure

References

AWS EC2 (Auto Scaling)

Use cases

Scaling activity
Instance replacement

Auto Scaling Group

Group of resources
Associated with either
- Launch Configuration
- Launch Template
Group size
- min
- max
- desired
  - used for manual scaling and scheduled scaling
  - must be beteween min <= desired <= max
Networking
- VPC and subnets
- AZ
- (Optional) Placement Group
  - Cannot specify multiple AZ if used
(Optional) ELB
Healthcheck
- Type
- Grace Period
Cooldown period
Termination Policies
Suspended Processes
Instance Protection
- Protects from termination on scale-in

Notification

Sends SNS notification on following instance events
- launch
- terminate
- failed to launch
- failed to terminate

Launch Configuration

Template (blueprint) for instance to be launched by AS
- AMI
- Instance type
- Spot Instances (Yes/No)
- Detailed monitoring
- Role
- Public IP assignment
- Additional Storage (EBS)
- Security Groups
- Keypair
Each ASG has exactly 1 (current) launch configuration
- Cannot be edited - must be cloned
- Upating Launch Configuration does impact existing instances
As of 2018 recommended to use Launch Templates

Healthcheck

Instance starts as healthy
Type
- EC2 (default)
  - System Status
  - Instance Status
- ELB (optional)
  - AS Reports instance as unhealthy if ELB reports OutOfService
  - Combined with EC2 healthcheck (logical "AND")
- Custom healthcheck
  - Manually notify AS that an instance is healthy/unhealthy (set-instance-health)
  - Overrides the health status
  - e.g. Can be used to mark the instance as healthy when it is rebooted
HealthcheckGracePeriod
- Amount of time to wait before AS starts relying on Healthcheck
  - By default assumes Healthy
- Used when new instance is started to give it time to prepare itself
- If lifecycle hook is attached the time starts AFTER it is completed

Healthcheck Replacement

When instance is marked unhealthy it is immediately marked for termination
- e.g Stop instance
- You can attempt to call "SetInstanceHealth=Healthy" but there is race condition
Subsequently new activity to launch replacement is initiated
EIP and EBS volumes are NOT attached to replacement instance
- You can handle this via User data script

Auto Scaling Instance Lifecycle

Pending (instance launched/attached)
- Pending:Wait (lifecycle hook: 1h)
- Pending:Proceed
InService (passed healthcheck)
EnteringStandby
Standby
- Call "ExitStandby" to return to Pending
Terminating
- Terminating:Wait (lifecycle hook: 1h)
- Terminating:Proceed
Terminated
Detaching

Lifecycle hooks

Custom code executed when instance enters "Wait" state
- launch - Pending:Wait, e.g.
  - Install additional software with CodeDeploy
  - Fill-up cache
- terminate - Terminating:Wait, e.g.
  - Analyze crashed instance
  - Retrieve Logs
  - Copy data out of instance
AS sends notification to Notification Target:
- Targets
  - SQS
  - SNS
  - CloudWatch Events (e.g. Lambda)
Code must return result CompleteLifecycleAction
- CONTINUE
- ABANDON
  - On Terminate:Wait it still terminates but no other lifecycle hooks are executed
Timeout: default 60 minutes,
- Can be extended with RecordLifecycleActionHeartbeat
  - Max 48 hours
Cooldown starts AFTER hook is completed
- AS is frozen for a longer period of time
Max 50 hooks per Auto Scaling group

Instance actions

Attach existing EC2 instance
- Must not be part of other ASG
- AMI must exist
- Increases Desired Count
- Lifecycle
  - Pending
    - Call "AttachInstances"
Detach EC2 instance from ASG
- Can be used to move to different ASG
- Lifecycle
  - Detaching
    - Call "DetachInstances":
  - Detached
- Specify if you want to decrement "Desired"
Standby
- manually take instance out of ASG
- Instance is deregistered from ELB (if applicable)
- No healthcheck performed
- Use cases
  - Update or modify instance
  - Troubleshoot
- By default "Desired" decremented (i.e. no replacement launched)

ELB integration

ASG can use multiple ELBs
- Max 50 per ASG
- If any healthcheck fails the instance is marked unhealthy by ASG
  - even if all other ELBs consider it OK
- Use case
  - Each ELB has a different SSL certificate associated
ELB points to ASG rather than specific instances inside it
ASG can re-use
- ELB healthcheck
- Connection Draining (waits before termination)

Scaling

Methods
- Manual
- Schedule
- Dynamic
  - Simple
  - Step
  - Target Tracking

Scaling Manually

Manually change the size of the ASG
- desired count

Scaling by Schedule

When you can predict exact dates
Maximum 125 scheduled actions (4*31) per month
Similar to programming a room thermostat
Group size properties change (min, desired, max)
Types
- One time - start time
- Recurring - cron syntax
  - There is no YEAR field

Scaling by Policy (dynamic)

Scaling Adjustment Types
- ChangeInCapacity(+/- number_of_instances)
- ExactCapacity(number_of_instances)
- PercentChangeInCapacity(+/- percent_change_in_capacity)
  - MinAdjustmentMagnitue (minimum number of instances)
  - Rounding
    - (-1,1) => 1
    - (-inf,-1)u(1,+inf) => drop fraction part (cast)
Policy Type
- Simple Scaling - single adjustment
  - Supports any ALARM
  - When breach defined adjustment occurs (e.g. 3->8)
  - Cooldown supported
- Step Scaling
  - Recommended
  - One or more steps
  - Responds to the magnitude of the Alarm (not just binary: ALARM/OK)
  - Warm-up supported
- Target Tracking
  - Supported metrics
    - ALB Request Count per Target
    - Average CPU Utilization
    - Average Network In
    - Average Network Out
  - When breached scale-out occurs
    - Works like thermostat ("I want average CPU Utilization to be < 50%")
  - Warm-up supported
  - Scale-in can be disabled
Based on CloudWatch Alarms
- e.g. CPU Utilization, ELB Latency, ELB RequestCount, SQSNumberOfMessagesVisible

Oscillations

Adding/removing instances changes the state of the system
- This may cause oscillation behavior
In order to damp oscillations two mechanisms are provided
- Cooldown
  - Period of time to wait before another scaling action
  - How long to wait before previous action gives result
  - Damps oscillations
  - Supported for Simple Scaling Policy
  - Locks the entire ASG
- Warm-up
  - Supported by Step Scaling Policy and Target Tracking Policy
  - Period of time after adding new instance when it is not counted towards aggregated metrics
  - Prevents adding or terminating too many instances

Termination Policy

How AWS decides which instance to terminate on scale-in
Firstly - always try to balance AZ (choose random AZ if all have the same instance count)
Secondly
- Default (OCR)
  - OldestLaunchConfiguration
  - ClosestToNextInstanceHour
  - Random
- Custom
  - OldestInstance
  - NewestInstance
  - OldestLaunchConfiguration
  - ClosestToNextInstanceHour
Multpile policies can be associated with ASG
- e.g. "OldestLaunchConfiguration","NewestInstance", "Default")

Auto Scaling Processes

Independent processes (workers) that perform state transitions
Can be individually suspended/resumed (e.g. for debugging)
- Administrative suspension
  - All processes in the group are suspended
  - When fail to launch instance for 24h
  - Can be resumed
Types
- Launch - add new instances to the group
- Terminate - removes instances from the group
- Healthcheck - checks the health status
- ReplaceUnhealthy
  - Uses: Healthcheck, Terminate, Launch
- AZRebalance
  - Balance instance count between AZ
    - When AZ is removed from a group
    - AZ is failing or has recovered
    - Instance is explicitly terminated
  - Uses: Launch (before termination)
    - Unlike Healthcheck Replacement that kills the instance first
- AlarmNotification
  - Accepts and reacts on CW Alarms associated with a group
  - Required for executing policies based on ALARM triggers
- ScheduledActions
  - Performs scheduled actions
- AddToLoadBalancer
  - adds launched instances to ELB

CloudWatch metrics

AutoScaling maintains aggregated instance metrics for all instances in the group (e.g. CpuUtilization)
- Identical to EC2 but dimension is ASG (not instanceId)
Auto Scaling Metrics
- GroupMinSize
- GroupMaxSize
- GroupDesiredSize
- GroupInService
- ...

Spot Instances

Can be used with ASG
Require separate Launch Configuration
- Specify bid price
  - Cannot be modified as Launch Configuration is immutable
- Cannot mix on-demand and spot
When spot instance is interrupted AS tries to launch replacement

Wednesday, 21 February 2018

AWS Glacier

Model

Vault
- name may be the same across regions (unique per region)
- container for archives
- analogous to S3 bucket
- max 1000 per region
Archive
- base unit storage in Glacier
- immutable (create/delete only)
- can be any data (photo, document, etc.)
  - best practice: aggregate data into .zip or .tar
- 32 kB metadata overhead
  - Recommended >= 1MB per object
- max 40TB per archive
- Upload
  - Single max 4GB
  - Recommended multi-part for > 100 MB
    - Compute and supply tree-hash
      - Hash for each megabyte segment and combine in tree fashion
Inventory
- Updated once per day
- List of all archives
- Inventory date not changed if no add/delete of archives
- Format: CSV or JSON
- Similar concepts exists now for S3

Jobs

Executed asynchronously (Job ID returned)
Associated with vault
- Multiple jobs may be in-progress
When it completes user can download the output (available for 24h)
Types
- Archive Retrieval
  - entire archive or subset of files in the archive
- Inventory Retrieval (list of archives)
  - filter can be applied (e.g. archive creation date)
May have SNS notifications enabled

Upload (Tree Hash)

On upload include 2 headers
- x-amz-content-sha256
  - hash of entire payload used for signature calculation
- x-amz-sha256-tree-hash
  - specific to archive upload
  - main benefit - avoids re-reading a (potentially big) file to calculate its hash
    - it's computed piece-meal
  - for each chunk of 1MB compute hash (last may be < 1 MB)
    - build the next level of tree (compute hash again)
      - repeat until you reach top (root)
  - Examples
    - Single request (6.5 MB)
      - 1 request (SHA256 computed 13 times)
    - Multi-part
      - 2 requests each has hash-tree of corresponding parts
      - Complete Multipart Upload (tree hash of entire archive)

Vault Access Policy

Resource-based policy
- similar to bucket policy

Vault Lock Policy

Similar to vault access control
Enforce compliance requirements
- e.g. WORM (Write Once Read Many)
Once policy is locked it cannot be edited
- Stronger control than vault access policy
Use case
- time-based data retention rules (deny deletes) but allow read access
  - Combine vault lock policy (deny delete) and vault access policy (read)
- Compliance
Process
- Initiate lock
  - Sets to IN_PROGRESS and returns LockId
  - Validate and test your policy
  - 24 hours timeout (abort)
- Complete the lock process
Policy elements
- Resource (vault)
- Conditions
  - glacier:ArchiveAgeInDays, glacier:ResourceTag
- Action

Pricing

Storage
- ~20% of S3 Standard
- ~50% of S3 IA
Depends on Access Frequency
- Bulk 5-12h - cheapest
- Standard 3-5h
- Expedited 1-5 minutes
  - Up to 250MB objects
    - Larger take linearly longer
  - Provisioned Capacity Unit available

Glacier Select

Filtering on Glacier side
Similar to S3 Select
- Pattern matching
- Auditing
- Data integration
Allows to GET subset of an object

Integrated with S3 (storage class)
Lifecycle configuration can transition between S3 <-> Glacier Storage Class

AWS Snowball (Edge)

Overview

Evolution of original Snowball device

Main Differences

Form factor
Bigger capacity: 100 TB
Better network connectivity:
- 10 or 25 Gb SPF
- 40 Gb QSPF+
- 3G
- WiFi

Clustering

Multiple Snowballs constitute a cluster
Treated as Network Attached Storage (NAS)
Cannot be used to import/export data
- on-premise storage only
Redundancy
- 45TB/100TB per node usable
- Survives
  - 1 node crash (normal operation)
  - 2 node crash (read-only mode)

File interface

Access via NFS
- Preserves file metadata as object attributes
Stored files can be access via AWS Storage Gateway
Cannot be accessed via EFS

Storage Endpoints

S3
- subset of REST: LIST, GET, PUT, DELETE, HEAD, MultiPart upload
NFS

Local processing

Embedded Greengrass Core (IoT) to run Lambda
Lambda function pre-deployed on device
- Python
- Max 128 MB RAM
Triggered on PutObject
Use cases
- filter, clean, analyze, track data

References

https://aws.amazon.com/snowball-edge/faqs/

AWS Snowball

Overview

Transferring large amount of data offline
- Import data to S3
- Export data from S3
"Never understimate the bandwidth of truck full of hard drives".
Devices "snowballs" can be combined together to create a really big "SNOWBALL"
Transferring 50TB on 150 Mbps link (50% utilization) takes 63 days
Management API

Device

50/80 TB
Tamper proof
All data encrypted
Network adapters (SPF+, RJ45) @10GbE
e-ink for display

Snowball client

Runs on-premise
Used to unlock device
Transfers data onto device
Supports HDFS
Works best if OS supports AES-NI

S3 Adapter

Exposes S3 comptaible endpoint
Existing tools can just point to Snowball IP

Job (import)

workflow for handling data import
- Create job
- Snowball shipped to customer
- Download client
- Plug device to the network/machine
  - place deep in the stack to prevent bottlenecks
- Client uploads data onto device
  - Encrypts using KMS
  - Parallel upload
- Ship device back to AWS
- AWS team plugs in to their network and uploads to S3 (in future EBS)

References

https://aws.amazon.com/blogs/aws/aws-snowball-edge-more-storage-local-endpoints-lambda-functions/

AWS EFS

Overview

Network File System
Shared between instances
NFS v4.0/4.1 compatible
- Alternative standard is CIFS (SMB extension) on Windows
- Both are "on-the-wire" protocols (i.e. data serialization)
Distributed across many servers
- Aggreate I/O throughput (10GB+)
- Size scales to 1 PB+

Filesystem (max 10 per AWS account)

uses fopen/fclose/fwrite (POSIX compliant)
shared access to file
modififcations in situ ("in place")
Alternatives
- EBS which is block store (lower level)
- S3 object store (uses GETs/PUTs) - (higher level)

File ingest

EFS File Sync
- Tool to copy data in parallel
- 5x faster than standard linux tools
- Supported by AWS console
Command line tools
- rsync - single-threaded, very chatty
- cp - single-threaded, faster than rsync
- GNU parallel - shell tool to run command in parallel
- mcp - multi-threaded, drop-in replacement for cp, developed by NASA
- fpart - multi-threaded rsync
- s3cp + GNU paraller (to copy S3 -> EFS)

Mount Target

Endpoint for connecting to EFS
Each AZ has its own endpoint
- When multiple subnets in AZ use arbitrary subnet
  - IP Address assigned from the subnet
- DNS assigned automatically
- Avoids inter-AZ traffic (paid)
Has Security Group
Mounting
- Manually: mount -t nfs4 DNS "mount-point"
- On reboot: fstab (nfs defaults auto 0 0)
- On launch: cloud-init

Use cases

Oracle
SAP
Legacy applications
WordPress
JIRA storage
Shared or clustered databases
Shared dataset when you want to modify files in situ
Overflow
DR

Security

Initially 755 root root
UID and GID are used (not user names)
- Turn off Id Mapper
No identity authentication (anybody can claim to be root)
Permissions are cached
chown_restricted
- "giving away" files not permitted
- root can change owner
- root/owner can change owning group
  - if owner changes he must be member of target group also
No root squashing
- remote "root" is also a "root" on the EFS (i.e. can change file ownership)
  - i.e. no way to isolate data from 2 EC2 instances
Uses Security Groups (TCP 2049)
Access from On-premise possible
- DX
- 3rd party VPN (but not VGW)

Sizing

Each object
- Metadata: 2KiB
- Data: increments of 4KiB
Metered information may not be real time

Performance

SSD based
Parallelizable (like S3)
Modes
- General purpose
  - Low Latency
  - Limited throughput (max 7K ops/sec)
- Max I/O
  - Large scale and data-heavy applications
  - Higher latency per ops
  - The higher IO size the higher the throughput
Burstable Throughput
- Minimum 100MiB/s
- Burst of 100MiB/s per TB of storage (e.g. 10TiB can burst to 10 * 100 = 1000 MiB/s)
- Credit earned: 50 MiB/s per 1TiB
  - Cap: can burst max 12h / day

Backup

Must be deployed by customer
CloudFormation template available (Lambda, SNS, EC2, DynamoDB, S3)

Legacy Approaches

Linux
- Use storage optimized instances in RAID0 array
- DRDB
  - Replicate blocks between AZs sync
  - Replicate blocks to EBS async -> Snapshot
- For really large stores use GlusterFS: 2PB
  - On Windows DNS round-robin required as there is no client
  - Slow for small files
  - Native x64 Linux client recommended
    - Linux client gets the list of all servers and load balances himself
      - similar to AWS Memcached auto-discovery client

Windows
- DFS (Distributed File System) got improved in Windows 2012
- Samba Client writes/reads synchronously
  - Must be SMB v3 (which means Windows 2012 must be used everywhere)

References

Monday, 19 February 2018

AWS Amazon MQ

Overview

AWS-managed version of Apache Active MQ
Access to Active MQ console
Use case
- Migration of existing application (typically enterprise)
Alternatives
- SQS

Similar model to RDS
- Active Broker
- Standby Broker
- Failover automatic

Apache Active MQ

Features
- Persistent and Transient messages
- Local and distributed transactions (XA)
- Queues & Topics
- Unlimited message size/retention
Protocols
- JMS
- MQTT
- AMQP
- NMS
- STOMP
- WebSocket

References

https://www.youtube.com/watch?v=dCucC1SKkvI

AWS SQS (FIFO)

Overview

Type of SQS queue (compare: Standard)
Stronger guarantees on ordering and exactly-once processing
- works under specific condidtions
- exactly-once delivery is generally impossible in distributed systems

Ordering

MessageGroupId parameter applied on SendMessage*
- Partitions the messages into multiple groups (each group preserves ordering)
Sender
- To be meaningful each sender should be single-thread and use distinct MessageGroupId
Receiver
- ReceiveMessages may return multiple distinct MessageGroupIds
  - No control which ones are returned
  - These MessageGroupIds become "blocked"
    - Until they are deleted (acknowledged) other Receiver are "partially blocked"

Rate Limit

Unlike Standard FIFO imposes rate limits
Max 300 requests/second

Send Deduplication

SQS drops duplicate messages (from sender)
Types
- ContentBasedDeduplication
  - SQS calculates SHA256 over message content (not attributes)
- MessageDeduplicationId
  - Explicitly passed by Sender
  - Overrides ContentBasedDeduplication
Detection for 5 minutes

Receive Deduplication

ReceiveMessageDeduplicationId
- Can be passed in Receive
- Useful when processing crashed and we retry
  - Allows to skip the normal VisibililtyTimeout

References

https://aws.amazon.com/blogs/developer/how-the-amazon-sqs-fifo-api-works/

AWS SQS

Model

Queue - identified by url
- e.g. http://sqs.us-east-1.amazonaws.com/123456789012/queue2
- if idle may get deleted after 30 days
Message
- Max: 256KB of text. Larger can be managed via S3
  - SQS message is a pointer to S3 object
- Max 10 messages in single request (Send/Receive)
- Has uniquely assigned MessageId (max 100 characters)
- MD5 of the message is returned on Receive
- Receipt handle
  - Returned when a message is received
    - If received many times: latest handle is valid
  - Needed to delete a message
  - Max 1024 characters
- Retention Period
  - Default: 4d
  - Range: 60s-14d
Visibility Timeout
- After receive the message remains in the queue but is "invisible" for others to receive
- Prevents multiple consumers processing the same message (i.e. reservation)
- Invisible = In Flight
- Timeout can be updated per queue/per message
- VisibilityTimeout=0: "I do not want to process it, just peeking"
- Default: 30s
- Range: 0s-12h
Types
- Standard
  - Ordering: NOT guaranteed (message can arrive in any order)
    - Producer can include sequence# and consumer can reorder itself (like TCP does)
  - At-least-once delivery
    - Possible to get duplicates as messages stored on multiple servers and receive/delete may not reach all of them
    - Processing should be idempotent
- FIFO
  - See SQS(FIFO)

Message Attributes

Send along with message but separate from message body
Max 10 attributes per message
Can be used for structured metadata (timestamp, geospatia data, identifiers)
Structure
- Name
- Type - String, Number, Binary
  - CustomType (e.g. Binary.gif, Binary.jpeg, Number.float) - type trait
- Value

Pricing

Requests
- 1 request = max 64KB chunk so 256 kB message = 4 requests
Data transfer

Polling

Short (standard)
- Sample random servers (e.g. A,B)
- May not retrieve messages even if they exist (e.g. on C)
- WaitTimeSeconds = 0 or queue attribute ReceiveMessageWaitTimeSeconds = 0
Long
- wait until message is available or request times-out
- checks ALL the servers (A,B,C) unlike "Short polling"
- WaitTimeSeconds: 1-20 has priority over ReceiveMessageWaitTimeSeconds
- can be set on:
  - ReceiveMessage
  - CreateQueue
  - SetQueueAttribute
- Default: 0
- Range: 0-20s

Delivery Delay

Default: 0
Range: 0-15min
Delayed Queue
Message Timer

Batching

Reduces cost (pricing based on requests not individual messsage)
SendMessageBatch
DeleteMessageBatch
ReceiveMessage already processes up to 10 messages (no batch counterpart)
AmazonSQSBufferedAsync in Java

Dead Letter Queue (DLQ)

Enable Redrive Policy
- Target queue ARN
- Configure maximum number of receives before message is sent to DLQ
Retention based on original creation date
- DLQ should typically have longer retention
Requires separate consumer process for this queue
Allows to isolate failed messages - "poision pills"
- Delete never happened for them
AWS Console "peek" counts as Receive

SNS Integration

Topic subscription
Fan-outs
- Image uploaded event sent to SNS
  - SQS: generate thumbnail
  - SQS: image recogntion
  - SQS: indexing

Encryption

Stored in encrypted form on SQS Servers
- Encrypted
  - Message body
- Not Encrypted
  - Queue metadata
  - Message metdata (message Id, timestamp, attributes)
  - Per-queue metrics
SSE-KMS
- AWS-managed CMK
- Custom CMK
Data Key Reuse Period
- "Data Key" caching - configurable
- Shorter -> more expensive -> better protection
  - KMS has limit 100 TPS

Permissions

Resource level permission (similar to bucket policy)
e.g. Grant other AWS accounts access
- Also anonymous access
- Supports conditions

Saturday, 17 February 2018

Misc (Cloud Migration)

Unobvious on-premise costs

Labor
- broken disks
- patching hosts
- servers going offline
Network
- Bandwidth needs (peak/average)
Capacity/Utilization
- Cost of overprovisioning
- How much buffer capacity
  - what to do when exceeded
Availability
- Do you have DR
Power
- What is peak/average power requirements
- HVAC
- 2N redundancy (everything is duplicated)
Space
- How much can you grow

Migration Bubble

Temporarily increased costs
- Planning and assessment
- Duplicate environments
- Staff Training
- Migration Consulting
- 3rd Party Tooling
- Lease penalties
  - Due to early termination of contracts

Methodology

Evaluate
- Migration Readiness Assessment (MRE)
  - Are we ready?
  - Do we have all resources?
Plan
- MRE and Planning Engagement
Design
Migrate
- Execution
  - Discovery
    - What do I have
  - Grouping / Mapping
    - Servers into application stacks
    - Always migrate verticals
      - Horizontal move (e.g. DB first) unrealistic
  - Migration Strategy
    - See below (6R Approach )
  - Data Migration
  - Landing Zone
  - Server Migrations
  - Database Migrations
Optimize

Migration Strategy (6R Approach)

Retain (Revisit)
- Do not touch
  - e.g. mainframe that must stay
Remove
- Decomission
Rehost
- aka Lift&Shift
- Minimal changes to application
  - typically configuration only
- Storage migration needed
- e.g.
  - Import VM as-is
Replatform
- Up-version of OS
- Data Conversion
  - e.g. MysQL -> RDS (PostgreSQL)
Refactor
- Change to use Cloud native mechanisms, e.g.
  - multi-region
  - Auto-Scaling
Rearchitecture
- Complete rewrite
  - e.g. Server -> Serverless

Difficulties in migration

Coordination
Lack of testing
Server downtime during cutover

References

AWS CloudFront

Model

Distribution
- Web: DNS starts with "d" (download)
- RTMP: DNS starts with "s" (streaming)
Origin - place where the authority files are stored
- S3
- Custom Web Server
Behavior
- How CF behaves when receives request
- Path pattern - specifies requests the behaviors applies to
- Examples
  - Forward Headers/Cookies
  - Minimum/Default/Maximum TTL
  - Restrict Viewer Access (signed Urls only)
Integrated with WAF
Supports HTTP/2
Supports IPv6
- Using signed urls not recommended as IPv4 vs IPv6 mixture possible

Lambda@Edge

Ability to run lambda on CF
Events
- Viewer Request (after)
- Origin Request (before)
- Origin Response (after)
- Viewer Reponse (before)
Use cases
- Inspect rewrite cookies/urls
- Support legacy urls
- Make HTTP requests to third parties

Forwarding Requests

Origin does not see all request data
Forwardable
- Headers (All, Whitelisted Only)
- Cookies
- Query parameters
Forwarding allows caching different object version based on value
- Increases cache memory footprint
Use cases
- Prevent hotlinking
- Allow CORS for everyone

TTL

Obey origin response headers (Cache-Control max-age, s-max-age, Expires)
- max-age is recommended
Behavior can override: mininum TTL, default TTL, Maximum TTL
- e.g. in cases when Origin does not set it properly
TTL-0
- Used for Dynamic Content
- CloudFront still caches the content
- Makes GET If-Modified-Since every time
  - gives origin a chance to signal content hasn't changed
  - this saves bandwidth as Origin does not have to resend the page

Origin

S3
- Origin Access Identity (OAI)
  - special CF user associated with customer Distribution-Origin
    - ```
    "Principal":{"CanonicalUser":"79a59d8f8d5218e7cd47ef2be"},
```
- change S3 bucket policy to only allow OAI
Custom (customer own Web Server)
Multiple origins
- First match (based on path) wins
- Requires cache behavior for each origin

Signing

Restrict access with signed urls or/and signed cookies
- Date/Time
- IPs
- Requires: CloudFront Key-Pair
Signed Url
- Restrict access to individual files
- Query parameters: Expires, Policy, Signature, Key-Pair-Id
Signed Cookies
- Not supported for RTMP
- No need to change urls
- Restrict access to multiple files at once
  - e.g. HLS stream (multiple file segments)
- User authenticates on customer site which sets her browser's Signed Cookie
Process
- Create public-private key-pair
- Upload to account (via Console)
- Indicate which AWS accounts can sign (Trusted Signer)
- Create policy document (i.e. rules of access)
  - SHA1 of policy document signed with private key
- Include encoded policy document + signature as query string parameters
- CloudFront verifies policy/signature on access
Account Id added to
- Web - behavior (can have multiple behaviors)
- RTMP - distribution

Trusted Signer

AWS account with an active CloudFront Key Pair
Key Pair allowed for root account only (not IAM user)
Max 2 active key pairs at a time
Possible to upload your own RSA key

Geoblocking

Built-in: country-level (~99.8% accuracy)
ThirdParty
- Use your webserver to build links

Compression

Supported natively
Compressed by edge locations
Compressible files 1,000 bytes - 10,000,000 bytes
ETAGs are stripped (as "compressed vs non-compressed" should have different values)
Enabled on Behavior
Custom Origin Compressions
- Still use when file type not supported by CF

Invalidation

Expensive
Supports wildcards
- e.g. "/images/hi-res/*"

SSL

Custom SSL certificates
- Dedicated IP - 600$/month
- SNI - only newer browsers support them
Supports ACM certificates
Supports Redirection HTTP->HTTPS on the edge
Communitcation to Origin
- Match Viewer
- Enforce HTTPS
- Enforce HTTP

Pricing

Classes
- All (us, eu, ap-northeast, ap-southeast-1, ap-southeast-2, sa-east)
- 200 (us, eu, ap-northeast, ap-southeast-1)
- 100 (us, eu)
- Viewers in locations not covered in price class see larger latency
Reserved Capacity available

Can act as reverse proxy

May sit in front of dynamic website
- cache only certain portions based on rules
- works like Varnish
Header "X-Forwarded-Proto" useful to decide http vs https on origin

Header manipulation

Custom Headers can be added/overridden
Use case:
- Add X-Shared-Secret=****** to allow Origin verify the request is from CF
- Add CORS headers (bypass user)

Field-level Encryption

Ability to encrypt HTML Forms fields at the edge
Keys
- Public RSA provided by customer to CloudFront
- Private RSA secured by user (e.g. Pameter Store+KMS)
Use cases
- PCI compliance

References

https://aws.amazon.com/blogs/aws/lambdaedge-intelligent-processing-of-http-requests-at-the-edge/
https://aws.amazon.com/blogs/aws/lambdaedge-intelligent-processing-of-http-requests-at-the-edge/

AWS EC2 (VM Import/Export)

Overview

Allows importing on-premise VM images
Not integrated into AWS Console
- Must use API
Recommended to use Server Migration Service (GA: 2016)

Virtualization Platforms

VMWare
Citrix Xen
Microsoft Hyper-V

VM Image Formats

OVA - Open Virtual Appliance
- Supports multiple disks
VMDK - Stream-optimized ESX Virtual Machine Disk
- Compatible with VMWare ESK/vSphere
VHD/VHDX - fixed dynamic image formats
- Compatible with Microsoft Hyper-V, Citrix Xen
Raw

Importing VMs

Stop source VM
Prepare source VM (disable firewall, enable remote access, etc.)
Export VM from the virtualization environment
- If using VMWare vCenter you can use AWS Connector for vCenter
- Result: image (see VM Image Formats)
Import into S3 VM catalog
- ec2-import-instance
  - starts a new task
    - check status: ec2-describe-conversions-tasks
  - once completed instance is launched in stopped state
Create AMI
- Optionally copy AMI to other regions
Limitations
- HVM only with 64-bit
- BYOL for RHEL
- Expanded image cannot exceed 1TiB
- Single ENI
- No SR-IOV (Enhanced Networking)
  - Except for Windows 2012

Importing Instance

Imports as EC2 Instance

Importing Volume

Similar to importing VM image
Imports single disk (creates EBS snapshot)

Exporting VM

Export only if you previously imported it
Create bucket with upload/delete & view permissions for AWS account

vCenter

AWS has special management plugin (AWS VM Connector)
- Standalone VM (OVA file)
- Supports monitoring, tagging
- Verifies various permissions along the way
Process
- vSphere client authorizes the import
  - AWS Management Portal for vCenter
    - Verifies permissions and returns a token
- vSphere client sends the request (+token) to the AWS Connector
- AWS Connector starts the migration
  - Returns taskId to the vSphere client

AWS Server Migration Service

Overview

Lift&Shift server migration
Supports incremental migration of live servers
Orchestrates large-scale migrations (see Migration Hub)
Tracks progress
Produces AMI
- Removes unwanted VM tools before AMI
Replication schedules
Supports
- VMWare
- Microsoft Hyper-V
Alternatives
- VM Import API

Agentless

No need to install agent on each source server
Connector
- Needs to be installed on the host (hypervisor)
- Captures block-level changes

Incremental replication

Constantly hydrates data into target environment
Requires one-time snapshot
Limits network bandwidth
Frequency configurable (e.g. every 2h)
- Every run is separate AMI
Duration 90 days
Shortens the cut-over time
- "Last sync" only

Licensing

Auto
- Windows - AWS will provide license
  - AWS will charge for license
- Linux - BYOL
AWS
BYOL

References

https://www.youtube.com/watch?v=kxtcB21xXzM

AWS Database Migration Service

Overview

Online database migration
Creates tables, loads data, keeps it sync
Agentless (uses Replication Instance)

Engines

Supports
- MySQL
- MariaDB
- PostgreSQL
- Oracle
- SQL Server
- Aurora
- Redshift

Model

Environments
- on-premise
- AWS (EC2, RDS)
Heterogenous migration possible (e.g. Oracle -> MySQL)
Does not propagate most of DDL (indexes, sprocs, etc.)
- Must be replicated manually
To obtain changes it calls database's change capture API
- Oracle: log miner
- MySQL: binlog API
- PostgreSQL: replication slots

Endpoint

"Conection String" to the database
required for source/target

Replication Instance

EC2 instance
Performs actual work
Similar to data pipeline
Instance Types
- C4 (e.g. dms.c4.large), 100GB gp2
- T2 (e.g. dms.t2.micro), 50GB gp2
Public IP - needed only when source/target accessible over Internet

Migration Task

Which schema/tables to migrate
Migration Type
- Full Load, Ongoing Replication
- Ongoing Replication
- Full Load
Endpoints (source/target)

Encryption

Encrypts storage used by replication instance and endpoint connection
Uses KMS (aws/dms)

Use Cases

One time migration
Continous replication (e.g. prod->test)
Move shards between servers
Moving off commercial database
Hybrid deployments
Integrate tables from 3rd party tools (e.g. PeopleSoft)

References

https://www.youtube.com/watch?v=nMTtQOtgp9g&t=2551s

AWS Migration Hub

Overview

Centralized Single pane of glass for tracking migration process
Integration with other tools
Multi-region
Provides reporting
- Views
  - Project Management
  - Engineering
- Can be extended via API

AWS Tools Integrations

AWS Server Migration Service
AWS Database Migration Service
Application Discovery Service
- Integrated (used to be stand-alone service)
- Responsibilities
  - Identify servers
  - Measure performance
    - Time-series utilization tracking
      - Helps right-size instances
  - Captures dependencies
    - Network communication between servers
- Modes
  - Agentless
    - VMware vCenter only
      - Installed on WMware host (hypervisor)
    - Easy to deploy but limited data
  - Agent
    - Installed on each server
    - Runs in user space

3rd Party Tools Integrations

CloudEndure
Atadata

References

https://www.youtube.com/watch?v=ij43Vn-2qcU

Thursday, 15 February 2018

AWS ECS(Task)

Task

Instanciated task definiton
Running docker instances (containers) specified in definition
- One ore more containers on tsk
Entire task is run on a single Container Instance
Two tasks (from the same definition) may run on different container instances

Task Definition

JSON representation of a docker run CMD
Required to run Docker container
Grouping of related containers that run together e.g.
- container: wordpress (PHP)
- container: mysql
Each container in the group lands on the same instance
If you use the same name ECS versions it and always uses the latest
Must be registered with ECS

Task Definition (structure)

Family
- Name of the task definiton - can have revisions
Container Definitions
- basic
  - name - can be used is links section
  - image - url to repository
  - memory - at least 4MiB
  - port mappings - bridge container host port with container port
    - hostPort -> containerPort
    - protocol
- advanced
  - cpu - sharing CPU
  - essential - should failed container terminate the whole task
  - links - allows 2 containers to communicate without port mappings
  - ...
Volumes
- Data volumes to be attached to container
- Volumes are associated with container instance
- Use cases
  - persistence (container disks are ephemeral)
  - Sharing scratch area