Monday, 26 February 2018

AWS IoT Device Management


Overview
  • Tool to manage millions of IoT devices
  • Onboarding
    • templates for bulk-registration (device provisioning)
    • hetergenous devices
      • Amazon FreeRTOS based
      • Greengrass based
  • Organization
    • Extends IoT Device Registry
    • Hierarchical model
      • Policies can be set on sub-hierarchies
      • Groups
    • Queries (fleet search)
      • e.g. device type, firmware version
  • Monitoring
    • Gathers telemetry (real-time connection, status, authentication)
  • Remote Management

Job
  • Can target specific device groups
  • Examples
    • Selective fleet update (OTA)
      • software/firmware update
    • Collect diagnostics (e.g. engine started >= 20K times)
    • Device reboot

Amazon FreeRTOS


Overview
  • Based on FreeRTOS kernel 
  • Runs on edge device (IoT)
  • Allows to connect device to 
    • AWS Service
    • More powerful device (via Greengrass)
  • Extends via packages
    • Security library
    • Connectivity
    • Secure Updateability (can update itself)

FreeRTOS
  • Popular Real-Time Operating System 
  • Open Source
  • De-facto standard for micro-controllers
    • Texas Instruments, Microchip, NXP Semiconductors,....

AWS Greengrass

Greengrass Core
  • Runtime allowing code execution on IoT
  • Allows to run local Lambda functions on device
  • Local MQTT messaging across network
  • Secure communication with cloud
  • Device with Greengrass may act as a gateway for less powerful devices
  • Over-The-Air (OTA) Updates
  • Supported devices
    • >= 128 MB memory
    • x86/ARM (>=1GHz)
    • Example
      • RaspberryPi
  • Use cases
    • Filter/aggregate device data to only transmit necessary 
      • Reduces cost
    • Offline processing (connectivity may be slow or intermittent)
    • Can be deployed on Snowball Edge

Greengrass Group
  • Collection of
    • Greengardd Cores
    • Devices (Amazon FreeRTOS, iOT Device)
  • Things can communicate with each other


References

AWS Step Functions


Overview
  • Orchestration layer
  • AWS manages the state between invocation
  • Suitable for Serverless Lambda functions
  • Simplified version of Simple Workflow Service
    • Uses SWF behind the scene
  • Uses Amazon States Language (JSON) to express state

State
  • Element of state machine (workflow)
    • Task (do some work)
    • Choice
    • Stop (Fail, Succeed)
    • Pass input->output 
    • Wait (delay timer)
    • Parallel 

Task
  • Lambda function
  • Activity
    • Performed by worker (EC2, ECS, mobile device)
    • Uses long polling

Error handling
  • Retry
    • Task and Parallel states
    • ErrorEquals (what exception to handle)
    • Interval (delay after first attempt)
    • MaxAttempts
    • BackoffRate (how quickly interval increases)
  • Catch
    • Fallback state

References

Saturday, 24 February 2018

AWS Storage Gateway

Overview
  • Enables hybrid storage architectures 
    • Move data to AWS for Big data / cloud bursting migration
    • Backup, archive, DR
    • Tiered storage (on-premise, cloud)
  • Uses native AWS storage
    • S3
    • EBS Snapshots
  • Efficient data transfer
    • Reduces bandwidth usage
  • Local caching

AWS Storage Gateway VM
  • Virtual appliance downloaded from AWS
  • Acts as a facade so that client applications on-premise need no change
    • Standard storage protocols
  • Installed on a host on-premise 
    • Needs VMWare/Windows Hyper-V hypervisor
    • Possible to install on EC2
      • e.g. for PoC purpose
  • Activation
    • specify IP address, name, timezone
    • AWS region to store snapshots
    • associates gateway with AWS account
  • Must have access to disk subsystem: SAN, NAS or DAS

File Gateway
  • Exposes NFS mount target
    • NFS v3/4.1 
    • Mounts S3 as a file system
  • 1-1 mapping between S3 object and file
    • Including metadata
  • Data stored in S3 bucket and can be accessed directly
    • Fine-grained control, lifecycle, CRR, etc.
    • EFS does not provide this
    • NFS client can access any data in the bucket
      • Including created outside of File Gateway, e.g.
        • Replicated from other bucket
        • Imported via Snowball

Volume Gateway
  • Exposes iSCSI mount point (block strorage)
    • Initiator: on-premise Application Servier
    • Target: Storage Gateway
  • Data stored on S3 in opaque format 
    • unlike File Gateway
      • no direct access to objects
      • stored in AWS buckets 
    • compression at-rest and in-transit
  • On-premise volume can be backed up to EBS snapshot
    • Restored as EBS volume
  • Max 1 PB of volume
  • Mode
    • Stored
    • Cached

Volume Gateway (Stored)
  • Local disk: "source of truth"
    • S3:  continuous synchronous backup of on-premise volume
  • EBS snapshots can be created
    • ad-hoc
    • scheduled
  • Size
    • Max 16TB per volume (EBS volume limit)
    • Max 32 voumes (=Max 512TB)
  • Use case
    • Offsite backup
      • On restore: everything downloaded

Voume Gateway (Cached)
  • S3: "source of truth"
    • Local disk: cache of data
  • Minimize need for on-premise scaling
  • Size
    • Max 32TB per volume
    • Max 32 volumes (=Max 1PB)
  • Data stored on S3 as "Volume Storage"
    • Supports Point In Time Snapshot of data in S3 -> EBS Snapshot
      • Can be restored to "Volume Storage"
      • If < 16 TB can also be restored to EBS volume
  • Allocated on-premise storage best practices
    • Cache Storage - local cache of frequently accessed data
      • optimize performance for iSCSI
      • durable data storage on-premise
      • Allow at least 20% of entire storage volume
      • Use RAID5 or RAID6
      • When fills-up and full of dirty data - iSCSI writes are blocked
    • Upload Buffer  - queue to get data up to S3 (asynchronous).
      • At least 150GBs recommended
      • Optimize performance for S3 
      • When it fills-up data is uploaded from Cache Storage directly but we cannot take PITs Snapshots during that time
    • Separate Cache/Buffer to different spindles
  • Dirty data - data put in Cached Volume that is not yet uploaded to S3
  • Avoid
    • Windows full format as it initializes blocks and you start paying for used storage (use quickformat instead)
    • full antivirus sweep as it ruins the cache
  • Use cases
    • Large data set but small working set 
    • Moving from Volume (Stored) to Volume (Shared)
    • Backup
      • On restore: nothing downloaded (empty cache)

Tape Gateway
  • Drop-in replacement for physical tape infrastructure
  • Exposes iSCSI interface
    • Media Changer
    • Tape Drive
  • Virtual Tape (VT)
    • Analogous to physical cartridge
    • Size 100-2.5TB
    • States
      • AVAILABLE - application may write to it
      • IN TRANSIT TO VTS - uploading data to AWS
      • ARCHIVING - upload to AWS complete. Archiving
      • ARCHIVED - in Glacier
  • Virtual Tape Library (VTL)
    • Analogous to Physical Library (with robotic arms and tape drives)
    • Many existing backup tools supported (e.g. Dell, Veritas, etc.)    
    • Max 1500 Virtual Tapes (total 150TB) in library 
      • Unlimited number in AWS
    • Drives
      • Tape Drive
        • I/O and Seek
        • Max 10
        • Responds to SCSI commands
      • Media Changer  (robotic arm)
        • Max 1
        • "Inserts" Virtual Tape into Tape Drive
  • Virtual Tape Shelf (VTS)
    • Analogous to off-site tape holding facility
    • When backup software ejects the tape it is moved to VTS
    • Backed by Glacier
  • Use case 
    • Replacing physical tape infrastructure

References

AWS EC2 (Auto Scaling)


Use cases
  • Scaling activity
  • Instance replacement

Auto Scaling Group    
  • Group of resources
  • Associated with either
    • Launch Configuration
    • Launch Template
  • Group size
    • min
    • max
    • desired
      • used for manual scaling and scheduled scaling
      • must be beteween min <= desired <= max
  • Networking
    • VPC and subnets
    • AZ
    • (Optional) Placement Group
      • Cannot specify multiple AZ if used
  • (Optional) ELB
  • Healthcheck
    • Type
    • Grace Period
  • Cooldown period
  • Termination Policies
  • Suspended Processes
  • Instance Protection
    • Protects from termination on scale-in

Notification
  • Sends SNS notification on following instance events
    • launch
    • terminate
    • failed to launch
    • failed to terminate

Launch Configuration
  • Template (blueprint) for instance to be launched by AS
    • AMI
    • Instance type
    • Spot Instances (Yes/No)
    • Detailed monitoring
    • Role
    • Public IP assignment
    • Additional Storage (EBS)
    • Security Groups
    • Keypair
  • Each ASG has exactly 1 (current) launch configuration
    • Cannot be edited - must be cloned
    • Upating Launch Configuration does impact existing instances
  • As of 2018 recommended to use Launch Templates

Healthcheck
  • Instance starts as healthy
  • Type 
    • EC2 (default)
      • System Status
      • Instance Status
    • ELB (optional)
      • AS Reports instance as unhealthy if ELB reports OutOfService
      • Combined with EC2 healthcheck (logical "AND")
    • Custom healthcheck 
      • Manually notify AS that an instance is healthy/unhealthy (set-instance-health)
      • Overrides the health status
      • e.g. Can be used to mark the instance as healthy when it is rebooted
  • HealthcheckGracePeriod
    • Amount of time to wait before AS starts relying on Healthcheck
      • By default assumes Healthy
    • Used when new instance is started to give it time to prepare itself
    • If lifecycle hook is attached the time starts AFTER it is completed

Healthcheck Replacement
  • When instance is marked unhealthy it is immediately marked for termination
    • e.g Stop instance
    • You can attempt to call "SetInstanceHealth=Healthy" but there is race condition
  • Subsequently new activity to launch replacement is initiated
  • EIP and EBS volumes are NOT attached to replacement instance
    • You can handle this via User data script

Auto Scaling Instance Lifecycle 
  • Pending (instance launched/attached)
    • Pending:Wait (lifecycle hook: 1h)
    • Pending:Proceed
  • InService  (passed healthcheck)
  • EnteringStandby
  • Standby
    • Call "ExitStandby" to return to Pending
  • Terminating
    • Terminating:Wait (lifecycle hook: 1h)
    • Terminating:Proceed
  • Terminated
  • Detaching

Lifecycle hooks
  • Custom code executed when instance enters "Wait" state
    • launch - Pending:Wait, e.g.
      • Install additional software with CodeDeploy
      • Fill-up cache
    • terminate - Terminating:Wait, e.g.
      • Analyze crashed instance
      • Retrieve Logs
      • Copy data out of instance
  • AS sends notification to Notification Target:
    • Targets
      • SQS
      • SNS
      • CloudWatch Events (e.g. Lambda)
  • Code must return result CompleteLifecycleAction
    • CONTINUE
    • ABANDON   
      • On Terminate:Wait it still terminates but no other lifecycle hooks are executed
  • Timeout: default 60 minutes,
    • Can be extended with RecordLifecycleActionHeartbeat
      • Max 48 hours 
  • Cooldown starts AFTER hook is completed
    • AS is frozen for a longer period of time
  • Max 50 hooks per Auto Scaling group

Instance actions
  • Attach existing EC2 instance
    • Must not be part of other ASG
    • AMI must exist
    • Increases Desired Count
    • Lifecycle
      • Pending
        • Call "AttachInstances"
  • Detach EC2 instance from ASG
    • Can be used to move to different ASG
    • Lifecycle
      • Detaching
        • Call "DetachInstances":
      • Detached
    • Specify if you want to decrement "Desired"
  • Standby
    • manually take instance out of ASG
    • Instance is deregistered from ELB (if applicable)
    • No healthcheck performed 
    • Use cases
      • Update or modify instance
      • Troubleshoot
    • By default "Desired" decremented (i.e. no replacement launched)

ELB integration
  • ASG can use multiple ELBs
    • Max 50 per ASG
    • If any healthcheck fails the instance is marked unhealthy by ASG
      • even if all other ELBs consider it OK
    • Use case
      • Each ELB has a different SSL certificate associated
  • ELB points to ASG rather than specific instances inside it
  • ASG can re-use
    • ELB healthcheck     
    • Connection Draining (waits before termination)


Scaling
  • Methods
    • Manual
    • Schedule
    • Dynamic
      • Simple
      • Step
      • Target Tracking

Scaling Manually
  • Manually change the size of the ASG
    • desired count

Scaling by Schedule
  • When you can predict exact dates
  • Maximum 125 scheduled actions (4*31) per month
  • Similar to programming a room thermostat
  • Group size properties change (min, desired, max)
  • Types
    • One time - start time
    • Recurring  - cron syntax
      • There is no YEAR field

Scaling by Policy (dynamic)
  • Scaling Adjustment Types
    • ChangeInCapacity(+/- number_of_instances)
    • ExactCapacity(number_of_instances)
    • PercentChangeInCapacity(+/- percent_change_in_capacity)
      • MinAdjustmentMagnitue (minimum number of instances)
      • Rounding
        • (-1,1) => 1
        • (-inf,-1)u(1,+inf) => drop fraction part (cast)
  • Policy Type
    • Simple Scaling - single adjustment 
      • Supports any ALARM
      • When breach defined adjustment occurs (e.g. 3->8)
      • Cooldown supported
    • Step Scaling
      • Recommended
      • One or more steps
      • Responds to the magnitude of the Alarm (not just binary: ALARM/OK)
      • Warm-up supported
    • Target Tracking
      • Supported metrics
        • ALB Request Count per Target
        • Average CPU Utilization
        • Average Network In
        • Average Network Out
      • When breached scale-out occurs 
        • Works like thermostat ("I want average CPU Utilization to be < 50%")
      • Warm-up supported
      • Scale-in can be disabled
  • Based on CloudWatch Alarms
    • e.g. CPU Utilization, ELB Latency, ELB RequestCount, SQSNumberOfMessagesVisible

Oscillations
  • Adding/removing instances changes the state of the system
    • This may cause oscillation behavior
  • In order to damp oscillations two mechanisms are provided
    • Cooldown
      • Period of time to wait before another scaling action
      • How long to wait before previous action gives result
      • Damps oscillations
      • Supported for Simple Scaling Policy
      • Locks the entire ASG
    • Warm-up
      • Supported by Step Scaling Policy and Target Tracking Policy 
      • Period of time after adding new instance when it is not counted towards aggregated metrics
      • Prevents adding or terminating too many instances

Termination Policy
  • How AWS decides which instance to terminate on scale-in
  • Firstly - always try to balance AZ (choose random AZ if all have the same instance count)
  • Secondly 
    •  Default (OCR)
      • OldestLaunchConfiguration
      • ClosestToNextInstanceHour
      • Random
    • Custom
      • OldestInstance
      • NewestInstance
      • OldestLaunchConfiguration
      • ClosestToNextInstanceHour
  • Multpile policies can be associated with ASG
    • e.g. "OldestLaunchConfiguration","NewestInstance", "Default")

Auto Scaling Processes
  • Independent processes (workers) that perform state transitions 
  • Can be individually suspended/resumed (e.g. for debugging)
    • Administrative suspension
      • All processes in the group are suspended
      • When fail to launch instance for 24h
      • Can be resumed
  • Types
    • Launch - add new instances to the group
    • Terminate - removes instances from the group
    • Healthcheck - checks the health status
    • ReplaceUnhealthy
      • Uses: Healthcheck, Terminate, Launch
    • AZRebalance
      • Balance instance count between AZ
        • When AZ is removed from a group
        • AZ is failing or has recovered
        • Instance is explicitly terminated
      • Uses: Launch (before termination)
        • Unlike Healthcheck Replacement that kills the instance first
    • AlarmNotification
      • Accepts and reacts on CW Alarms associated with a group
      • Required for executing policies based on ALARM triggers
    • ScheduledActions
      • Performs scheduled actions
    • AddToLoadBalancer
      • adds launched instances to ELB

CloudWatch metrics
  • AutoScaling maintains aggregated instance metrics for all instances in the group (e.g. CpuUtilization)
    • Identical to EC2 but dimension is ASG (not instanceId)
  • Auto Scaling Metrics
    • GroupMinSize
    • GroupMaxSize
    • GroupDesiredSize
    • GroupInService 
    • ...

Spot Instances


  • Can be used with ASG
  • Require separate Launch Configuration
    • Specify bid price
      • Cannot be modified as Launch Configuration is immutable
    • Cannot mix on-demand and spot
  • When spot instance is interrupted AS tries to launch replacement

Wednesday, 21 February 2018

AWS Glacier


Model
  • Vault
    • name may be the same across regions (unique per region)
    • container for archives
    • analogous to S3 bucket
    • max 1000 per region
  • Archive
    • base unit storage in Glacier
    • immutable (create/delete only)
    • can be any data (photo, document, etc.)
      • best practice: aggregate data into .zip or .tar
    • 32 kB metadata overhead
      • Recommended >= 1MB per object
    • max 40TB per archive
    • Upload
      • Single max 4GB
      • Recommended multi-part for > 100 MB
        • Compute and supply tree-hash
          • Hash for each megabyte segment and combine in tree fashion
  • Inventory
    • Updated once per day
    • List of all archives
    • Inventory date not changed if no add/delete of archives
    • Format: CSV or JSON
    • Similar concepts exists now for S3

Jobs
  • Executed asynchronously (Job ID returned)
  • Associated with vault
    • Multiple jobs may be in-progress
  • When it completes user can download the output (available for 24h)
  • Types
    • Archive Retrieval
      • entire archive or subset of files in the archive
    • Inventory Retrieval (list of archives)
      • filter can be applied (e.g. archive creation date)
  • May have SNS notifications enabled

Upload (Tree Hash)
  • On upload include 2 headers
    • x-amz-content-sha256
      • hash of entire payload used for signature calculation
    • x-amz-sha256-tree-hash
      • specific to archive upload
      • main benefit - avoids re-reading a (potentially big) file to calculate its hash
        • it's computed piece-meal
      • for each chunk of 1MB compute hash (last may be < 1 MB)
        • build the next level of tree (compute hash again)
          • repeat until you reach top (root)
      • Examples
        • Single request (6.5 MB)
          • 1 request (SHA256 computed 13 times)
        • Multi-part
          • 2 requests each has hash-tree of corresponding parts
          • Complete Multipart Upload (tree hash of entire archive)

Vault Access Policy
  • Resource-based policy
    • similar to bucket policy

Vault Lock Policy
  • Similar to vault access control
  • Enforce compliance requirements
    • e.g. WORM (Write Once Read Many)
  • Once policy is locked it cannot be edited
    • Stronger control than vault access policy
  • Use case
    • time-based data retention rules (deny deletes) but allow read access
      • Combine vault lock policy (deny delete) and vault access policy (read)
    • Compliance
  • Process
    • Initiate lock
      • Sets to IN_PROGRESS and returns LockId
      • Validate and test your policy
      • 24 hours timeout (abort)
    • Complete the lock process
  • Policy elements
    • Resource (vault)
    • Conditions
      • glacier:ArchiveAgeInDays, glacier:ResourceTag
    • Action

Pricing
  • Storage 
    • ~20% of S3 Standard
    • ~50% of S3 IA
  • Depends on Access Frequency
    • Bulk 5-12h - cheapest
    • Standard 3-5h
    • Expedited 1-5 minutes
      • Up to 250MB objects
        • Larger take linearly longer
      • Provisioned Capacity Unit available

Glacier Select
  • Filtering on Glacier side
  • Similar to S3 Select
    • Pattern matching
    • Auditing 
    • Data integration
  • Allows to GET subset of an object

S3


  • Integrated with S3 (storage class)
  • Lifecycle configuration can transition between S3 <-> Glacier Storage Class

AWS Snowball (Edge)

Overview
  • Evolution of original Snowball device

Main Differences
  • Form factor
  • Bigger capacity: 100 TB
  • Better network connectivity:
    • 10 or 25 Gb SPF
    • 40 Gb QSPF+
    • 3G
    • WiFi

Clustering
  • Multiple Snowballs constitute a cluster
  • Treated as Network Attached Storage (NAS)
  • Cannot be used to import/export data
    • on-premise storage only
  • Redundancy
    • 45TB/100TB per node usable
    • Survives
      • 1 node crash (normal operation)
      • 2 node crash (read-only mode)

File interface
  • Access via NFS
    • Preserves file metadata as object attributes
  • Stored files can be access via AWS Storage Gateway
  • Cannot be accessed via EFS

Storage Endpoints 
  • S3
    • subset of REST: LIST, GET, PUT, DELETE, HEAD, MultiPart upload
  • NFS

Local processing
  • Embedded Greengrass Core (IoT) to run Lambda
  • Lambda function pre-deployed on device
    • Python
    • Max 128 MB RAM
  • Triggered on PutObject
  • Use cases
    • filter, clean, analyze, track data


References

AWS Snowball

Overview
  • Transferring large amount of data offline
    • Import data to S3
    • Export data from S3
  • "Never understimate the bandwidth of truck full of hard drives".
  • Devices "snowballs" can be combined together to create a really big "SNOWBALL"
  • Transferring 50TB on 150 Mbps link (50% utilization) takes 63 days
  • Management API

Device
  • 50/80 TB 
  • Tamper proof
  • All data encrypted
  • Network adapters (SPF+, RJ45) @10GbE
  • e-ink for display

Snowball client
  • Runs on-premise
  • Used to unlock device
  • Transfers data onto device
  • Supports HDFS
  • Works best if OS supports AES-NI

S3 Adapter
  • Exposes S3 comptaible endpoint 
  • Existing tools can just point to Snowball IP

Job (import)
  • workflow for handling data import
    • Create job
    • Snowball shipped to customer
    • Download client
    • Plug device to the network/machine
      • place deep in the stack to prevent bottlenecks
    • Client uploads data onto device
      • Encrypts using KMS
      • Parallel upload
    • Ship device back to AWS 
    • AWS team plugs in to their network and uploads to S3 (in future EBS)

References

AWS EFS

Overview
  • Network File System
  • Shared between instances
  • NFS v4.0/4.1 compatible
    • Alternative standard is CIFS (SMB extension) on Windows
    • Both are "on-the-wire" protocols (i.e. data serialization)
  • Distributed across many servers
    • Aggreate I/O throughput (10GB+)
    • Size scales to 1 PB+

Filesystem (max 10 per AWS account)
  • uses fopen/fclose/fwrite (POSIX compliant)
  • shared access to file 
  • modififcations in situ ("in place")
  • Alternatives
    • EBS which is block store (lower level)
    • S3 object store (uses GETs/PUTs) - (higher level)

File ingest
  • EFS File Sync
    • Tool to copy data in parallel
    • 5x faster than standard linux tools
    • Supported by AWS console
  • Command line tools
    • rsync - single-threaded, very chatty
    • cp - single-threaded, faster than rsync
    • GNU parallel - shell tool to run command in parallel
    • mcp - multi-threaded, drop-in replacement for cp, developed by NASA
    • fpart - multi-threaded rsync
    • s3cp + GNU paraller (to copy S3 -> EFS)

Mount Target
  • Endpoint for connecting to EFS
  • Each AZ has its own endpoint
    • When multiple subnets in AZ use arbitrary subnet
      • IP Address assigned from the subnet
    • DNS assigned automatically
    • Avoids inter-AZ traffic (paid) 
  • Has Security Group
  • Mounting
    • Manually: mount -t nfs4 DNS "mount-point"
    • On reboot: fstab (nfs defaults auto 0 0)
    • On launch: cloud-init

Use cases
  • Oracle
  • SAP
  • Legacy applications
  • WordPress
  • JIRA storage
  • Shared or clustered databases
  • Shared dataset when you want to modify files in situ
  • Overflow 
  • DR

Security
  • Initially 755 root root
  • UID and GID are used (not user names)
    • Turn off Id Mapper
  • No identity authentication (anybody can claim to be root)
  • Permissions are cached
  • chown_restricted 
    • "giving away" files not permitted
    • root can change owner
    • root/owner can change owning group
      • if owner changes he must be member of target group also
  • No root squashing
    • remote "root" is also a "root" on the EFS (i.e. can change file ownership)
      • i.e. no way to isolate data from 2 EC2 instances
  • Uses Security Groups (TCP 2049)
  • Access from On-premise possible
    • DX
    • 3rd party VPN (but not VGW)

Sizing
  • Each object
    • Metadata:  2KiB 
    • Data: increments of 4KiB
  • Metered information may not be real time

Performance
  • SSD based
  • Parallelizable (like S3)
  • Modes
    • General purpose
      • Low Latency
      • Limited throughput (max 7K ops/sec)
    • Max I/O
      • Large scale and data-heavy applications
      • Higher latency per ops
      • The higher IO size the higher the throughput
  • Burstable Throughput
    • Minimum 100MiB/s
    • Burst of 100MiB/s per TB of storage (e.g. 10TiB can burst to 10 * 100 =  1000 MiB/s)
    • Credit earned: 50 MiB/s per 1TiB
      • Cap: can burst max 12h / day

Backup
  • Must be deployed by customer
  • CloudFormation template available (Lambda, SNS, EC2, DynamoDB, S3)

Legacy Approaches
  • Linux
    • Use storage optimized instances in RAID0 array
    • DRDB
      • Replicate blocks between AZs sync
      • Replicate blocks to EBS async -> Snapshot
    • For really large stores use GlusterFS: 2PB
      • On Windows DNS round-robin required as there is no client
      • Slow for small files
      • Native x64 Linux client recommended
        • Linux client gets the list of all servers and load balances himself 
          • similar to AWS Memcached auto-discovery client
  • Windows
    • DFS (Distributed File System) got improved in Windows 2012
    • Samba Client writes/reads synchronously
      • Must be SMB v3  (which means Windows 2012 must be used everywhere)

References

Monday, 19 February 2018

AWS Amazon MQ

Overview
  • AWS-managed version of Apache Active MQ
  • Access to Active MQ console
  • Use case
    • Migration of existing application (typically enterprise)
  • Alternatives
    • SQS

HA
  • Similar model to RDS
    • Active Broker
    • Standby Broker
    • Failover automatic

Apache Active MQ
  • Features
    • Persistent and Transient messages
    • Local and distributed transactions (XA)
    • Queues & Topics
    • Unlimited message size/retention
  • Protocols
    • JMS
    • MQTT
    • AMQP
    • NMS
    • STOMP
    • WebSocket


References

AWS SQS (FIFO)

Overview
  • Type of SQS queue (compare: Standard)
  • Stronger guarantees on ordering and exactly-once processing
    • works under specific condidtions
    • exactly-once delivery is generally impossible in distributed systems

Ordering
  • MessageGroupId parameter applied on SendMessage*
    • Partitions the messages into multiple groups (each group preserves ordering)
  • Sender
    • To be meaningful each sender should be single-thread and use distinct MessageGroupId
  • Receiver
    • ReceiveMessages may return multiple distinct MessageGroupIds
      • No control which ones are returned
      • These MessageGroupIds become "blocked"
        • Until they are deleted (acknowledged)  other Receiver  are "partially blocked"

Rate Limit
  • Unlike Standard FIFO imposes rate limits
  • Max 300 requests/second

Send Deduplication 
  • SQS drops duplicate messages (from sender)
  • Types
    • ContentBasedDeduplication
      • SQS calculates SHA256 over message content (not attributes)
    • MessageDeduplicationId
      • Explicitly passed by Sender
      • Overrides ContentBasedDeduplication
  • Detection for 5 minutes

Receive Deduplication
  • ReceiveMessageDeduplicationId
    • Can be passed in Receive
    • Useful when processing crashed and we retry
      • Allows to skip the normal VisibililtyTimeout

References

AWS SQS

Model
  • Queue - identified by url
  •  Message
    • Max: 256KB of text. Larger can be managed via S3
      • SQS message is a pointer to S3 object
    • Max 10 messages in single request (Send/Receive)
    • Has uniquely assigned MessageId (max 100 characters)
    • MD5 of the message is returned on Receive
    • Receipt handle
      • Returned when a message is received
        • If received many times: latest handle is valid
      • Needed to delete a message
      • Max 1024 characters
    • Retention Period
      • Default: 4d
      • Range: 60s-14d
  • Visibility Timeout
    • After receive the message remains in the queue but is "invisible" for others to receive
    • Prevents multiple consumers processing the same message (i.e. reservation)
    • Invisible = In Flight
    • Timeout can be updated per queue/per message
    • VisibilityTimeout=0: "I do not want to process it, just peeking"
    • Default: 30s
    • Range: 0s-12h
  • Types
    • Standard
      • Ordering: NOT guaranteed (message can arrive in any order)
        • Producer can include sequence# and consumer can reorder itself (like TCP does)
      • At-least-once delivery
        • Possible to get duplicates as messages stored on multiple servers and receive/delete may not reach all of them
        • Processing should be idempotent
    • FIFO
      • See SQS(FIFO)

Message Attributes
  • Send along with message but separate from message body
  • Max 10 attributes per message
  • Can be used for structured metadata (timestamp, geospatia data, identifiers)
  • Structure
    • Name
    • Type - String, Number, Binary
      • CustomType (e.g. Binary.gif, Binary.jpeg, Number.float) - type trait
    • Value

Pricing
  • Requests
    • 1 request = max 64KB chunk so 256 kB message  = 4 requests
  • Data transfer

Polling
  • Short (standard)
    • Sample random servers (e.g. A,B)
    • May not retrieve messages even if they exist (e.g. on C)
    • WaitTimeSeconds = 0 or queue attribute ReceiveMessageWaitTimeSeconds = 0
  • Long
    • wait until message is available or request times-out
    • checks ALL the servers (A,B,C) unlike "Short polling"
    • WaitTimeSeconds: 1-20 has priority over ReceiveMessageWaitTimeSeconds 
    • can be set on:
      • ReceiveMessage
      • CreateQueue
      • SetQueueAttribute
    • Default: 0
    • Range: 0-20s

Delivery Delay
  • Default: 0
  • Range: 0-15min
  • Delayed Queue
  • Message Timer

Batching
  • Reduces cost (pricing based on requests not individual messsage)
  • SendMessageBatch
  • DeleteMessageBatch
  • ReceiveMessage already processes up to 10 messages (no batch counterpart)
  • AmazonSQSBufferedAsync in Java

Dead Letter Queue (DLQ)
  • Enable Redrive Policy
    • Target queue ARN
    • Configure maximum number of receives before message is sent to DLQ
  • Retention based on original creation date
    • DLQ should typically have longer retention
  • Requires separate consumer process for this queue
  • Allows to isolate failed messages - "poision pills"
    • Delete never happened for them
  • AWS Console "peek" counts as Receive

SNS Integration
  • Topic subscription
  • Fan-outs
    • Image uploaded event sent to SNS
      • SQS: generate thumbnail
      • SQS: image recogntion
      • SQS: indexing

Encryption
  • Stored in encrypted form on SQS Servers
    • Encrypted
      • Message body
    • Not Encrypted
      • Queue metadata
      • Message metdata (message Id, timestamp, attributes)
      • Per-queue metrics
  • SSE-KMS
    • AWS-managed CMK
    • Custom CMK
  • Data Key Reuse Period
    • "Data Key" caching - configurable
    • Shorter -> more expensive -> better protection
      • KMS has limit 100 TPS

Permissions


  • Resource level permission (similar to bucket policy)
  • e.g. Grant other AWS accounts access
    • Also anonymous access
    • Supports conditions

Saturday, 17 February 2018

Misc (Cloud Migration)


Unobvious on-premise costs
  • Labor
    • broken disks
    • patching hosts
    • servers going offline
  • Network 
    • Bandwidth needs (peak/average)
  • Capacity/Utilization
    • Cost of overprovisioning
    • How much buffer capacity
      • what to do when exceeded
  • Availability
    • Do you have DR
  • Power
    • What is peak/average power requirements
    • HVAC
    • 2N redundancy (everything is duplicated)
  • Space
    • How much can you grow

Migration Bubble
  • Temporarily increased costs
    • Planning and assessment
    • Duplicate environments
    • Staff Training
    • Migration Consulting
    • 3rd Party Tooling
    • Lease penalties
      • Due to early termination of contracts

Methodology
  • Evaluate
    • Migration Readiness Assessment (MRE)
      • Are we ready?
      • Do we have all resources?
  • Plan
    • MRE and Planning Engagement
  • Design
  • Migrate
    • Execution
      • Discovery
        • What do I have
      • Grouping / Mapping
        • Servers into application stacks
        • Always migrate verticals
          • Horizontal move (e.g. DB first) unrealistic
      • Migration Strategy
        • See below (6R Approach )
      • Data Migration
      • Landing Zone
      • Server Migrations
      • Database Migrations
  • Optimize

Migration Strategy (6R Approach)
  • Retain (Revisit)
    • Do not touch
      • e.g. mainframe that must stay
  • Remove
    • Decomission
  • Rehost
    • aka Lift&Shift
    • Minimal changes to application
      • typically configuration only
    • Storage migration needed
    • e.g. 
      • Import VM as-is 
  • Replatform
    • Up-version of OS
    • Data Conversion
      • e.g. MysQL  -> RDS (PostgreSQL)
  • Refactor
    • Change to use Cloud native mechanisms, e.g.
      • multi-region
      • Auto-Scaling
  • Rearchitecture
    • Complete rewrite
      • e.g. Server -> Serverless

Difficulties in migration
  • Coordination
  • Lack of testing
  • Server downtime during cutover

References

AWS CloudFront

Model
  • Distribution
    • Web: DNS starts with "d" (download)
    • RTMP: DNS starts with "s" (streaming)
  • Origin - place where the authority files are stored
    • S3
    • Custom Web Server
  • Behavior
    • How CF behaves when receives request
    • Path pattern - specifies requests the behaviors applies to
    • Examples
      • Forward Headers/Cookies
      • Minimum/Default/Maximum TTL
      • Restrict Viewer Access (signed Urls only)
  • Integrated with WAF
  • Supports HTTP/2 
  • Supports IPv6
    • Using signed urls not recommended as IPv4 vs IPv6 mixture possible

Lambda@Edge
  • Ability to run lambda on CF
  • Events 
    • Viewer Request (after)
    • Origin Request (before)
    • Origin Response (after)
    • Viewer Reponse (before)
  • Use cases
    • Inspect rewrite cookies/urls
    • Support legacy urls
    • Make HTTP requests to third parties

Forwarding Requests
  • Origin does not see all request data 
  • Forwardable
    • Headers (All, Whitelisted Only)
    • Cookies
    • Query parameters
  • Forwarding allows caching different object version based on value
    • Increases cache memory footprint
  • Use cases
    • Prevent hotlinking
    • Allow CORS for everyone

TTL
  • Obey origin response headers (Cache-Control max-age, s-max-age, Expires)
    • max-age is recommended
  • Behavior can override: mininum TTL, default TTL, Maximum TTL
    • e.g. in cases when Origin does not set it properly
  • TTL-0
    • Used for Dynamic Content
    • CloudFront still caches the content
    • Makes GET If-Modified-Since every time
      • gives origin a chance to signal content hasn't changed
      • this saves bandwidth as Origin does not have to resend the page

Origin
  • S3
    • Origin Access Identity (OAI)
      • special CF user associated with customer Distribution-Origin
        • "Principal":{"CanonicalUser":"79a59d8f8d5218e7cd47ef2be"},
      • change S3 bucket policy to only allow OAI
  • Custom (customer own Web Server)
  • Multiple origins
    • First match (based on path) wins
    • Requires cache behavior for each origin

Signing
  • Restrict access with signed urls or/and signed cookies
    • Date/Time
    • IPs
    • Requires: CloudFront Key-Pair
  • Signed Url
    • Restrict access to individual files
    • Query parameters: Expires, Policy, Signature, Key-Pair-Id
  • Signed Cookies
    • Not supported for RTMP
    • No need to change urls
    • Restrict access to multiple files at once
      • e.g. HLS stream (multiple file segments)
    • User authenticates on customer site which sets her browser's Signed Cookie
  • Process
    • Create public-private key-pair
    • Upload to account (via Console)
    • Indicate which AWS accounts can sign (Trusted Signer)
    • Create policy document (i.e. rules of access)
      • SHA1 of policy document signed with private key
    • Include encoded policy document + signature as query string parameters
    • CloudFront verifies policy/signature on access 
  • Account Id added to 
    • Web - behavior  (can have multiple behaviors)
    • RTMP - distribution

Trusted Signer
  • AWS account with an active CloudFront Key Pair
  • Key Pair allowed for root account only (not IAM user)
  • Max 2 active key pairs at a time
  • Possible to upload your own RSA key

Geoblocking
  • Built-in: country-level (~99.8% accuracy)
  • ThirdParty
    • Use your webserver to build links 

Compression
  • Supported natively 
  • Compressed by edge locations
  • Compressible files 1,000 bytes - 10,000,000 bytes
  • ETAGs are stripped (as "compressed vs non-compressed" should have different values)
  • Enabled on Behavior
  • Custom Origin Compressions
    • Still use when file type not supported by CF

Invalidation
  • Expensive
  • Supports wildcards
    • e.g. "/images/hi-res/*"

SSL
  • Custom SSL certificates 
    • Dedicated IP - 600$/month
    • SNI - only newer browsers support them
  • Supports ACM certificates
  • Supports Redirection HTTP->HTTPS on the edge
  • Communitcation to Origin 
    • Match Viewer
    • Enforce HTTPS
    • Enforce HTTP

Pricing
  • Classes 
    • All (us, eu, ap-northeast, ap-southeast-1, ap-southeast-2, sa-east)
    • 200 (us, eu, ap-northeast, ap-southeast-1)
    • 100 (us, eu)
    • Viewers in locations not covered in price class see larger latency
  • Reserved Capacity available

Can act as reverse proxy
  • May sit in front of dynamic website
    • cache only certain portions based on rules
    • works like Varnish
  • Header "X-Forwarded-Proto" useful to decide http vs https on origin

Header manipulation
  • Custom Headers can be added/overridden
  • Use case:
    • Add  X-Shared-Secret=****** to allow Origin verify the request is from CF
    • Add CORS headers (bypass user)

Field-level Encryption
  • Ability to encrypt HTML Forms fields at the edge
  • Keys
    • Public RSA provided by customer to CloudFront
    • Private RSA secured by user (e.g. Pameter Store+KMS)
  • Use cases
    • PCI compliance

References

AWS EC2 (VM Import/Export)


Overview
  • Allows importing on-premise VM images
  • Not integrated into AWS Console
    • Must use API
  • Recommended to use Server Migration Service (GA: 2016)

Virtualization Platforms
  • VMWare
  • Citrix Xen
  • Microsoft Hyper-V

VM Image Formats
  • OVA - Open Virtual Appliance
    • Supports multiple disks
  • VMDK - Stream-optimized ESX Virtual Machine Disk 
    • Compatible with VMWare ESK/vSphere
  • VHD/VHDX - fixed dynamic image formats
    • Compatible with Microsoft Hyper-V, Citrix Xen
  • Raw

Importing VMs
  • Stop source VM
  • Prepare source VM (disable firewall, enable remote access, etc.)
  • Export VM from the virtualization environment
    • If using VMWare vCenter you can use AWS Connector for vCenter
    • Result: image (see VM Image Formats)
  • Import into S3 VM catalog 
    • ec2-import-instance
      • starts a new task 
        • check status: ec2-describe-conversions-tasks
      • once completed instance is launched in stopped state
  • Create AMI
    • Optionally copy AMI to other regions
  • Limitations
    • HVM only with 64-bit
    • BYOL for RHEL
    • Expanded image cannot exceed 1TiB
    • Single ENI
    • No SR-IOV (Enhanced Networking)
      • Except for Windows 2012

Importing Instance
  • Imports as EC2 Instance

Importing Volume
  • Similar to importing VM image
  • Imports single disk (creates EBS snapshot)

Exporting VM
  • Export only if you previously imported it
  • Create bucket with upload/delete & view permissions for AWS account

vCenter


  • AWS has special management plugin (AWS VM Connector)
    • Standalone VM (OVA file)
    • Supports monitoring, tagging
    • Verifies various permissions along the way
  • Process
    • vSphere client authorizes the import
      • AWS Management Portal for vCenter
        • Verifies permissions and returns a token
    • vSphere client sends the request (+token) to the AWS Connector
    • AWS Connector starts the migration
      • Returns taskId to the vSphere client

AWS Server Migration Service


Overview
  • Lift&Shift server migration
  • Supports incremental migration of live servers
  • Orchestrates large-scale migrations (see Migration Hub)
  • Tracks progress
  • Produces AMI
    • Removes unwanted VM tools before AMI
  • Replication schedules
  • Supports
    • VMWare
    • Microsoft Hyper-V
  • Alternatives
    • VM Import API

Agentless
  • No need to install agent on each source server
  • Connector
    • Needs to be installed on the host (hypervisor)
    • Captures block-level changes

Incremental replication
  • Constantly hydrates data into target environment
  • Requires one-time snapshot
  • Limits network bandwidth
  • Frequency configurable (e.g. every 2h)
    • Every run is separate AMI
  • Duration 90 days
  • Shortens the cut-over time
    • "Last sync" only

Licensing
  • Auto
    • Windows - AWS will provide license
      • AWS will charge for license
    • Linux - BYOL
  • AWS
  • BYOL

References

AWS Database Migration Service

Overview
  • Online database migration
  • Creates tables, loads data, keeps it sync
  • Agentless (uses Replication Instance)

Engines
  • Supports
    • MySQL
    • MariaDB
    • PostgreSQL
    • Oracle
    • SQL Server
    • Aurora
    • Redshift

Model
  • Environments
    • on-premise
    • AWS (EC2, RDS)
  • Heterogenous migration possible (e.g. Oracle -> MySQL)
  • Does not propagate most of DDL (indexes, sprocs, etc.)
    • Must be replicated manually
  • To obtain changes it calls database's change capture API
    • Oracle: log miner
    • MySQL:  binlog API
    • PostgreSQL: replication slots

Endpoint
  • "Conection String" to the database
  • required for source/target

Replication Instance
  • EC2 instance
  • Performs actual work
  • Similar to data pipeline
  • Instance Types
    • C4 (e.g. dms.c4.large), 100GB gp2
    • T2 (e.g. dms.t2.micro), 50GB gp2
  • Public IP - needed only when source/target accessible over Internet

Migration Task
  • Which schema/tables to migrate
  • Migration Type
    • Full Load, Ongoing Replication
    • Ongoing Replication
    • Full Load
  • Endpoints (source/target)

Encryption
  • Encrypts storage used by replication instance and endpoint connection
  • Uses KMS (aws/dms) 

Use Cases
  • One time migration
  • Continous replication (e.g. prod->test)
  • Move shards between servers
  • Moving off commercial database
  • Hybrid deployments
  • Integrate tables from 3rd party tools (e.g. PeopleSoft)

References


AWS Migration Hub


Overview
  • Centralized Single pane of glass for tracking migration process
  • Integration with other tools
  • Multi-region
  • Provides reporting
    • Views
      • Project Management
      • Engineering 
    • Can be extended via API


AWS Tools Integrations
  • AWS Server Migration Service
  • AWS Database Migration Service
  • Application Discovery Service
    • Integrated (used to be stand-alone service)
    • Responsibilities
      • Identify servers
      • Measure performance
        • Time-series utilization tracking
          • Helps right-size instances
      • Captures dependencies
        • Network communication between servers
    • Modes
      • Agentless
        • VMware vCenter only
          • Installed on WMware host (hypervisor)
        • Easy to deploy but limited data
      • Agent
        • Installed on each server
        • Runs in user space


3rd Party Tools Integrations
  • CloudEndure
  • Atadata

References

Thursday, 15 February 2018

AWS ECS(Task)


Task
  • Instanciated task definiton
  • Running docker instances (containers) specified in definition
    • One ore more containers on tsk
  • Entire task is run on a single Container Instance
  • Two tasks (from the same definition) may run on different container instances 

Task Definition
  • JSON representation of a docker run CMD
  • Required to run Docker container
  • Grouping of related containers that run together e.g.
    • container: wordpress (PHP)
    • container: mysql
  • Each container in the group lands on the same instance
  • If you use the same name ECS versions it and always uses the latest
  • Must be registered with ECS

Task Definition (structure)
  • Family
    • Name of the task definiton - can have revisions
  • Container Definitions
    • basic
      • name - can be used is links section
      • image - url to repository
      • memory  - at least 4MiB
      • port mappings - bridge container host port with container port
        • hostPort -> containerPort
        • protocol 
    • advanced
      • cpu - sharing CPU
      • essential - should failed container terminate the whole task
      • links - allows 2 containers to communicate without port mappings
      • ...
  • Volumes
    • Data volumes to be attached to container
    • Volumes are associated with container instance
    • Use cases
      • persistence (container disks are ephemeral)
      • Sharing scratch area