Notes on AWS, Big Data, Machine Learning and Leadership: 2017

Sunday, 31 December 2017

AWS CloudWatch (Events)

Overview

AWS Resources publish information about state changes as CloudWatch Events
Target can execute action upon event
Rule can route event to Target
Use cases
- Invoke Lambda to modify DNS when EC2 instance is launched
- Direct CloudTrail records to Kinesis
- Run SSM command on when instance is launced
- Log AWS API Calls
Near real-time
At least-one trigger

Event

Triggered by:
- AWS resource changes state, e.g
  - EC2 instance pending->running
  - ASG launches or terminates an instance
  - EBS created a snapshot
  - Code Deploy instance state change
  - Sign-in to AWS Management Console
  - [many other AWS Services]
- AWS CloudTrail
  - Can be used as intermediary
  - Read/Write calls supported by CloudTrail can be relayed as Events
- Customer code publishes event (PutEvents)
- Scheduled (self-triggered)
  - Cron expressions
  - Rate expressions
Uses JSON format
Can contain custom payload (useful for Lambda)

Event Bus

Each AWS account has default bus
Allows sending events to receiver AWS account
- On receiver account specify permissions
- Create a rule
- Attach foreign Event Bus as a target

Rule

matches incoming events and route to targets
matching is unordered

Target

Receives event as JSON
- AWS Systems Manager (Run Command)
- EC2 API calls
- ECS tasks
- Lambda
- Kinesis Streams
- SNS
- [other AWS Services]
- Event Bus in another account

References

http://docs.aws.amazon.com/AmazonCloudWatch/latest/events/DocumentHistory_cwe.html

AWS ELB(NLB)

Overview

Operates at OSI Level 4 (connection level)
- TCP: IP + Port
- Level 3 would be just IP
Full control over IP addresses
- Single IP address per AZ (VPC subnet)
  - EIP possible to attach
  - No CNAME resolution
Long-running connections (months) supported
- Normally idles after timeout
- Use cases: IoT, gaming, messaging
- No idle-timeout configuration
Zonality
- No cross-zone balancing
  - But fails over to another AZ if all targets unhealthy (Route 53)

Limitations

No SSL termination
No Backend server encryption

Types

Internet-facing
Internal

Target Types

Instance Id or IP (just like ALB)

Performance

Scales to millions of requests
Very low latency
Handles volatile traffic well
- Sudden spike (e.g. "flash sales")

Client source IP

Unlike other ELB it preserves source IP address
Only applies to targets registered by instance ID (not IP targerts)
Proxy Protocol still available
No need for X-Forwarded-For

Monitoring

VPC flow logs (instead of access logs)
CloudWatch

Healthchecks

Network level
- Observes normal (organic) traffic to target
Application level (like CLB/ALB)
- Synthetic

Pricing

NLCU
- 100K active connections / minute
- 800 new connections (flows) / second
- 2.22 Mbps (1 GB / h)
Highest dimension used (like in ALB)

References

https://aws.amazon.com/elasticloadbalancing/details/#compare https://aws.amazon.com/elasticloadbalancing/details/#compare

AWS ELB(ALB)

Overview

Layer 7 (advanced)
- Content based routing
Evaluates listener rules
Use cases
- Single LB fronting different types of services (e.g. website, api)
- Microservices in containers (integrated wih ECS)
Improved performance over ELB (cheaper)
Integrated with WAF
IPv6 support

Types

Internet facing
Internal

Limitations

No backend authentication (unlike CLB)

Listeners

HTTP/HTTPS
- Ports 1-65535
HTTPS
- Multiple certificates possible (SNI)
  - Smart selection if
WebSockets
- HTTP (ws://) or HTTPS (wss://)
HTTP/2
- HTTPS listeners only
- Server-Push not available
Has Listener Rules (1+)

Listener Rule

Contains
- Priority
- Action
  - Always forward request
- Optional Host
  - Host-based routing
- Optional Path
  - Path-based routing
Default rule has no conditions (catch-all)

Target

Type
- EC2 instance
- IP address
  - Inside/outside VPC (e.g. on-premise)
  - IP must be private
    - ClassicLink instances
    - Peered-VPC
    - On-premise instances (Direct Connection/VPN)
      - Use case: migrate-to-cloud/burst-to-cloud/fail-over-to-cloud
State
- draining
Same target may be registered multiple times (different ports) e.g. microservices

Target Group

Set of targets
Listener rule forwards traffic to Target Group
Has its own HealthCheck
- If no healthy targets still routes traffic
You don't need to take the whole instance out of rotation
May be attached to Auto Scaling Group

Request Tracing

LB injects a header X-Amzn-Trace-Id
Supports chaining: Field={Root, Self}
Visible in Access Logs ("trace_id")

Sticky Sessions

Only LB cookie supported (AWSALB)
Websockets are inherently sticky (long-lasting connection)

Healthchecks

Ability to define "successful" HTTP status codes

Pricing

Per-hour fee
LCU
- Dimenstions
  - 3000 Active Connections per minute
  - 25 new connections established per second
    - Certificate key size matters (shorter = cheaper)
  - 1000 rules evaluation
  - Data transferred 2.25 Mbps (=1 GB/per-hour)
- Highest dimension used to evaluate number of LCUs consumed

References

Thursday, 28 December 2017

AWS Batch

Overview

Simplifies running batch jobs
Provisions EC2 resources
- Allows to specify % of spot instances
Lower level than Hadoop
No additional pricing
Uses ECS container instances to execute jobs
Scales to 100K+ jobs

Batch Computing Advantages

You can shift computing when it is cheaper
Avoids idling resources + higher efficiency
Enables prioritization

Use Cases

File uploaded to S3
- SNS notification
  - Lambda submits a batch job

Job (what)

Unit of work
- Shell script
- Linux executable
- Container image (docker)
  - Pulled from internal/external registry
Runs as containerized application
Has AWS Job Id and Name
Can reference other jobs (dependencies)
- You can chain multiple jobs
Parameters can get overridden

Job Definition (how)

"Blueprint for resources"
Name:RevisionNumber
Hardware requirements
- vCPU
- Memory
Mount points
Environmental variables
jobRoleARN - permission passed to the container

Job States

SUBMITTED
- Added to the queue
- Upon evaluation by Job Scheduler transitions to
  - PENDING (has dependencies)
  - RUNNABLE (no dependencies
PENDING
- cannot run due to dependencies
- if dependency fails the parent jobs moves to FAILED, too
RUNNABLE
- no outstanding dependencies
- can be started as soon as resources are available
STARTING
- scheduled on the host
- container initialization is underway; transitions to
  - RUNNING
RUNNING
- Running as a container job on ECS container instance
SUCCEEDED
- Job completed with Exit code = 0
- Logs available in CloudWatch Logs
FAILED
- All available attempts failed
- Retry
  - Trigger
    - Exit Code != 0
    - EC2 instancec failure
    - AWS failure
  - Attempts
    - Default:1 , Max:10
    - AWS_BATCH_JOB_ATTEMPT environmental variable passed

Job Queue

Place where submitted jobs reside until scheduled
Priority value associated
Has Compute Environments associated
- Ordered, Max 3

Scheduler

Attached to a Job Queue
Decides when and where jobs are run (i.e. what resources)
Dependency-aware
Runs queues according to priorities
FIFO

Compute Environment

Same as ECS Cluster
Set of compute resources
Types
- Managed
  - Specific Instance Types (multiple) or The Newest
  - Min/Max/Desired vCPUs
- Unmanaged
  - AWS Batch creates ECS Cluster
  - Use when you need special resources (e.g. EFS, Dedicated Hosts)

Array Job

Collection or
Examples (embarrassingly parallel)
- Monte Carlo simulations
- Parametric sweeps
- Large rendering jobs
Submitted like a single job
- Specify array size
- AWS_BATCH_JOB_ARRAY_INDEX passed to container
Parent Array Job has normal AWS Batch Id (e.g. 1)
- Children have index appended (e.g. 1:0)
Dependency type
- SEQUENTIAL
  - A:1 cannot start until A:0 succeeds
- N_TO_N
  - Allows to run multi-stage processing
  - Each job corresponds to input split

References

Wednesday, 27 September 2017

AWS EMR

Model

EMR is more than "map reduce"
Hadoop
- Moniker for all Open Source big data projectes (ecosystem)
  - Extract: Sqoop, MapReduce API
  - Transform & Load: Spark, Cascading, Pig, MR
  - Data Warehouse (file formats): Parquet, ORC, Seq, Text
  - Report Generation: Hive, Spark, Cascading, Pig
  - Ad hoc analysis: Presto, Hive, Spark-SQL, Lingual, Impala
- Distributed storage and compute
EMR manages Hadoop cluster
- Deploying software bits
- Managing nodes lifecycle
- AWS runs customized version of Hadoop - new release every month
- Uses Amazon Linux (A-Linux)
EMR also supports non-Hadoop distribution MapR
- no-NameNode architecture
- can tolerate multiple failures with automatic failover/failback

Cluster

Collection of nodes
Master node - management, coordination of slaves
- Not much processing power required
- Do not use spot instances
Slave nodes
- Core nodes - run tasks and store data
  - Processing Power + Storage
- Task nodes (optional) - run tasks
  - Processing Power (no storage)
- Failed slaves are not automatically replaced
Use cases
- Job flow engine (i.e. schedule jobs)
- Long running cluster (shared EMR cluster that stays up)
  - e.g. for Facebook Presto
  - can use (blue/green) deployment for new cluster

Security

Security Groups
- Master - ingress
  - SSH
  - Various IP ranges belonging to AWS
- Slave
IAM Roles
- EMR Role - EMR service access on your behalf (i.e. running nodes)
- EC2 Instance profile - associated with running EMR nodes (i.e. what can be accessed by EC2 instance)
- Auto Scaling role - allows Autoscaling interact with EMR

Job

Workflow that represents program executed by EMR
Consists of series of steps
- Step types
  - Streaming program
    - reads standard input
    - runs mapper,
    - run reducer
    - writes to standard output
  - Hive program
  - Pig program
  - Spark application
  - Custom JAR
    - Java program
    - Bash script

Processing Data

Submit jobs directly to installed app (e.g. Hive, Pig)
- SSH to master
- Access tools
Running steps

Cluster lifecycle (Job flow)

STARTING: AWS provisions clusters, installs Hadoop
BOOTSTRAPPING: install additional apps
RUNNING: runs all the steps
After steps are completed
- WAITING: if long running persistent cluster
  - SHUTTING_DOWN: manually terminated
    - TERMINATED
- SHUTTING_DOWN
  - COMPLETED

Cost

EMR
S3
EC2
- Spot Instances
  - Hadoop is already interruptible so a good fit
  - Do not use spot for master node

Storage

Hadoop HDFS
- native filesystem (also used for HBase)
- cannot decouple storage from compute
- ehpemeral (lost when cluster terminated)
- useful for caching intermediate results
- replicates data between nodes
- Node Types
  - DataNode - stores files' blocks (64MB)
  - NameNode - master for DataNode (tracks which block is where)
EMRFS
- S3 single source of truth: data lake
- Multiple clusters can work on the same data
- Consistent View
  - DynamoDB based index
  - Very fast index
- Copy
  - s3distcp - efficient, parallel copy of data S3 <-> EMR cluster
Combination
- S3 as input/output
- HDFS intermediate results

Hadoop YARN

"Yet Another Resource Negotiator"
Component for managing resources
- nodes
- allocating tasks
It can be used to run application not related to Hadoop MapReduce, e.g.
- Apache Tez
- Apache Spark

Tools

Hue - UI for Hadoop
Hive
- Uses SQL syntax to generate map reduce jobs
- Code generation
- Schedule with the engine
- Quite slow
- extensible with Java
- complex user defined types
- can access DynamoDB, S3
- can process very large amounts of data
Impala
- SQL-like language
- In-memory
- Uses hive metadata
- Bypasses Hadoop MapReduce
- Only works with HDFS
Facebook Presto
- In-memory
- Can work with Hive tables
- Very low latency
- Query directly against S3
- Bypasses MapReducec

References