Notes on AWS, Big Data, Machine Learning and Leadership: AWS EMR

Thursday, 8 March 2018

AWS EMR

Overview

Managed deployment of Big Data solutions (Hadoop ecosystem, Spark, Presto etc.)
- Used to stand for Elastic Map Reduce but much more nowadays
Hadoop
- Moniker for all Open Source big data projectes (ecosystem)
  - Extract: Sqoop (e.g. access data from MySQL) , MapReduce API
  - Transform & Load: Spark, Cascading, Pig, MR
  - Data Warehouse (file formats): Parquet, ORC, Seq, Text
  - Report Generation: Hive, Spark, Cascading, Pig
  - Ad hoc analysis: Presto, Hive, Spark-SQL, Lingual, Impala
- Distributed storage and compute
- AWS runs customized version of Hadoop - new release every month
EMR manages cluster
- Deploying software bits
- Managing nodes lifecycle
- Uses Amazon Linux (A-Linux)

Cluster

Collection of nodes
Master node - management, coordination of slaves
- Not much processing power required
- Do not use spot instances
Slave nodes
- Core nodes - run tasks and store data
  - Processing Power + Storage
  - Do not scale down gracefully (needs data reshuffle) hence Task nodes
- Task nodes (optional) - run tasks
  - Processing Power (no storage)
  - Good fit for Spot instances
- Failed slaves are not automatically replaced
Use cases
- Job flow engine (launch mode: "step execution")
- Long running cluster (launch mode: "cluster')
  - e.g. for Facebook Presto
  - can use (blue/green) deployment for new cluster

Security

Security Groups
- Master - ingress
  - SSH
  - Various IP ranges belonging to AWS
- Slave
IAM Roles
- EMR Role - EMR service role to access AWS on your behalf (e.g. launching instance)
- EC2 Instance profile - associated with running EMR nodes (i.e. what can be accessed by EC2 instance)
- Auto Scaling role - allows Autoscaling interact with EMR
Encryption
- Tez, Spark, Map Reduce
- At-rest
  - S3
  - LocalDisk
- In-Transit

Job

Workflow that represents program executed by EMR
Consists of series of steps
- Step types
  - Streaming program
    - reads standard input
    - runs mapper,
    - run reducer
    - writes to standard output
  - Hive program
  - Pig program
  - Spark application
  - Custom JAR
    - Java program
    - Bash script

Cluster lifecycle (Job flow)

STARTING: AWS provisions clusters, installs Hadoop
BOOTSTRAPPING: install additional apps
RUNNING: runs all the steps
After steps are completed
- WAITING: if long running persistent cluster
  - SHUTTING_DOWN: manually terminated
    - TERMINATED
- SHUTTING_DOWN
  - COMPLETED

Autoscaling

Based on AWS Application Scaling
Adds/removes cluster nodes
For YARN: pending containers is a good metric to watch

Cost

EMR service
S3
- Data access
- Logs
EC2
- Spot Instances
  - Hadoop is already interruptible so a good fit
  - Do not use spot for master node
  - Types
    - Traditional "Spot" (maximum bid price)
    - Instance fleet
      - Similiar to "Spot Fleets" but not identical
      - Heterogenous fleets
      - Spot blocks
        
        Specify duration (max 6h)
        
        Instance will not be interrupted
      - Provisioning timeout
        
        When AWS can't get you spot
        
        Terminate ("I can wait")
        
        Switch to On-Demand ("I can't wait")

Storage

Hadoop HDFS
- native filesystem (also used for HBase)
- cannot decouple storage from compute
- ehpemeral (lost when cluster terminated)
- useful for caching intermediate results
- replicates data between nodes
- Node Types
  - DataNode - stores files' blocks (64MB)
  - NameNode - master for DataNode
    - Tracks which file block is on which DataNode
EMRFS
- S3 single source of truth (data lake)
- Implementation of HDFS
- Multiple clusters can work on the same data
- Consistent View
  - list, read-after-write
- DynamoDB based index tracks S3 changes
  - Very fast index
Combination
- S3 as input/output
- HDFS intermediate results

Hadoop YARN

"Yet Another Resource Negotiator"
Component for managing resources
- nodes
- allocating tasks
It can be used to run application not related to Hadoop MapReduce, e.g.
- Apache Tez
- Apache Spark

Tools

Hue - UI for Hadoop
Hive
- Suitable for batch processing
  - We can run 6 Hive queries on a good day
- Executes step serially (one-by-one)
  - All the mappers run
  - When done all the reduces can run
  - All the mappers on the next step, etc.
  - All the intermediate steps are stored on disks
    - Designed for fault-tolerance so it can recover when something goes wrong
- Uses SQL syntax to generate map reduce jobs
- Code generation
- Schedule with the engine
- Quite slow
- extensible with Java
- complex user defined types
- can access DynamoDB, S3
- can process very large amounts of data
Impala
- SQL-like language
- In-memory
- Uses hive metadata
- Bypasses Hadoop MapReduce
- Only works with HDFS
Facebook Presto
- In-memory
- Can work with Hive tables
- Very low latency
- Query directly against S3
- Bypasses MapReduce
- Alternative
  - Athena

References

No comments:

Post a Comment

Subscribe to: Post Comments (Atom)