Wednesday, 27 September 2017

AWS EMR

Model
  • EMR is more than "map reduce"
  • Hadoop
    • Moniker for all Open Source big data projectes (ecosystem)
      • Extract: Sqoop, MapReduce API
      • Transform & Load: Spark, Cascading, Pig, MR
      • Data Warehouse (file formats): Parquet, ORC, Seq, Text
      • Report Generation: Hive, Spark, Cascading, Pig
      • Ad hoc analysis: Presto, Hive, Spark-SQL, Lingual, Impala
    • Distributed storage and compute
  • EMR manages Hadoop cluster
    • Deploying software bits
    • Managing nodes lifecycle 
    • AWS runs customized version of Hadoop - new release every month
    • Uses Amazon Linux (A-Linux)
  • EMR also supports non-Hadoop distribution MapR
    • no-NameNode architecture 
    • can tolerate multiple failures with automatic failover/failback

Cluster
  • Collection of nodes
  • Master node - management, coordination of slaves
    • Not much processing power required
    • Do not use spot instances
  • Slave nodes
    • Core nodes - run tasks and store data
      • Processing Power + Storage
    • Task nodes (optional) - run tasks
      • Processing Power (no storage)
    • Failed slaves are not automatically replaced
  • Use cases
    • Job flow engine (i.e. schedule jobs)
    • Long running cluster (shared EMR cluster that stays up)
      • e.g. for Facebook Presto
      • can use  (blue/green) deployment for new cluster

Security
  • Security Groups
    • Master - ingress
      • SSH
      • Various IP ranges belonging to AWS
    • Slave
  • IAM Roles
    • EMR Role - EMR service access on your behalf (i.e. running nodes)
    • EC2 Instance profile - associated with running EMR nodes (i.e. what can be accessed by EC2 instance)
    • Auto Scaling role - allows Autoscaling interact with EMR


Job
  • Workflow that represents program executed by EMR
  • Consists of series of steps
    • Step types
      • Streaming program
        • reads standard input
        • runs mapper,
        • run reducer
        • writes to standard output
      • Hive program
      • Pig program
      • Spark application
      • Custom JAR
        • Java program
        • Bash script

Processing Data
  • Submit jobs directly to installed app (e.g. Hive, Pig)
    • SSH to master
    • Access tools
  • Running steps

Cluster lifecycle (Job flow)
  • STARTING: AWS provisions clusters, installs Hadoop
  • BOOTSTRAPPING: install additional apps
  • RUNNING: runs all the steps
  • After steps are completed
    • WAITING: if long running persistent cluster
      • SHUTTING_DOWN: manually terminated
        • TERMINATED
    • SHUTTING_DOWN
      • COMPLETED

Cost
  • EMR
  • S3
  • EC2
    • Spot Instances
      • Hadoop is already interruptible so a good fit
      • Do not use spot for master node


Storage
  • Hadoop HDFS
    • native filesystem (also used for HBase)
    • cannot decouple storage from compute 
    • ehpemeral (lost when cluster terminated)
    • useful for caching intermediate results
    • replicates data between nodes 
    • Node Types
      • DataNode - stores files' blocks (64MB)
      • NameNode - master for DataNode (tracks which block is where)
  • EMRFS
    • S3 single source of truth: data lake
    • Multiple clusters can work on the same data
    • Consistent View 
      • DynamoDB based index
      • Very fast index
    • Copy
      • s3distcp - efficient, parallel  copy of data S3 <-> EMR cluster
  • Combination
    • S3 as input/output
    • HDFS intermediate results

Hadoop YARN
  • "Yet Another Resource Negotiator"
  • Component for managing resources
    • nodes
    • allocating tasks
  • It can be used to run application not related to Hadoop MapReduce, e.g.
    • Apache Tez 
    • Apache Spark

Tools
  • Hue - UI for Hadoop
  • Hive
    • Uses SQL syntax to generate map reduce jobs
    • Code generation
    • Schedule with the engine
    • Quite slow
    • extensible with Java
    • complex user defined types
    • can access DynamoDB, S3
    • can process very large amounts of data
  • Impala
    • SQL-like language
    • In-memory
    • Uses hive metadata    
    • Bypasses Hadoop MapReduce
    • Only works with HDFS
  • Facebook Presto
    • In-memory 
    • Can work with Hive tables
    • Very low latency
    • Query directly against S3
    • Bypasses MapReducec
References