Thursday, 8 March 2018

AWS EMR

Overview
  • Managed deployment of Big Data solutions (Hadoop ecosystem, Spark, Presto etc.)
    • Used to stand for Elastic Map Reduce but much more nowadays
  • Hadoop
    • Moniker for all Open Source big data projectes (ecosystem)
      • Extract: Sqoop (e.g. access data from MySQL) , MapReduce API
      • Transform & Load: Spark, Cascading, Pig, MR
      • Data Warehouse (file formats): Parquet, ORC, Seq, Text
      • Report Generation: Hive, Spark, Cascading, Pig
      • Ad hoc analysis: Presto, Hive, Spark-SQL, Lingual, Impala
    • Distributed storage and compute
    • AWS runs customized version of Hadoop - new release every month
  • EMR manages cluster
    • Deploying software bits
    • Managing nodes lifecycle 
    • Uses Amazon Linux (A-Linux)

Cluster
  • Collection of nodes
  • Master node - management, coordination of slaves
    • Not much processing power required
    • Do not use spot instances
  • Slave nodes
    • Core nodes - run tasks and store data
      • Processing Power + Storage
      • Do not scale down gracefully (needs data reshuffle) hence Task nodes
    • Task nodes (optional) - run tasks
      • Processing Power (no storage)
      • Good fit for Spot instances
    • Failed slaves are not automatically replaced
  • Use cases
    • Job flow engine (launch mode: "step execution")
    • Long running cluster (launch mode: "cluster')
      • e.g. for Facebook Presto
      • can use  (blue/green) deployment for new cluster

Security
  • Security Groups
    • Master - ingress
      • SSH
      • Various IP ranges belonging to AWS
    • Slave
  • IAM Roles
    • EMR Role - EMR service role to access AWS on your behalf (e.g. launching instance)
    • EC2 Instance profile - associated with running EMR nodes (i.e. what can be accessed by EC2 instance)
    • Auto Scaling role - allows Autoscaling interact with EMR
  • Encryption
    • Tez, Spark, Map Reduce
    • At-rest
      • S3
      • LocalDisk
    • In-Transit



Job
  • Workflow that represents program executed by EMR
  • Consists of series of steps
    • Step types
      • Streaming program
        • reads standard input
        • runs mapper,
        • run reducer
        • writes to standard output
      • Hive program
      • Pig program
      • Spark application
      • Custom JAR
        • Java program
        • Bash script


Cluster lifecycle (Job flow)
  • STARTING: AWS provisions clusters, installs Hadoop
  • BOOTSTRAPPING: install additional apps
  • RUNNING: runs all the steps
  • After steps are completed
    • WAITING: if long running persistent cluster
      • SHUTTING_DOWN: manually terminated
        • TERMINATED
    • SHUTTING_DOWN
      • COMPLETED

Autoscaling
  • Based on AWS Application Scaling
  • Adds/removes cluster nodes
  • For YARN: pending containers is a good metric to watch

Cost
  • EMR service
  • S3
    • Data access
    • Logs
  • EC2
    • Spot Instances
      • Hadoop is already interruptible so a good fit
      • Do not use spot for master node
      • Types
        • Traditional "Spot" (maximum bid price)
        • Instance fleet
          • Similiar to "Spot Fleets" but not identical
          • Heterogenous fleets
          • Spot blocks
            • Specify duration (max 6h)
            • Instance will not be interrupted
          • Provisioning timeout
            • When AWS can't get you spot
              • Terminate ("I can wait")
              • Switch to On-Demand ("I can't wait")


Storage
  • Hadoop HDFS
    • native filesystem (also used for HBase)
    • cannot decouple storage from compute 
    • ehpemeral (lost when cluster terminated)
    • useful for caching intermediate results
    • replicates data between nodes 
    • Node Types
      • DataNode - stores files' blocks (64MB)
      • NameNode - master for DataNode
        • Tracks which file block is on which DataNode
  • EMRFS
    • S3 single source of truth (data lake)
    • Implementation of HDFS
    • Multiple clusters can work on the same data
    • Consistent View 
      • list, read-after-write
    • DynamoDB based index tracks S3 changes
      • Very fast index
  • Combination
    • S3 as input/output
    • HDFS intermediate results

Hadoop YARN
  • "Yet Another Resource Negotiator"
  • Component for managing resources
    • nodes
    • allocating tasks
  • It can be used to run application not related to Hadoop MapReduce, e.g.
    • Apache Tez 
    • Apache Spark

Tools
  • Hue - UI for Hadoop
  • Hive
    • Suitable for batch processing
      • We can run 6 Hive queries on a good day
    • Executes step serially (one-by-one)
      • All the mappers run 
      • When done all the reduces can run
      • All the mappers on the next step, etc.
      • All the intermediate steps are stored on disks
        • Designed for fault-tolerance so it can recover when something goes wrong
    • Uses SQL syntax to generate map reduce jobs
    • Code generation
    • Schedule with the engine
    • Quite slow
    • extensible with Java
    • complex user defined types
    • can access DynamoDB, S3
    • can process very large amounts of data
  • Impala
    • SQL-like language
    • In-memory
    • Uses hive metadata    
    • Bypasses Hadoop MapReduce
    • Only works with HDFS
  • Facebook Presto
    • In-memory 
    • Can work with Hive tables
    • Very low latency
    • Query directly against S3
    • Bypasses MapReduce
    • Alternative 
      • Athena

References

No comments:

Post a Comment