Model
- EMR is more than "map reduce"
- Hadoop
- Moniker for all Open Source big data projectes (ecosystem)
- Extract: Sqoop, MapReduce API
- Transform & Load: Spark, Cascading, Pig, MR
- Data Warehouse (file formats): Parquet, ORC, Seq, Text
- Report Generation: Hive, Spark, Cascading, Pig
- Ad hoc analysis: Presto, Hive, Spark-SQL, Lingual, Impala
- Distributed storage and compute
- EMR manages Hadoop cluster
- Deploying software bits
- Managing nodes lifecycle
- AWS runs customized version of Hadoop - new release every month
- Uses Amazon Linux (A-Linux)
- EMR also supports non-Hadoop distribution MapR
- no-NameNode architecture
- can tolerate multiple failures with automatic failover/failback
Cluster
- Collection of nodes
- Master node - management, coordination of slaves
- Not much processing power required
- Do not use spot instances
- Slave nodes
- Core nodes - run tasks and store data
- Processing Power + Storage
- Task nodes (optional) - run tasks
- Processing Power (no storage)
- Failed slaves are not automatically replaced
- Use cases
- Job flow engine (i.e. schedule jobs)
- Long running cluster (shared EMR cluster that stays up)
- e.g. for Facebook Presto
- can use (blue/green) deployment for new cluster
Security
- Security Groups
- Master - ingress
- SSH
- Various IP ranges belonging to AWS
- Slave
- IAM Roles
- EMR Role - EMR service access on your behalf (i.e. running nodes)
- EC2 Instance profile - associated with running EMR nodes (i.e. what can be accessed by EC2 instance)
- Auto Scaling role - allows Autoscaling interact with EMR
Job
- Workflow that represents program executed by EMR
- Consists of series of steps
- Step types
- Streaming program
- reads standard input
- runs mapper,
- run reducer
- writes to standard output
- Hive program
- Pig program
- Spark application
- Custom JAR
Processing Data
- Submit jobs directly to installed app (e.g. Hive, Pig)
- SSH to master
- Access tools
- Running steps
Cluster lifecycle (Job flow)
- STARTING: AWS provisions clusters, installs Hadoop
- BOOTSTRAPPING: install additional apps
- RUNNING: runs all the steps
- After steps are completed
- WAITING: if long running persistent cluster
- SHUTTING_DOWN: manually terminated
- SHUTTING_DOWN
Cost
- EMR
- S3
- EC2
- Spot Instances
- Hadoop is already interruptible so a good fit
- Do not use spot for master node
Storage
- Hadoop HDFS
- native filesystem (also used for HBase)
- cannot decouple storage from compute
- ehpemeral (lost when cluster terminated)
- useful for caching intermediate results
- replicates data between nodes
- Node Types
- DataNode - stores files' blocks (64MB)
- NameNode - master for DataNode (tracks which block is where)
- EMRFS
- S3 single source of truth: data lake
- Multiple clusters can work on the same data
- Consistent View
- DynamoDB based index
- Very fast index
- Copy
- s3distcp - efficient, parallel copy of data S3 <-> EMR cluster
- Combination
- S3 as input/output
- HDFS intermediate results
Hadoop YARN
- "Yet Another Resource Negotiator"
- Component for managing resources
- It can be used to run application not related to Hadoop MapReduce, e.g.
Tools
- Hue - UI for Hadoop
- Hive
- Uses SQL syntax to generate map reduce jobs
- Code generation
- Schedule with the engine
- Quite slow
- extensible with Java
- complex user defined types
- can access DynamoDB, S3
- can process very large amounts of data
- Impala
- SQL-like language
- In-memory
- Uses hive metadata
- Bypasses Hadoop MapReduce
- Only works with HDFS
- Facebook Presto
- In-memory
- Can work with Hive tables
- Very low latency
- Query directly against S3
- Bypasses MapReducec
References