Overview
- Managed deployment of Big Data solutions (Hadoop ecosystem, Spark, Presto etc.)
- Used to stand for Elastic Map Reduce but much more nowadays
- Hadoop
- Moniker for all Open Source big data projectes (ecosystem)
- Extract: Sqoop (e.g. access data from MySQL) , MapReduce API
- Transform & Load: Spark, Cascading, Pig, MR
- Data Warehouse (file formats): Parquet, ORC, Seq, Text
- Report Generation: Hive, Spark, Cascading, Pig
- Ad hoc analysis: Presto, Hive, Spark-SQL, Lingual, Impala
- Distributed storage and compute
- AWS runs customized version of Hadoop - new release every month
- Moniker for all Open Source big data projectes (ecosystem)
- EMR manages cluster
- Deploying software bits
- Managing nodes lifecycle
- Uses Amazon Linux (A-Linux)
Cluster
- Collection of nodes
- Master node - management, coordination of slaves
- Not much processing power required
- Do not use spot instances
- Slave nodes
- Core nodes - run tasks and store data
- Processing Power + Storage
- Do not scale down gracefully (needs data reshuffle) hence Task nodes
- Task nodes (optional) - run tasks
- Processing Power (no storage)
- Good fit for Spot instances
- Failed slaves are not automatically replaced
- Core nodes - run tasks and store data
- Use cases
- Job flow engine (launch mode: "step execution")
- Long running cluster (launch mode: "cluster')
- e.g. for Facebook Presto
- can use (blue/green) deployment for new cluster
Security
- Security Groups
- Master - ingress
- SSH
- Various IP ranges belonging to AWS
- Slave
- Master - ingress
- IAM Roles
- EMR Role - EMR service role to access AWS on your behalf (e.g. launching instance)
- EC2 Instance profile - associated with running EMR nodes (i.e. what can be accessed by EC2 instance)
- Auto Scaling role - allows Autoscaling interact with EMR
- Encryption
- Tez, Spark, Map Reduce
- At-rest
- S3
- LocalDisk
- In-Transit
Job
- Workflow that represents program executed by EMR
- Consists of series of steps
- Step types
- Streaming program
- reads standard input
- runs mapper,
- run reducer
- writes to standard output
- Hive program
- Pig program
- Spark application
- Custom JAR
- Java program
- Bash script
- Streaming program
- Step types
Cluster lifecycle (Job flow)
- STARTING: AWS provisions clusters, installs Hadoop
- BOOTSTRAPPING: install additional apps
- RUNNING: runs all the steps
- After steps are completed
- WAITING: if long running persistent cluster
- SHUTTING_DOWN: manually terminated
- TERMINATED
- SHUTTING_DOWN: manually terminated
- SHUTTING_DOWN
- COMPLETED
- WAITING: if long running persistent cluster
Autoscaling
- Based on AWS Application Scaling
- Adds/removes cluster nodes
- For YARN: pending containers is a good metric to watch
Cost
- EMR service
- S3
- Data access
- Logs
- EC2
- Spot Instances
- Hadoop is already interruptible so a good fit
- Do not use spot for master node
- Types
- Traditional "Spot" (maximum bid price)
- Instance fleet
- Similiar to "Spot Fleets" but not identical
- Heterogenous fleets
- Spot blocks
- Specify duration (max 6h)
- Instance will not be interrupted
- Provisioning timeout
- When AWS can't get you spot
- Terminate ("I can wait")
- Switch to On-Demand ("I can't wait")
- When AWS can't get you spot
- Spot Instances
Storage
- Hadoop HDFS
- native filesystem (also used for HBase)
- cannot decouple storage from compute
- ehpemeral (lost when cluster terminated)
- useful for caching intermediate results
- replicates data between nodes
- Node Types
- DataNode - stores files' blocks (64MB)
- NameNode - master for DataNode
- Tracks which file block is on which DataNode
- EMRFS
- S3 single source of truth (data lake)
- Implementation of HDFS
- Multiple clusters can work on the same data
- Consistent View
- list, read-after-write
- DynamoDB based index tracks S3 changes
- Very fast index
- Combination
- S3 as input/output
- HDFS intermediate results
Hadoop YARN
- "Yet Another Resource Negotiator"
- Component for managing resources
- nodes
- allocating tasks
- It can be used to run application not related to Hadoop MapReduce, e.g.
- Apache Tez
- Apache Spark
Tools
- Hue - UI for Hadoop
- Hive
- Suitable for batch processing
- We can run 6 Hive queries on a good day
- Executes step serially (one-by-one)
- All the mappers run
- When done all the reduces can run
- All the mappers on the next step, etc.
- All the intermediate steps are stored on disks
- Designed for fault-tolerance so it can recover when something goes wrong
- Uses SQL syntax to generate map reduce jobs
- Code generation
- Schedule with the engine
- Quite slow
- extensible with Java
- complex user defined types
- can access DynamoDB, S3
- can process very large amounts of data
- Suitable for batch processing
- Impala
- SQL-like language
- In-memory
- Uses hive metadata
- Bypasses Hadoop MapReduce
- Only works with HDFS
- Facebook Presto
- In-memory
- Can work with Hive tables
- Very low latency
- Query directly against S3
- Bypasses MapReduce
- Alternative
- Athena
References
No comments:
Post a Comment