Notes on AWS, Big Data, Machine Learning and Leadership: September 2017

Model

EMR is more than "map reduce"
Hadoop
- Moniker for all Open Source big data projectes (ecosystem)
  - Extract: Sqoop, MapReduce API
  - Transform & Load: Spark, Cascading, Pig, MR
  - Data Warehouse (file formats): Parquet, ORC, Seq, Text
  - Report Generation: Hive, Spark, Cascading, Pig
  - Ad hoc analysis: Presto, Hive, Spark-SQL, Lingual, Impala
- Distributed storage and compute
EMR manages Hadoop cluster
- Deploying software bits
- Managing nodes lifecycle
- AWS runs customized version of Hadoop - new release every month
- Uses Amazon Linux (A-Linux)
EMR also supports non-Hadoop distribution MapR
- no-NameNode architecture
- can tolerate multiple failures with automatic failover/failback

Cluster

Collection of nodes
Master node - management, coordination of slaves
- Not much processing power required
- Do not use spot instances
Slave nodes
- Core nodes - run tasks and store data
  - Processing Power + Storage
- Task nodes (optional) - run tasks
  - Processing Power (no storage)
- Failed slaves are not automatically replaced
Use cases
- Job flow engine (i.e. schedule jobs)
- Long running cluster (shared EMR cluster that stays up)
  - e.g. for Facebook Presto
  - can use (blue/green) deployment for new cluster

Security

Security Groups
- Master - ingress
  - SSH
  - Various IP ranges belonging to AWS
- Slave
IAM Roles
- EMR Role - EMR service access on your behalf (i.e. running nodes)
- EC2 Instance profile - associated with running EMR nodes (i.e. what can be accessed by EC2 instance)
- Auto Scaling role - allows Autoscaling interact with EMR

Job

Processing Data

Submit jobs directly to installed app (e.g. Hive, Pig)
- SSH to master
- Access tools
Running steps

Cluster lifecycle (Job flow)

Cost

EMR
S3
EC2
- Spot Instances
  - Hadoop is already interruptible so a good fit
  - Do not use spot for master node

Storage

Hadoop YARN

"Yet Another Resource Negotiator"
Component for managing resources
- nodes
- allocating tasks
It can be used to run application not related to Hadoop MapReduce, e.g.
- Apache Tez
- Apache Spark

Tools

References

Notes on AWS, Big Data, Machine Learning and Leadership