Overview
- Written in Scala (JVM)
- Minimizes I/O by using Memory
- More parition-aware (where data is) than MapReduce
- No expensive shuffling
- Easy to write a job that can interact both with static data and a stream
Languages
- Scala
- Python
- SQL
- R
- Java
Data Frame
- Main abstraction (used to be RDD)
- Distributed collection of data organized in columns
- ~distributed table
- Uses Catalyst query planner
- Better than RDD
Data Set
- Type safety around Data Frames
Spark SQL
- Once you load data into DataFrame you can query it with low latency
Directed Acyclic Graph (DAG)
- Actions are split into DAGs
- Structure
- Stage
- Tasks
- Stage
Cluster management
- Standalone (e.g. local machine)
- YARN (supported by EMR)
- Mesos
References
No comments:
Post a Comment