Monday, 12 March 2018

Spark

Overview
  • Written in Scala (JVM)
  • Minimizes I/O by using Memory
  • More parition-aware (where data is) than MapReduce
    • No expensive shuffling
  • Easy to write a job that can interact both with static data and a stream

Languages
  • Scala
  • Python
  • SQL
  • R
  • Java

Data Frame
  • Main abstraction (used to be RDD)
  • Distributed collection of data organized in columns
    • ~distributed table
  • Uses Catalyst query planner
    • Better than RDD

Data Set
  • Type safety around Data Frames

Spark SQL
  • Once you load data into DataFrame you can query it with low latency

Directed Acyclic Graph (DAG)
  • Actions are split into DAGs
  • Structure
    • Stage
      • Tasks

Cluster management
  • Standalone (e.g. local machine)
  • YARN (supported by EMR)
  • Mesos

References

No comments:

Post a Comment