Notes on AWS, Big Data, Machine Learning and Leadership: Spark

Monday, 12 March 2018

Spark

Overview

Written in Scala (JVM)
Minimizes I/O by using Memory
More parition-aware (where data is) than MapReduce
- No expensive shuffling
Easy to write a job that can interact both with static data and a stream

Languages

Scala
Python
SQL
R
Java

Data Frame

Main abstraction (used to be RDD)
Distributed collection of data organized in columns
- ~distributed table
Uses Catalyst query planner
- Better than RDD

Data Set

Type safety around Data Frames

Spark SQL

Once you load data into DataFrame you can query it with low latency

Directed Acyclic Graph (DAG)

Actions are split into DAGs
Structure
- Stage
  - Tasks

Cluster management

Standalone (e.g. local machine)
YARN (supported by EMR)
Mesos

References

https://www.youtube.com/watch?v=N7_u1b18yGg

No comments:

Post a Comment

Subscribe to: Post Comments (Atom)