Notes on AWS, Big Data, Machine Learning and Leadership: Data Lake

Monday, 12 March 2018

Overview

Alternative to Data Warehouse
- Benefits
  - Self-documenting schema
  - Enforced Data Type
  - Common Security Model
  - Simple to access
  - Transactionality
- Couples Compute&Storage
Separation of Compute&Storage
- Cost efficiency (all data without unused cores)
- Independent cost attribution
- Right tool
- Increased durability
Compontents
- Storage&Streams
- Catalogue&Search

Schema-on-read

Alternative to schema-on-write (predefined schema)
- You create a schema (table with columns) up-front
- You write data to it
- You query it
Apply your own "interpretation" on the data
Benefits
- Different people have different needs (business analysis, diagnostics)
  - You don't need to predict the needs (access patterns). There is always new "need"
- When consolidating multiple datasets you don't need to create uber-schema
- You can get value out of it immediately
- Semi-structured data

Data Structure

Unstructured
- Logs, dump files
Semi-structured
- JSON, XML
- Consider evolution (Avro)
- Store the schema for file (metada/tag)
Structured
- Row: CSV, TSV
- Columnar: ORC, Parquet

Notes on AWS, Big Data, Machine Learning and Leadership