Overview
- Alternative to Data Warehouse
- Benefits
- Self-documenting schema
- Enforced Data Type
- Common Security Model
- Simple to access
- Transactionality
- Couples Compute&Storage
- Benefits
- Separation of Compute&Storage
- Cost efficiency (all data without unused cores)
- Independent cost attribution
- Right tool
- Increased durability
- Compontents
- Storage&Streams
- Catalogue&Search
Schema-on-read
- Alternative to schema-on-write (predefined schema)
- You create a schema (table with columns) up-front
- You write data to it
- You query it
- Apply your own "interpretation" on the data
- Benefits
- Different people have different needs (business analysis, diagnostics)
- You don't need to predict the needs (access patterns). There is always new "need"
- When consolidating multiple datasets you don't need to create uber-schema
- You can get value out of it immediately
- Semi-structured data
- Different people have different needs (business analysis, diagnostics)
Data Structure
- Unstructured
- Logs, dump files
- Semi-structured
- JSON, XML
- Consider evolution (Avro)
- Store the schema for file (metada/tag)
- Structured
- Row: CSV, TSV
- Columnar: ORC, Parquet
No comments:
Post a Comment