Monday, 12 March 2018

Data Lake

Overview
  • Alternative to Data Warehouse
    • Benefits
      • Self-documenting schema
      • Enforced Data Type
      • Common Security Model
      • Simple to access
      • Transactionality
    • Couples Compute&Storage
  • Separation of Compute&Storage
    • Cost efficiency (all data without unused cores)
    • Independent cost attribution
    • Right tool 
    • Increased durability
  • Compontents
    • Storage&Streams
    • Catalogue&Search

Schema-on-read
  • Alternative to schema-on-write (predefined schema)
    • You create a schema (table with columns) up-front
    • You write data to it
    • You query it
  • Apply your own "interpretation" on the data
  • Benefits
    • Different people have different needs (business analysis, diagnostics)
      • You don't need to predict the needs (access patterns). There is always new "need"
    • When consolidating multiple datasets you don't need to create uber-schema
    • You can get value out of it immediately
    • Semi-structured data

Data Structure
  • Unstructured
    • Logs, dump files
  • Semi-structured
    • JSON, XML
    • Consider evolution (Avro)
    • Store the schema for file (metada/tag)
  • Structured
    • Row: CSV, TSV
    • Columnar:  ORC, Parquet

No comments:

Post a Comment