Saturday, 10 March 2018

AWS Redshift

Overview
  • Petabyte data warehouse
  • Suitable for OLAP
  • Massively Parallel Processing (MPP) - share nothing architecture
  • SQL Clients / BI Tools can connect to it
    • JDBC/ODBC using PostgreSQL drivers
  • Leader/Compute node architecture
  • Columnar Storage

Data Distribution
  • Style
    • EVEN (default) - when tables do not participate in JOINs
    • KEY - column value decides where physically placed. Can be used to colocate data from multiple tables (JOINS)
      • compare Hash Key in DynamoDB
      • skew = ratio biggest/smallest node (ideal = 1)
    • ALL - copy kept on all nodes, appropriate for relatively slow moving data
      • Redistribution is cheap so small dimension tables do not benefit much

Compression
  • Redshift automatically detects compression algorithm by sampling rows
    • Multiple data encodings available
  • Turn off for staging/temp table
    • Bake the compression in table definition
  • Do not compress sort keys
  • ANALYZE COMPRESSION 
    • Command to find optimal compression
  • Benefits
    • Reduces I/O
    • Allows to store more data (~4x)

Data Block
  • Data stored in 1MB blocks
  • Stores column values for various rows
    • e.g. INT column - you can store 250K/block after compression
  • Immutable

Zone Map
  • Metadata about data blocks
    • Min/Max for each block (all blocks have it)
  • Kept in-memory
  • Use case
    • Eliminates unnecessary I/O (skip entire blocks)
      • No need to scan blocks

Data Sort Key
  • Makes Zone Maps more effective
  • Single Sort Key per table
  • COMPOUND (default)
    • May contain multiple columns
      • Choose low cardinality column first
    • If your query sorts by second column only it gives you nothing
  • INTERLEAVED 
    • Use space filling curve to organize the data in a block such a way that you get performance improvement (see References)
    • Spread data by limited number of blocks
    • Up to 8 columns


Transactions
  • Uses MVCC
  • ACID compliant (serializable)
  • Global commit
  • Expensive - use  "large" explicit transactions to avoid "small" implicit

Temporary table
  • Not mirrored 
  • Do trigger full commit/backups

Snapshots (backup)
  • Very similar to RDS snapshots
    • automated (5GB or 8h)
    • manual
    • can copy automated snapshots cross-region
      • independent retention period
  • Can take manual snapshot from exisitng automated snapshot
  • Can restore entire cluster OR a single table
  • Can be shared with other AWS accounts

Ingestion
  • COPY command (query)
    • Parallel by default
    • Each slice loads one file at a time
    • Do not use single input file
      • Use at least as many input files as you have slices
      • Recommended max 1GB
    • Max load throughput 100MB/s
    • Loads data from S3
  • INSERT INTO SELECT FROM
    • Inserts data from other tables
  • INSERT/UPDATE are slow
    • high fixed cost for a commit
    • must rewrite entire block (immutable)
  • Use compression on individual files
  • Manifest file
    • Defines instruction with files and what to do when Redshift cannot find them (e.g. stop load, skip, etc.)
  • Turn off automatic stats update (does too much work) and update for important columns only (sort/dist) after load is completed

VACUUM
  • Command
  • Clean-up after bulk changes
  • Every update is "delete followed by an insert" (immutable blocks)
  • Every delete is "soft"
  • Space not reclaimed automatically
  • Vacuum does
    • Garbage Collection (reclaim deleted rows aka "ghost/phantom rows")
    • Sorts data in Sort Keys
  • Very expensive - run weekly


ANALYZE
  • Analyzes table statistics
  • Improves query optimizer

Encryption
  • Setup during cluster creation
    • unload data + reload to new cluster
  • Data blocks and system metada encrypted for cluster and its snapshots
  • Data encryption key - encrypts data block 
    • Database key - stored on disk separately from cluster. Kept in memory by Redshift
      • Cluster key - protected by either KMS or (Cloud)HSM
        • Master Key - encrypts cluster key
  • Key rotation
    • Cluster is unavailable (state: ROTATING_KEYS)

Audit
  • All SQL operations are logged and can be downloaded to S3


References:


No comments:

Post a Comment