Notes on AWS, Big Data, Machine Learning and Leadership: AWS Redshift

Overview

Data Distribution

Compression

Redshift automatically detects compression algorithm by sampling rows
- Multiple data encodings available
Turn off for staging/temp table
- Bake the compression in table definition
Do not compress sort keys
ANALYZE COMPRESSION
- Command to find optimal compression
Benefits
- Reduces I/O
- Allows to store more data (~4x)

Data Block

Data stored in 1MB blocks
Stores column values for various rows
- e.g. INT column - you can store 250K/block after compression
Immutable

Zone Map

Metadata about data blocks
- Min/Max for each block (all blocks have it)
Kept in-memory
Use case
- Eliminates unnecessary I/O (skip entire blocks)
  - No need to scan blocks

Data Sort Key

Makes Zone Maps more effective
Single Sort Key per table
COMPOUND (default)
- May contain multiple columns
  - Choose low cardinality column first
- If your query sorts by second column only it gives you nothing
INTERLEAVED
- Use space filling curve to organize the data in a block such a way that you get performance improvement (see References)
- Spread data by limited number of blocks
- Up to 8 columns

Transactions

Temporary table

Snapshots (backup)

Very similar to RDS snapshots
- automated (5GB or 8h)
- manual
- can copy automated snapshots cross-region
  - independent retention period
Can take manual snapshot from exisitng automated snapshot
Can restore entire cluster OR a single table
Can be shared with other AWS accounts

Ingestion

COPY command (query)
- Parallel by default
- Each slice loads one file at a time
- Do not use single input file
  - Use at least as many input files as you have slices
  - Recommended max 1GB
- Max load throughput 100MB/s
- Loads data from S3
INSERT INTO SELECT FROM
- Inserts data from other tables
INSERT/UPDATE are slow
- high fixed cost for a commit
- must rewrite entire block (immutable)
Use compression on individual files
Manifest file
- Defines instruction with files and what to do when Redshift cannot find them (e.g. stop load, skip, etc.)
Turn off automatic stats update (does too much work) and update for important columns only (sort/dist) after load is completed

VACUUM

Command
Clean-up after bulk changes
Every update is "delete followed by an insert" (immutable blocks)
Every delete is "soft"
Space not reclaimed automatically
Vacuum does
- Garbage Collection (reclaim deleted rows aka "ghost/phantom rows")
- Sorts data in Sort Keys
Very expensive - run weekly

ANALYZE

Encryption

Setup during cluster creation
- unload data + reload to new cluster
Data blocks and system metada encrypted for cluster and its snapshots
Data encryption key - encrypts data block
- Database key - stored on disk separately from cluster. Kept in memory by Redshift
  - Cluster key - protected by either KMS or (Cloud)HSM
    - Master Key - encrypts cluster key
Key rotation
- Cluster is unavailable (state: ROTATING_KEYS)

Audit

References:

Notes on AWS, Big Data, Machine Learning and Leadership