Overview
- Petabyte data warehouse
- Suitable for OLAP
- Massively Parallel Processing (MPP) - share nothing architecture
- SQL Clients / BI Tools can connect to it
- JDBC/ODBC using PostgreSQL drivers
- Leader/Compute node architecture
- Columnar Storage
Data Distribution
- Style
- EVEN (default) - when tables do not participate in JOINs
- KEY - column value decides where physically placed. Can be used to colocate data from multiple tables (JOINS)
- compare Hash Key in DynamoDB
- skew = ratio biggest/smallest node (ideal = 1)
- ALL - copy kept on all nodes, appropriate for relatively slow moving data
- Redistribution is cheap so small dimension tables do not benefit much
Compression
- Redshift automatically detects compression algorithm by sampling rows
- Multiple data encodings available
- Turn off for staging/temp table
- Bake the compression in table definition
- Do not compress sort keys
- ANALYZE COMPRESSION
- Command to find optimal compression
- Benefits
- Reduces I/O
- Allows to store more data (~4x)
Data Block
- Data stored in 1MB blocks
- Stores column values for various rows
- e.g. INT column - you can store 250K/block after compression
- Immutable
Zone Map
- Metadata about data blocks
- Min/Max for each block (all blocks have it)
- Kept in-memory
- Use case
- Eliminates unnecessary I/O (skip entire blocks)
- No need to scan blocks
- Eliminates unnecessary I/O (skip entire blocks)
Data Sort Key
- Makes Zone Maps more effective
- Single Sort Key per table
- COMPOUND (default)
- May contain multiple columns
- Choose low cardinality column first
- If your query sorts by second column only it gives you nothing
- May contain multiple columns
- INTERLEAVED
- Use space filling curve to organize the data in a block such a way that you get performance improvement (see References)
- Spread data by limited number of blocks
- Up to 8 columns
Transactions
- Uses MVCC
- ACID compliant (serializable)
- Global commit
- Expensive - use "large" explicit transactions to avoid "small" implicit
Temporary table
- Not mirrored
- Do trigger full commit/backups
Snapshots (backup)
- Very similar to RDS snapshots
- automated (5GB or 8h)
- manual
- can copy automated snapshots cross-region
- independent retention period
- Can take manual snapshot from exisitng automated snapshot
- Can restore entire cluster OR a single table
- Can be shared with other AWS accounts
Ingestion
- COPY command (query)
- Parallel by default
- Each slice loads one file at a time
- Do not use single input file
- Use at least as many input files as you have slices
- Recommended max 1GB
- Max load throughput 100MB/s
- Loads data from S3
- INSERT INTO SELECT FROM
- Inserts data from other tables
- INSERT/UPDATE are slow
- high fixed cost for a commit
- must rewrite entire block (immutable)
- Use compression on individual files
- Manifest file
- Defines instruction with files and what to do when Redshift cannot find them (e.g. stop load, skip, etc.)
- Turn off automatic stats update (does too much work) and update for important columns only (sort/dist) after load is completed
VACUUM
- Command
- Clean-up after bulk changes
- Every update is "delete followed by an insert" (immutable blocks)
- Every delete is "soft"
- Space not reclaimed automatically
- Vacuum does
- Garbage Collection (reclaim deleted rows aka "ghost/phantom rows")
- Sorts data in Sort Keys
- Very expensive - run weekly
ANALYZE
- Analyzes table statistics
- Improves query optimizer
Encryption
- Setup during cluster creation
- unload data + reload to new cluster
- Data blocks and system metada encrypted for cluster and its snapshots
- Data encryption key - encrypts data block
- Database key - stored on disk separately from cluster. Kept in memory by Redshift
- Cluster key - protected by either KMS or (Cloud)HSM
- Master Key - encrypts cluster key
- Cluster key - protected by either KMS or (Cloud)HSM
- Database key - stored on disk separately from cluster. Kept in memory by Redshift
- Key rotation
- Cluster is unavailable (state: ROTATING_KEYS)
Audit
- All SQL operations are logged and can be downloaded to S3
References:
- http://arcanecode.com/2007/07/23/dimensions-versus-facts-in-data-warehousing/
- http://www.programmerinterview.com/index.php/database-sql/cardinality-versus-selectivity/
- http://optimalbi.com/blog/2014/12/08/redshift-zone-maps-as-an-indexing-strategy/
- http://stackoverflow.com/questions/8213235/mysql-covering-vs-composite-vs-column-index
- https://www.youtube.com/watch?v=fmy3jCxUliM
No comments:
Post a Comment