Overview
- Interactive Data Query Service
- Cannot be used to modify data
- Allows to query data on S3
- Based on Presto (Facebook)
- Massively Parallel Processing
- Uses schema-on-read
Model
- Table
- EXTERNAL (data stored in S3)
- S3 data location
- Metadata
- Uses Apache Hive DDL
- Stored in AWS Glue if availalble in Region
- SerDe (how to interpret a row)
Workgroups
- Separate workloads
- Query results (permissions)
- Works well with AWS Glue: Fine Grained Access Control (FGAC) for table access
- Saved Queries
- Enforce limits (cost control)
- Override client settings
CTAS
- Create Table As Select
- Allows to create "materialized views"
- Data can be stored in columnar format (ORC, Parquet)
- Can be used
- partition/bucketing purposes
- subsets of data
Query Results
- Stored in S3
- Can use KMS for encryption
Pricing
- Pay for data scanned in S3
- $5 per TB of data scanned
- Minimum 10 MB per query
- Optimizations
- Compression
- Parititioning
- Columnar data formats (e.g. Parquet)
- Additional costs for KMS (when using SSE-KMS in S3)
Partitioning
- Primary method to increase query performance
- Reduces the amount of data to scan
- Data separated in subfolders
- Partition column is a "virtual" column in the table
- PARTITIONED BY
- Data layout
- Apache Hive Format
- "/partitionkey=partitionvalue"
- Natively supported by Athena
- Data must be layout in the specific way
- Run MSCK REPAIR TABLE for Athena to scan S3 and figure out the partitions
- Example (S3 Inventory folders)
- /dt=2020-04-19-00-00
- /dt=2020-04-26-00-00
- /dt=2020-05-03-00-00
- ADD PARTITION
- Data is layout in partitions but format is other than Hive
- Example (ELB Logs)
- /2020/05/01
- /2020/05/02
- /2020/05/03
- /2020/05/04
- Similar effect can be achieved by creating a table and pointing to specific subfolder
- Non-partitioned
- Lambda to move (copy+delete) file to proper location
- Use CTAS to extract partition
- Kinesis Firehose use custom prefixes
Query Optimizations
- ◣
- JOIN: larger table on the left
- table on the right is distributed (copied) among worker nodes
- table on the left is streamed to worker nodes
- GROUP BY:
- high cardinality column on the left
References