Thursday, 25 June 2020

Athena

Overview
  • Interactive Data Query Service
    • Cannot be used to modify data
  • Allows to query data on S3 
  • Based on Presto (Facebook)
  • Massively Parallel Processing
  • Uses schema-on-read

Model
  • Table
    • EXTERNAL (data stored in S3)
  • S3 data location
  • Metadata
    • Uses Apache Hive DDL
    • Stored in AWS Glue if availalble in Region
  • SerDe (how to interpret a row)

Workgroups
  • Separate workloads
    • Query results (permissions)
      • Works well with AWS Glue: Fine Grained Access Control (FGAC) for table access
    • Saved Queries
    • Enforce limits (cost control)
  • Override client settings


CTAS
  • Create Table As Select
  • Allows to create "materialized views"
  • Data can be stored in columnar format (ORC, Parquet)
  • Can be used
    • partition/bucketing purposes
    • subsets of data

Query Results
  • Stored in S3
    • Can use KMS for encryption

Pricing
  • Pay for data scanned in S3
    • $5 per TB of data scanned
    • Minimum 10 MB per query
  • Optimizations
    • Compression
    • Parititioning
    • Columnar data formats (e.g. Parquet)
  • Additional costs for KMS (when using SSE-KMS in S3)

Partitioning
  • Primary method to increase query performance
    • Reduces the amount of data to scan
  • Data  separated in subfolders
  • Partition column is a "virtual" column in the table
    • PARTITIONED BY
  • Data layout
    • Apache Hive Format
      • "/partitionkey=partitionvalue"
      • Natively supported by Athena
      • Data must be layout in the specific way
      • Run MSCK REPAIR TABLE for Athena to scan S3 and figure out the partitions
      • Example (S3 Inventory folders)
        • /dt=2020-04-19-00-00
        • /dt=2020-04-26-00-00
        • /dt=2020-05-03-00-00
    • ADD PARTITION
      • Data is layout in partitions but format is other than Hive
      • Example (ELB Logs)
        • /2020/05/01
        • /2020/05/02
        • /2020/05/03
        • /2020/05/04
      • Similar effect can be achieved by creating a table and pointing to specific subfolder
    • Non-partitioned
      • Lambda to move (copy+delete) file to proper location
      • Use CTAS to extract partition
      • Kinesis Firehose use custom prefixes

Query Optimizations
    • JOIN: larger table on the left
      • table on the right is distributed (copied) among worker nodes
      • table on the left is streamed to worker nodes
    • GROUP BY:
      • high cardinality column on the left

References