Notes on AWS, Big Data, Machine Learning and Leadership: 2020

Thursday, 25 June 2020

Athena

Overview

Interactive Data Query Service

Cannot be used to modify data

Allows to query data on S3
Based on Presto (Facebook)
Massively Parallel Processing
Uses schema-on-read

Model

Table

EXTERNAL (data stored in S3)

S3 data location
Metadata

Uses Apache Hive DDL
Stored in AWS Glue if availalble in Region

SerDe (how to interpret a row)

Workgroups

Separate workloads

Query results (permissions)

Works well with AWS Glue: Fine Grained Access Control (FGAC) for table access

Saved Queries
Enforce limits (cost control)

Override client settings

CTAS

Create Table As Select
Allows to create "materialized views"
Data can be stored in columnar format (ORC, Parquet)
Can be used

partition/bucketing purposes
subsets of data

Query Results

Stored in S3

Can use KMS for encryption

Pricing

Pay for data scanned in S3

$5 per TB of data scanned
Minimum 10 MB per query

Optimizations

Compression
Parititioning
Columnar data formats (e.g. Parquet)

Additional costs for KMS (when using SSE-KMS in S3)

Partitioning

Primary method to increase query performance

Reduces the amount of data to scan

Data separated in subfolders
Partition column is a "virtual" column in the table

PARTITIONED BY

Data layout

Apache Hive Format

"/partitionkey=partitionvalue"
Natively supported by Athena
Data must be layout in the specific way
Run MSCK REPAIR TABLE for Athena to scan S3 and figure out the partitions
Example (S3 Inventory folders)

/dt=2020-04-19-00-00
/dt=2020-04-26-00-00
/dt=2020-05-03-00-00

ADD PARTITION

Data is layout in partitions but format is other than Hive
Example (ELB Logs)

/2020/05/01
/2020/05/02
/2020/05/03
/2020/05/04

Similar effect can be achieved by creating a table and pointing to specific subfolder

Non-partitioned

Lambda to move (copy+delete) file to proper location
Use CTAS to extract partition
Kinesis Firehose use custom prefixes

Query Optimizations

◣

JOIN: larger table on the left

table on the right is distributed (copied) among worker nodes
table on the left is streamed to worker nodes

GROUP BY:

high cardinality column on the left

References

Subscribe to: Comments (Atom)