Notes on AWS, Big Data, Machine Learning and Leadership: AWS Glue

Friday, 9 March 2018

AWS Glue

Overview

Managed Data Catalog and ETL service
Discovery
- Automatically discovers and categorizes data
Development
- Generates ETL code
Deploy
- Runs jobs (serverless)

Data Catalog

Hive metastore compatible
- AWS Extensions
  - Search over metadata
  - Connection Info - JDBC URLs
  - Classification for identifying types (e.g. Grok expressions)
    - Compare AWS Macie
  - Version as metadata evolves
- Insertion
  - Hive DDL
  - Bulk Import
  - AWS Crawlers
    - automatically extract data from S3 and create tables
    - detect schema changes
    - detect Hive-style partitions
Highly Available
Inegrations
- Athena, Redshift Spectrum, EMR (replacement for Hive@local MySQL)
Source
- RDS, S3 (CSV, TSV, JSON, Parquet,...), Redshift, EC2 database

Job Authoring

Allows to describe transformations
- Given source&target create me the code
Python code generated (PySpark)
Developer endpoint
- Allows to spin-up Zeppelin and connect to it
Dynamic Frame
- Wraps Data Frame
- Infers schema on-the-fly

Job Execution

Execution of the ETL code (serverless)
Triggers
- Schedule based (e.g. time of day)
- Event based (e.g. job completion)
- On-demand (e.g. AWS Lambda on S3 PUT)
Bookmark
- State
  - what data was processed
  - process only new files

References

https://www.youtube.com/watch?v=CMjycQ_3M14&t=329s

No comments:

Post a Comment

Subscribe to: Post Comments (Atom)