Friday, 9 March 2018

AWS Glue

Overview
  • Managed Data Catalog and ETL service
  • Discovery
    • Automatically discovers and categorizes data
  • Development
    • Generates ETL code
  • Deploy
    • Runs jobs (serverless)

Data Catalog
  • Hive metastore compatible 
    • AWS Extensions
      • Search over metadata
      • Connection Info - JDBC URLs
      • Classification for identifying types (e.g. Grok expressions)
        • Compare AWS Macie
      • Version as metadata evolves
    • Insertion
      • Hive DDL
      • Bulk Import
      • AWS Crawlers
        • automatically extract data from S3 and create tables
        • detect schema changes
        • detect Hive-style partitions
  • Highly Available
  • Inegrations
    • Athena, Redshift Spectrum, EMR (replacement for Hive@local MySQL)
  • Source
    • RDS, S3 (CSV, TSV, JSON, Parquet,...), Redshift, EC2 database

Job Authoring
  • Allows to describe transformations
    • Given source&target create me the code
  • Python code generated (PySpark)
  • Developer endpoint
    • Allows to spin-up Zeppelin and connect to it
  • Dynamic Frame
    • Wraps Data Frame
    • Infers schema on-the-fly

Job Execution
  • Execution of the ETL code (serverless)
  • Triggers
    • Schedule based (e.g. time of day)
    • Event based (e.g. job completion)
    • On-demand (e.g. AWS Lambda on S3 PUT)
  • Bookmark
    • State
      • what data was processed
      • process only new files


References




No comments:

Post a Comment