Overview
- Managed Data Catalog and ETL service
- Discovery
- Automatically discovers and categorizes data
- Development
- Generates ETL code
- Deploy
- Runs jobs (serverless)
Data Catalog
- Hive metastore compatible
- AWS Extensions
- Search over metadata
- Connection Info - JDBC URLs
- Classification for identifying types (e.g. Grok expressions)
- Compare AWS Macie
- Version as metadata evolves
- Insertion
- Hive DDL
- Bulk Import
- AWS Crawlers
- automatically extract data from S3 and create tables
- detect schema changes
- detect Hive-style partitions
- AWS Extensions
- Highly Available
- Inegrations
- Athena, Redshift Spectrum, EMR (replacement for Hive@local MySQL)
- Source
- RDS, S3 (CSV, TSV, JSON, Parquet,...), Redshift, EC2 database
Job Authoring
- Allows to describe transformations
- Given source&target create me the code
- Python code generated (PySpark)
- Developer endpoint
- Allows to spin-up Zeppelin and connect to it
- Dynamic Frame
- Wraps Data Frame
- Infers schema on-the-fly
Job Execution
- Execution of the ETL code (serverless)
- Triggers
- Schedule based (e.g. time of day)
- Event based (e.g. job completion)
- On-demand (e.g. AWS Lambda on S3 PUT)
- Bookmark
- State
- what data was processed
- process only new files
- State
References
No comments:
Post a Comment