Overview
- Manages data workflow orchestration: dependencies, scheduling, alerting.
- Can be used for scheduling EMR jobs
- Can be used for ETL
- Compare AWS Glue
Model
- Pipeline - definition of the workflow
- Precondition - rule to check before activity is run (e.g. S3KeyExists, DynamoDBDataExists)
- Schedule - when to run activity (e.g. daily, weekly)
- Task Runner - polls DataPipeline and perfoms tasks
- AWS can automatically install it on the resources it launches (e.g. EC2 instances)
- Install manually on long-running EC2 instance/physical hardware
Data Node
- representation of business data
- types
- DynamoDB
- SQL
- Redshift
- S3
Activity
- action initiatied by pipeline
- types
- CopyActivity
- HiveCopyActivity
- RedshiftCopyActivity
- HiveActivity
- PigActivity
- EMRActivity
- HadoopActivity
- ShellCommandActivity
- SqlActivity
Resource
- Performs work that pipeline specifies
- types
- Ec2Resource
- EmrCluster
Templates
- Export DynamoDB to S3
- Export S3 to DynamoDB
- Copy S3 to RDS
- Copy RDS to S3
- Run Hive Analytics on S3 data
- Copy on-premise MySQL to RDS
No comments:
Post a Comment