Sunday, 11 March 2018

AWS Data Pipeline

Overview
  • Manages data workflow orchestration: dependencies, scheduling, alerting.
  • Can be used for scheduling EMR jobs 
  • Can be used for ETL
    • Compare AWS Glue

Model
  • Pipeline - definition of the workflow
  • Precondition - rule to check before activity is run (e.g. S3KeyExists, DynamoDBDataExists)
  • Schedule - when to run activity (e.g. daily, weekly)
  • Task Runner - polls DataPipeline and perfoms tasks
    • AWS can automatically install it on the resources it launches (e.g. EC2 instances)
    • Install manually on long-running EC2 instance/physical hardware

Data Node

  • representation of business data
  • types
    • DynamoDB
    • SQL
    • Redshift
    • S3

Activity
  • action initiatied by pipeline
  • types
    • CopyActivity
    • HiveCopyActivity
    • RedshiftCopyActivity
    • HiveActivity
    • PigActivity
    • EMRActivity
    • HadoopActivity
    • ShellCommandActivity
    • SqlActivity

Resource
  • Performs work that pipeline specifies
  • types
    • Ec2Resource
    • EmrCluster

Templates
  • Export DynamoDB to S3
  • Export S3 to DynamoDB
  • Copy S3 to RDS
  • Copy RDS to S3
  • Run Hive Analytics on S3 data
  • Copy on-premise MySQL to RDS



No comments:

Post a Comment