Sunday, 11 March 2018

AWS Kinesis Data Streams


Model
  • Record - individual item put into a stream
  • Multiple consumers have independent Iterator and can process stream concurrently

Stream
  • Ordered sequence of data records distributed into shards
    • Ordering impacts concurrency (1 shard = 1 logical reader)
  • Retention: 1 day (default) - 7 days

Shard 
  • Group of data records
  • Unit of scale and parallelism
  • Limits
    • Operations: 1000 record writes/sec, 5 reads/sec
    • Bandwidth: 1MB/s write, 2MB/s read
  • Provisioned model - reserve capacity upfront (like DynamoDB)
  • Each shard "owns" a range on the "hash ring"
  • Status
    • OPEN - accepts new records
    • CLOSED
      • does not accept any new records (after resharding)
      • has end sequence number
    • EXPIRED 
      • parent shard (after reshading)
      • all data records exceeded retention period
  • Resharding
    • UpdateShard API
      • Performs manual operations below
    • Manual
      • Pairwise (always only 2 shards affected) 
      • Splitting
        • Set new range keys for each child (e.g. 30/70)
      • Merging
        • Must be adjacent ranges
    • Important to read from ancestor first (to preserve data order)
      • KCL does it

Data Record
  • Unit of data stored in Kinesis
  • Structure
    • Sequence Number
    • Partition Key
    • Data blob (max 1MB)

Sequence Number
  • Assigned by Kinesis
  • Specific to shard in a stream (but not across shards)
  • Generally increasing over time
    • Use SequenceNumberForOrdering for strict ordering

Producer
  • Puts data into the stream 
    • PartitionKey
      • Owned by Customer
      • AWS generates MD5 to map to hash key range
      • Used to route to the same shard 
      • Strategy
        • managed buffer - high cardinality of PKs to evenly distribute traffic (avoid hot shard)
        • streaming mapper - business concepts, e.g. userId, gameId go to same shard
    • Every successful PUT acked with Sequence# (monotonically increasing)
  • PutRecords
    • Batch write (aggregated data)
    • Indivitual records in the batch may fail

Consumer
  • Reads records from the specific shard
  • Shard Iterator - determines a position in the stream
  • Application (KCL) on EC2 instance
  • Lambda
    • Pull model (AWS Lambda polls Kinesis streams)
    • Request-Response invocation

Collectors (Third Party)
  • Preaggregate data before PUT
  • Tools

Availability
  • Data replicated across 3 AZ
  • Durable

Performance
  • 100s TB/s (when multiple shards used)
  • 1MB/s ingress - leave some headroom for spiky traffic 
  • 2MB/s egress - leave some headroom for catch-up
  • Monitor
    • CloudWatch - per stream level
    • Custom metrics: log hash key (i.e. derive MD5 yourself) and log shardId

Kinesis Producer Library (KPL)
  • Aggregates data into 1 MB blocks (per-shard)
  • Selective retries on PutRecords
  • Rate limiting
  • High-throughput

Kinesis Client Library (KCL)
  • Uses DynamoDB to track state
    • shardId (Hash Key)
    • checkpoint (sequence number for shard)
    • parentShardId (ensure parent is processed before processing children)
  • Coordinates shard-worker mapping
  • Structure
    • Worker 
      • Invokes Record Processors  (ExecutorService)
        • Each Record Processor responsible for one shard 
        • Examples
          • S3Connector
          • RedshiftConnector (uses S3Connector)
          • DynamoDbConnector
  • Starting Point
    • LATEST - tip of the stream
    • TRIM_HORIZON - beginning of the stream

Encryption
  • SSE-KMS supported

References

No comments:

Post a Comment