Tuesday, 3 May 2016

AWS Kinesis Streams

Streaming of small fast moving data

Model
  • Record - individual item put into a stream
  • Multiple consumers have independent cursor and can process stream concurrently

Stream
  • Ordered sequence of data records distributed into shards
  • Retention: 1 day (default) - 7 days

Shard 
  • group of data records
  • unit of scale and parallelism
  • Limits
    • Operations: 1000 record writes/sec, 5 reads/sec
    • Bandwidth: 1MB/s write, 2MB/s read
  • Provisioned model - reserve capacity upfront (like DynamoDB)
  • Owns a range on the "hash ring"
  • Status
    • OPEN - accepts new records
    • CLOSED
      • does not accept any new records (after resharding)
      • has end sequence number
    • EXPIRED 
      • parent shard (after reshading)
      • all data records exceeded retention period
  • Resharding
    • Pairwise (always only 2 shards affected) 
    • Splitting
      • Set new range keys for each child (e.g. 30/70)
    • Merging
      • Must be adjacent ranges
    • Important to read from ancestor first (to preserve data order)

Data Record
  • Unit of data stored in Kinesis
  • Structure
    • Sequence Number
    • Partition Key
    • Data blob (max 1MB)

Sequence Number
  • Assigned by Kinesis
  • Specific to shard in a stream (not across shards)
  • Generally increasing over time
    • Use SequenceNumberForOrdering for strict ordering

Producer
  • Puts data into the stream 
    • PartitionKey
      • Owned by Customer
      • AWS generates MD5 to map to hash key range
      • Used to route to the same shard 
      • Strategy
        • managed buffer - high cardinality of PKs to evenly distribute traffic (avoid hot shard)
        • streaming mapper - business concepts, e.g. userId, gameId go to same shard
    • Every successful PUT acked with Sequence# (monotonically increasing)
  • PutRecords
    • Batch write (aggregated data)
    • Indivitual records in the batch may fail
Consumer
  • Reads records from the specific shard
  • Shard Iterator - determines a position in the stream
  • Application (KCL) on EC2 instance
  • Lambda
    • Pull model (AWS Lambda polls Kinesis streams)
    • Request-Response invocation

Collectors (Third Party)
  • Preaggregate data before PUT
  • Tools

Availability
  • Data replicated across 3 AZ
  • Durable

Performance
  • 100s TB/s (when multiple shards used)
  • 1MB/s ingress - leave some headroom for spiky traffic 
  • 2MB/s egress - leave some headroom for catch-up
  • Monitor
    • CloudWatch - per stream level
    • Custom metrics: log hash key (i.e. derive MD5 yourself) and log shardId

Kinesis Producer Library (KPL)
  • Aggregates data into 1 MB blocks (per-shard)
  • Selective retries on PutRecords
  • Rate limiting

Kinesis Client Library (KCL)
  • Uses DynamoDB to track state
    • shardId (Hash Key)
    • checkpoint (sequence number for shard)
    • parentShardId (ensure parent is processed before processing children)
  • Structure
    • Worker 
      • Invokes Record Processors  (ExecutorService)
        • Each Record Processor responsible for one shard 
        • Examples
          • S3Connector
          • RedshiftConnector (uses S3Connector)
          • DynamoDbConnector
  • Statring Point
    • LATEST - tip of the stream
    • TRIM_HORIZON - beginning of the stream

Use Cases
  • Advertising data aggregation
  • Logs ingestion
  • Transaction/order data collection
  • Dashboard/metrics - time series, windowed analytics, weighted averages

References

No comments:

Post a Comment