Notes on AWS, Big Data, Machine Learning and Leadership: AWS Kinesis Streams

Streaming of small fast moving data

Model

Stream

Shard

group of data records
unit of scale and parallelism
Limits
- Operations: 1000 record writes/sec, 5 reads/sec
- Bandwidth: 1MB/s write, 2MB/s read
Provisioned model - reserve capacity upfront (like DynamoDB)
Owns a range on the "hash ring"
Status
- OPEN - accepts new records
- CLOSED
- - does not accept any new records (after resharding)
  - has end sequence number
- EXPIRED
- - parent shard (after reshading)
  - all data records exceeded retention period
Resharding
- Pairwise (always only 2 shards affected)
- Splitting
- - Set new range keys for each child (e.g. 30/70)
- Merging
- - Must be adjacent ranges
- Important to read from ancestor first (to preserve data order)

Data Record

Sequence Number

Producer

Consumer

Collectors (Third Party)

Availability

Performance

100s TB/s (when multiple shards used)
1MB/s ingress - leave some headroom for spiky traffic
2MB/s egress - leave some headroom for catch-up
Monitor
- CloudWatch - per stream level
- Custom metrics: log hash key (i.e. derive MD5 yourself) and log shardId

Kinesis Producer Library (KPL)

Kinesis Client Library (KCL)

Uses DynamoDB to track state
- shardId (Hash Key)
- checkpoint (sequence number for shard)
- parentShardId (ensure parent is processed before processing children)
Structure
- Worker
- - Invokes Record Processors (ExecutorService)
  - - Each Record Processor responsible for one shard
    - Examples
    - - S3Connector
      - RedshiftConnector (uses S3Connector)
      - DynamoDbConnector
Statring Point
- LATEST - tip of the stream
- TRIM_HORIZON - beginning of the stream

Use Cases

References

Notes on AWS, Big Data, Machine Learning and Leadership