Notes on AWS, Big Data, Machine Learning and Leadership: AWS Kinesis Data Streams

Model

Record - individual item put into a stream
Multiple consumers have independent Iterator and can process stream concurrently

Stream

Ordered sequence of data records distributed into shards
- Ordering impacts concurrency (1 shard = 1 logical reader)
Retention: 1 day (default) - 7 days

Shard

Group of data records
Unit of scale and parallelism
Limits
- Operations: 1000 record writes/sec, 5 reads/sec
- Bandwidth: 1MB/s write, 2MB/s read
Provisioned model - reserve capacity upfront (like DynamoDB)
Each shard "owns" a range on the "hash ring"
Status
- OPEN - accepts new records
- CLOSED
  - does not accept any new records (after resharding)
  - has end sequence number
- EXPIRED
  - parent shard (after reshading)
  - all data records exceeded retention period
Resharding
- UpdateShard API
  - Performs manual operations below
- Manual
  - Pairwise (always only 2 shards affected)
  - Splitting
    - Set new range keys for each child (e.g. 30/70)
  - Merging
    - Must be adjacent ranges
- Important to read from ancestor first (to preserve data order)
  - KCL does it

Data Record

Sequence Number

Assigned by Kinesis
Specific to shard in a stream (but not across shards)
Generally increasing over time
- Use SequenceNumberForOrdering for strict ordering

Producer

Puts data into the stream
- PartitionKey
  - Owned by Customer
  - AWS generates MD5 to map to hash key range
  - Used to route to the same shard
  - Strategy
    - managed buffer - high cardinality of PKs to evenly distribute traffic (avoid hot shard)
    - streaming mapper - business concepts, e.g. userId, gameId go to same shard
- Every successful PUT acked with Sequence# (monotonically increasing)
PutRecords
- Batch write (aggregated data)
- Indivitual records in the batch may fail

Consumer

Reads records from the specific shard
Shard Iterator - determines a position in the stream
Application (KCL) on EC2 instance
Lambda
- Pull model (AWS Lambda polls Kinesis streams)
- Request-Response invocation

Collectors (Third Party)

Availability

Performance

100s TB/s (when multiple shards used)
1MB/s ingress - leave some headroom for spiky traffic
2MB/s egress - leave some headroom for catch-up
Monitor
- CloudWatch - per stream level
- Custom metrics: log hash key (i.e. derive MD5 yourself) and log shardId

Kinesis Producer Library (KPL)

Kinesis Client Library (KCL)

Uses DynamoDB to track state
- shardId (Hash Key)
- checkpoint (sequence number for shard)
- parentShardId (ensure parent is processed before processing children)
Coordinates shard-worker mapping
Structure
- Worker
  - Invokes Record Processors (ExecutorService)
    - Each Record Processor responsible for one shard
    - Examples
      - S3Connector
      - RedshiftConnector (uses S3Connector)
      - DynamoDbConnector
Starting Point
- LATEST - tip of the stream
- TRIM_HORIZON - beginning of the stream

Encryption

References

Notes on AWS, Big Data, Machine Learning and Leadership