Model
- Record - individual item put into a stream
- Multiple consumers have independent Iterator and can process stream concurrently
Stream
- Ordered sequence of data records distributed into shards
- Ordering impacts concurrency (1 shard = 1 logical reader)
- Retention: 1 day (default) - 7 days
Shard
- Group of data records
- Unit of scale and parallelism
- Limits
- Operations: 1000 record writes/sec, 5 reads/sec
- Bandwidth: 1MB/s write, 2MB/s read
- Provisioned model - reserve capacity upfront (like DynamoDB)
- Each shard "owns" a range on the "hash ring"
- Status
- OPEN - accepts new records
- CLOSED
- does not accept any new records (after resharding)
- has end sequence number
- EXPIRED
- parent shard (after reshading)
- all data records exceeded retention period
- Resharding
- UpdateShard API
- Performs manual operations below
- Manual
- Pairwise (always only 2 shards affected)
- Splitting
- Set new range keys for each child (e.g. 30/70)
- Merging
- Must be adjacent ranges
- Important to read from ancestor first (to preserve data order)
- KCL does it
- UpdateShard API
Data Record
- Unit of data stored in Kinesis
- Structure
- Sequence Number
- Partition Key
- Data blob (max 1MB)
Sequence Number
- Assigned by Kinesis
- Specific to shard in a stream (but not across shards)
- Generally increasing over time
- Use SequenceNumberForOrdering for strict ordering
Producer
- Puts data into the stream
- PartitionKey
- Owned by Customer
- AWS generates MD5 to map to hash key range
- Used to route to the same shard
- Strategy
- managed buffer - high cardinality of PKs to evenly distribute traffic (avoid hot shard)
- streaming mapper - business concepts, e.g. userId, gameId go to same shard
- Every successful PUT acked with Sequence# (monotonically increasing)
- PartitionKey
- PutRecords
- Batch write (aggregated data)
- Indivitual records in the batch may fail
Consumer
- Reads records from the specific shard
- Shard Iterator - determines a position in the stream
- Application (KCL) on EC2 instance
- Lambda
- Pull model (AWS Lambda polls Kinesis streams)
- Request-Response invocation
Collectors (Third Party)
- Preaggregate data before PUT
- Tools
- Fluentd: http://docs.fluentd.org/articles/quickstart
- input - where to read logs from
- buffer - in-memory if output fails
- output - passes to persistent storage (e.g. S3, SQS, Redis, MongoDB)
- Apache Flume: https://flume.apache.org/
- agent - collects the log and sends to collector
- collector - stores in permanent store
- Fluentd: http://docs.fluentd.org/articles/quickstart
Availability
- Data replicated across 3 AZ
- Durable
Performance
- 100s TB/s (when multiple shards used)
- 1MB/s ingress - leave some headroom for spiky traffic
- 2MB/s egress - leave some headroom for catch-up
- Monitor
- CloudWatch - per stream level
- Custom metrics: log hash key (i.e. derive MD5 yourself) and log shardId
Kinesis Producer Library (KPL)
- Aggregates data into 1 MB blocks (per-shard)
- Selective retries on PutRecords
- Rate limiting
- High-throughput
Kinesis Client Library (KCL)
- Uses DynamoDB to track state
- shardId (Hash Key)
- checkpoint (sequence number for shard)
- parentShardId (ensure parent is processed before processing children)
- Coordinates shard-worker mapping
- Structure
- Worker
- Invokes Record Processors (ExecutorService)
- Each Record Processor responsible for one shard
- Examples
- S3Connector
- RedshiftConnector (uses S3Connector)
- DynamoDbConnector
- Invokes Record Processors (ExecutorService)
- Worker
- Starting Point
- LATEST - tip of the stream
- TRIM_HORIZON - beginning of the stream
Encryption
- SSE-KMS supported
References
- https://www.youtube.com/watch?v=8u9wIC1xNt8&list=WL&index=2
- https://www.oreilly.com/ideas/the-world-beyond-batch-streaming-101,
- http://engineering.life360.com/engineering/2017/09/25/streaming-with-kinesis-on-AWS/
- https://aws.amazon.com/blogs/big-data/implementing-efficient-and-reliable-producers-with-the-amazon-kinesis-producer-library/
- https://github.com/awslabs/aws-big-data-blog/tree/master/aws-blog-kinesis-producer-library
No comments:
Post a Comment