Event Time
- when did the event took place
- aka client-side time
- may be unreliable (e.g. unreliable clock on a mobile devices, connectivity issues, queuing)
Ingest time
- when was the record added to the streaming source (received)
- aka server-side time
- available as ApproximateArrivalTimeStamp
Processing Time
- the time event was processed (observed) by the pipeline
- Processes uniformly
- Processing Time >= Event Time (time skew)
Windowing
- Defines computation bounds
- Where to start/end computation over continous (unbounded) stream
- Tumbling
- Aligned - all windows start at the same time (e.g. 13:00 UTC)
- Unaligned - windows for various keys start at different times
- Non-overlapping
- grouped keys do not overlap
- Each datapoint belongs to only one window
- Result at the of the window
- Use case
- Periodic reporting
- Leader-boards
- Aggregating data over a minute
- Sliding
- Rolling window
- Overlapping
- result immediately
- Use case
- Real-time operational analytics
- Session (dynamic)
- Sequenece of events terminated by period of inactivity
- Length cannot be determined a priori
- Almost always unaligned (every key starts at different times)
- Examples
- Analyzing user behavior (e.g. clickstream logs)
Windowing by Processing Time
- Pros
- Simple (just buffer the data)
- Completeness is straightforward (no concept of "late data" here)
- Good fit if you want to infer information about the source as it is observed
- e.g. Monitoring
- Cons
- If Event Time <> Processing Time the result is not representative
- Mix of old and new data
- If Event Time <> Processing Time the result is not representative
Windowing by Event Time
- Gold standard
- Pros
- Correct in the face of event time/processing time skew
- Cons
- "Event Time" Windows must live longer than their length
- Buffering
- Disk is cheap
- Many useful aggregations do not require entire data set (e.g. sum, avg)
- Completenes
- You can never be sure when the window is actually completed (e.g. late data will never arrive).
- Buffering
- "Event Time" Windows must live longer than their length
Watermark
- Temporal notion of input completeness in the "event-time domain"
- F(P) -> E
- Based on current processing time (all input gathered so far) what is the maximum Event Time the systems thinks has all the data
- No "late data" < E will arrive
- Once I got to E nothing earlier than E will ever arrive
- Based on current processing time (all input gathered so far) what is the maximum Event Time the systems thinks has all the data
- Problems
- Too slow (latency) - materializing output delayed only due to watermark (no late data ever arrives)
Perfect Watermark
- Complete knowledge of all input data
- No problem of "late data"
- Rare in practice
Heuristic Watermark
- Estimate the progress
- It can make mistakes
- Problems
- Too fast - make incorrect guess about completeness and materialize the window before late data arrived
Trigger
- When "in processing time" should the results be materialized
- Decison based on
- Watermark progress (after you reached event E)
- Processing Time progress - regular periodic updates
- Element counts - number of elements observed
- Punctuation - record has a special meaning
References
- https://www.oreilly.com/ideas/the-world-beyond-batch-streaming-101
No comments:
Post a Comment