Saturday, 10 March 2018

Stream Processing (Windowing)


Event Time
  • when did the event took place
  • aka client-side time
  • may be unreliable (e.g.  unreliable clock on a mobile devices, connectivity issues, queuing)

Ingest time
  • when was the record added to the streaming source (received)
  • aka server-side time
  • available as ApproximateArrivalTimeStamp

Processing Time
  • the time event was processed (observed) by the pipeline
  • Processes uniformly
  • Processing Time >= Event Time (time skew)


Windowing 
  • Defines computation bounds 
    • Where to start/end computation over continous (unbounded) stream
  • Tumbling
    • Aligned - all windows start at the same time (e.g. 13:00 UTC)
    • Unaligned - windows for various keys start at different times
    • Non-overlapping
      • grouped keys do not overlap
      • Each datapoint belongs to only one window
    • Result at the of the window
    • Use case
      • Periodic reporting
      • Leader-boards
      • Aggregating data over a minute
  • Sliding
    • Rolling window
    • Overlapping
    • result immediately
    • Use case
      • Real-time operational analytics
  • Session (dynamic)
    • Sequenece of events terminated by period of inactivity
    • Length cannot be determined a priori
    • Almost always unaligned (every key starts at different times)
    • Examples
      • Analyzing user behavior (e.g. clickstream logs)

Windowing by Processing Time
  • Pros
    • Simple (just buffer the data)
    • Completeness is straightforward (no concept of "late data" here)
    • Good fit if you want to infer information about the source as it is observed
      • e.g. Monitoring
  • Cons
    • If Event Time <> Processing Time the result is not representative
      • Mix of old and new data

Windowing by Event Time
  • Gold standard
  • Pros
    • Correct in the face of event time/processing time skew
  • Cons
    • "Event Time" Windows must live longer than their length
      • Buffering 
        • Disk is cheap 
        • Many useful aggregations do not require entire data set (e.g. sum, avg)
      • Completenes
        • You can never be sure when the window is actually completed (e.g. late data will never arrive). 

Watermark
  • Temporal notion of input completeness in the "event-time domain"
  • F(P) -> E
    • Based on current processing time (all input gathered so far) what is the maximum Event Time the systems thinks has all the data
      • No "late data" <  E will arrive
    • Once I got to E nothing earlier than E will ever arrive
  • Problems
    • Too slow (latency) - materializing output delayed only due to watermark (no late data ever arrives)

Perfect Watermark
  • Complete knowledge of all input data
  • No problem of "late data"
  • Rare in practice

Heuristic Watermark
  • Estimate the progress
  • It can make mistakes
  • Problems
    • Too fast - make incorrect guess about completeness and materialize the window before late data arrived

Trigger
  • When "in processing time" should the results be materialized
  • Decison based on
    • Watermark progress (after you reached event E)
    • Processing Time progress - regular periodic updates
    • Element counts - number of elements observed
    • Punctuation - record has a special meaning 

References
  • https://www.oreilly.com/ideas/the-world-beyond-batch-streaming-101

No comments:

Post a Comment