Monday, 12 March 2018

Columnar File Formats

Benefits
  • Contains indexes and summary statistics that allow to skip entire blocks
    • Compare with skip lists
  • Typically queries do not need entire row 
  • Compress well
    • Possible to use encoding
  • Suitable for Vectorized Query Execution
    • Exploiting CPU Single Instruction Multiple Data (SIMD)


ORC
  • Used by Facebook
  • Developed at  Yahoo
  • Structure
      • Contains list of stripes
        • Default = 64MB (just like HDFS block)
          • Bigger = efficient reads
          • Small = efficient memory use, more splits
        • Stripe structure
          • Index data
          • Row data
          • Stripe Footer
      • Footer
        • List of stripe locations
        • Type descriptions
        • Strip statistics
      • Postscript
        • Compression Parameters
        • File format version
  • Schema change

Parquet
  • Developed at Twitter

References

No comments:

Post a Comment