Benefits
- Contains indexes and summary statistics that allow to skip entire blocks
- Compare with skip lists
- Typically queries do not need entire row
- Compress well
- Possible to use encoding
- Suitable for Vectorized Query Execution
- Exploiting CPU Single Instruction Multiple Data (SIMD)
ORC
- Used by Facebook
- Developed at Yahoo
- Structure
- Contains list of stripes
- Default = 64MB (just like HDFS block)
- Bigger = efficient reads
- Small = efficient memory use, more splits
- Stripe structure
- Index data
- Row data
- Stripe Footer
- Default = 64MB (just like HDFS block)
- Footer
- List of stripe locations
- Type descriptions
- Strip statistics
- Postscript
- Compression Parameters
- File format version
- Contains list of stripes
- Schema change
Parquet
- Developed at Twitter
References
No comments:
Post a Comment