Comparison of Parquet and ORC

2022/02/21 #Spark

I'm using Spark3.2.1 to convert MongoDB BSON files to Parquet and ORC. To save the cost, I want to choose a best combination of data type + compression. Both the data file size and the data size scanned in queries should be small, because they will be saved in AWS S3 and queried in AWS Athena, which compute cost by data stored and scanned.

Since BSON is binary Json, the data is highly nested. A sample would be:

{
  "_id":{
    "a":1,
    "b":2,
    // ...,
  },
  "value":{
    "c": {},// Map<String, Long>
    "d": {},// Map<String, Map<String, Long>>
    // ... // hundreds of other columns
  }
}

Different data will make a big difference on the comparison results. Below is just my own test result, you should do the test with your own data.

Compression

Parquet and ORC supports many compression algorithms. Check ParquetOptions and OrcOptions for detail lists.

object ParquetOptions {
  // The parquet compression short names
  private val shortParquetCompressionCodecNames = Map(
    "none" -> CompressionCodecName.UNCOMPRESSED,
    "uncompressed" -> CompressionCodecName.UNCOMPRESSED,
    "snappy" -> CompressionCodecName.SNAPPY,
    "gzip" -> CompressionCodecName.GZIP,
    "lzo" -> CompressionCodecName.LZO,
    "lz4" -> CompressionCodecName.LZ4,
    "brotli" -> CompressionCodecName.BROTLI,
    "zstd" -> CompressionCodecName.ZSTD)
}

object OrcOptions {
  // The ORC compression short names
  private val shortOrcCompressionCodecNames = Map(
    "none" -> "NONE",
    "uncompressed" -> "NONE",
    "snappy" -> "SNAPPY",
    "zlib" -> "ZLIB",
    "lzo" -> "LZO",
    "lz4" -> "LZ4",
    "zstd" -> "ZSTD")
}

The difference is that Parquet provides gzip while ORC provides zlib. A good article to explain gzip and zlib.

LZO needs extra libraries so I didn't test it. Below table shows different file size ratios compared with the smallest combination zlib + ORC. All the compression configs are default, such as compression level.

compression	Parquet	ORC
none	2.32	1.39
lz4	1.56	1.18
snappy	1.53	1.16
zstd	1.18	1.02
gzip (zlib)	1.16	1.00

Processing time

My app's logic is fairly simple, just parse and save. But ORC cost 3-4x the time of Parquet. Maybe that's because of the way ORC handles nested columns, or just simply because ORC wasn't optimized well enough compared to Parquet in Spark.

Data scanned in queries

Parquet uses Dremel algorithm to flatten the nested columns, while ORC just nests columns in columns. So when read from a inner column, Parquet only reads the columns needed, while ORC needs to decompress the outer columns fully.

A simple query in Athena can reveal this:

The files used in the test: Parquet=376MB, ORC=319MB

SQL	Parquet data scanned (MB)	ORC data scanned (MB)
select sum(value.a) from table;	1.6	303.94
select sum(value.b) from table;	6.11	303.94
select sum(element_at(value.c, 'd').e) from table;	97.14	303.94

Conclusion

So, Parquet + gzip is good enough for me. The data size is sub-optimal, processing time is fast, data scanned is also small.

Further optimization like row group size and data page size can also be tested and applied, but just for now it's enough to leave them to the default.