The Evolution of Big Data Table Management

2025/07/02

This is just a summary of my understanding of how to organize bigdata in tables. So the level arrangement is rather random.

Level 1

No tables, just a bunch of files in a directory, read a single file or whole directory to process.

Hive style partitioned/bucketed directories.

Pros: easy to understand, prune data with partitions.
Cons: partition evolution is hard, file listing is slow in S3, and may reach API limits.

Use Hive(or Glue...) metastore.

Zorder.

Pros: more efficient for different filter combinations.
Cons: not a table property, need to manually organize the data, write amplification.

Table formats: Iceberg, Delta Lake, Hudi...

Pros: ACID, schema evolution management, data updates, time travel, branch/tags...
Cons: lost control at file level, every task need to go through these table formats.

Liquid clustering, auto clustering based on query usage, incremental.

Auto compaction on Tables.