← Home

Managed table’s location for Spark & Hive

2023/05/01

When using Spark with Hive metastore, there are some differences in warehouse location configuration, which controls where to save the data for managed tables.

If not using correctly, the result may be confusing. There are some machanisms behind the scenes:

Enough for the background, now let’s see some different examples:

I create a HDFS 3.3.2 in my local machine, set up a metastore for Hive 2.3.7 and Spark 3.3.2, all using the default configs.


Then:

Use spark-sql to create a table, without specifying database. create table spark_text_table ...;


Use spark-sql to create a database db1, then create a table db1.spark_text_table:


Use Hive to create a database db2, then create a table db2.hive_text_table:


One caveat for the default Spark warehouse path $PWD/spark-warehouse: as it uses the present working directory as the warehouse path, if you call spark-sql/spark-shell at different directories and create new databases, you will save tables to different locations, which will greatly increase the management overhead.

But there is still a good news: the databases and tables metadata is saved in Hive’s metastore, you can still query all the tables created before, even in different directories.