Hive provides a good way for you to evaluate your data on HDFS. It is the common case where you create your data and then want to use hive to evaluate it. In that case, creating a external table is the approach that makes sense.
In this post, we are going to discuss a more complicated usage where we need to include more than one partition fields into this external table. And the original data on HDFS is in JSON.
For the difference between managed table and external table, please refer to this SO post.
Here is your data
In this post, we assume your data is saved on HDFS as /user/coolguy/awesome_data/year=2017/month=11/day=01/\*.json.snappy
.
The data is well divided into daily chunks. So, we definitely want to keep year
, month
, day
as the partitions in our external hive table.
Make hive be able to read JSON
Since every line in our data is a JSON object, we need to tell hive how to comprehend it as a set of fields.
To achieve this, we are going to add an external jar. There are two jars that I know of could do the job:
To add the jar you choose to hive, simply run ADD JAR
in the hive console:
Note: The path here is the path to your jar on the local machine. But you can still specify a path on HDFS by specifying hdfs://
prefix.
Create the external table
By now, all the preparation is done. The rest of the work is pretty straight forward:
- Tell hive where to look for the data.
- Tell hive which ones are the fields for partitions.
- Tell hive which library to use for JSON parsing.
So, the HQL to create the external table is something like:
This HQL uses
hive-hcatalog-core-X.Y.Z.2.4.2.0-258.jar
to parse JSON. For the usage ofjson-serde-X.Y.Z-jar-with-dependencies.jar
, changeROW FORMAT SERDE 'org.apache.hive.hcatalog.data.JsonSerDe'
toROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe'
.
There are two things you want to be careful about:
- The fields for partition shouldn’t be in the
<field-list>
. - The part of
/year=2017/month=11/day=01/
in the path shouldn’t be in theLOCATION
.
And here you go, you get yourself an external table based on the existing data on HDFS.
However…
Then soon enough, you will find this external table doesn’t seem to contain any data. That is because we need to manually add partitions into the table.
When you finish the ingestion of /user/coolguy/awesome_data/year=2017/month=11/day=02/
, you should also run