Monday, March 17, 2014

Hive joins

Hive is a beast if you tune it for your queries, but it can be painfully slow if you do not optimize hive for your queries, especially join operations.


read about hive join optimization here.

https://www.facebook.com/notes/facebook-engineering/join-optimization-in-apache-hive/470667928919



A word about bucket map join

----------------------------------------

Bucket map join is helpful if you know the keys you are going to use in your queries. Basically Hive hashes on the key (specified for clustering while table creation) and then clusters them into n backets (again n is specified during table creation). You create two tables who are to be joined, specifying the same key and same number of buckets. While joining hive loads the bucket with the same number (any of 0 to n-1) and performs an in memory join.

There are two issues I find with bucket map joins.

1) The tables have to be hive managed. To do bucket map join for external table is tedious. If you are not willing to make the table hive managed (or INTERNAL) then, you will have to hash the key and bucket them by yourself and then partition the data into n buckets.  I have not tried this yourself. I guess the easier way to do this to copy the data to a hive managed tables and then do a insert overwrite select .... query into the hive managed table from your external table. Remember to set hive.enforce.bucketing=true. You can do this from hive prompt.

hive >  hive.enforce.bucketing=true

That means, you are duplicating the data into hive managed table. My data pipeline is, data flowing  into hdfs location, and then copying to a hive managed table for joining with other tabled.


2) If you have two keys which are candidate of joins to two separate tables, then you will have to have two separate hive managed tables with the same data, but different keys used for clustering. If you specify two keys at the same time for clustering, hive uses a hash of both keys together, so that the hash is a combination of both keys. If you are doing a join on both keys together well and good. But if you have to join on one key at one time and another key another time, then this wont work, as the hash was created using both the keys.

3) If you do a load data in path..., then hive simply copies the directory from external location to hive warehouse (/user/hive/warehouse) in hdfs. It does not do the bucketing and sorting, even if you enforce it. So the way to have your tables bucketized is to do an insert overwrite into the table as in

hive > insert overwrite table foo select * from bar;

That means create two hive tables each clustered by different keys.

If anyone knows a better kungfoo, then please post it as a comment.


Yarn memory requirements and tuning it for your hive joins.

  If you run into error messages like below

Container [pid=19699,containerID=container_1394473937156_0074_01_000122] is running beyond virtual memory limits. Current usage: 34.7 MB of 1 GB physical memory used; 11.1 GB of 2.1 GB virtual memory used. Killing container.

for yarn optimization go here
yarn optimization
It is an indication that, you have not tuned your yarn memory configurations.
Yarn has the notion of container which is associated with each hadoop node. Resources are allocated to each container. You have to specify  the memory allocation to each container and to each JVM inside the container.  Yarn typically allocates 2.1 times(default configuration)  of physical memory as virtual memory.  In the above error message it indicates that the JVM is allocated more memory  in the setting java.child.opts (around  9GB) than the container can handle. The jvm memory allocation should be less than what is allocated to yarn container. When you do a bucketmap join (or even a regular map join) hive reads the data into memory. You have to carefully configure your yarn and jvm memory considering the size of your data.

For more details on configuring yarn memory read the link mentioned above.

Just in - I figured out that yarn does not respect the configuration
yarn.scheduler.minimum-allocation-mb, to allocate a minimum memory for the container. Instead it uses this configuration as the base chunk of memmory allocation, whose multiple will be allocated to cover the memory specified in 

mapreduce.map.memory.mb and mapreduce.reduce.memory.mb. So if you run into physical memory limits for the yarn container then configure the above params accordingly.




No comments: