Hive query with JOIN, GROUP BY and SUM does not return results


On Hive 0.11, and lower versions, if we set:

set hive.optimize.skewjoin=true; 
set hive.auto.convert.join=false;

A query with JOIN, GROUP BY and SUM does not return results.

But if we make the query a little more simple, using JOIN but not GROUP and SUM functions, We will GET RESULTS.

I have found that there are bugs reported recently:

https://issues.apache.org/jira/browse/HIVE-5888

and

https://issues.apache.org/jira/browse/HIVE-6041

This bug is related also with the previous one (https://issues.apache.org/jira/browse/HIVE-6520), already reported:

If we set:

set hive.optimize.skewjoin=true; 
set hive.auto.convert.join=true;

We will no have any output.

The reason is that the skew join in hive relies on a reduce phase to save skewed keys on local disk, but hive.auto.convert.join=true turns a mapreduce task into a mapjoin task in some scenarios.

As a result, there is no skewed keys generated by the mapjoin and the result is empty.

If you set hive.auto.convert.join=false to disable the auto conversion of a mapjoin, the performance is very bad because the reduce phase takes a very long time to process the skew keys.

 

This is expected to be resolved on version hive-0.13.0.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s