Hive query with JOIN, GROUP BY and SUM does not return results


On Hive 0.11, and lower versions, if we set:

set hive.optimize.skewjoin=true; 
set hive.auto.convert.join=false;

A query with JOIN, GROUP BY and SUM does not return results.

But if we make the query a little more simple, using JOIN but not GROUP and SUM functions, We will GET RESULTS.

I have found that there are bugs reported recently:

https://issues.apache.org/jira/browse/HIVE-5888

and

https://issues.apache.org/jira/browse/HIVE-6041

This bug is related also with the previous one (https://issues.apache.org/jira/browse/HIVE-6520), already reported:

If we set:

set hive.optimize.skewjoin=true; 
set hive.auto.convert.join=true;

We will no have any output.

The reason is that the skew join in hive relies on a reduce phase to save skewed keys on local disk, but hive.auto.convert.join=true turns a mapreduce task into a mapjoin task in some scenarios.

As a result, there is no skewed keys generated by the mapjoin and the result is empty.

If you set hive.auto.convert.join=false to disable the auto conversion of a mapjoin, the performance is very bad because the reduce phase takes a very long time to process the skew keys.

 

This is expected to be resolved on version hive-0.13.0.

check system variables or environment variables on Hive


On Hive we can check values for system variables or environment variables with the command:

hive> set;

if we need to ask for a specific variable value, we can run:

hive> set hive.security.authorization.enabled; 
hive.security.authorization.enabled=false

More information:

https://cwiki.apache.org/confluence/display/Hive/AdminManual+Configuration

Installing sqoop


Apache Sqoop is a tool designed for efficiently transferring bulk data between Apache Hadoop and structured datastores such as relational databases.

To install sqoop follow these commands:

wget https://archive.apache.org/dist/sqoop/1.4.5/sqoop-1.4.5.bin__hadoop-2.0.4-alpha.tar.gz 

tar -xzf sqoop-1.4.5.bin__hadoop-2.0.4-alpha.tar.gz 

mv sqoop-1.4.5.bin__hadoop-2.0.4-alpha /home/hadoop/sqoop 

wget https://s3-eu-west-1.amazonaws.com/vivanih-emr/sqoop/mysql-connector-java-5.1.30.tar.gz 

tar -xzf mysql-connector-java-5.1.30.tar.gz 

cp mysql-connector-java-5.1.30/mysql-connector-java-5.1.30-bin.jar /home/hadoop/sqoop/lib/ 

export PATH=$PATH:/home/hadoop/sqoop/bin

A basic command to copy from MySQL to HDFS is:

sqoop import --connect jdbc:mysql://shard3copy.cbmfrobfdt6y.us-east-1.rds.amazonaws.com:3306/shard3copy --username root --password hadoop123 --table yourtable

Now, if you need to make it available to Hive (create the metastore entry), you can use:

sqoop import --connect jdbc:mysql://mysqlhost/mysqldb --username user --password passwd --query "SELECT table1.date_col, table1.big_int_col, table1.int_col, table1.varchar_col, table1.float_col FROM MYSQL_TABLE1 AS table1 WHERE \$CONDITIONS" --split-by table1.date_col --hive-import --hive-table hivedb.hive_table1 --target-dir hive_table1_data`