This is the common error:
java.lang.OutOfMemoryError: GC overhead limit exceeded
This error will occur in several Java environments, but, in particular, with Hive, is pretty common when big structures or several thousands objects are stored in memory.
According to Sun, the error will raise if too much time is being spent in garbage collection:
If more than 98% of the total time is spent in garbage collection and less than 2% of the heap is recovered, an OutOfMemoryError will be thrown.
This feature is designed to prevent applications from running for an extended period of time while making little or no progress because the heap is too small.
If necessary, this feature can be disabled by adding the option -XX:-UseGCOverheadLimit.
Also, we can increase the Heap Size, via “-Xmx1024m” option.
Another interesting option is the Concurrent Collector “UseConcMarkSweepGC“: It performs most of its work concurrently (i.e., while the application is still running) to keep garbage collection pauses short.
It is designed for applications with medium to large-sized data sets for which response time is more important than overall throughput, since the techniques used to minimize pauses can reduce application performance.
In Hive, we can set thes parameters with a command like this:
SET mapred.child.java.opts="-server -Xmx1g -XX:+UseConcMarkSweepGC -XX:-UseGCOverheadLimit";
An alternative to avoid storing all the structure in memory:
Write intermediate results to a temporary table in the database instead of hashmaps, a database table table is not memory bound, so use an indexed table is a solution in many cases.
When the intermediate table is complete, execute a sql statement(s) from it instead of from memory.