I came across some scenarios where an application (i.e. Mapreduce) communicating to HBase through YARN could silently fail with a timeout like the following:
2017-01-30 19:42:03,657 DEBUG [main] org.apache.hadoop.hbase.client.ConnectionManager$HConnectionImplementation: locateRegionInMeta parentTable=hbase:meta, metaLocation=, attempt=9 of 35 failed; retrying after sleep of 10095 because: Failed after attempts=36, exceptions: Mon Jan 30 19:42:03 UTC 2017, null, java.net.SocketTimeoutException: callTimeout=60000, callDuration=68463: row 'test2,#cmrNo acctNo,99999999999999' on table 'hbase:meta' at region=hbase:meta,,1.1588230740, hostname=ip-172-31-3-246.us-west-2.compute.internal,16000,1485539268192, seqNum=0
The root cause for this behavior here wasn’t related to any missconfiguration at server/networking side, but a library missing in the class path.
When there is a zookeeper issue, depending on the retry parameters the exceptions are not visible.
On this case, In the Mapreduce Java application I’ve added/modified the following parameters that lead into more visibility in the communication layer between Zookeeper and HBase:
conf.set("hbase.client.retries.number", Integer.toString(1)); conf.set("zookeeper.session.timeout", Integer.toString(60000)); conf.set("zookeeper.recovery.retry", Integer.toString(1));
After this, the following exception was visible:
Exception: java.lang.NoClassDefFoundError: com/yammer/metrics/core/Gauge at org.apache.hadoop.hbase.client.RpcRetryingCaller.callWithRetries(RpcRetryingCaller.java:157) at org.apache.hadoop.hbase.client.ResultBoundedCompletionService$QueueingFuture.run(ResultBoundedCompletionService.java:65) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) Caused by: java.io.IOException: com.google.protobuf.ServiceException: java.lang.NoClassDefFoundError: com/yammer/metrics/core/Gauge
Playing around these parameters will cause the application exit quickly when there is a problem with the cluster. This can be desirable in a production environment.
Reducing the parameters to a more conservative value could yield better recovery times. Setting zookeeper.recovery.retry to 0 will still result in up to two connection attempts made to all zk servers in the quorum and cause and application failure to happen in under a minute should there be a loss of zookeeper connectivity during execution.
As an additional note, if you are receiving timeouts because the application is trying to contact localhost instead of the quorum server, you can set the explicit parameters:
// HBase through MR on Yarn is trying to connect to localhost instead of quorum. conf.set("hbase.zookeeper.quorum","172.31.3.246"); conf.set("hbase.zookeeper.property.clientPort","2181");
I’ve added a couple of examples of Mapreduce applications for HBase here: https://github.com/hvivani/bigdata/tree/master/hbase
Some additional notes on this behavior: https://discuss.pivotal.io/hc/en-us/articles/200933006-Hbase-application-hangs-indefinitely-connecting-to-zookeeper