The lost art of debugging Java threads


Either you are trying to figure out a memory leak in an Atlassian application, a DB engine… or you just want to know why your java process is ‘hung’… or eating all CPU resources… you’ll find yourself performing these debugging steps… Here a few notes about performing Java thread dumps and checking what’s going on under the hood.

Which Java process is using most of the CPU:

 $ ps u -C java

Generate the Java thread dump:

$ jstack -l PId > PId-threads.txt

From the Java threads I can count:

$ awk '/State: / { print }' < PId-threads.txt  | sort | uniq -c
 450    java.lang.Thread.State: BLOCKED (on object monitor)
 240    java.lang.Thread.State: RUNNABLE
  47    java.lang.Thread.State: TIMED_WAITING (on object monitor)
 294    java.lang.Thread.State: TIMED_WAITING (parking)
  31    java.lang.Thread.State: TIMED_WAITING (sleeping)
  42    java.lang.Thread.State: WAITING (on object monitor)
  62    java.lang.Thread.State: WAITING (parking)

From this, we search on the ones that are “waiting to lock”***:

$ awk '/waiting to lock / { print }' < PId-threads.txt  | sort | uniq -c
   1     - waiting to lock <0x0000000600a027d8> (a org.apache.log4j.spi.RootLogger)
 294     - waiting to lock <0x0000000600f2e770> (a java.util.Hashtable)
  19     - waiting to lock <0x0000000600f36fc8> (a java.lang.Object)
   1     - waiting to lock <0x000000072f6e6708> (a org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler)

Main thread blocking here is: <0x0000000600f2e770>.

From the BLOCKED threads, we have many processes waiting for getConnection:

"Thread-132985" prio=10 tid=0x00007fec40784800 nid=0x662d waiting for monitor entry [0x00007fec18cd4000]
   java.lang.Thread.State: BLOCKED (on object monitor)
    at org.apache.hadoop.ipc.Client.getConnection(Client.java:1449)
    - waiting to lock <0x0000000600f2e770> (a java.util.Hashtable)

And many others waiting for Connection.close:

"IPC Client (738091550) connection to /10.66.2.38:9022 from hadoop" daemon prio=10 tid=0x00007fec41c1f800 nid=0x2dcc waiting for monitor entry [0x00007fec15da6000]
   java.lang.Thread.State: BLOCKED (on object monitor)
    at org.apache.hadoop.ipc.Client$Connection.close(Client.java:1135)
    - waiting to lock <0x0000000600f2e770> (a java.util.Hashtable)

All of them are BLOCKED by <0x0000000600f2e770>.

Opening other PId’s we can find the Java thread that has the lock on this resource. The culprit will look like:

"Thread-133346" prio=10 tid=0x00007fec40ac7800 nid=0x747e runnable [0x00007fec17cc4000]
   java.lang.Thread.State: RUNNABLE
    at java.lang.Thread.<init>(Thread.java:234)
    at org.apache.hadoop.ipc.Client$Connection.<init>(Client.java:396)
    at org.apache.hadoop.ipc.Client.getConnection(Client.java:1452)
    - locked <0x0000000600f2e770> (a java.util.Hashtable)
    at org.apache.hadoop.ipc.Client.call(Client.java:1381)
    at org.apache.hadoop.ipc.Client.call(Client.java:1363)
    at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:215)
    at com.sun.proxy.$Proxy42.getApplicationReport(Unknown Source)
    at org.apache.hadoop.yarn.api.impl.pb.client.ApplicationClientProtocolPBClientImpl.getApplicationReport(ApplicationClientProtocolPBClientImpl.java:163)
    at sun.reflect.GeneratedMethodAccessor39.invoke(Unknown Source)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:606)
    at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:190)
    at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:103)
    at com.sun.proxy.$Proxy43.getApplicationReport(Unknown Source)
    at org.apache.hadoop.yarn.client.api.impl.YarnClientImpl.getApplicationReport(YarnClientImpl.java:291)
    at org.apache.hadoop.mapred.ResourceMgrDelegate.getApplicationReport(ResourceMgrDelegate.java:294)
    at org.apache.hadoop.mapred.ClientServiceDelegate.getProxy(ClientServiceDelegate.java:152)
    at org.apache.hadoop.mapred.ClientServiceDelegate.invoke(ClientServiceDelegate.java:319)
    - locked <0x0000000733033970> (a org.apache.hadoop.mapred.ClientServiceDelegate)
    at org.apache.hadoop.mapred.ClientServiceDelegate.getJobStatus(ClientServiceDelegate.java:419)
    at org.apache.hadoop.mapred.YARNRunner.getJobStatus(YARNRunner.java:532)
    at org.apache.hadoop.mapreduce.Job$1.run(Job.java:314)
    at org.apache.hadoop.mapreduce.Job$1.run(Job.java:311)
    at java.security.AccessController.doPrivileged(Native Method)
    at javax.security.auth.Subject.doAs(Subject.java:415)
    at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1548)
    at org.apache.hadoop.mapreduce.Job.updateStatus(Job.java:311)
    - locked <0x0000000733032e20> (a org.apache.hadoop.mapreduce.Job)
    at org.apache.hadoop.mapreduce.Job.getJobState(Job.java:347)
    at org.apache.hadoop.mapred.JobClient$NetworkedJob.getJobState(JobClient.java:295)
    - locked <0x0000000733032e10> (a org.apache.hadoop.mapred.JobClient$NetworkedJob)
    at org.apache.hadoop.hive.ql.exec.mr.HadoopJobExecHelper.progress(HadoopJobExecHelper.java:244)
    at org.apache.hadoop.hive.ql.exec.mr.HadoopJobExecHelper.progress(HadoopJobExecHelper.java:547)
    at org.apache.hadoop.hive.ql.exec.mr.ExecDriver.execute(ExecDriver.java:426)
    at org.apache.hadoop.hive.ql.exec.mr.MapRedTask.execute(MapRedTask.java:136)
    at org.apache.hadoop.hive.ql.exec.Task.executeTask(Task.java:153)
    at org.apache.hadoop.hive.ql.exec.TaskRunner.runSequential(TaskRunner.java:85)
    at org.apache.hadoop.hive.ql.exec.TaskRunner.run(TaskRunner.java:72)

This kind of Blocked Status is not technically a Deadlock, but just a thread blocking other threads: locked status for a resource while many other objects start queuing waiting for the same resource.

*** Instead of “waiting to lock” we can search for “parking to wait for“. We can find “waiting to lock” in the thread dump when using intrinsic locks and “parking to wait for” when using locks from java.util.concurrent.

Java change default version / cambiar la version Java por defecto


If we have more than one Java version installed on your Linux server (Redhat flavor) you can change defaults using ‘alternatives’ command:

[hadoop@ip-172-31-36-252 ~]$ sudo /usr/sbin/alternatives --config java

There are 2 programs which provide 'java'.

  Selection    Command
-----------------------------------------------
*+ 1           /usr/lib/jvm/jre-1.7.0-openjdk.x86_64/bin/java
   2           /usr/lib/jvm/jre-1.8.0-openjdk.x86_64/bin/java

Enter to keep the current selection[+], or type selection number: 2
[hadoop@ip-172-31-36-252 ~]$ sudo /usr/sbin/alternatives --config java

There are 2 programs which provide 'java'.

  Selection    Command
-----------------------------------------------
*  1           /usr/lib/jvm/jre-1.7.0-openjdk.x86_64/bin/java
 + 2           /usr/lib/jvm/jre-1.8.0-openjdk.x86_64/bin/java

Enter to keep the current selection[+], or type selection number:
[hadoop@ip-172-31-36-252 ~]$

 

monitoring HTTP requests on the fly


Install httpry:

sudo yum install httpry

or

$ sudo yum install gcc make git libpcap-devel
$ git clone https://github.com/jbittel/httpry.git
$ cd httpry
$ make
$ sudo make install

then run:

sudo httpry -i eth0

Output will be like:

httpry version 0.1.8 -- HTTP logging and information retrieval tool
Copyright (c) 2005-2014 Jason Bittel <jason.bittel@gmail.com>
Starting capture on eth0 interface
2016-07-27 14:20:59.598    172.31.43.18    169.254.169.254    >    GET    169.254.169.254    /latest/dynamic/instance-identity/document    HTTP/1.1    -    -
2016-07-27 14:20:59.599    169.254.169.254    172.31.43.18    <    -    -    -    HTTP/1.0    200    OK
2016-07-27 14:22:02.034    172.31.43.18    169.254.169.254    >    GET    169.254.169.254    /latest/dynamic/instance-identity/document    HTTP/1.1    -    -
2016-07-27 14:22:02.034    169.254.169.254    172.31.43.18    <    -    -    -    HTTP/1.0    200    OK
2016-07-27 14:23:04.640    172.31.43.18    169.254.169.254    >    GET    169.254.169.254    /latest/dynamic/instance-identity/document    HTTP/1.1    -    -
2016-07-27 14:23:04.640    169.254.169.254    172.31.43.18    <    -    -    -    HTTP/1.0    200    OK
2016-07-27 14:24:07.122    172.31.43.18    169.254.169.254    >    GET    169.254.169.254    /latest/dynamic/instance-identity/document    HTTP/1.1    -    -
2016-07-27 14:24:07.123    169.254.169.254    172.31.43.18    <    -    -    -    HTTP/1.0    200    OK

 

 

 

 

yarn: execute a script on all the nodes in the cluster


This is more Linux script related, but, sometimes we have a Hadoop (YARN) cluster running and we need to run a post install script or activity that executes on all the nodes in the cluster:

for i in `yarn node --list | cut -f 1 -d ':' | grep "ip"`; do ssh -i your-key.pem hadoop@$i 'hadoop fs -copyToLocal s3://mybucket/myscript.sh | chmod +x /home/hadoop/myscript.sh | /home/hadoop/myscript.sh' ; done

Note: we will need the your-key.pem file in the master node.

Indexing Common Crawl Metadata on Elasticsearch using Cascading


If you want to explore how to parallelize the data ingestion into Elasticsearch, please have a look to this post I have written for Amazon AWS:

http://blogs.aws.amazon.com/bigdata/post/TxC0CXZ3RPPK7O/Indexing-Common-Crawl-Metadata-on-Amazon-EMR-Using-Cascading-and-Elasticsearch

It explains how to index Common Crawl metadata into Elasticsearch using Cascading connector directly from the S3 data source.

Cascading Source Code is available here.

How Ganglia works


What is Ganglia ?

Ganglia is a scalable distributed monitoring system for high-performance computing systems such as clusters and Grids. It is based on a hierarchical design targeted at federations of clusters. It leverages widely used technologies such as XML for data representation, XDR for compact, portable data transport, and RRDtool for data storage and visualization.

ganglia

Ganglia has the following main components:

1. Ganglia Monitoring Daemon (gmond)
Gmond stands for ganglia monitoring daemon. It is a lightweight service that is installed on every machine you’d like to monitor.
Gmond has four main responsibilities:

1.1 Monitor changes in host state.
1.2 Announce relevant changes.
1.3 Listen to the state of all other ganglia nodes via a unicast or multicast channel.
1.4 Answer requests for an XML description of the cluster state.

Each gmond transmits information in two different ways:

a. Unicasting or Multicasting host state in external data representation (XDR) format using UDP messages.
b. Sending XML over a TCP connection.

Notes about gmond:

– The main configuration file of gmond is /etc/gmond.conf
– gmond is multithreaded

Test gmond installation:

telnet localhost 8649

You should see XML that conforms to the ganglia XML spec.

Or

gmond -d 5 -c /etc/ganglia/gmond.conf

to see the service in debugging mode.

2. Ganglia Meta Daemon (gmetad)
The ganglia meta daemon (gmetad) is a service that collects data from other gmetad and gmond sources and stores their state to disk in indexed round-robin (RRD) databases. Gmetad provides a simple query mechanism for collecting specific information about groups of machines.

Notes about gmetad:

– the main configuration file for gmetad is /etc/gmetad.conf
– You need atleast one gmetad daemon installed node on each cluster.
– This gemetad daemon is the one who collects data send by gmond daemon.
– All other nodes other than the one in the cluster, do not require gmetad daemon to be installed.
– If you need the machine containing gmetad configured as node to be monitored, then in that case you need to install both gmond and gmetad on the machine.

Test gmetad installation:

telnet localhost 8651

3. Ganglia PHP Web Front-end
The Ganglia web front-end provides a view of the gathered information via real-time dynamic web pages. Most importantly, it displays Ganglia data in a meaningful way for system administrators and computer users using PHP.

how ganglia works

In this picture we can see  gmond installed in each node and sending data to gmetad installed in a “gmetad node”. We can have one or more  “nodes with gmetad” in a cluster.

gmetad collects all the data from gmond and stores it in rrdtool database. Which is then collected by the php scripts, and showed as the first picture in this article.

4. Gmetrics
The ganglia metric tool is a commandline application that you can use to inject custom made metrics about hosts that are being monitored by ganglia. It has the ability to spoof messages as coming from a different host in case you want to capture and report metrics from a device where you don’t have gmond running (like a network or other embedded device).

5. Gstat
The ganglia stat tool is a commandline application that you can use to query a gmond for metric information directly.

6. RRD tool:

Ganglia uses RRD tool to store its data and visualization.

RRD tool is the short form for Round Robin Data base tool. This is a wonderful and useful open source data base tool. In this RRD stores data in time-series. For example RRD tool will store all values of CPU load at a certain time interval and then graph these data according to time.

Elasticsearch and Kibana on EMR Hadoop cluster


If you need to add Elasticsearch and Kibana on EMR, please have a look to this post I have written for Amazon AWS:

http://blogs.aws.amazon.com/bigdata/post/Tx1E8WC98K4TB7T/Getting-Started-with-Elasticsearch-and-Kibana-on-Amazon-EMR

It contains all the steps to launch a cluster and perform the basic testings on both tools.

Additionally, here you will find the source code for the bootstrap actions used to configure Elasticsearch and Kibana on the EMR Hadoop cluster:

https://github.com/awslabs/emr-bootstrap-actions/tree/master/elasticsearch


Versión en Español

Si necesitas Elasticsearch y Kibana instalado en un cluster  EMR, por favor, mira esta publicacion que he escrito para Amazon AWS:

http://blogs.aws.amazon.com/bigdata/post/Tx1E8WC98K4TB7T/Getting-Started-with-Elasticsearch-and-Kibana-on-Amazon-EMR

Contiene todos los pasos para crear un cluster y realizar las pruebas basicas en las dos herramientas.

Adicionalmente, aqui encontraras el codigo fuente para las bootstrap actions que uso para instalar Elasticsearch y Kibana en el EMR Hadoop cluster.

https://github.com/awslabs/emr-bootstrap-actions/tree/master/elasticsearch