Either you are trying to figure out a memory leak in an Atlassian application, a DB engine… or you just want to know why your java process is ‘hung’… or eating all CPU resources… you’ll find yourself performing these debugging steps… Here a few notes about performing Java thread dumps and checking what’s going on under the hood.
Which Java process is using most of the CPU:
$ ps u -C java
Generate the Java thread dump:
$ jstack -l PId > PId-threads.txt
From the Java threads I can count:
$ awk '/State: / { print }' < PId-threads.txt | sort | uniq -c 450 java.lang.Thread.State: BLOCKED (on object monitor) 240 java.lang.Thread.State: RUNNABLE 47 java.lang.Thread.State: TIMED_WAITING (on object monitor) 294 java.lang.Thread.State: TIMED_WAITING (parking) 31 java.lang.Thread.State: TIMED_WAITING (sleeping) 42 java.lang.Thread.State: WAITING (on object monitor) 62 java.lang.Thread.State: WAITING (parking)
From this, we search on the ones that are “waiting to lock”***:
$ awk '/waiting to lock / { print }' < PId-threads.txt | sort | uniq -c 1 - waiting to lock <0x0000000600a027d8> (a org.apache.log4j.spi.RootLogger) 294 - waiting to lock <0x0000000600f2e770> (a java.util.Hashtable) 19 - waiting to lock <0x0000000600f36fc8> (a java.lang.Object) 1 - waiting to lock <0x000000072f6e6708> (a org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler)
Main thread blocking here is: <0x0000000600f2e770>.
From the BLOCKED threads, we have many processes waiting for getConnection:
"Thread-132985" prio=10 tid=0x00007fec40784800 nid=0x662d waiting for monitor entry [0x00007fec18cd4000]
java.lang.Thread.State: BLOCKED (on object monitor)
at org.apache.hadoop.ipc.Client.getConnection(Client.java:1449)
- waiting to lock <0x0000000600f2e770> (a java.util.Hashtable)
And many others waiting for Connection.close:
"IPC Client (738091550) connection to /10.66.2.38:9022 from hadoop" daemon prio=10 tid=0x00007fec41c1f800 nid=0x2dcc waiting for monitor entry [0x00007fec15da6000]
java.lang.Thread.State: BLOCKED (on object monitor)
at org.apache.hadoop.ipc.Client$Connection.close(Client.java:1135)
- waiting to lock <0x0000000600f2e770> (a java.util.Hashtable)
All of them are BLOCKED by <0x0000000600f2e770>.
Opening other PId’s we can find the Java thread that has the lock on this resource. The culprit will look like:
"Thread-133346" prio=10 tid=0x00007fec40ac7800 nid=0x747e runnable [0x00007fec17cc4000]
java.lang.Thread.State: RUNNABLE
at java.lang.Thread.<init>(Thread.java:234)
at org.apache.hadoop.ipc.Client$Connection.<init>(Client.java:396)
at org.apache.hadoop.ipc.Client.getConnection(Client.java:1452)
- locked <0x0000000600f2e770> (a java.util.Hashtable)
at org.apache.hadoop.ipc.Client.call(Client.java:1381)
at org.apache.hadoop.ipc.Client.call(Client.java:1363)
at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:215)
at com.sun.proxy.$Proxy42.getApplicationReport(Unknown Source)
at org.apache.hadoop.yarn.api.impl.pb.client.ApplicationClientProtocolPBClientImpl.getApplicationReport(ApplicationClientProtocolPBClientImpl.java:163)
at sun.reflect.GeneratedMethodAccessor39.invoke(Unknown Source)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:190)
at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:103)
at com.sun.proxy.$Proxy43.getApplicationReport(Unknown Source)
at org.apache.hadoop.yarn.client.api.impl.YarnClientImpl.getApplicationReport(YarnClientImpl.java:291)
at org.apache.hadoop.mapred.ResourceMgrDelegate.getApplicationReport(ResourceMgrDelegate.java:294)
at org.apache.hadoop.mapred.ClientServiceDelegate.getProxy(ClientServiceDelegate.java:152)
at org.apache.hadoop.mapred.ClientServiceDelegate.invoke(ClientServiceDelegate.java:319)
- locked <0x0000000733033970> (a org.apache.hadoop.mapred.ClientServiceDelegate)
at org.apache.hadoop.mapred.ClientServiceDelegate.getJobStatus(ClientServiceDelegate.java:419)
at org.apache.hadoop.mapred.YARNRunner.getJobStatus(YARNRunner.java:532)
at org.apache.hadoop.mapreduce.Job$1.run(Job.java:314)
at org.apache.hadoop.mapreduce.Job$1.run(Job.java:311)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1548)
at org.apache.hadoop.mapreduce.Job.updateStatus(Job.java:311)
- locked <0x0000000733032e20> (a org.apache.hadoop.mapreduce.Job)
at org.apache.hadoop.mapreduce.Job.getJobState(Job.java:347)
at org.apache.hadoop.mapred.JobClient$NetworkedJob.getJobState(JobClient.java:295)
- locked <0x0000000733032e10> (a org.apache.hadoop.mapred.JobClient$NetworkedJob)
at org.apache.hadoop.hive.ql.exec.mr.HadoopJobExecHelper.progress(HadoopJobExecHelper.java:244)
at org.apache.hadoop.hive.ql.exec.mr.HadoopJobExecHelper.progress(HadoopJobExecHelper.java:547)
at org.apache.hadoop.hive.ql.exec.mr.ExecDriver.execute(ExecDriver.java:426)
at org.apache.hadoop.hive.ql.exec.mr.MapRedTask.execute(MapRedTask.java:136)
at org.apache.hadoop.hive.ql.exec.Task.executeTask(Task.java:153)
at org.apache.hadoop.hive.ql.exec.TaskRunner.runSequential(TaskRunner.java:85)
at org.apache.hadoop.hive.ql.exec.TaskRunner.run(TaskRunner.java:72)
This kind of Blocked Status is not technically a Deadlock, but just a thread blocking other threads: locked status for a resource while many other objects start queuing waiting for the same resource.
*** Instead of “waiting to lock” we can search for “parking to wait for“. We can find “waiting to lock” in the thread dump when using intrinsic locks and “parking to wait for” when using locks from java.util.concurrent.