Translations of this page:

Thread Dumps

I work a lot with Java application and web servers such as Weblogic, JBoss and Tomcat. When an application goes bad a Thread Dump can tell you a lot. A Thread Dump is a snapshot of the internal state (effectively a stack trace) of all the Threads running inside a Java Virtual Machine (JVM). If an application running on the JVM is unresponsive because threads are blocked waiting for a resource (lock, database lock, slow pages etc) the Thread Dump can be a useful complement to normal logs for diagnosing the cause of the problem.

This short article shows you how to obtain a thread dump on Windows and Unix and gives a couple of examples of problems I've analyzed.

Thread Dumps on Windows

You can get a thread dump directly from the console window. Simple type <CTL><BREAK> then cut and paste the thread dump.

Thread Dumps on Unix

To obtain a thread dump log in as root and get the Process Id (PID) of the JVM you want to obtain a dump for. This example uses an instance of the Apache Tomcat 5.5 servlet engine.

# ps aux | grep java
…
tomcat   22016  0.4 18.9 1307680 767592 ?    Sl   Feb01  18:34 /usr/local/jdk/bin/java -Xms128m -Xmx512m -server -Djava.util.logging.manager=org.apache.juli.ClassLoaderLogManager -Djava.util.logging.config.file=/home/sezame/sezame/clients/bto/config/tomcat/conf/logging.properties -Dcom.sun.management.jmxremote -Dcom.sun.management.jmxremote.port=8889 -Dcom.sun.management.jmxremote.authenticate=false -Dcom.sun.management.jmxremote.ssl=false -Dsezame.home=/home/sezame/sezame -Dwebserver.home=/usr/local/apache2 -Dsezame.docroot=/var/apache/htdocs -Dsezame.webapps=/home/sezame/webapps -Dsezame.onlysite=gap -Djava.endorsed.dirs=/usr/local/tomcat/common/endorsed -classpath :/usr/local/tomcat/bin/bootstrap.jar:/usr/local/tomcat/bin/commons-logging-api.jar -Dcatalina.base=/home/sezame/sezame/clients/bto/config/tomcat -Dcatalina.home=/usr/local/tomcat -Djava.io.tmpdir=/home/sezame/sezame/clients/bto/config/tomcat/temp org.apache.catalina.startup.Bootstrap start

Here we see that the tomcat PID is 22016 and base directory is: /home/sezame/sezame/clients/bto/config/tomcat. Change to the logs sub-directory under the base directory and run

# kill -SIGQUIT 22016

The thread dump should be output to Catalina.out and will have the following format:

# tail -3000 catalina.out | more
…
Full thread dump Java HotSpot(TM) 64-Bit Server VM (1.5.0_09-b03 mixed mode):
 
"TP-Processor24" daemon prio=1 tid=0x0000002acc7c5580 nid=0x198 runnable [0x0000000054494000..0x0000000054494d30]
        at java.net.SocketInputStream.socketRead0(Native Method)
        at java.net.SocketInputStream.read(SocketInputStream.java:129)
        at java.io.BufferedInputStream.fill(BufferedInputStream.java:218)
        at java.io.BufferedInputStream.read1(BufferedInputStream.java:256)
        at java.io.BufferedInputStream.read(BufferedInputStream.java:313)
        - locked <0x0000002aa6351298> (a java.io.BufferedInputStream)
        at org.apache.jk.common.ChannelSocket.read(ChannelSocket.java:626)
        at org.apache.jk.common.ChannelSocket.receive(ChannelSocket.java:564)
        at org.apache.jk.common.ChannelSocket.processConnection(ChannelSocket.java:691)
        at org.apache.jk.common.ChannelSocket$SocketConnection.runIt(ChannelSocket.java:895)
        at org.apache.tomcat.util.threads.ThreadPool$ControlRunnable.run(ThreadPool.java:685)
        at java.lang.Thread.run(Thread.java:595) 
 
"TP-Processor23" daemon prio=1 tid=0x0000002acc5dd310 nid=0x197 runnable [0x0000000054393000..0x0000000054393db0]
    etc

This tells us that the first thread is currently waiting in the SocketRead method – probably reading a page request from a client. There will usually be a large number of such traces.

Database Deadlock or Pool Leak

The client was having a problem with the application freezing on a regular basis. I asked the operations team to take a thread dump when the next outage occurred. I loaded this into IBM's JCA tool. The Thread Dump list showed a high number of total threads, approx 500 (Tomcat was configured for 400 user threads). Very few of these were runnable with the majority waiting on a lock in the database connection pool's getConnection() method.

This implied either a deadlock or a connection pool leak. In the first case we have threads that need more than one connection to complete some work. After acquiring one connection they are locked out by other threads. In the latter case we have a connection which is not returned to the pool, either as a programming error or due to an exception condition not being caught. These kind of errors can be detected by keeping a trace of who is acquiring and releasing connections.

Further analysis of the system showed that deadlocks were possible as each JSP page consumed multiple database connections due to an design error. However a code review also showed numerous places where database connections could be left open as they were not placed in a finally block.

get() throws Exception {
  Connection c = pool.getConnection();
  try {
    // do some work
    ...
  } catch (SQLException e) {
    // log exception
  }
}

All of the methods rethrew Exception - which meant that Exception outside of SQLExceptions were not even logged. This is poor practice.

Here is what the code should look like

Connection c = pool.getConnection();
try {
  // do some work
  ...
} catch (SQLException e) {
  // log exception
} finally {
  pool.returnConnection(c);
}

You might even log exceptions later in the chain. See the Java Exception Patterns for more ideas.

OSCache Problem

We have an application consisting of two main JSP pages, we'll call them Content and Main. A developer recently modified one JSP to add tags for caching part of the page content using OSCache. This change got rolled out to staging then production without anyone noticing there was a problem. The Content page is called quite deep in the site. A while later another developer modified Main to cache some information about the user using oscache (not a good practise, they should have put this in the session.)

When this second page was rolled into production the Tomcat server got progressively slower then stopped. This occurred over the space of 2 hours. Management immediately suspected the changes to the Main page were to blame. Cause and effect, action and reaction but things are rarely so simple. The operations guys thought that the problem was with the mod_jk connector between the front end Apache load balancer and the backend Tomcats. Immediately two task force were set up, one to modify the Main jsp and another to investigate the mod_jk connection.

I got infrastructure to restart the server and take Thread Dumps at periods. I loaded these into TDA and did a comparison. It was obvious that a lot of threads were hanging in OSCache operations.

"TP-Processor21" daemon prio=6 tid=0x18f53d88 nid=0x1e4 in Object.wait() [0x2104f000..0x2104fd18]
        at java.lang.Object.wait(Native Method)
        - waiting on <0x08995c10> (a com.opensymphony.oscache.base.EntryUpdateState)
        at java.lang.Object.wait(Object.java:474)
        at com.opensymphony.oscache.base.Cache.getFromCache(Cache.java:278)
        - locked <0x08995c10> (a com.opensymphony.oscache.base.EntryUpdateState)
        at com.opensymphony.oscache.web.tag.CacheTag.doStartTag(CacheTag.java:514)
        at com.sezame.taglib.cda.html.CacheTag.doStartTag(CacheTag.java:150)
        at org.apache.jsp.gap.display.main.gap_005fcontent_jsp._jspService(org.apache.jsp.gap.display.main.gap_005fcontent_jsp:2122)
 
"TP-Processor20" daemon prio=6 tid=0x18f53d88 nid=0x1e4 in Object.wait() [0x2104f000..0x2104fd18]
        at java.lang.Object.wait(Native Method)
        - waiting on <0x08995c10> (a com.opensymphony.oscache.base.EntryUpdateState)
        at java.lang.Object.wait(Object.java:474)
        at com.opensymphony.oscache.base.Cache.getFromCache(Cache.java:278)
        - locked <0x08995c10> (a com.opensymphony.oscache.base.EntryUpdateState)
        at com.opensymphony.oscache.web.tag.CacheTag.doStartTag(CacheTag.java:514)
        at com.sezame.taglib.cda.html.CacheTag.doStartTag(CacheTag.java:150)
        at org.apache.jsp.gap.display.main.gap_005fcontent_jsp._jspService(org.apache.jsp.gap.display.main.gap_005fcontent_jsp:2122)

Yes there were hangs in Main but also in Content which had not been suspected. I checked the Main code and nothing looked obvious. However a search on the OSCache forums revealed that you could get hangs if cache tags were nested. Content cached a navigation bar that was generated in an included script. A check of the code revealed that this bar was both cached in Content and in the included script. Removing one of the caches fixed the problem.

Meanwhile the task forces beavered on. They fixed Main and rolled out the changes. The hangs still occurred but at a much slower rate. Logic, Content is only rarely called. I suspect that as the usage of the site had ramped up we had seen more and more problems with this page. The mod_jk force also expended a huge amount of energy. Finally the Content fix was rolled out… solving the problem.

Thread Dump Analyzers

To analyze thread dumps it helps to use a tool to sort and present the data.

JCA

JCA requires you to register with Alphaworks. It makes good use of colour and can also detect conditions such as deadlocks.

http://www.alphaworks.ibm.com/tech/jca

TDA

Thread Dump Analyzer. TDA has some nice features. You can load multiple snapshots which enables you to find long-running threads.

http://tda.dev.java.net/servlets/NewsItemView?newsItemID=3608

Samurai

Samurai can be launched by Java Web Start which avoids having to explicitly download and install it and also means you will be using the latest code base.

http://yusuke.homeip.net/samurai/?english

tech/java/thread-dumps.txt · Last modified: 2008/04/15 13:20 by davidof
Recent changes RSS feed