Interesting problem posited to me today: why are we having trouble with using a JVM with a 6GB heap? Cutting remarks aside, I approached the problem with the assumption that the 6GB is gainfully used. It’s a vendor product we’re working with anyway, so there’s not much we can do about it.
An interesting twist is that the problem only happens in production, and not the testing environment (which is smaller in scale). I had to thus work with ancillary evidence. As I was asked only to provide advice and direction, injecting my own code into the process wasn’t possible. The problem, BTW, is that performance under heavy, concurrent load suddenly drops a lot, and while operations eventually complete, the operative word is “eventually”.
Given the size of the heap and the observation that it only happens under load, I went in with hypothesis it was a GC issue. After working with the research they had done, and some of my own, I reported my opinions/findings to the team:
- The OS tools to monitor process memory (in this case, prstat) aren’t accurate in a certain sense. Memory leaks in one’s Java program should be found from inside the JVM, e.g. java.lang.Runtime.freeMemory(). The OS, if it still has space (in this case several GB), is under no pressure to reduce the memory allocated to a process that was once needy, and thus stands a good chance of being needy again. Probably artifically using up the free memory will induce an OS response that allows for better measurement.
- While the Sun 1.4 JVM has a concurrent garbage collector (so no more stop-the-world-while-I-collect-the-garbage), it still has to do a full GC in order to compact the memory and/or to catch up if allocations outpace the concurrent GC. Most of the reading was from: http://java.sun.com/docs/hotspot/gc1.4.2/faq.html
- The IBM 1.4 JVM has an incremental compaction algorithm. Their lab results (10 seconds to 0.5 seconds) were interesting in two ways: 1) a major decrease in compaction overherad, 2) even for a 700MB heap takes 10 seconds for a full GC cycle on a serious MP box.
- The use of a very large object cache with Least-Frequently-Used expiry is perhaps not a good idea for a long running Java application (weeks) with heavy (albeit unconfirmed) GC overhead. A large cache means an extra GB or two that always has to be scoured by the GC. An LFU-policy optimizes for the average long-term case, but if the purpose is to maintain a minimum SLA, an LRU policy might be more effective at reacting to sudden spikes in loads. In the face of heterogeneous operations, an LFU cache also needs to be much bigger to be effective (e.g. consider three operations that access their own non-intersecting data set equally often).
My main recommendations to the team:
- Try out the IBM 1.4 JVM with incremental compaction.
- Partition into multiple processes with smaller heaps.
- Decrease the cache size to balance out I/O costs vs. GC costs over large heaps.
- Change the cache algorithm to something that helps out better in busy situations.
- Get more/better diagnostic information somehow.
I guess I could also have recommended trying Azul’s hardware JVMs, but I don’t think they’d be able to stomach the cost.
I don’t consider myself an expert in garbage collection, cache algorithms, or magically diagnosing problems with insufficient information, but one does one’s best… Hopefully I guessed the right direction, but I’m looking forward to receive the vendor’s experts’ responses… or suggestions from informed visitors.