Well, after two weeks of slogging through myriad combination of GC settings, GC logs, application logs, multiple JConsole windows, and cursing the various spreadsheet programs' pathetic ability to deal with data set that is of any significant size, I finally hit a jackpot of sort. Looking back though I practically would have to slap myself for not seeing this from the very beginning, it was right there, so apparent, so obvious.
Long story short. We have a service application that is used by various groups within the company. Lately we had been struggling a little bit to maintain our SLAs under much high load then we used to have. So we started digging into logs, and pretty soon started to focus on GC tuning. This application runs on Java 5, so we started with replacing the default serial collector with concurrent mark sweep collectors. There were some improvements, but not enough. The application still have way too many full GC events, and way too many full GCs that were caused by concurrent mode failures. However, there were not many regular CMS collection at all.
Looking through JConsole, full GCs started pretty much right after application start up and just keep going and going, and would later on be joined by concurrent mode failures when load starts to go up. So we increased total heap, and propotionally added more space to both young generation and old generation, hoping this would give objects in the young generation more time to die, and give old generation enough space to avoid concurrent mode failure. Well, that did not quit work out as we had hoped. YoungGen collections still happen at about the same frequency, and average YoungGen collection time really took a dive, using 5x more time. Worse, the full GCs and concurrent mode failures are still there, only this time there were less concurrent mode failures. Presumably the increased OldGen size helped out a little in terms of avoiding concurrent mode failures. The worst part of all this though is the large heap actually increased GC pause times and caused even more outliers that fall out of our SLA.
Now comes the jackpot part. I was staring at the GC logs which I imported into Excel, and at JConsole windows, watching the count of full CMS gcs going up steadily. Then it hit me, the pattern that is so obvious now: the full GCs were coming in once a minute, almost like clock work, right from application start up, when there were barely anything in OldGen. There is no way these were triggered by OldGen being used up, something must be calling System.gc() somewhere. A quick googling indicate this is probably the case. And the fix is straight forward, disable explicit GC calls by using the flag -XX:+DisableExplicitGC. Presto, no more this one full GC per minute nonsense.
An added bonus, with explicit GC disabled, now we see the expected regular CMS collections. And more importantly, we found out our PermGen was getting full. Our load test still led to two concurrent mode failures. From what we can see with JConsole, these two failures happens right after PermGen got full, the full GC cleaned up about half of the PermGen space. So we will up our PermGen as well. Our application instantiate a whole bunch of objects every few minutes and caches them, so we know we are thrashing PermGen space. The once per minute full GCs were masking the fact that our PermGen was too small.
One day I might spend some time to find out which library we use is calling System.gc(). I am pretty sure I will go anywhere near it, not after these last couple weeks in GC tuning paradise :-)