GC Tuning: Reduce Pause Times in Any Runtime
Garbage collection pauses are unavoidable, but their duration is entirely negotiable.
Let’s see it in action. Imagine a Java application processing a steady stream of incoming requests. Each request creates short-lived objects. Without tuning, the garbage collector (GC) might decide to perform a full, stop-the-world collection right in the middle of a critical request, freezing the application for hundreds of milliseconds. This is the pause time we aim to eliminate.
Here’s how a tuned GC behaves. We’ll focus on a generational GC, common in Java (HotSpot JVM) but conceptually similar to others. Generational GCs divide the heap into "generations": young (eden, survivor spaces) and old. Most objects are short-lived and die in the young generation, which is collected frequently and quickly. Only objects that survive multiple young-gen collections are promoted to the old generation, which is collected less often but is larger.
The core problem is that the old generation can become large, and collecting it can take a long time. The goal of tuning is to keep the old generation as small as possible, or to use GCs that can collect the old generation concurrently with the application.
The primary levers we pull are:
-
Heap Size: This is the most fundamental. Too small a heap leads to frequent collections. Too large a heap means more work for the GC when it does run.
- Diagnosis: Monitor heap usage (
jstat -gcutil <pid> 1s). Look for consistently high old-gen usage (O) or frequent full collections (indicated byFGCincreasing rapidly). - Fix: For Java, use
-Xms4g -Xmx4gto set initial and maximum heap size to 4GB. This prevents resizing overhead and provides sufficient working space without being excessively large. - Why it works: A stable, appropriately sized heap ensures the GC has enough room to work without constant resizing thrashing and doesn’t have to scan an unnecessarily massive amount of memory during collections.
- Diagnosis: Monitor heap usage (
-
Young Generation Size: This is where most objects are born and die. Making it too small causes objects to be promoted to the old generation prematurely, increasing the frequency of costly old-gen collections.
- Diagnosis: Observe the Eden space (
E) and Survivor spaces (S0,S1) usage injstat -gcutil <pid> 1s. IfEis always near 100% andS0/S1are frequently filling up and causing promotions, the young gen might be too small. - Fix: Use
-XX:NewRatio=3. This means the old generation will be 3 times larger than the young generation (total heap size = young gen + old gen). If your heap is 4GB, the young gen will be 1GB. - Why it works: A larger young generation allows more short-lived objects to be created and collected within it, delaying their promotion to the old generation and reducing the frequency of old-generation collections.
- Diagnosis: Observe the Eden space (
-
Garbage Collector Choice: Different GCs have different performance characteristics. For low-pause applications, concurrent collectors are key.
- Diagnosis: Observe
jstat -gc <pid> 1s. Look for longftime(full GC time) andFGCT(full GC count) values. - Fix: Use
-XX:+UseG1GC. This is the default in modern Java versions but explicitly setting it ensures you’re using the Garbage-First collector, designed for predictable pause times. - Why it works: G1 divides the heap into regions and collects them intelligently, prioritizing regions with the most garbage. It can perform most of its work concurrently with the application, significantly reducing stop-the-world pauses.
- Diagnosis: Observe
-
Target Pause Time (G1GC Specific): G1GC allows you to specify a target pause time.
- Diagnosis: Even with G1, if pauses are still too long, the default target might be too aggressive or the heap too small to meet it.
- Fix: Use
-XX:MaxGCPauseMillis=100. This tells G1 to try to keep pauses below 100 milliseconds. G1 will adjust its collection strategy (e.g., collect more regions concurrently) to meet this goal. - Why it works: This provides a direct signal to the GC algorithm about your application’s latency requirements, allowing it to make trade-offs (e.g., slightly more CPU usage) to meet the desired pause time.
-
Concurrent Marking (G1GC Specific): G1 has a concurrent marking phase to identify live objects in the old generation. Tuning its phases can help.
- Diagnosis: Monitor GC logs for long STW pauses during the "remark" phase of G1.
- Fix: Use
-XX:G1ReservePercent=15. This tells G1 to leave 15% of the heap available for the old generation, preventing it from becoming completely full before initiating a concurrent cycle. - Why it works: By ensuring there’s always some free space in the old generation, G1 can start its concurrent marking cycle earlier, giving it more time to complete the work before a full STW collection is absolutely necessary. This reduces the likelihood of a long pause during the remark phase.
-
Humongous Objects (G1GC Specific): G1 handles very large objects ("humongous objects") specially, allocating them directly to humongous regions. Too many can fragment the heap.
- Diagnosis: Look for excessive
H.GC(Humongous GC) entries in GC logs orjstat -gc <pid> 1sshowing high humongous region usage. - Fix: This often requires application-level changes to avoid allocating extremely large objects (e.g., byte arrays > 512KB in Java). If unavoidable, consider increasing heap size or tuning
G1HeapRegionSize(advanced, requires restart). - Why it works: Humongous objects bypass the normal allocation and collection mechanisms, and if too many are allocated, they can quickly fill up the heap and lead to fragmentation, forcing more frequent, longer collections.
- Diagnosis: Look for excessive
The next error you’ll hit is likely related to CPU starvation if your GC is too aggressive in its concurrency, or OutOfMemoryError if your heap is still too small for your application’s peak load.