背景

接手的服务在启动过程中总会发生young gc 超过100ms的异常告警，以下记录一下分析过程和定位解决的办法

原有的jvm参数配置如下

/usr/local/java/bin/java
-Djava.util.logging.config.file=/usr/local/tomcat/conf/logging.properties
-Djava.util.logging.manager=org.apache.juli.ClassLoaderLogManager
-Djdk.tls.ephemeralDHKeySize=2048
-Djava.protocol.handler.pkgs=org.apache.catalina.webresources
-server
-Djava.awt.headless=true
-Xms5324m
-Xmx5324m
-Xss512k
-XX:PermSize=350m
-XX:MaxPermSize=350m
-XX:MetaspaceSize=256m
-XX:MaxMetaspaceSize=256m
-XX:NewSize=2048m
-XX:MaxNewSize=2048m
-XX:SurvivorRatio=8
-XX:MaxTenuringThreshold=9
-XX:+UseConcMarkSweepGC
-XX:+UseCMSInitiatingOccupancyOnly
-XX:+CMSScavengeBeforeRemark
-XX:+ScavengeBeforeFullGC
-XX:+UseCMSCompactAtFullCollection
-XX:+CMSParallelRemarkEnabled
-XX:CMSFullGCsBeforeCompaction=9
-XX:CMSInitiatingOccupancyFraction=80
-XX:+CMSClassUnloadingEnabled
-XX:SoftRefLRUPolicyMSPerMB=0
-XX:-ReduceInitialCardMarks
-XX:+CMSPermGenSweepingEnabled
-XX:CMSInitiatingPermOccupancyFraction=80
-XX:+ExplicitGCInvokesConcurrent
-Djava.nio.channels.spi.SelectorProvider=sun.nio.ch.EPollSelectorProvider
-Djava.util.logging.manager=org.apache.juli.ClassLoaderLogManager
-XX:+PrintGCDetails
-XX:+PrintGCDateStamps
-XX:+PrintGCApplicationConcurrentTime
-XX:+PrintGCApplicationStoppedTime
-XX:+PrintHeapAtGC

org.apache.catalina.startup.Bootstrap

从已有的参数只能看出来使用的是CMS gc，分析不出根因，所以还是要看gc的日志

第一次after gc ，from space剩余95%,times: 0.22 secs

{Heap before GC invocations=0 (full 0):
 par new generation   total 1887488K, used 1677824K [0x00000006a3c00000, 0x0000000723c00000, 0x0000000723c00000)
  eden space 1677824K, 100% used [0x00000006a3c00000, 0x000000070a280000, 0x000000070a280000)
  from space 209664K,   0% used [0x000000070a280000, 0x000000070a280000, 0x0000000716f40000)
  to   space 209664K,   0% used [0x0000000716f40000, 0x0000000716f40000, 0x0000000723c00000)
 concurrent mark-sweep generation total 3354624K, used 0K [0x0000000723c00000, 0x00000007f0800000, 0x00000007f0800000)
 Metaspace       used 14282K, capacity 14502K, committed 14720K, reserved 1062912K
  class space    used 1525K, capacity 1600K, committed 1664K, reserved 1048576K
2023-04-20T18:05:25.023+0800: 6.553: [GC (Allocation Failure) 2023-04-20T18:05:25.023+0800: 6.553: [ParNew: 1677824K->200363K(1887488K), 0.2213936 secs] 1677824K->200363K(5242112K), 0.2214910 secs] [Times: user=0.27 sys=0.19, real=0.22 secs]
Heap after GC invocations=1 (full 0):
 par new generation   total 1887488K, used 200363K [0x00000006a3c00000, 0x0000000723c00000, 0x0000000723c00000)
  eden space 1677824K,   0% used [0x00000006a3c00000, 0x00000006a3c00000, 0x000000070a280000)
  from space 209664K,  95% used [0x0000000716f40000, 0x00000007232eaf48, 0x0000000723c00000)
  to   space 209664K,   0% used [0x000000070a280000, 0x000000070a280000, 0x0000000716f40000)
 concurrent mark-sweep generation total 3354624K, used 0K [0x0000000723c00000, 0x00000007f0800000, 0x00000007f0800000)
 Metaspace       used 14282K, capacity 14502K, committed 14720K, reserved 1062912K
  class space    used 1525K, capacity 1600K, committed 1664K, reserved 1048576K
}

第二次 after gc, from space 剩余 100%，times: 0.47 sec

{Heap before GC invocations=1 (full 0):
 par new generation   total 1887488K, used 1878187K [0x00000006a3c00000, 0x0000000723c00000, 0x0000000723c00000)
  eden space 1677824K, 100% used [0x00000006a3c00000, 0x000000070a280000, 0x000000070a280000)
  from space 209664K,  95% used [0x0000000716f40000, 0x00000007232eaf48, 0x0000000723c00000)
  to   space 209664K,   0% used [0x000000070a280000, 0x000000070a280000, 0x0000000716f40000)
 concurrent mark-sweep generation total 3354624K, used 0K [0x0000000723c00000, 0x00000007f0800000, 0x00000007f0800000)
 Metaspace       used 14361K, capacity 14630K, committed 14720K, reserved 1062912K
  class space    used 1525K, capacity 1600K, committed 1664K, reserved 1048576K
2023-04-20T18:05:30.630+0800: 12.160: [GC (Allocation Failure) 2023-04-20T18:05:30.630+0800: 12.160: [ParNew: 1878187K->209664K(1887488K), 0.4742895 secs] 1878187K->312846K(5242112K), 0.4743858 secs] [Times: user=0.96 sys=0.23, real=0.47 secs]
Heap after GC invocations=2 (full 0):
 par new generation   total 1887488K, used 209664K [0x00000006a3c00000, 0x0000000723c00000, 0x0000000723c00000)
  eden space 1677824K,   0% used [0x00000006a3c00000, 0x00000006a3c00000, 0x000000070a280000)
  from space 209664K, 100% used [0x000000070a280000, 0x0000000716f40000, 0x0000000716f40000)
  to   space 209664K,   0% used [0x0000000716f40000, 0x0000000716f40000, 0x0000000723c00000)
 concurrent mark-sweep generation total 3354624K, used 103182K [0x0000000723c00000, 0x00000007f0800000, 0x00000007f0800000)
 Metaspace       used 14361K, capacity 14630K, committed 14720K, reserved 1062912K
  class space    used 1525K, capacity 1600K, committed 1664K, reserved 1048576K
}

第三次after gc,from space 剩余77%,times：0.17 secs

{Heap before GC invocations=2 (full 0):
 par new generation   total 1887488K, used 1887488K [0x00000006a3c00000, 0x0000000723c00000, 0x0000000723c00000)
  eden space 1677824K, 100% used [0x00000006a3c00000, 0x000000070a280000, 0x000000070a280000)
  from space 209664K, 100% used [0x000000070a280000, 0x0000000716f40000, 0x0000000716f40000)
  to   space 209664K,   0% used [0x0000000716f40000, 0x0000000716f40000, 0x0000000723c00000)
 concurrent mark-sweep generation total 3354624K, used 103182K [0x0000000723c00000, 0x00000007f0800000, 0x00000007f0800000)
 Metaspace       used 24402K, capacity 24816K, committed 24960K, reserved 1071104K
  class space    used 2775K, capacity 2924K, committed 2944K, reserved 1048576K
2023-04-20T18:05:35.991+0800: 17.521: [GC (Allocation Failure) 2023-04-20T18:05:35.991+0800: 17.521: [ParNew: 1887488K->161805K(1887488K), 0.1737749 secs] 1990670K->295146K(5242112K), 0.1738674 secs] [Times: user=0.24 sys=0.07, real=0.17 secs]
Heap after GC invocations=3 (full 0):
 par new generation   total 1887488K, used 161805K [0x00000006a3c00000, 0x0000000723c00000, 0x0000000723c00000)
  eden space 1677824K,   0% used [0x00000006a3c00000, 0x00000006a3c00000, 0x000000070a280000)
  from space 209664K,  77% used [0x0000000716f40000, 0x0000000720d43548, 0x0000000723c00000)
  to   space 209664K,   0% used [0x000000070a280000, 0x000000070a280000, 0x0000000716f40000)
 concurrent mark-sweep generation total 3354624K, used 133341K [0x0000000723c00000, 0x00000007f0800000, 0x00000007f0800000)
 Metaspace       used 24402K, capacity 24816K, committed 24960K, reserved 1071104K
  class space    used 2775K, capacity 2924K, committed 2944K, reserved 1048576K
}

第四次after gc,from space剩余25%，times:0.06

{Heap before GC invocations=3 (full 0):
 par new generation   total 1887488K, used 1839629K [0x00000006a3c00000, 0x0000000723c00000, 0x0000000723c00000)
  eden space 1677824K, 100% used [0x00000006a3c00000, 0x000000070a280000, 0x000000070a280000)
  from space 209664K,  77% used [0x0000000716f40000, 0x0000000720d43548, 0x0000000723c00000)
  to   space 209664K,   0% used [0x000000070a280000, 0x000000070a280000, 0x0000000716f40000)
 concurrent mark-sweep generation total 3354624K, used 133341K [0x0000000723c00000, 0x00000007f0800000, 0x00000007f0800000)
 Metaspace       used 53754K, capacity 55010K, committed 55296K, reserved 1097728K
  class space    used 6536K, capacity 6827K, committed 6912K, reserved 1048576K
2023-04-20T18:05:41.230+0800: 22.760: [GC (Allocation Failure) 2023-04-20T18:05:41.230+0800: 22.760: [ParNew: 1839629K->54127K(1887488K), 0.0663926 secs] 1972970K->215678K(5242112K), 0.0665115 secs] [Times: user=0.18 sys=0.03, real=0.06 secs]
Heap after GC invocations=4 (full 0):
 par new generation   total 1887488K, used 54127K [0x00000006a3c00000, 0x0000000723c00000, 0x0000000723c00000)
  eden space 1677824K,   0% used [0x00000006a3c00000, 0x00000006a3c00000, 0x000000070a280000)
  from space 209664K,  25% used [0x000000070a280000, 0x000000070d75bd80, 0x0000000716f40000)
  to   space 209664K,   0% used [0x0000000716f40000, 0x0000000716f40000, 0x0000000723c00000)
 concurrent mark-sweep generation total 3354624K, used 161550K [0x0000000723c00000, 0x00000007f0800000, 0x00000007f0800000)
 Metaspace       used 53754K, capacity 55010K, committed 55296K, reserved 1097728K
  class space    used 6536K, capacity 6827K, committed 6912K, reserved 1048576K
}

可以看到，当fromspace要拷贝的更少的时候，所需要的gc时间则越短，所以推理猜测启动过程中是前几次GC时年轻代剩余的对象比较多，故复制算法导致耗时相对较高（具体的定位方法可以更换成G1 gc并增加-XX:+PrintReferenceGC参数）

增加参数后可以看到ref各阶段的耗时，比如单次gc 380ms，其中 object copy 142ms 和 ref proc的195 ms耗时占了大头，那这样就好解决了

[Ref Proc: 195.6 ms] 就是ref proc耗时长，再其中[FinalReference, 88104 refs, 0.1934312 secs] 占了0.19S
[Object Copy (ms): Min: 142.5, Avg: 152.2, Max: 156.6, Diff: 14.1, Sum: 608.7]

2023-04-25T20:10:10.201+0800: 7.862: [GC pause (G1 Evacuation Pause) (young)8.038: [SoftReference, 0 refs, 0.0001785 secs]8.038: [WeakReference, 5246 refs, 0.0013074 secs]8.040: [FinalReference, 88104 refs, 0.1934312 secs]8.233: [PhantomReference, 0 refs, 0 refs, 0.0000145 secs]8.233: [JNI Weak Reference, 0.0003759 secs], 0.3816639 secs]

   [Parallel Time: 163.1 ms, GC Workers: 4]

      [GC Worker Start (ms): Min: 7874.3, Avg: 7875.1, Max: 7876.4, Diff: 2.1]

      [Ext Root Scanning (ms): Min: 0.8, Avg: 6.7, Max: 20.2, Diff: 19.3, Sum: 26.9]

      [Update RS (ms): Min: 0.0, Avg: 0.0, Max: 0.0, Diff: 0.0, Sum: 0.0]

         [Processed Buffers: Min: 0, Avg: 0.0, Max: 0, Diff: 0, Sum: 0]

      [Scan RS (ms): Min: 0.1, Avg: 0.4, Max: 0.7, Diff: 0.6, Sum: 1.8]

      [Code Root Scanning (ms): Min: 0.0, Avg: 2.6, Max: 4.7, Diff: 4.7, Sum: 10.3]

      [Object Copy (ms): Min: 142.5, Avg: 152.2, Max: 156.6, Diff: 14.1, Sum: 608.7]

      [Termination (ms): Min: 0.0, Avg: 0.0, Max: 0.0, Diff: 0.0, Sum: 0.0]

      [GC Worker Other (ms): Min: 0.0, Avg: 0.1, Max: 0.3, Diff: 0.2, Sum: 0.6]

      [GC Worker Total (ms): Min: 160.7, Avg: 162.0, Max: 163.0, Diff: 2.3, Sum: 648.2]

      [GC Worker End (ms): Min: 8037.0, Avg: 8037.1, Max: 8037.3, Diff: 0.2]

   [Code Root Fixup: 0.8 ms]

   [Code Root Purge: 0.2 ms]

   [Clear CT: 0.6 ms]

   [Other: 217.0 ms]

      [Choose CSet: 0.0 ms]

      [Ref Proc: 195.6 ms]

      [Ref Enq: 1.7 ms]

      [Redirty Cards: 0.1 ms]

      [Humongous Reclaim: 0.1 ms]

      [Free CSet: 6.9 ms]

   [Eden: 2048.0M(2048.0M)->0.0B(1800.0M) Survivors: 0.0B->248.0M Heap: 2049.0M(5324.0M)->247.2M(5324.0M)]

Heap after GC invocations=1 (full 0):

 garbage-first heap   total 5451776K, used 253091K [0x0000000673400000, 0x0000000673605330, 0x00000007c0000000)

  region size 2048K, 124 young (253952K), 124 survivors (253952K)

 Metaspace       used 14747K, capacity 15010K, committed 15232K, reserved 1062912K

  class space    used 1565K, capacity 1632K, committed 1664K, reserved 1048576K

}

 [Times: user=0.44 sys=0.30, real=0.38 secs]

经验

启动过程object copy时间较久，是因为新生代使用的复制算法，拷贝的数据多，自然时间久。一种解决方法是更换gc处理器为G1，并且取消年轻代大小的控制，因为G1可以根据用户设置的暂停时间目标自动调整年轻代和总堆大小，暂停目标越短年轻代空间越小（年轻代小，需要回收的空间可能就会小，但是回收次数也就会更多）；整个年轻代内存会在初始空间-XX:G1NewSizePercent(默认整堆5%)与最大空间-XX:G1MaxNewSizePercent(默认60%)之间动态变化。年轻代空控制在合适的区间，自然可以缩短young gc的时间。
finalizer ref 回收时间久则可以通过-XX:+ParallelRefProcEnabled解决。

结果

通过优化两个参数，启动从报young gc 100-300 ms 到50ms以下，符合优化预期。

Vilden 's blog

young gc耗时长的问题

背景

经验

结果