0%

young gc耗时长的问题

背景

接手的服务在启动过程中总会发生young gc 超过100ms的异常告警,以下记录一下分析过程和定位解决的办法

原有的jvm参数配置如下

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
/usr/local/java/bin/java
-Djava.util.logging.config.file=/usr/local/tomcat/conf/logging.properties
-Djava.util.logging.manager=org.apache.juli.ClassLoaderLogManager
-Djdk.tls.ephemeralDHKeySize=2048
-Djava.protocol.handler.pkgs=org.apache.catalina.webresources
-server
-Djava.awt.headless=true
-Xms5324m
-Xmx5324m
-Xss512k
-XX:PermSize=350m
-XX:MaxPermSize=350m
-XX:MetaspaceSize=256m
-XX:MaxMetaspaceSize=256m
-XX:NewSize=2048m
-XX:MaxNewSize=2048m
-XX:SurvivorRatio=8
-XX:MaxTenuringThreshold=9
-XX:+UseConcMarkSweepGC
-XX:+UseCMSInitiatingOccupancyOnly
-XX:+CMSScavengeBeforeRemark
-XX:+ScavengeBeforeFullGC
-XX:+UseCMSCompactAtFullCollection
-XX:+CMSParallelRemarkEnabled
-XX:CMSFullGCsBeforeCompaction=9
-XX:CMSInitiatingOccupancyFraction=80
-XX:+CMSClassUnloadingEnabled
-XX:SoftRefLRUPolicyMSPerMB=0
-XX:-ReduceInitialCardMarks
-XX:+CMSPermGenSweepingEnabled
-XX:CMSInitiatingPermOccupancyFraction=80
-XX:+ExplicitGCInvokesConcurrent
-Djava.nio.channels.spi.SelectorProvider=sun.nio.ch.EPollSelectorProvider
-Djava.util.logging.manager=org.apache.juli.ClassLoaderLogManager
-XX:+PrintGCDetails
-XX:+PrintGCDateStamps
-XX:+PrintGCApplicationConcurrentTime
-XX:+PrintGCApplicationStoppedTime
-XX:+PrintHeapAtGC

org.apache.catalina.startup.Bootstrap

从已有的参数只能看出来使用的是CMS gc,分析不出根因,所以还是要看gc的日志

第一次after gc ,from space剩余95%,times: 0.22 secs

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
{Heap before GC invocations=0 (full 0):
par new generation total 1887488K, used 1677824K [0x00000006a3c00000, 0x0000000723c00000, 0x0000000723c00000)
eden space 1677824K, 100% used [0x00000006a3c00000, 0x000000070a280000, 0x000000070a280000)
from space 209664K, 0% used [0x000000070a280000, 0x000000070a280000, 0x0000000716f40000)
to space 209664K, 0% used [0x0000000716f40000, 0x0000000716f40000, 0x0000000723c00000)
concurrent mark-sweep generation total 3354624K, used 0K [0x0000000723c00000, 0x00000007f0800000, 0x00000007f0800000)
Metaspace used 14282K, capacity 14502K, committed 14720K, reserved 1062912K
class space used 1525K, capacity 1600K, committed 1664K, reserved 1048576K
2023-04-20T18:05:25.023+0800: 6.553: [GC (Allocation Failure) 2023-04-20T18:05:25.023+0800: 6.553: [ParNew: 1677824K->200363K(1887488K), 0.2213936 secs] 1677824K->200363K(5242112K), 0.2214910 secs] [Times: user=0.27 sys=0.19, real=0.22 secs]
Heap after GC invocations=1 (full 0):
par new generation total 1887488K, used 200363K [0x00000006a3c00000, 0x0000000723c00000, 0x0000000723c00000)
eden space 1677824K, 0% used [0x00000006a3c00000, 0x00000006a3c00000, 0x000000070a280000)
from space 209664K, 95% used [0x0000000716f40000, 0x00000007232eaf48, 0x0000000723c00000)
to space 209664K, 0% used [0x000000070a280000, 0x000000070a280000, 0x0000000716f40000)
concurrent mark-sweep generation total 3354624K, used 0K [0x0000000723c00000, 0x00000007f0800000, 0x00000007f0800000)
Metaspace used 14282K, capacity 14502K, committed 14720K, reserved 1062912K
class space used 1525K, capacity 1600K, committed 1664K, reserved 1048576K
}

第二次 after gc, from space 剩余 100%,times: 0.47 sec

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
{Heap before GC invocations=1 (full 0):
par new generation total 1887488K, used 1878187K [0x00000006a3c00000, 0x0000000723c00000, 0x0000000723c00000)
eden space 1677824K, 100% used [0x00000006a3c00000, 0x000000070a280000, 0x000000070a280000)
from space 209664K, 95% used [0x0000000716f40000, 0x00000007232eaf48, 0x0000000723c00000)
to space 209664K, 0% used [0x000000070a280000, 0x000000070a280000, 0x0000000716f40000)
concurrent mark-sweep generation total 3354624K, used 0K [0x0000000723c00000, 0x00000007f0800000, 0x00000007f0800000)
Metaspace used 14361K, capacity 14630K, committed 14720K, reserved 1062912K
class space used 1525K, capacity 1600K, committed 1664K, reserved 1048576K
2023-04-20T18:05:30.630+0800: 12.160: [GC (Allocation Failure) 2023-04-20T18:05:30.630+0800: 12.160: [ParNew: 1878187K->209664K(1887488K), 0.4742895 secs] 1878187K->312846K(5242112K), 0.4743858 secs] [Times: user=0.96 sys=0.23, real=0.47 secs]
Heap after GC invocations=2 (full 0):
par new generation total 1887488K, used 209664K [0x00000006a3c00000, 0x0000000723c00000, 0x0000000723c00000)
eden space 1677824K, 0% used [0x00000006a3c00000, 0x00000006a3c00000, 0x000000070a280000)
from space 209664K, 100% used [0x000000070a280000, 0x0000000716f40000, 0x0000000716f40000)
to space 209664K, 0% used [0x0000000716f40000, 0x0000000716f40000, 0x0000000723c00000)
concurrent mark-sweep generation total 3354624K, used 103182K [0x0000000723c00000, 0x00000007f0800000, 0x00000007f0800000)
Metaspace used 14361K, capacity 14630K, committed 14720K, reserved 1062912K
class space used 1525K, capacity 1600K, committed 1664K, reserved 1048576K
}

第三次after gc,from space 剩余77%,times:0.17 secs

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
{Heap before GC invocations=2 (full 0):
par new generation total 1887488K, used 1887488K [0x00000006a3c00000, 0x0000000723c00000, 0x0000000723c00000)
eden space 1677824K, 100% used [0x00000006a3c00000, 0x000000070a280000, 0x000000070a280000)
from space 209664K, 100% used [0x000000070a280000, 0x0000000716f40000, 0x0000000716f40000)
to space 209664K, 0% used [0x0000000716f40000, 0x0000000716f40000, 0x0000000723c00000)
concurrent mark-sweep generation total 3354624K, used 103182K [0x0000000723c00000, 0x00000007f0800000, 0x00000007f0800000)
Metaspace used 24402K, capacity 24816K, committed 24960K, reserved 1071104K
class space used 2775K, capacity 2924K, committed 2944K, reserved 1048576K
2023-04-20T18:05:35.991+0800: 17.521: [GC (Allocation Failure) 2023-04-20T18:05:35.991+0800: 17.521: [ParNew: 1887488K->161805K(1887488K), 0.1737749 secs] 1990670K->295146K(5242112K), 0.1738674 secs] [Times: user=0.24 sys=0.07, real=0.17 secs]
Heap after GC invocations=3 (full 0):
par new generation total 1887488K, used 161805K [0x00000006a3c00000, 0x0000000723c00000, 0x0000000723c00000)
eden space 1677824K, 0% used [0x00000006a3c00000, 0x00000006a3c00000, 0x000000070a280000)
from space 209664K, 77% used [0x0000000716f40000, 0x0000000720d43548, 0x0000000723c00000)
to space 209664K, 0% used [0x000000070a280000, 0x000000070a280000, 0x0000000716f40000)
concurrent mark-sweep generation total 3354624K, used 133341K [0x0000000723c00000, 0x00000007f0800000, 0x00000007f0800000)
Metaspace used 24402K, capacity 24816K, committed 24960K, reserved 1071104K
class space used 2775K, capacity 2924K, committed 2944K, reserved 1048576K
}

第四次after gc,from space剩余25%,times:0.06

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
{Heap before GC invocations=3 (full 0):
par new generation total 1887488K, used 1839629K [0x00000006a3c00000, 0x0000000723c00000, 0x0000000723c00000)
eden space 1677824K, 100% used [0x00000006a3c00000, 0x000000070a280000, 0x000000070a280000)
from space 209664K, 77% used [0x0000000716f40000, 0x0000000720d43548, 0x0000000723c00000)
to space 209664K, 0% used [0x000000070a280000, 0x000000070a280000, 0x0000000716f40000)
concurrent mark-sweep generation total 3354624K, used 133341K [0x0000000723c00000, 0x00000007f0800000, 0x00000007f0800000)
Metaspace used 53754K, capacity 55010K, committed 55296K, reserved 1097728K
class space used 6536K, capacity 6827K, committed 6912K, reserved 1048576K
2023-04-20T18:05:41.230+0800: 22.760: [GC (Allocation Failure) 2023-04-20T18:05:41.230+0800: 22.760: [ParNew: 1839629K->54127K(1887488K), 0.0663926 secs] 1972970K->215678K(5242112K), 0.0665115 secs] [Times: user=0.18 sys=0.03, real=0.06 secs]
Heap after GC invocations=4 (full 0):
par new generation total 1887488K, used 54127K [0x00000006a3c00000, 0x0000000723c00000, 0x0000000723c00000)
eden space 1677824K, 0% used [0x00000006a3c00000, 0x00000006a3c00000, 0x000000070a280000)
from space 209664K, 25% used [0x000000070a280000, 0x000000070d75bd80, 0x0000000716f40000)
to space 209664K, 0% used [0x0000000716f40000, 0x0000000716f40000, 0x0000000723c00000)
concurrent mark-sweep generation total 3354624K, used 161550K [0x0000000723c00000, 0x00000007f0800000, 0x00000007f0800000)
Metaspace used 53754K, capacity 55010K, committed 55296K, reserved 1097728K
class space used 6536K, capacity 6827K, committed 6912K, reserved 1048576K
}

可以看到,当fromspace要拷贝的更少的时候,所需要的gc时间则越短,所以推理猜测启动过程中是前几次GC时年轻代剩余的对象比较多,故复制算法导致耗时相对较高(具体的定位方法可以更换成G1 gc并增加-XX:+PrintReferenceGC参数)

增加参数后可以看到ref各阶段的耗时,比如单次gc 380ms,其中 object copy 142ms 和 ref proc的195 ms耗时占了大头,那这样就好解决了

[Ref Proc: 195.6 ms] 就是ref proc耗时长 ,再其中[FinalReference, 88104 refs, 0.1934312 secs] 占了0.19S
[Object Copy (ms): Min: 142.5, Avg: 152.2, Max: 156.6, Diff: 14.1, Sum: 608.7]

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
2023-04-25T20:10:10.201+0800: 7.862: [GC pause (G1 Evacuation Pause) (young)8.038: [SoftReference, 0 refs, 0.0001785 secs]8.038: [WeakReference, 5246 refs, 0.0013074 secs]8.040: [FinalReference, 88104 refs, 0.1934312 secs]8.233: [PhantomReference, 0 refs, 0 refs, 0.0000145 secs]8.233: [JNI Weak Reference, 0.0003759 secs], 0.3816639 secs]

[Parallel Time: 163.1 ms, GC Workers: 4]

[GC Worker Start (ms): Min: 7874.3, Avg: 7875.1, Max: 7876.4, Diff: 2.1]

[Ext Root Scanning (ms): Min: 0.8, Avg: 6.7, Max: 20.2, Diff: 19.3, Sum: 26.9]

[Update RS (ms): Min: 0.0, Avg: 0.0, Max: 0.0, Diff: 0.0, Sum: 0.0]

[Processed Buffers: Min: 0, Avg: 0.0, Max: 0, Diff: 0, Sum: 0]

[Scan RS (ms): Min: 0.1, Avg: 0.4, Max: 0.7, Diff: 0.6, Sum: 1.8]

[Code Root Scanning (ms): Min: 0.0, Avg: 2.6, Max: 4.7, Diff: 4.7, Sum: 10.3]

[Object Copy (ms): Min: 142.5, Avg: 152.2, Max: 156.6, Diff: 14.1, Sum: 608.7]

[Termination (ms): Min: 0.0, Avg: 0.0, Max: 0.0, Diff: 0.0, Sum: 0.0]

[GC Worker Other (ms): Min: 0.0, Avg: 0.1, Max: 0.3, Diff: 0.2, Sum: 0.6]

[GC Worker Total (ms): Min: 160.7, Avg: 162.0, Max: 163.0, Diff: 2.3, Sum: 648.2]

[GC Worker End (ms): Min: 8037.0, Avg: 8037.1, Max: 8037.3, Diff: 0.2]

[Code Root Fixup: 0.8 ms]

[Code Root Purge: 0.2 ms]

[Clear CT: 0.6 ms]

[Other: 217.0 ms]

[Choose CSet: 0.0 ms]

[Ref Proc: 195.6 ms]

[Ref Enq: 1.7 ms]

[Redirty Cards: 0.1 ms]

[Humongous Reclaim: 0.1 ms]

[Free CSet: 6.9 ms]

[Eden: 2048.0M(2048.0M)->0.0B(1800.0M) Survivors: 0.0B->248.0M Heap: 2049.0M(5324.0M)->247.2M(5324.0M)]

Heap after GC invocations=1 (full 0):

garbage-first heap total 5451776K, used 253091K [0x0000000673400000, 0x0000000673605330, 0x00000007c0000000)

region size 2048K, 124 young (253952K), 124 survivors (253952K)

Metaspace used 14747K, capacity 15010K, committed 15232K, reserved 1062912K

class space used 1565K, capacity 1632K, committed 1664K, reserved 1048576K

}

[Times: user=0.44 sys=0.30, real=0.38 secs]

经验

  1. 启动过程object copy时间较久,是因为新生代使用的复制算法,拷贝的数据多,自然时间久。一种解决方法是更换gc处理器为G1,并且取消年轻代大小的控制,因为G1可以根据用户设置的暂停时间目标自动调整年轻代和总堆大小, 暂停目标越短年轻代空间越小(年轻代小,需要回收的空间可能就会小,但是回收次数也就会更多);整个年轻代内存会在初始空间-XX:G1NewSizePercent(默认整堆5%)与最大空间-XX:G1MaxNewSizePercent(默认60%)之间动态变化。年轻代空控制在合适的区间,自然可以缩短young gc的时间。
  2. finalizer ref 回收时间久则可以通过-XX:+ParallelRefProcEnabled解决。

结果

通过优化两个参数,启动从报young gc 100-300 ms 到50ms以下,符合优化预期。