ORACLE数据库实例报错ORA-00470: LGWR process terminated with error宕机问题分析报告
- 服务概述
10月21号03:22分,BOSS数据库实例发生异常宕机;工程师及时响应此问题并对此故障原因进行分析及相关建议,详细的故障情况及相关日志、TRACE文件的分析及总结、建议,请参阅本文档。
- hzboss数据库实例宕机分析
4.1 数据库层面日志的分析
4.1.1故障数据库实例alert日志
数据库实例的Alert日志
对故障时间点附近的ALERT日志分析如下:
Wed Oct 21 03:06:22 2024
Thread 1 advanced to log sequence 254206 (LGWR switch)
Current log# 9 seq# 254206 mem# 0: /boss_sysdata/oradata/dtvboss/redo09.log
Wed Oct 21 03:17:10 2024
WARNING: aiowait timed out 1 times --该时间段系统出现aio等待的告警,该告警由_iowait_timeouts隐含参数控制,超过该参数设置的值后会导致系统宕机,这里并未超过系统设置值。但是提示我们当时系统在io调度上出现了问题。
Wed Oct 21 03:22:11 2024
Errors in file /oracle/admin/dtvboss/bdump/dtvboss_arc0_404.trc:
ORA-00494: enqueue [CF] held for too long (more than 900 seconds) by 'inst 1, osid 367' --22分17秒系统抛错,进程367持有controlfile的enqueue lock时间被其他进程(arc0)等待超时,该enqueue lock持有时间由隐含参数_controlfile_enqueue_timeout控制,默认为900秒
Wed Oct 21 03:22:17 2024
System State dumped to trace file /oracle/admin/dtvboss/bdump/dtvboss_arc0_404.trc
Wed Oct 21 03:22:50 2024
Errors in file /oracle/admin/dtvboss/bdump/dtvboss_lgwr_365.trc:
ORA-00494: enqueue [CF] held for too long (more than 900 seconds) by 'inst 1, osid 367' --22分50秒,进程367持有的controlfile的enqueue lock时间被其他进程(lgwr)等待超时抛错
Wed Oct 21 03:22:55 2024
Killing enqueue blocker (pid=367) on resource CF-00000000-00000000 --LGWR进程发现异常查杀持有controlfile equeue lock的进程367
by killing session 1648.1
Wed Oct 21 03:22:55 2024
Errors in file /oracle/admin/dtvboss/bdump/dtvboss_lgwr_365.trc:
ORA-07445: exception encountered: core dump [ksuklms()+672] [SIGSEGV] [Address not mapped to object] [0x000000062] [] []
Wed Oct 21 03:23:18 2024
Errors in file /oracle/admin/dtvboss/bdump/dtvboss_pmon_343.trc: --后续一系列报错宕机开始
ORA-00470: LGWR process terminated with error
Wed Oct 21 03:23:18 2024
PMON: terminating instance due to error 470
Wed Oct 21 03:23:18 2024
ORA-470 encountered when generating server alert SMG-3503
Wed Oct 21 03:23:18 2024
Errors in file /oracle/admin/dtvboss/bdump/dtvboss_j006_22335.trc:
ORA-00604: error occurred at recursive SQL level 1
ORA-00470: LGWR process terminated with error
Wed Oct 21 03:23:22 2024
Termination issued to instance processes. Waiting for the processes to exit
Wed Oct 21 03:23:28 2024
Instance termination failed to kill one or more processes
4.1.2 故障时trace文件分析
故障时,节点数据库实例的alert日志中提到了如下TRACE文件:dtvboss_arc0_404.trc
分析如下:
/oracle/admin/dtvboss/bdump/dtvboss_arc0_404.trc
Oracle Database 10g Enterprise Edition Release 10.2.0.4.0 - 64bit Production
With the Partitioning, OLAP, Data Mining and Real Application Testing options
ORACLE_HOME = /oracle/product/10.2.0/db_1
System name: SunOS
Node name: hzboss-ora1
Release: 5.10
Version: Generic_144488-12
Machine: sun4u
Instance name: dtvboss
Redo thread mounted by this instance: 1
Oracle process number: 23
Unix process pid: 404, image: oracle@hzboss-ora1 (ARC0)
*** SERVICE NAME:(SYS$BACKGROUND) 2024-10-21 03:22:11.264
*** SESSION ID:(1646.3) 2024-10-21 03:22:11.264
*** 2024-10-21 03:22:11.264
Unable to get enqueue on resource CF-00000000-00000000 (ges mode req=3 held=6)
Possible local blocker ospid=367 sid=1648 sser=1 time_held=37 secs (ges mode req=6 held=4)
DUMP LOCAL BLOCKER: initiate state dump for KILL BLOCKER
possible owner[17.367] on resource CF-00000000-00000000
Dumping process info of pid[17.367] requested by pid[23.404]
ORA-00494: enqueue [CF] held for too long (more than 900 seconds) by 'inst 1, osid 367'
-------------------------------------------------------------------------------
ENQUEUE [CF] HELD FOR TOO LONG
enqueue holder: 'inst 1, osid 367'
Process 'inst 1, osid 367' is holding an enqueue for maximum allowed time.
The process will be terminated.
Oracle Support Services triaging information: to find the root-cause, look
at the call stack of process 'inst 1, osid 367' located below. Ask the
developer that owns the first NON-service layer in the stack to investigate.
Common service layers are enqueues (ksq), latches (ksl), library cache
pins and locks (kgl), and row cache locks (kqr).
Dumping process 17.367 info:
*** 2024-10-21 03:22:11.300
---------------------------------------------------------------
从这一段信息可以看到,此TRACE文件是ARC0进程的TRACE,ARC0进程的ORACLE SID是23,OS PID是404;
ARC0进程无法获取资源:CF-00000000-00000000 (controlfile enqueue 队列锁),此资源的持有者是pid[17.367]。
ARC0进程的TRACE分析
PROCESS 23:
----------------------------------------
SO: 8a9004f88, type: 2, owner: 0, flag: INIT/-/-/0x00
(process) Oracle pid=23, calls cur/top: 8a95c6650/8a95c6650, flag: (2) SYSTEM
int error: 0, call error: 0, sess error: 0, txn error 0
(post info) last post received: 0 0 24
last post received-location: ksasnd
last process to post me: b66002018 1 6
last post sent: 0 0 24
last post sent-location: ksasnd
last process posted by me: b66005770 1 2
(latch info) wait_event=0 bits=0
Process Group: DEFAULT, pseudo proc: 5eb3388f0
O/S info: user: oracle, term: UNKNOWN, ospid: 404
OSD pid info: Unix process pid: 404, image: oracle@hzboss-ora1 (ARC0)
Dump of memory from 0x00000005F033B1F0 to 0x00000005F033B3F8
5F033B1F0 0000000C 00000000 00000008 AF739C70 [.............s.p]
………………………………
5F033B3E0 00000005 000313A7 00000008 AF739C70 [.............s.p]
5F033B3F0 00000006 000313A7 [........]
(FOB) flags=2 fib=8af713780 incno=0 pending i/o cnt=0
fname=/boss_sysdata/oradata/dtvboss/control03.ctl
fno=2 lblksz=16384 fsiz=1308
(FOB) flags=2 fib=8af7133e0 incno=0 pending i/o cnt=0
fname=/boss_sysdata/oradata/dtvboss/control02.ctl
fno=1 lblksz=16384 fsiz=1308
(FOB) flags=2 fib=8af713040 incno=0 pending i/o cnt=0
fname=/boss_sysdata/oradata/dtvboss/control01.ctl
fno=0 lblksz=16384 fsiz=1308
----------------------------------------
SO: 8a9778500, type: 11, owner: 8a9004f88, flag: INIT/-/-/0x00
(broadcast handle) flag: (2) ACTIVE SUBSCRIBER, owner: 8a9004f88,
event: 21, last message event: 21,
last message waited event: 21, messages read: 0
channel: (5eb52cc20) scumnt mount lock
scope: 1, event: 22, last mesage event: 20,
publishers/subscribers: 0/20,
messages published: 1
----------------------------------------
SO: 8a95b7328, type: 4, owner: 8a9004f88, flag: INIT/-/-/0x00
(session) sid: 1646 trans: 0, creator: 8a9004f88, flag: (51) USR/- BSY/-/-/-/-/-
DID: 0001-0017-0000000D, short-term DID: 0000-0000-00000000
txn branch: 0
oct: 0, prv: 0, sql: 0, psql: 0, user: 0/SYS
service name: SYS$BACKGROUND
last wait for 'enq: CF - contention' blocking sess=0x0 seq=43322 wait_time=2929723 seconds since wait started=36
name|mode=43460004, 0=0, operation=0
Dumping Session Wait History
for 'enq: CF - contention' count=1 wait_time=2929723
name|mode=43460004, 0=0, operation=0
for 'enq: CF - contention' count=1 wait_time=2929725
name|mode=43460004, 0=0, operation=0
for 'enq: CF - contention' count=1 wait_time=2929726
name|mode=43460004, 0=0, operation=0
for 'enq: CF - contention' count=1 wait_time=2929726
name|mode=43460004, 0=0, operation=0
for 'enq: CF - contention' count=1 wait_time=2929725
name|mode=43460004, 0=0, operation=0
for 'enq: CF - contention' count=1 wait_time=2929723
name|mode=43460004, 0=0, operation=0
for 'enq: CF - contention' count=1 wait_time=2929725
name|mode=43460004, 0=0, operation=0
for 'enq: CF - contention' count=1 wait_time=2929726
name|mode=43460004, 0=0, operation=0
for 'enq: CF - contention' count=1 wait_time=2929724
name|mode=43460004, 0=0, operation=0
for 'enq: CF - contention' count=1 wait_time=2929750
name|mode=43460004, 0=0, operation=0
temporary object counter: 0
----------------------------------------
UOL used : 0 locks(used=0, free=0)
KGX Atomic Operation Log 5eaffc848
Mutex 0(0, 0) idn 0 oper NONE
Library Cache uid 1646 efd 0 whr 0 slp 0
KGX Atomic Operation Log 5eaffc890
Mutex 0(0, 0) idn 0 oper NONE
Library Cache uid 1646 efd 0 whr 0 slp 0
KGX Atomic Operation Log 5eaffc8d8
Mutex 0(0, 0) idn 0 oper NONE
Library Cache uid 1646 efd 0 whr 0 slp 0
----------------------------------------
SO: 8afaa4330, type: 41, owner: 8a95b7328, flag: INIT/-/-/0x00
(dummy) nxc=0, nlb=0
----------------------------------------
SO: 8a97783d8, type: 11, owner: 8a9004f88, flag: INIT/-/-/0x00
(broadcast handle) flag: (2) ACTIVE SUBSCRIBER, owner: 8a9004f88,
event: 24, last message event: 27,
last message waited event: 27, messages read: 1
channel: (5eb528a48) system events broadcast channel
scope: 2, event: 2923059, last mesage event: 27,
publishers/subscribers: 0/421,
messages published: 1
----------------------------------------
SO: 8a95c6650, type: 3, owner: 8a9004f88, flag: INIT/-/-/0x00
(call) sess: cur 8a95b7328, rec 8a9591bc8, usr 8a95b7328; depth: 0
----------------------------------------
SO: 8a95d1fb8, type: 5, owner: 8a95c6650, flag: INIT/-/-/0x00
(enqueue) CF-00000000-00000000 DID: 0001-0017-0000000D
lv: 00 79 4a fb 00 00 00 00 00 00 00 00 00 00 00 00 res_flag: 0x2
res: 0x5eb40f608, lock_flag: 0x0
own: 0x8a95b7328, sess: 0x8a95b7328, prv: 0x8a95d1fe8
----------------------------------------
SO: 8a9591bc8, type: 4, owner: 8a95c6650, flag: INIT/-/-/0x00
(session) sid: 1618 trans: 0, creator: 0, flag: (2) -/REC -/-/-/-/-/-
DID: 0000-0000-00000000, short-term DID: 0000-0000-00000000
txn branch: 0
oct: 0, prv: 0, sql: 0, psql: 0, user: 0/SYS
temporary object counter: 0
----------------------------------------
SO: 8a97ad460, type: 16, owner: 8a9004f88, flag: INIT/-/-/0x00
(osp req holder)
此段信息可以看到,ARC0进程等待controlfile enqueue 队列锁。等待事件为enq: CF - contention
SO: 8a95d1fb8, type: 5, owner: 8a95c6650, flag: INIT/-/-/0x00
(enqueue) CF-00000000-00000000 DID: 0001-0017-0000000D
lv: 00 79 4a fb 00 00 00 00 00 00 00 00 00 00 00 00 res_flag: 0x2
res: 0x5eb40f608, lock_flag: 0x0
own: 0x8a95b7328, sess: 0x8a95b7328, prv: 0x8a95d1fe8
这里说明arc0在申请的资源是:res:0x5eb40f608。
而在process 17号的ckpt中可以看到0x5eb40f608资源被ckpt以mode: SSX方式持有持有。
CKPT进程的TRACE分析
PROCESS 17:
----------------------------------------
SO: 8a90037d0, type: 2, owner: 0, flag: INIT/-/-/0x00
(process) Oracle pid=17, calls cur/top: 8a95c5df8/8a95c5df8, flag: (6) SYSTEM
int error: 0, call error: 0, sess error: 0, txn error 0
(post info) last post received: 0 0 114
last post received-location: kcbbza
last process to post me: 8a9001830 1 6
last post sent: 0 0 9
last post sent-location: ksqrcl
last process posted by me: b66011d18 91 0
(latch info) wait_event=0 bits=0
Process Group: DEFAULT, pseudo proc: 5eb3388f0
O/S info: user: oracle, term: UNKNOWN, ospid: 367
OSD pid info: Unix process pid: 367, image: oracle@hzboss-ora1 (CKPT)
Dump of memory from 0x00000005EF308A18 to 0x00000005EF308C20
5EF308A10 0000000C 00000000 [........]
…………………………………………
----------------------------------------
SO: 8a95b9df8, type: 4, owner: 8a90037d0, flag: INIT/-/-/0x00
(session) sid: 1648 trans: 0, creator: 8a90037d0, flag: (51) USR/- BSY/-/-/-/-/-
DID: 0001-0011-00000005, short-term DID: 0000-0000-00000000
txn branch: 0
oct: 0, prv: 0, sql: 0, psql: 0, user: 0/SYS
service name: SYS$BACKGROUND
waiting for 'direct path write' blocking sess=0x0 seq=39369 wait_time=0 seconds since wait started=37
file number=72, first dba=1, block cnt=1
……………………………………
----------------------------------------
SO: 8a9777a30, type: 11, owner: 8a90037d0, flag: INIT/-/-/0x00
(broadcast handle) flag: (2) ACTIVE SUBSCRIBER, owner: 8a90037d0,
event: 14, last message event: 27,
last message waited event: 27, messages read: 1
channel: (5eb528a48) system events broadcast channel
scope: 2, event: 2923059, last mesage event: 27,
publishers/subscribers: 0/421,
messages published: 1
----------------------------------------
SO: 8a95c5df8, type: 3, owner: 8a90037d0, flag: INIT/-/-/0x00
(call) sess: cur 8a95b9df8, rec 8a95b5dc0, usr 8a95b9df8; depth: 0
----------------------------------------
SO: 8a95d1f20, type: 5, owner: 8a95c5df8, flag: INIT/-/-/0x00
(enqueue) CF-00000000-00000004 DID: 0001-0011-00000005
lv: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 res_flag: 0x2
res: 0x5eb507858, mode: S, lock_flag: 0x0
own: 0x8a95b9df8, sess: 0x8a95b9df8, proc: 0x8a90037d0, prv: 0x5eb507868
----------------------------------------
SO: 8a95d1cc0, type: 5, owner: 8a95c5df8, flag: INIT/-/-/0x00
(enqueue) CF-00000000-00000000 DID: 0001-0011-00000005
lv: 00 79 4a fb 00 00 00 00 00 00 00 00 00 00 00 00 res_flag: 0x2
res: 0x5eb40f608, mode: SSX, lock_flag: 0x0
own: 0x8a95b9df8, sess: 0x8a95b9df8, proc: 0x8a90037d0, prv: 0x5eb40f618
----------------------------------------
SO: 8a95b5dc0, type: 4, owner: 8a95c5df8, flag: INIT/-/-/0x00
(session) sid: 1645 trans: 0, creator: 0, flag: (2) -/REC -/-/-/-/-/-
DID: 0000-0000-00000000, short-term DID: 0000-0000-00000000
txn branch: 0
oct: 0, prv: 0, sql: 0, psql: 0, user: 0/SYS
temporary object counter: 0
----------------------------------------
SO: 8a95cb468, type: 5, owner: 8a95c5df8, flag: INIT/-/-/0x00
(enqueue) RS-00000019-00000001 DID: 0001-0011-00000005
lv: 20 20 00 00 00 00 00 00 00 00 00 00 00 00 00 00 res_flag: 0x2
res: 0x5eb40f8c8, mode: SS, lock_flag: 0x0
own: 0x8a95b9df8, sess: 0x8a95b9df8, proc: 0x8a90037d0, prv: 0x5eb40f8d8
----------------------------------------
SO: 8a95cb338, type: 5, owner: 8a95c5df8, flag: INIT/-/-/0x00
(enqueue) CF-00000000-00000000 DID: 0000-0011-00000004
lv: 00 79 4a fb 00 00 00 00 00 00 00 00 00 00 00 00 res_flag: 0x2
res: 0x5eb40f608, mode: SS, lock_flag: 0x0
own: 0x8a95b9df8, sess: 0x8a95b9df8, proc: 0x8a90037d0, prv: 0x8a95d1cf0
----------------------------------------
SO: 8a97ad2c8, type: 16, owner: 8a90037d0, flag: INIT/-/-/0x00
(osp req holder)
此处可以看到CKPT进程持有(enqueue) CF-00000000-00000000,持有模式是:res: 0x5eb40f608, mode: SSX, lock_flag: 0x0,可以对应到ARC0进程请求的资源。
LGWR进程的TRACE分析
PROCESS 10:
----------------------------------------
SO: b66002018, type: 2, owner: 0, flag: INIT/-/-/0x00
(process) Oracle pid=10, calls cur/top: b665c6de0/b665c6de0, flag: (6) SYSTEM
int error: 0, call error: 0, sess error: 0, txn error 0
(post info) last post received: 0 0 24
last post received-location: ksasnd
last process to post me: 8a9000860 1 6
last post sent: 109 0 4
last post sent-location: kslpsr
last process posted by me: 8a902ef90 105 0
(latch info) wait_event=0 bits=0
Process Group: DEFAULT, pseudo proc: 5eb3388f0
O/S info: user: oracle, term: UNKNOWN, ospid: 365
OSD pid info: Unix process pid: 365, image: oracle@hzboss-ora1 (LGWR)
Dump of memory from 0x00000005EF308810 to 0x00000005EF308A18
5EF308810 00000015 00000000 0000000B 6C1C1158 [............l..X]
………………
----------------------------------------
SO: b665cc938, type: 5, owner: b66002018, flag: INIT/-/-/0x00
(enqueue) RT-00000001-00000000 DID: 0000-000A-00000005
lv: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 res_flag: 0x2
res: 0x5eb40f828, mode: X, lock_flag: 0x0
own: 0xb665bf398, sess: 0xb665bf398, proc: 0xb66002018, prv: 0x5eb40f838
(FOB) flags=2 fib=8af713780 incno=0 pending i/o cnt=0
fname=/boss_sysdata/oradata/dtvboss/control03.ctl
fno=2 lblksz=16384 fsiz=1308
(FOB) flags=2 fib=8af7133e0 incno=0 pending i/o cnt=0
fname=/boss_sysdata/oradata/dtvboss/control02.ctl
fno=1 lblksz=16384 fsiz=1308
(FOB) flags=2 fib=8af713040 incno=0 pending i/o cnt=0
fname=/boss_sysdata/oradata/dtvboss/control01.ctl
fno=0 lblksz=16384 fsiz=1308
----------------------------------------
SO: b66778b58, type: 11, owner: b66002018, flag: INIT/-/-/0x00
(broadcast handle) flag: (2) ACTIVE SUBSCRIBER, owner: b66002018,
event: 12, last message event: 20,
last message waited event: 20, messages read: 1
channel: (5eb52cc20) scumnt mount lock
scope: 1, event: 22, last mesage event: 20,
publishers/subscribers: 0/20,
messages published: 1
----------------------------------------
SO: b665bf398, type: 4, owner: b66002018, flag: INIT/-/-/0x00
(session) sid: 824 trans: 0, creator: b66002018, flag: (51) USR/- BSY/-/-/-/-/-
DID: 0001-000A-00000006, short-term DID: 0000-0000-00000000
txn branch: 0
oct: 0, prv: 0, sql: 0, psql: 0, user: 0/SYS
service name: SYS$BACKGROUND
waiting for 'enq: CF - contention' blocking sess=0x8a95b9df8 seq=61266 wait_time=0 seconds since wait started=0
name|mode=43460005, 0=0, operation=0
Dumping Session Wait History
for 'enq: CF - contention' count=1 wait_time=2929719
name|mode=43460005, 0=0, operation=0
for 'enq: CF - contention' count=1 wait_time=2929716
name|mode=43460005, 0=0, operation=0
for 'enq: CF - contention' count=1 wait_time=2929716
name|mode=43460005, 0=0, operation=0
for 'enq: CF - contention' count=1 wait_time=2929718
name|mode=43460005, 0=0, operation=0
for 'enq: CF - contention' count=1 wait_time=2929719
name|mode=43460005, 0=0, operation=0
for 'enq: CF - contention' count=1 wait_time=2929719
name|mode=43460005, 0=0, operation=0
for 'enq: CF - contention' count=1 wait_time=2929713
name|mode=43460005, 0=0, operation=0
for 'enq: CF - contention' count=1 wait_time=2929715
name|mode=43460005, 0=0, operation=0
for 'enq: CF - contention' count=1 wait_time=2929722
name|mode=43460005, 0=0, operation=0
for 'enq: CF - contention' count=1 wait_time=2929736
name|mode=43460005, 0=0, operation=0
temporary object counter: 0
----------------------------------------
UOL used : 0 locks(used=0, free=0)
KGX Atomic Operation Log 5ebff7cd0
Mutex 0(0, 0) idn 0 oper NONE
Library Cache uid 824 efd 0 whr 0 slp 0
KGX Atomic Operation Log 5ebff7d18
Mutex 0(0, 0) idn 0 oper NONE
Library Cache uid 824 efd 0 whr 0 slp 0
KGX Atomic Operation Log 5ebff7d60
Mutex 0(0, 0) idn 0 oper NONE
Library Cache uid 824 efd 0 whr 0 slp 0
----------------------------------------
SO: b6caa57e0, type: 41, owner: b665bf398, flag: INIT/-/-/0x00
(dummy) nxc=0, nlb=0
----------------------------------------
SO: b66778a48, type: 11, owner: b66002018, flag: INIT/-/-/0x00
(broadcast handle) flag: (2) ACTIVE SUBSCRIBER, owner: b66002018,
event: 13, last message event: 27,
last message waited event: 27, messages read: 1
channel: (5eb528a48) system events broadcast channel
scope: 2, event: 2923059, last mesage event: 27,
publishers/subscribers: 0/421,
messages published: 1
----------------------------------------
SO: b665c6de0, type: 3, owner: b66002018, flag: INIT/-/-/0x00
(call) sess: cur b665bf398, rec 0, usr b665bf398; depth: 0
----------------------------------------
SO: b665cce90, type: 5, owner: b665c6de0, flag: INIT/-/-/0x00
(enqueue) CF-00000000-00000000 DID: 0001-000A-00000006
lv: 00 79 4a fb 00 00 00 00 00 00 00 00 00 00 00 00 res_flag: 0x2
res: 0x5eb40f608, req: SSX, lock_flag: 0x0
own: 0xb665bf398, sess: 0xb665bf398, proc: 0xb66002018, prv: 0x5eb40f628
----------------------------------------
SO: b667ae720, type: 16, owner: b66002018, flag: INIT/-/-/0x00
(osp req holder)
从LGWR进程的TRACE中可以看到:waiting for 'enq: CF - contention' blocking sess=0x8a95b9df8 seq=61266 wait_time=0 seconds since wait started=0
说明LGWR进程也在等待controlfile enqueue 队列锁,等待事件为enq: CF – contention。
SO: b665cce90, type: 5, owner: b665c6de0, flag: INIT/-/-/0x00
(enqueue) CF-00000000-00000000 DID: 0001-000A-00000006
lv: 00 79 4a fb 00 00 00 00 00 00 00 00 00 00 00 00 res_flag: 0x2
res: 0x5eb40f608, req: SSX, lock_flag: 0x0
own: 0xb665bf398, sess: 0xb665bf398, proc: 0xb66002018, prv: 0x5eb40f628
从此段可以看到,LGWR进程请求的CF锁资源是0x5eb40f608,并且是以SSX模式请求;通过CKPT进程的TRACE可以看到其已经own: 0x8a95b9df8 持有此资源res: 0x5eb40f608, mode: SSX, lock_flag: 0x0 SSX模式持有此锁;因此LGWR进程也处于等待此资源的状态。
4.2 主机负载情况分析
从ALERT日志中提到的故障发生时的TRACE文件:
/oracle/admin/dtvboss/bdump/dtvboss_pmon_343.trc
/oracle/admin/dtvboss/bdump/dtvboss_arc0_404.trc
中可以找到故障发生时的主机负载情况:
*** 2024-10-21 03:22:11.300
Dumping diagnostic information for ospid 367:
OS pid = 367
loadavg : 0.44 0.62 1.82
swap info: free_mem = 1435.35M rsv = 59461.24M
alloc = 55941.14M avail = 55124.91 swap_free = 58645.01M
Dumping diagnostic information for oracle@hzboss-ora1:
OS pid = 24887
loadavg : 1.91 1.01 1.84
swap info: free_mem = 7661.14M rsv = 53520.42M
alloc = 50905.74M avail = 62991.89 swap_free = 65606.57M
F S UID PID PPID C PRI NI ADDR SZ WCHAN STIME TTY TIME CMD
0 S oracle 24887 1 0 39 20 ? 4115427 ? Sep 24 ? 29:24 oracledtvboss (LOCAL=NO)
pstack: cannot examine 24887: no such process
*** 2024-10-21 03:23:53.644
ksuitm_check: OS PID=8690 is still alive
*** 2024-10-21 03:23:53.644
Dumping diagnostic information for oracle@hzboss-ora1 (J002):
OS pid = 8690
loadavg : 1.93 1.02 1.84
swap info: free_mem = 7660.87M rsv = 53524.68M
alloc = 50905.95M avail = 62987.56 swap_free = 65606.30M
F S UID PID PPID C PRI NI ADDR SZ WCHAN STIME TTY TIME CMD
0 O oracle 8690 1 2 39 20 ? 4121237 00:57:24 ? 29:43 ora_j002_dtvboss
pstack: cannot examine 8690: no such process
*** 2024-10-21 03:23:54.751
ksuitm_check: OS PID=21430 is still alive
*** 2024-10-21 03:23:54.751
Dumping diagnostic information for oracle@hzboss-ora1 (J003):
OS pid = 21430
loadavg : 1.94 1.02 1.84
swap info: free_mem = 7660.27M rsv = 53521.18M
alloc = 50906.49M avail = 62991.00 swap_free = 65605.69M
F S UID PID PPID C PRI NI ADDR SZ WCHAN STIME TTY TIME CMD
0 O oracle 21430 1 2 39 20 ? 4119652 02:59:44 ? 0:27 ora_j003_dtvboss
pstack: cannot examine 21430: unanticipated system error
*** 2024-10-21 03:23:57.839
ksuitm_check: OS PID=367 is still alive
*** 2024-10-21 03:23:57.840
Dumping diagnostic information for oracle@hzboss-ora1 (CKPT):
OS pid = 367
loadavg : 1.98 1.04 1.84
swap info: free_mem = 7660.46M rsv = 53520.98M
alloc = 50906.02M avail = 62990.88 swap_free = 65605.85M
F S UID PID PPID C PRI NI ADDR SZ WCHAN STIME TTY TIME CMD
0 S oracle 367 1 0 39 20 ? 4116160 ? Mar 30 ? 498:14 ora_ckpt_dtvboss
pstack: cannot examine 367: no such process
*** 2024-10-21 03:23:58.926
从这些负载信息可以看到,整个系统当时的空闲内存和swap还留有余地,足够当时系统的使用,所以内存不足的原因可以排除。
4.3 数据库版本情况
Oracle Database 10g Enterprise Edition Release 10.2.0.4.0 - 64bit Production
With the Partitioning, OLAP, Data Mining and Real Application Testing options
ORACLE_HOME = /oracle/product/10.2.0/db_1
System name: SunOS
Node name: hzboss-ora1
Release: 5.10
Version: Generic_144488-12
Machine: sun4u
Instance name: dtvboss
Redo thread mounted by this instance: 1
Oracle process number: 2
COMP_ID COMP_NAME VERSION STATUS MODIFIED
---------- ----------------------------------- --------------- -------- -----------------------
SDO Spatial 10.2.0.4.0 VALID 18/FEB/2011:21:05:30
ORDIM Oracle.interMedia 10.2.0.4.0 VALID 18/FEB/2011:21:05:29
AMD OLAP.Catalog 10.2.0.4.0 VALID 18/FEB/2011:21:05:29
EM Oracle.Enterprise.Manager 10.2.0.4.0 VALID 18/FEB/2011:21:04:33
XDB Oracle.XML.Database 10.2.0.4.0 VALID 18/FEB/2011:21:05:29
CONTEXT Oracle.Text 10.2.0.4.0 VALID 18/FEB/2011:21:05:29
EXF Oracle.Expression.Filter 10.2.0.4.0 VALID 18/FEB/2011:21:05:29
RUL Oracle.Rule.Manager 10.2.0.4.0 VALID 18/FEB/2011:21:05:29
OWM Oracle.Workspace.Manager 10.2.0.4.3 VALID 18/FEB/2011:21:05:29
ODM Oracle.Data.Mining 10.2.0.4.0 VALID 18/FEB/2011:21:05:29
CATALOG Oracle.Database.Catalog.Views 10.2.0.4.0 VALID 18/FEB/2011:21:05:29
CATPROC Oracle.Database.Packages.and.Types 10.2.0.4.0 VALID 18/FEB/2011:21:05:29
JAVAVM JServer.JAVA.Virtual.Machine 10.2.0.4.0 VALID 18/FEB/2011:21:05:29
XML Oracle.XDK 10.2.0.4.0 VALID 18/FEB/2011:21:05:29
CATJAVA Oracle.Database.Java.Packages 10.2.0.4.0 VALID 18/FEB/2011:21:05:29
APS OLAP.Analytic.Workspace 10.2.0.4.0 VALID 18/FEB/2011:21:05:29
XOQ Oracle.OLAP.API 10.2.0.4.0 VALID 18/FEB/2011:21:05:29
- 总结与建议
针对此次数据库实例宕机问题,总结及建议如下:
5.1 问题分析汇总
5.1.1 宕机问题总结
综上可以得出结论,由于arc0进程要写日志需要等待控制文件CF锁,lgwr估计要写日志或者切换也在等待控制文件CF锁,而ckpt进程持有controlfile enqueue lock时间过长,arch0等待超过900秒,导致oracle数据库系统判断异常。
此时oracle数据库系统会让lgwr进程kill持有controlfile enqueue lock的进程,相当于lgwr进程杀了ckpt进程,而ckpt进程又为oracle关键进程,连锁反应,引起宕机。
5.1.2 宕机的问题根源判断
在宕机前系统发出告警aiowait timed out 1 times,初步判断系统在当时的io存在一些问题。而当时系统的内存和swap资源都充足没有出现严重问题,trace中的loadavg : 0.44 0.62 1.82则说明了当时的cpu负载低,这里排除了内存和cpu的问题.
初步建议检测当时的redo日志的切换和产生量,当时的数据库服务器的io调用频繁度,因此可以参考下数据库的2点到3点间的awr信息(如果有数据库监控系统那么则可以分析到故障发生时间点的io信息,但是当前环境未部署)。
controlfile equeue lock的超时管理机制中,受_kill_controlfile_enqueue_blocker=false,_controlfile_enqueue_timeout,_kill_enqueue_blocker这3个隐含参数的影响,当前环境中我不建议增加_controlfile_enqueue_timeout,而是建议在通过控制_kill_controlfile_enqueue_blocker,_kill_enqueue_blocker参数来实现该机制的管理,以避免宕机的风险。
控制队列锁管理机制的bug以及避免:
控制文件队列锁有一个相关的匹配Bug 7692631 - 'DATABASE CRASHES WITH ORA-494 AFTER UPGRADE TO 10.2.0.4',这里可以通过以下2个参数规避。
_kill_controlfile_enqueue_blocker值分为true和false,默认为true,如果设置为false则不会去kill超时的进程,此时_kill_controlfile_enqueue_blocker=false参数失效。
_kill_enqueue_blocker值分为别0,1,2,3. 当设置为1的时候,保证后台进程不会被kill,而只会kill前台进程,这样可以避免宕机。
5.2 建议
5.2.1在数据库层上的调整建议
经过沟通当时正在跑批,所以当时的日志切换会比较频繁,因此我们建议对此时的redo log的大小匹配处理进行调整,适当增加日志组以及日志文件大小。控制日志切换保持在20-30分钟一次。
同时建议使用数据库监控工具,以便能在故障时候能及时获取到故障前的监控数据。
在参数调整角度上我们可以将_kill_enqueue_blocker设置为1或者将_kill_controlfile_enqueue_blocker设置为false。
5.2.2在系统层上的调整建议
同时检查系统io是否存在问题,
同时建议部署io层面的监控工具。