当前位置: 首页 > news >正文

RAC环境redo在各节点本地导致数据库故障恢复---惜分飞

最近一个运行在win平台的rac,由于断电之后,集群两个节点均无法正常启动,客户进行了一系列尝试,结果到了ora-600 kclchkblk_4错误无法继续.
通过对数据库日志分析,回溯了故障大概的原因,启动的时候报错为:
节点1启动报错

Sun Aug 03 15:21:22 2025

alter database open

This instance was first to open

Beginning crash recovery of 2 threads

 parallel recovery started with 32 processes

Started redo scan

Errors in file D:\APP\ADMINISTRATOR\diag\rdbms\orcl\orcl1\trace\orcl1_ora_7108.trc:

ORA-00314: 日志 11 (用于线程 2) 要求的 sequence# 147717 与 147541 不匹配

ORA-00312: 联机日志 11 线程 2: 'D:\REDOLOG\REDO011.LOG'

Abort recovery for domain 0

Aborting crash recovery due to error 314

Errors in file D:\APP\ADMINISTRATOR\diag\rdbms\orcl\orcl1\trace\orcl1_ora_7108.trc:

ORA-00314: 日志 11 (用于线程 2) 要求的 sequence# 147717 与 147541 不匹配

ORA-00312: 联机日志 11 线程 2: 'D:\REDOLOG\REDO011.LOG'

Abort recovery for domain 0

Errors in file D:\APP\ADMINISTRATOR\diag\rdbms\orcl\orcl1\trace\orcl1_ora_7108.trc:

ORA-00314: 日志 11 (用于线程 2) 要求的 sequence# 147717 与 147541 不匹配

ORA-00312: 联机日志 11 线程 2: 'D:\REDOLOG\REDO011.LOG'

ORA-314 signalled during: alter database open...

节点2启动报错

Sat Aug 02 15:45:43 2025

Successful mount of redo thread 2, with mount id 1735887907

Database mounted in Shared Mode (CLUSTER_DATABASE=TRUE)

Lost write protection disabled

Completed: ALTER DATABASE MOUNT /* db agent *//* {1:47460:124} */

ALTER DATABASE OPEN /* db agent *//* {1:47460:124} */

This instance was first to open

Beginning crash recovery of 2 threads

Sat Aug 02 15:45:49 2025

 parallel recovery started with 32 processes

Started redo scan

Errors in file D:\APP\ADMINISTRATOR\diag\rdbms\orcl\orcl2\trace\orcl2_ora_3444.trc:

ORA-00314: ?? 1 (???? 1) ??? sequence# 67782 ? 60818 ???

ORA-00312: ???? 1 ?? 1: 'D:\REDOLOG\REDO01.LOG'

Abort recovery for domain 0

Aborting crash recovery due to error 314

Errors in file D:\APP\ADMINISTRATOR\diag\rdbms\orcl\orcl2\trace\orcl2_ora_3444.trc:

ORA-00314: ?? 1 (???? 1) ??? sequence# 67782 ? 60818 ???

ORA-00312: ???? 1 ?? 1: 'D:\REDOLOG\REDO01.LOG'

Abort recovery for domain 0

Errors in file D:\APP\ADMINISTRATOR\diag\rdbms\orcl\orcl2\trace\orcl2_ora_3444.trc:

ORA-00314: ?? 1 (???? 1) ??? sequence# 67782 ? 60818 ???

ORA-00312: ???? 1 ?? 1: 'D:\REDOLOG\REDO01.LOG'

ORA-314 signalled during: ALTER DATABASE OPEN /* db agent *//* {1:47460:124} */...

看到这两个报错信息得出两个结论:
1)比较明显节点1需要thead 2的 group 11 sequence为147717,但是实际group 11文件的sequence为147541;而节点2启动需要thread 1的group 1 sequence为67782,但是实际中group 1文件的sequnece为60818,这两个都相差比较多,属于非正常的情况,很可能是文件本身有问题
2)这是一套win的rac架构,理论上redo应该在共享文件系统(一般是asm中),而这个第一感觉很可能是本地文件系统中
客户当时恢复之时查询信息截图

ORA-00314-ORA-00312


查看了两个节点的最后redo切换信息

--节点1

Sat Aug 02 10:49:31 2025

Thread 1 advanced to log sequence 67782 (LGWR switch)

  Current log# 1 seq# 67782 mem# 0: D:\REDOLOG\REDO01.LOG

--节点2(redo每组2G,节点2长时间没跑业务,之时做数据库导出操作,所以切换时间比较久远)

Sat Jul 26 16:56:42 2025

Thread 2 advanced to log sequence 147717 (LGWR switch)

  Current log# 11 seq# 147717 mem# 0: D:\REDOLOG\REDO011.LOG

并查看两个机器d:/redolog信息(客户自行resetlogs之后的,非第一现场,但是可以确认两个节点各自有一份redo文件

redo


本来这个是一个比较小的故障,只要把节点2的thread 1的redo拷贝到到节点1或者节点1的thread 2的redo拷贝到节点2,然后正常open库即可,现场恢复对rac不太熟悉,直接按照互联网上检索的处理方法,加上_allow_resetlogs_corruption然后强制拉库,结果不太幸运,拉库失败报ORA-600 kclchkblk_4错误

kclchkblk_4


Sun Aug 03 18:59:24 2025

alter database open resetlogs

RESETLOGS is being done without consistancy checks. This may result

in a corrupted database. The database should be recreated.

RESETLOGS after incomplete recovery UNTIL CHANGE 21497084214

Resetting resetlogs activation ID 1543012633 (0x5bf88119)

Sun Aug 03 18:59:46 2025

Setting recovery target incarnation to 3

Sun Aug 03 18:59:46 2025

This instance was first to open

Picked broadcast on commit scheme to generate SCNs

Sun Aug 03 18:59:48 2025

Assigning activation ID 1735960667 (0x6778a85b)

Thread 1 opened at log sequence 1

  Current log# 1 seq# 1 mem# 0: D:\REDOLOG\REDO01.LOG

Successful open of redo thread 1

MTTR advisory is disabled because FAST_START_MTTR_TARGET is not set

Sun Aug 03 18:59:49 2025

SMON: enabling cache recovery

Instance recovery: looking for dead threads

Instance recovery: lock domain invalid but no dead threads

Errors in file D:\APP\ADMINISTRATOR\diag\rdbms\orcl\orcl1\trace\orcl1_ora_8508.trc  (incident=728324):

ORA-00600: 内部错误代码, 参数: [kclchkblk_4], [5], [200595988], [5], [22247740], [], [], [], [], [], [], []

Incident details in: D:\APP\ADMINISTRATOR\diag\rdbms\orcl\orcl1\incident\incdir_728324\orcl1_ora_8508_i728324.trc

Use ADRCI or Support Workbench to package the incident.

See Note 411.1 at My Oracle Support for error and packaging details.

Sun Aug 03 18:59:53 2025

Errors in file D:\APP\ADMINISTRATOR\diag\rdbms\orcl\orcl1\trace\orcl1_ora_8508.trc:

ORA-00704: 引导程序进程失败

ORA-00704: 引导程序进程失败

ORA-00600: 内部错误代码, 参数: [kclchkblk_4], [5], [200595988], [5], [22247740], [], [], [], [], [], [], []

Errors in file D:\APP\ADMINISTRATOR\diag\rdbms\orcl\orcl1\trace\orcl1_ora_8508.trc:

ORA-00704: 引导程序进程失败

ORA-00704: 引导程序进程失败

ORA-00600: 内部错误代码, 参数: [kclchkblk_4], [5], [200595988], [5], [22247740], [], [], [], [], [], [], []

Error 704 happened during db open, shutting down database

USER (ospid: 8508): terminating the instance due to error 704

Sun Aug 03 18:59:54 2025

opiodr aborting process unknown ospid (9480) as a result of ORA-1092

Sun Aug 03 19:00:09 2025

Instance terminated by USER, pid = 8508

ORA-1092 signalled during: alter database open resetlogs...

opiodr aborting process unknown ospid (8508) as a result of ORA-1092

这个故障之后,客户那边无法自行恢复,让我这边介入处理,对于这个错误以前处理比较多,一般就是scn问题,通过Patch SCN小工具快速解决

QQ20250808-185918


数据库open成功之后主要报一些ORA-600 4137,ORA-600 6006等错误

Database Characterset is ZHS16GBK

Errors in file D:\APP\ADMINISTRATOR\diag\rdbms\orcl\orcl1\trace\orcl1_smon_8300.trc  (incident=1176205):

ORA-00600: 内部错误代码, 参数: [4137], [1.14.2713957], [0], [0], [], [], [], [], [], [], [], []

Incident details in: D:\APP\ADMINISTRATOR\diag\rdbms\orcl\orcl1\incident\incdir_1176205\orcl1_smon_8300_i1176205.trc

Use ADRCI or Support Workbench to package the incident.

See Note 411.1 at My Oracle Support for error and packaging details.

Fri Aug 08 19:03:14 2025

ORACLE Instance orcl1 (pid = 25) - Error 600 encountered while recovering transaction (1, 14).

Errors in file D:\APP\ADMINISTRATOR\diag\rdbms\orcl\orcl1\trace\orcl1_smon_8300.trc:

ORA-00600: 内部错误代码, 参数: [4137], [1.14.2713957], [0], [0], [], [], [], [], [], [], [], []

Fri Aug 08 19:03:15 2025

ORACLE Instance orcl1 (pid = 25) - Error 600 encountered while recovering transaction (5, 19).

Errors in file D:\APP\ADMINISTRATOR\diag\rdbms\orcl\orcl1\trace\orcl1_smon_8300.trc:

ORA-00600: 内部错误代码, 参数: [4137], [5.19.2318502], [0], [0], [], [], [], [], [], [], [], []

Starting background process MMON

Fri Aug 08 19:03:18 2025

MMON started with pid=29, OS id=4624

Fri Aug 08 19:03:19 2025

Errors in file D:\APP\ADMINISTRATOR\diag\rdbms\orcl\orcl1\trace\orcl1_smon_8300.trc  (incident=1176207):

ORA-00600: 内部错误代码, 参数: [6006], [1], [], [], [], [], [], [], [], [], [], []

Incident details in: D:\APP\ADMINISTRATOR\diag\rdbms\orcl\orcl1\incident\incdir_1176207\orcl1_smon_8300_i1176207.trc

Starting background process MMNL

Fri Aug 08 19:03:19 2025

MMNL started with pid=30, OS id=8344

Use ADRCI or Support Workbench to package the incident.

See Note 411.1 at My Oracle Support for error and packaging details.

ORACLE Instance orcl1 (pid = 25) - Error 600 encountered while recovering transaction (46, 28) on object 197344.

Errors in file D:\APP\ADMINISTRATOR\diag\rdbms\orcl\orcl1\trace\orcl1_smon_8300.trc:

ORA-00600: 内部错误代码, 参数: [6006], [1], [], [], [], [], [], [], [], [], [], []

通过重建undo解决该错误,数据库稳定运行,没有再crash和报明显错误,导出核心数据,完成本次恢复任务.

http://www.xdnf.cn/news/1316287.html

相关文章:

  • 勾股数-洛谷B3845 [GESP样题 二级]
  • 平行双目视觉-动手学计算机视觉18
  • Linux应用软件编程---多任务(线程)(线程创建、消亡、回收、属性、与进程的区别、线程间通信、函数指针)
  • (一)React企业级后台(Axios/localstorage封装/动态侧边栏)
  • Android 对话框 - 基础对话框补充(不同的上下文创建 AlertDialog、AlertDialog 的三个按钮)
  • WPFC#超市管理系统(6)订单详情、顾客注册、商品销售排行查询和库存提示、LiveChat报表
  • C#WPF实战出真汁13--【营业查询】
  • [辩论] TDD(测试驱动开发)
  • ZKmall开源商城的移动商城搭建:Uni-app+Vue3 实现多端购物体验
  • Collections.synchronizedList是如何将List变为线程安全的
  • Trae 辅助下的 uni-app 跨端小程序工程化开发实践分享
  • 李宏毅NLP-11-语音合成
  • 在 Element UI 的 el-table 中实现某行标红并显示删除线
  • 【PHP】Hyperf:接入 Nacos
  • Centos中内存CPU硬盘的查询
  • vscode无法检测到typescript环境解决办法
  • OpenCV 图像处理核心技术:边界填充、算术运算与滤波处理实战
  • 大模型应用发展与Agent前沿技术趋势(中)
  • JVM常用工具:jstat、jmap、jstack
  • 【Linux】IO多路复用
  • 17-线程
  • Python自学10-常用数据结构之字符串
  • Python异常、模块与包(五分钟小白从入门)
  • 文件快速复制工具,传输速度提升10倍
  • riscv中断处理软硬件流程总结
  • 【C语言强化训练16天】--从基础到进阶的蜕变之旅:Day6
  • Vue3 中的 ref、模板引用和 defineExpose 详解
  • 安卓14系统应用收不到开机广播
  • 【Java后端】Spring Boot 集成 MyBatis-Plus 全攻略
  • 大模型算法岗面试准备经验分享