当前位置: 首页 > news >正文

GaussDB 集群故障cm_ctl: can‘t connect to cm_server

1. 问题描述

gaussdb,3AZ3副本架构,重启节点服务器后,报错无法连接cm_server,cm_ctl: can’t connect to cm_server.

[omm@gaussdb03 ~]$ cm_ctl query -Cvpid
[  CMServer State   ]node             node_ip         instance                             state
-----------------------------------------------------------------------------
1  172.16.60.226 172.16.60.226   1    /data/cluster/data/cm/cm_server Down
2  172.16.60.227 172.16.60.227   2    /data/cluster/data/cm/cm_server Down
3  172.16.60.228 172.16.60.228   3    /data/cluster/data/cm/cm_server Standby[    ETCD State     ]node             node_ip         instance                     state
---------------------------------------------------------------------------
1  172.16.60.226 172.16.60.226   7001 /data/cluster/data/etcd Down
2  172.16.60.227 172.16.60.227   7002 /data/cluster/data/etcd Down
3  172.16.60.228 172.16.60.228   7003 /data/cluster/data/etcd Downcm_ctl: can't connect to cm_server. 
Maybe cm_server is not running, or timeout expired. Please try again.

2. 问题分析

  • 检查每台机器上,集群组件进程CM,ETCD,GTM,CN,DN还都存在
[root@gaussdb03 ~]# ps -ef |grep cluster
omm         5198       1  0 13:43 ?        00:00:06 /data/cluster/core/app/bin/om_monitor -L /data/cluster/logs/gaussdb/omm/cm/om_monitor
omm         5202    5198  9 13:43 ?        00:01:32 /data/cluster/core/app/bin/cm_agent
omm         5214       1  0 13:43 ?        00:00:03 /data/cluster/core/app/bin/etcd -name etcd_7003 --data-dir /data/cluster/data/etcd --client-cert-auth --trusted-ca-file /data/cluster/core/app/share/sslcert/etcd/etcdca.crt --cert-file /data/cluster/data/etcd/etcd.crt --key-file /data/cluster/data/etcd/etcd.key --peer-client-cert-auth --peer-trusted-ca-file /data/cluster/core/app/share/sslcert/etcd/etcdca.crt --peer-cert-file /data/cluster/data/etcd/etcd.crt --peer-key-file /data/cluster/data/etcd/etcd.key -initial-advertise-peer-urls https://172.16.60.228:30320 -listen-peer-urls https://172.16.60.228:30320 -listen-client-urls https://172.16.60.228:30300 -advertise-client-urls https://172.16.60.228:30300 --election-timeout 5000 --heartbeat-interval 1000 --log-outputs stdout --quota-backend-bytes 8589934592 --auto-compaction-mode periodic --auto-compaction-retention 1h -initial-cluster-token etcd-cluster-omm --enable-v2=false -initial-cluster etcd_7001=https://172.16.60.226:30320,etcd_7002=https://172.16.60.227:30320,etcd_7003=https://172.16.60.228:30320 -initial-cluster-state new
omm         5362       1  0 13:43 ?        00:00:00 /data/cluster/core/app/bin/gs_gtm -D /data/cluster/data/gtm -M pending
omm         5369       1 41 13:43 ?        00:06:57 /data/cluster/core/app/bin/gaussdb --coordinator -D /data/cluster/data/cn
omm         5385       1  2 13:43 ?        00:00:29 /data/cluster/core/app/bin/cm_server
omm         5576       1 23 13:43 ?        00:03:56 /data/cluster/core/app/bin/gaussdb --datanode -D /data/cluster/data/dn/dn_6003 -M pending
omm         6225       1 23 13:43 ?        00:03:54 /data/cluster/core/app/bin/gaussdb --datanode -D /data/cluster/data/dn/dn_6006 -M pending
omm         6482       1 23 13:43 ?        00:03:57 /data/cluster/core/app/bin/gaussdb --datanode -D /data/cluster/data/dn/dn_6007 -M pending
root       23084   23031  0 13:59 pts/0    00:00:00 grep cluster
  • 由于 CM,ETCD 均显示 Down,根据官方文档,应先保证 ETCD 正常,然后 CM 可以依赖 ETCD 选主
    在这里插入图片描述
  • 检查ETCD日志
[omm@gaussdb01 etcd]$ pwd
/data/cluster/logs/gaussdb/omm/cm/etcd
[omm@gaussdb01 etcd]$ view etcd_7001-current.log
{"level":"info","ts":"2025-09-01T14:11:57.175+0800","logger":"raft","caller":"etcdserver/zap_raft.go:77","msg":"6c461eeb977a77bb [logterm: 5, index: 16182] sent MsgPreVote request to 82a123c2037aba1a at term 5"}
{"level":"info","ts":"2025-09-01T14:11:57.175+0800","logger":"raft","caller":"etcdserver/zap_raft.go:77","msg":"6c461eeb977a77bb [logterm: 5, index: 16182] sent MsgPreVote request to d354b9b181618c10 at term 5"}
{"level":"warn","ts":"2025-09-01T14:11:57.489+0800","caller":"rafthttp/probing_status.go:68","msg":"prober detected unhealthy status","round-tripper-name":"ROUND_TRIPPER_SNAPSHOT","remote-peer-id":"82a123c2037aba1a","rtt":"0s","error":"dial tcp 172.16.60.228:30320: i/o timeout"}
{"level":"warn","ts":"2025-09-01T14:11:57.489+0800","caller":"rafthttp/probing_status.go:68","msg":"prober detected unhealthy status","round-tripper-name":"ROUND_TRIPPER_SNAPSHOT","remote-peer-id":"d354b9b181618c10","rtt":"0s","error":"dial tcp 172.16.60.227:30320: connect: no route to host"}
{"level":"warn","ts":"2025-09-01T14:11:57.489+0800","caller":"rafthttp/probing_status.go:68","msg":"prober detected unhealthy status","round-tripper-name":"ROUND_TRIPPER_RAFT_MESSAGE","remote-peer-id":"82a123c2037aba1a","rtt":"0s","error":"dial tcp 172.16.60.228:30320: connect: no route to host"}
{"level":"warn","ts":"2025-09-01T14:11:57.489+0800","caller":"rafthttp/probing_status.go:68","msg":"prober detected unhealthy status","round-tripper-name":"ROUND_TRIPPER_RAFT_MESSAGE","remote-peer-id":"d354b9b181618c10","rtt":"0s","error":"dial tcp 172.16.60.227:30320: connect: no route to host"}
  • 检查防火墙配置,防火墙未关闭,关闭防火墙
[root@gaussdb03 ~]# systemctl status firewalld
● firewalld.service - firewalld - dynamic firewall daemonLoaded: loaded (/usr/lib/systemd/system/firewalld.service; enabled; vendor preset: enabled)Active: active (running) since Mon 2025-09-01 13:42:30 CST; 29min agoDocs: man:firewalld(1)Main PID: 1334 (firewalld)Tasks: 2Memory: 34.6MCGroup: /system.slice/firewalld.service└─1334 /usr/bin/python3 /usr/sbin/firewalld --nofork --nopidSep 01 13:42:29 gaussdb03 systemd[1]: Starting firewalld - dynamic firewall daemon...
Sep 01 13:42:30 gaussdb03 systemd[1]: Started firewalld - dynamic firewall daemon.
[root@gaussdb03 ~]# systemctl stop firewalld.service
[root@gaussdb03 ~]# systemctl disable firewalld.service
Removed /etc/systemd/system/multi-user.target.wants/firewalld.service.
Removed /etc/systemd/system/dbus-org.fedoraproject.FirewallD1.service.
  • 再次检查集群状态正常
[omm@gaussdb01 ~]$ cm_ctl query -Cv
[  CMServer State   ]node             instance state
---------------------------------
1  172.16.60.226 1        Standby
2  172.16.60.227 2        Standby
3  172.16.60.228 3        Primary[    ETCD State     ]node             instance state
---------------------------------------
1  172.16.60.226 7001     StateFollower
2  172.16.60.227 7002     StateLeader
3  172.16.60.228 7003     StateFollower[   Cluster State   ]cluster_state   : Normal
redistributing  : No
balanced        : Yes
current_az      : AZ_ALL[ Coordinator State ]node             instance state
---------------------------------
1  172.16.60.226 5001     Normal
2  172.16.60.227 5002     Normal
3  172.16.60.228 5003     Normal[ Central Coordinator State ]node             instance state
---------------------------------
2  172.16.60.227 5002     Normal[     GTM State     ]node             instance state                    sync_state
-----------------------------------------------------------------
1  172.16.60.226 1001     P Primary Connection ok  Sync
2  172.16.60.227 1002     S Standby Connection ok  Sync
3  172.16.60.228 1003     S Standby Connection ok  Sync[  Datanode State   ]node             instance state            | node             instance state            | node             instance state
---------------------------------------------------------------------------------------------------------------------------------------
1  172.16.60.226 6001     P Primary Normal | 2  172.16.60.227 6002     S Standby Normal | 3  172.16.60.228 6003     S Standby Normal
2  172.16.60.227 6004     P Primary Normal | 1  172.16.60.226 6005     S Standby Normal | 3  172.16.60.228 6006     S Standby Normal
3  172.16.60.228 6007     P Primary Normal | 2  172.16.60.227 6008     S Standby Normal | 1  172.16.60.226 6009     S Standby Normal

3. 问题总结

由于操作系统防火墙未关闭,导致操作系统重启后,ETCD状态不正常,无法连接到其它节点,导致CMS状态异常,无法正常连接到实例。

http://www.xdnf.cn/news/1418113.html

相关文章:

  • .Net程序员就业现状以及学习路线图(二)
  • oracle默认事务隔离级别
  • Windows神器,按键屏蔽
  • 深入理解 HTTP 与 HTTPS:区别以及 HTTPS 加密原理
  • 【 VPX638】基于KU115 FPGA+C6678 DSP的6U VPX双FMC接口通用信号处理平台
  • 配送算法19 Two Fast Heuristics for Online Order Dispatching
  • Objective-C 的坚毅与传承:在Swift时代下的不可替代性优雅草卓伊凡
  • Java面试宝典:Redis高并发高可用(主从复制、哨兵)
  • 【算法基础】链表
  • PowerPoint和WPS演示如何在放映PPT时用鼠标划重点
  • 趣味学RUST基础篇(String)
  • rust语言 (1.88) egui (0.32.1) 学习笔记(逐行注释)(二十二)控件的可见、可用性
  • 如何从 STiROT 启动 STiROT_Appli_TrustZone LAT1556
  • JS闭包讲解
  • Elasticsearch面试精讲 Day 4:集群发现与节点角色
  • 《JAVA EE企业级应用开发》第一课笔记
  • 记录第一次使用docker打包镜像的操作步骤以及问题解决
  • 初识JVM
  • Personality Test 2025
  • 正则表达式与grep文本过滤详解
  • 【C++游记】AVL树
  • 刷题日记0901
  • (3dnr)多帧视频图像去噪 (二)
  • MySQL内置的各种单行函数
  • 强化学习实战:从零搭建自主移动机器人避障仿真(1)— 导论篇
  • 【LeetCode热题100道笔记+动画】乘积最大子数组
  • AI+PLM如何重构特种/高端复杂装备行业的工艺管理?
  • 再见 K8s!3款开源的云原生部署工具
  • 开源模型应用落地-模型上下文协议(MCP)-为AI智能体打造的“万能转接头”-“mcp-use”(十二)
  • [开源项目] Tiny-RAG :一套功能完善、高度可配的本地知识库问答解决方案