当前位置：首页 > news >正文

repmgr集群故障修复

news 2025/7/13 15:32:57

文章目录

环境
症状
问题原因
解决方案

环境

系统平台：Linux x86-64 Red Hat Enterprise Linux 7
版本：5.6.5

症状

repmgr集群无法连接。

问题原因

数据库无法分配内存，出现宕机。

解决方案

1、查看集群状态，判断主备节点及节点运行情况

[root@slave ~]# repmgr cluster show ID | Name          | Role    | Status    | Upstream      | Location | Priority | Replication lag | Last replayed LSN----+---------------+---------+-----------+---------------+----------+----------+-----------------+-------------------1  | x.x.0.121 | primary | -failed |               | default  | 100      | n/a             | none             2  | x.x.0.122 | standby |   ?unreachable | x.x.0.121 | default  | 100      | 5232GB         | 0/0

通过查看集群状态发现主备节点皆出现问题。

[root@slave ~]# ps -ef | grep postroot      1091     1  0 Apr15 ?        00:00:00 /usr/libexec/postfix/master -wpostfix   1094  1091  0 Apr15 ?        00:00:00 qmgr -l -t unix -upostfix   4743  1091  0 08:37 ?        00:00:00 pickup -l -t unix -uroot      4953  4821  0 09:41 pts/0    00:00:00 grep --color=auto post

通过查看post进程发现主备节点皆无数据库进程，代表数据库节点皆出现宕机情况。

2、查看两节点数据库时间线

[root@slave ~]# pg_controldata |grep TimeLineID
Latest checkpoint's TimeLineID:       11Latest checkpoint's PrevTimeLineID:   11

两个数据库节点时间线皆为11，说明两个节点出现问题之前并未发生主备切换，我们以原主库为主库进行集群恢复，同时备节点做好数据备份。

3、查看数据库日志，判断节点出现问题原因。

find / -iname hgdb_log -printcat highgodb_11.log

提示无法分配内存，free -h查看服务器内存使用情况，发现服务器内存共14gb，且内存使用量过高，查看数据库参数配置正常，shared_buffers=4GB；建议客户增加内存。

4、启动主节点，并注册为主库

pg_ctl startrepmgr  primary register -F

5、查看主库是否出现vip

ip a

6、备份备节点，重做备节点，注册为备库

cd /highgomv data data_bak20240323pg_basebackup -h 主库ip  -U highgo -D /highgo/data -Fp -P -Xs -R -vpg_ctl startrepmgr  standby register -F

7、检查集群状态

[root@slave ~]# repmgr cluster show ID | Name          | Role    | Status    | Upstream      | Location | Priority | Replication lag | Last replayed LSN----+---------------+---------+-----------+---------------+----------+----------+-----------------+-------------------1  | x.x.0.121 | primary | * running |               | default  | 100      | n/a             | none             2  | x.x.0.122 | standby |   running | x.x.0.121 | default  | 100      | 0 bytes         | 0/70007D0

主备节点集群状态都为running正常，且Replication lag为0 bytes

[root@hs02 ~]# ps -ef|grep postgroot      20568      1  0 17:37 ?        00:00:00 /highgo/database/4.5/bin/postgres -D /highgo/database/4.5/dataroot      20569  20568  0 17:37 ?        00:00:00 postgres: logger process   root      20570  20568  0 17:37 ?        00:00:00 postgres: startup process   recovering 000000010000000000000007root      20571  20568  0 17:37 ?        00:00:00 postgres: checkpointer process   root      20572  20568  0 17:37 ?        00:00:00 postgres: writer process   root      20573  20568  0 17:37 ?        00:00:00 postgres: stats collector process   root      20574  20568  0 17:37 ?        00:00:00 postgres: wal receiver process   streaming 0/70006F0root      20585  20568  0 17:37 ?        00:00:00 postgres: sysdba highgo x.x.0.122(13382) idle