当前位置：首页 > web >正文

kubespere使用中遇到的问题

web 2025/6/29 18:11:16

删除master节点后，运行kubectl命令报错

基础环境：用kubesphere平台安装了三台master节点，节点IP地址 10.10.101.35，10.10.101.36，10.10.101.37

报错日志和现象：

用 kubectl get nodes 查询节点信息会发生到10.10.101.93上，这个节点之前是新增的master节点，运行一段时间后已经被删除了，已经没有这个10.10.101.93节点了，目前还在连接10.10.101.93节点的master。

[root@k8s-010010101035 admin]# kubectl get nodes
E0604 15:58:30.507327   20935 memcache.go:265] "Unhandled Error" err="couldn't get current server API group list: Get \"https://lb.kubesphere.local:6443/api?timeout=32s\": dial tcp 10.10.101.93:6443: connect: no route to host"
E0604 15:58:33.578903   20935 memcache.go:265] "Unhandled Error" err="couldn't get current server API group list: Get \"https://lb.kubesphere.local:6443/api?timeout=32s\": dial tcp 10.10.101.93:6443: connect: no route to host"
E0604 15:58:36.650818   20935 memcache.go:265] "Unhandled Error" err="couldn't get current server API group list: Get \"https://lb.kubesphere.local:6443/api?timeout=32s\": dial tcp 10.10.101.93:6443: connect: no route to host"
E0604 15:58:39.722894   20935 memcache.go:265] "Unhandled Error" err="couldn't get current server API group list: Get \"https://lb.kubesphere.local:6443/api?timeout=32s\": dial tcp 10.10.101.93:6443: connect: no route to host"
E0604 15:58:42.794855   20935 memcache.go:265] "Unhandled Error" err="couldn't get current server API group list: Get \"https://lb.kubesphere.local:6443/api?timeout=32s\": dial tcp 10.10.101.93:6443: connect: no route to host"
Unable to connect to the server: dial tcp 10.10.101.93:6443: connect: no route to host

这里是因为kubekey会把解析写到每个节点的/etc/hosts里

# 现在的配置 
10.10.101.93  lb.kubesphere.local# 修改后的配置
127.0.0.1  lb.kubesphere.local

这里修改成127.0.0.1 后无报错了，运行kubectl命令就可以查询节点信息了。

admin]# kubectl get nodes
NAME                                      STATUS   ROLES           AGE   VERSION
k8s-010010101035          Ready    control-plane   7d    v1.31.8
k8s-010010101036             Ready    control-plane   7d    v1.31.8
k8s-010010101037           Ready    control-plane   7d    v1.31.8
test-010010101025   Ready    worker          7d    v1.31.8
test-010010101026   Ready    worker          7d    v1.31.8
test-010010101027   Ready    worker          7d    v1.31.8

使用kubekey减少节点时报错

kk版本是 v3.1.9 部署的k8s版本 v1.31.8

环境三个master节点，现在新增了一个节点10.10.101.101 现在用kk命令直接删除节点，不用在config-sample.yaml配置文件中删除配置

kube]# ./kk delete node test-10.10.101.93 -f config-sample.yaml# 报错如下
10:37:40 CST message: [k8s-010010101035]
1. check the node name in the config-sample.yaml
2. check the node name in the Kubernetes cluster
3. check the node name is the first master and etcd node name10:37:40 CST failed: [k8s-010010101035]
10:37:40 CST skipped: [k8s-010010101036]
10:37:40 CST skipped: [k8s-010010101037]
error: Pipeline[DeleteNodePipeline] execute failed: Module[CompareConfigAndClusterInfoModule] exec failed: 
failed: [k8s-010010101035] [FindNode] exec failed after 3 retries: 1. check the node name in the config-sample.yaml
2. check the node name in the Kubernetes cluster
3. check the node name is the first master and etcd node name

这里遇到了一个已知的bug，用以下命令删除掉多余节点

# Maybe it's a bug for kk v1.1.1. You can manually delete the node:
kubectl drain node <node-name>
kubectl delete node <node-name>

查询node信息现在已经没有了，但是要注意的是，现在每个节点上的haproxy还有这个配置文件，需要在每个node节点配置文件上删除后，重启haproxy服务

删除 多余的节点，在 node 中查看还是有信息
kubectl get nodes# 在每个node节点上有一个haproxy的配置文件
haproxy]# pwd
/etc/kubekey/haproxy
# 如果无法删除的话，需要手动在这里删除集群已经踢出多余的master节点，并重启haproxy就好了。
haproxy]# cat haproxy.cfg   40    server k8s-010010101035 10.10.101.35:6443 check check-ssl verify none41    server k8s-0100101036 10.10.101.36:6443 check check-ssl verify none42    server k8s-010010101037 10.10.101.37:6443 check check-ssl verify none43    server test-010010101093 10.10.101.93:6443 check check-ssl verify none# 这里手动修改配置文件删除 最后一行的 10.10.101.93 节点清除后，重启haproxy的pod
haproxy]# cat haproxy.cfg   40    server k8s-010010101035 10.10.101.35:6443 check check-ssl verify none41    server k8s-010010101036 10.10.101.36:6443 check check-ssl verify none42    server k8s-010010101037 10.10.101.37:6443 check check-ssl verify none# 关闭haproxy容器pod
haproxy]# crictl stop 02bab10a64666
02bab10a64666# haproxy容器pod会自动重启，重新加载配置文件
haproxy]# crictl exec -it 23c68910bc666 sh
~ $  cat -n /usr/local/etc/haproxy/haproxy.cfg40    server k8s-010010101035 10.10.101.35:6443 check check-ssl verify none41    server k8s-010010101036 10.10.101.36:6443 check check-ssl verify none42    server k8s-010010101037 10.10.101.37:6443 check check-ssl verify none
~ $ exit# 此时node工作节点不会再去连接已经下线的master节点。

内容参考了 github

相同的问题记录https://github.com/kubesphere/kubekey/issues/1533