kubespere使用中遇到的问题
删除master节点后,运行kubectl命令报错
基础环境:用kubesphere平台安装了三台master节点,节点IP地址 10.10.101.35,10.10.101.36,10.10.101.37
报错日志和现象:
用 kubectl get nodes 查询节点信息会发生到10.10.101.93上,这个节点之前是新增的master节点,运行一段时间后已经被删除了,已经没有这个10.10.101.93节点了,目前还在连接10.10.101.93节点的master。
[root@k8s-010010101035 admin]# kubectl get nodes
E0604 15:58:30.507327 20935 memcache.go:265] "Unhandled Error" err="couldn't get current server API group list: Get \"https://lb.kubesphere.local:6443/api?timeout=32s\": dial tcp 10.10.101.93:6443: connect: no route to host"
E0604 15:58:33.578903 20935 memcache.go:265] "Unhandled Error" err="couldn't get current server API group list: Get \"https://lb.kubesphere.local:6443/api?timeout=32s\": dial tcp 10.10.101.93:6443: connect: no route to host"
E0604 15:58:36.650818 20935 memcache.go:265] "Unhandled Error" err="couldn't get current server API group list: Get \"https://lb.kubesphere.local:6443/api?timeout=32s\": dial tcp 10.10.101.93:6443: connect: no route to host"
E0604 15:58:39.722894 20935 memcache.go:265] "Unhandled Error" err="couldn't get current server API group list: Get \"https://lb.kubesphere.local:6443/api?timeout=32s\": dial tcp 10.10.101.93:6443: connect: no route to host"
E0604 15:58:42.794855 20935 memcache.go:265] "Unhandled Error" err="couldn't get current server API group list: Get \"https://lb.kubesphere.local:6443/api?timeout=32s\": dial tcp 10.10.101.93:6443: connect: no route to host"
Unable to connect to the server: dial tcp 10.10.101.93:6443: connect: no route to host
这里是因为kubekey会把解析写到每个节点的/etc/hosts里
# 现在的配置
10.10.101.93 lb.kubesphere.local# 修改后的配置
127.0.0.1 lb.kubesphere.local
这里修改成127.0.0.1 后无报错了,运行kubectl命令就可以查询节点信息了。
admin]# kubectl get nodes
NAME STATUS ROLES AGE VERSION
k8s-010010101035 Ready control-plane 7d v1.31.8
k8s-010010101036 Ready control-plane 7d v1.31.8
k8s-010010101037 Ready control-plane 7d v1.31.8
test-010010101025 Ready worker 7d v1.31.8
test-010010101026 Ready worker 7d v1.31.8
test-010010101027 Ready worker 7d v1.31.8
使用kubekey减少节点时报错
kk版本是 v3.1.9 部署的k8s版本 v1.31.8
环境三个master节点,现在新增了一个节点10.10.101.101 现在用kk命令直接删除节点,不用在config-sample.yaml配置文件中删除配置
kube]# ./kk delete node test-10.10.101.93 -f config-sample.yaml# 报错如下
10:37:40 CST message: [k8s-010010101035]
1. check the node name in the config-sample.yaml
2. check the node name in the Kubernetes cluster
3. check the node name is the first master and etcd node name10:37:40 CST failed: [k8s-010010101035]
10:37:40 CST skipped: [k8s-010010101036]
10:37:40 CST skipped: [k8s-010010101037]
error: Pipeline[DeleteNodePipeline] execute failed: Module[CompareConfigAndClusterInfoModule] exec failed:
failed: [k8s-010010101035] [FindNode] exec failed after 3 retries: 1. check the node name in the config-sample.yaml
2. check the node name in the Kubernetes cluster
3. check the node name is the first master and etcd node name
这里遇到了一个已知的bug,用以下命令删除掉 多余节点
# Maybe it's a bug for kk v1.1.1. You can manually delete the node:
kubectl drain node <node-name>
kubectl delete node <node-name>
查询node信息现在已经没有了,但是要注意的是,现在每个节点上的haproxy还有这个配置文件,需要在每个node节点配置文件上删除后,重启haproxy服务
删除 多余的节点,在 node 中查看还是有信息
kubectl get nodes# 在每个node节点上有一个haproxy的配置文件
haproxy]# pwd
/etc/kubekey/haproxy
# 如果无法删除的话,需要手动在这里删除集群已经踢出多余的master节点,并重启haproxy就好了。
haproxy]# cat haproxy.cfg 40 server k8s-010010101035 10.10.101.35:6443 check check-ssl verify none41 server k8s-0100101036 10.10.101.36:6443 check check-ssl verify none42 server k8s-010010101037 10.10.101.37:6443 check check-ssl verify none43 server test-010010101093 10.10.101.93:6443 check check-ssl verify none# 这里手动修改配置文件删除 最后一行的 10.10.101.93 节点清除后,重启haproxy的pod
haproxy]# cat haproxy.cfg 40 server k8s-010010101035 10.10.101.35:6443 check check-ssl verify none41 server k8s-010010101036 10.10.101.36:6443 check check-ssl verify none42 server k8s-010010101037 10.10.101.37:6443 check check-ssl verify none# 关闭haproxy容器pod
haproxy]# crictl stop 02bab10a64666
02bab10a64666# haproxy容器pod会自动重启,重新加载配置文件
haproxy]# crictl exec -it 23c68910bc666 sh
~ $ cat -n /usr/local/etc/haproxy/haproxy.cfg40 server k8s-010010101035 10.10.101.35:6443 check check-ssl verify none41 server k8s-010010101036 10.10.101.36:6443 check check-ssl verify none42 server k8s-010010101037 10.10.101.37:6443 check check-ssl verify none
~ $ exit# 此时node工作节点不会再去连接已经下线的master节点。
内容参考了 github
相同的问题记录https://github.com/kubesphere/kubekey/issues/1533
社区也有人遇到了相同的问题
删除集群节点错误,./kk delete node node2 -f config-sample.yaml - KubeSphere 开发者社区