当我们部署 K3S 集群的服务器将要到期时,需要进行更换 master 节点,从网上搜索 master 迁移都没有找到相关资料,大部分都是备份和还原,在正式应用到生产环境之前,确认每个节点是可以正常更换的,对半吊子水平的来说显得尤为重要,本文记录了 master 更换(驱逐) 的过程。
过程
假设当前 K3S 集群有两个节点,一个 master 和 worker 节点,如下所示
NAME STATUS ROLES AGE VERSION
node1 Ready <none> 16m v1.23.9+k3s1
server1 Ready control-plane,etcd,master 19m v1.23.9+k3s1
Server 1 安装命令
curl -fsSL https://get.k3s.io | INSTALL_K3S_VERSION=v1.23.9+k3s1 \
sh -s - server --cluster-init --node-name server1 --docker \
--tls-san <Server1 IP> --node-external-ip <Server1 IP> --flannel-backend none \
--kube-proxy-arg metrics-bind-address=0.0.0.0
mkdir ~/.kube -p && ln /etc/rancher/k3s/k3s.yaml ~/.kube/config && chmod 600 ~/.kube/config
k3s kubectl annotate node server1 kilo.squat.ai/location=server1
k3s kubectl annotate node server1 kilo.squat.ai/force-endpoint=<Server1 IP>:51820
k3s kubectl annotate node server1 kilo.squat.ai/persistent-keepalive=20
k3s kubectl apply -f https://raw.githubusercontent.com/squat/kilo/main/manifests/crds.yaml
k3s kubectl apply -f https://raw.githubusercontent.com/squat/kilo/main/manifests/kilo-k3s.yaml
Node 1 安装命令
curl -fsSL https://get.k3s.io | INSTALL_K3S_VERSION=v1.23.9+k3s1 \
K3S_URL=https://<Server1 IP>:6443 K3S_TOKEN=XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX \
sh -s - agent --node-name node1 --docker --node-external-ip <Node1 IP> \
--kube-proxy-arg metrics-bind-address=0.0.0.0
# 在 server1 机器上为 node1 添加注解
k3s kubectl annotate node node1 kilo.squat.ai/location=node1
k3s kubectl annotate node node1 kilo.squat.ai/force-endpoint=<Node1 IP>:51820
k3s kubectl annotate node node1 kilo.squat.ai/persistent-keepalive=20
接下来,需要添加 server2,角色和 server1 的一样,启动 K3S 时执行命令的参数和安装 server1 时的一样
curl -fsSL https://get.k3s.io | INSTALL_K3S_VERSION=v1.23.9+k3s1 \
K3S_URL=https://<Server1 IP>:6443 K3S_TOKEN=XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX \
sh -s - server --cluster-init --node-name server2 --docker --tls-san <Server2 IP> \
--flannel-backend none --node-external-ip <Server2 IP> \
--kube-proxy-arg metrics-bind-address=0.0.0.0
mkdir ~/.kube -p && ln /etc/rancher/k3s/k3s.yaml ~/.kube/config && chmod 600 ~/.kube/config
# 在 server1 机器上为 server2 添加注解
k3s kubectl annotate node server2 kilo.squat.ai/location=server2
k3s kubectl annotate node server2 kilo.squat.ai/force-endpoint=<Server2 IP>:51820
k3s kubectl annotate node server2 kilo.squat.ai/persistent-keepalive=20
安装完后,执行 k3s kubectl get nodes
应该显示如下,注意 etcd 需要 2n+1
奇数个节点
NAME STATUS ROLES AGE VERSION
node1 Ready <none> 30s v1.23.9+k3s1
server1 Ready control-plane,etcd,master 5m30s v1.23.9+k3s1
server2 Ready control-plane,etcd,master 78s v1.23.9+k3s1
执行 k3s kubectl get pods -A -o wide
查看 pod 分布
NAMESPACE NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
kube-system coredns-d76bd69b-wsrh4 1/1 Running 0 9m57s 10.42.0.5 server1 <none> <none>
kube-system helm-install-traefik-crd-rzfr6 0/1 Completed 0 9m57s 10.42.0.2 server1 <none> <none>
kube-system helm-install-traefik-q7jq9 0/1 Completed 1 9m57s 10.42.0.3 server1 <none> <none>
kube-system kilo-fj5cv 1/1 Running 0 9m57s 172.19.15.217 server1 <none> <none>
kube-system kilo-gbpsf 1/1 Running 0 5m59s 172.19.18.49 server2 <none> <none>
kube-system kilo-h2db7 1/1 Running 0 5m11s 172.19.0.3 node1 <none> <none>
kube-system local-path-provisioner-6c79684f77-pxmsd 1/1 Running 0 9m56s 10.42.0.6 server1 <none> <none>
kube-system metrics-server-7cd5fcb6b7-tkdvr 1/1 Running 0 9m57s 10.42.0.4 server1 <none> <none>
kube-system svclb-traefik-070d5809-2nq6t 2/2 Running 0 9m49s 10.42.0.7 server1 <none> <none>
kube-system svclb-traefik-070d5809-4dbsb 2/2 Running 0 5m58s 10.42.1.2 server2 <none> <none>
kube-system svclb-traefik-070d5809-t9rvc 2/2 Running 0 5m1s 10.42.2.2 node1 <none> <none>
kube-system traefik-df4ff85d6-ctlfq 1/1 Running 0 9m49s 10.42.0.8 server1 <none> <none>
可以看到,server1、server2、node1 一切正常,接下来在 server2 机器上使用 kubectl drain <node_name>
命令安全的驱逐 server1
kubectl drain server1 --delete-emptydir-data --ignore-daemonsets --force
根据 pod 数量所需要的时间也有所不同,如果没有出现错误,就可以对它进行升级、维护、删除等操作,接下来查看 node 和 pod 状态
NAME STATUS ROLES AGE VERSION
node1 Ready <none> 10m v1.23.9+k3s1
server1 Ready,SchedulingDisabled control-plane,etcd,master 15m v1.23.9+k3s1
server2 Ready control-plane,etcd,master 11m v1.23.9+k3s1
NAMESPACE NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
kube-system coredns-d76bd69b-vwkpn 1/1 Running 0 3m23s 10.42.1.4 server2 <none> <none>
kube-system kilo-fj5cv 1/1 Running 0 16m 172.19.15.217 server1 <none> <none>
kube-system kilo-gbpsf 1/1 Running 0 12m 172.19.18.49 server2 <none> <none>
kube-system kilo-h2db7 1/1 Running 0 11m 172.19.0.3 node1 <none> <none>
kube-system local-path-provisioner-6c79684f77-c9sd2 1/1 Running 0 3m23s 10.42.1.5 server2 <none> <none>
kube-system metrics-server-7cd5fcb6b7-hpqpr 1/1 Running 0 3m23s 10.42.2.3 node1 <none> <none>
kube-system svclb-traefik-070d5809-2nq6t 2/2 Running 0 16m 10.42.0.7 server1 <none> <none>
kube-system svclb-traefik-070d5809-4dbsb 2/2 Running 0 12m 10.42.1.2 server2 <none> <none>
kube-system svclb-traefik-070d5809-t9rvc 2/2 Running 0 11m 10.42.2.2 node1 <none> <none>
kube-system traefik-df4ff85d6-sp7r4 1/1 Running 0 3m23s 10.42.1.3 server2 <none> <none>
可以看到有 pod 仍然存在 server1 节点上,这是因为无法删除特定的 pod,接下来删除 server1 节点,依旧是在 server2 机器上执行,这个过程有点漫长
kubectl delete node server1
等待命令执行结束后,再次查看 node 和 pod 状态
NAME STATUS ROLES AGE VERSION
node1 Ready <none> 15m v1.23.9+k3s1
server2 Ready control-plane,etcd,master 16m v1.23.9+k3s1
NAMESPACE NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
kube-system coredns-d76bd69b-vwkpn 1/1 Running 0 7m16s 10.42.1.4 server2 <none> <none>
kube-system kilo-gbpsf 1/1 Running 0 16m 172.19.18.49 server2 <none> <none>
kube-system kilo-h2db7 1/1 Running 0 15m 172.19.0.3 node1 <none> <none>
kube-system local-path-provisioner-6c79684f77-c9sd2 1/1 Running 0 7m16s 10.42.1.5 server2 <none> <none>
kube-system metrics-server-7cd5fcb6b7-hpqpr 1/1 Running 0 7m16s 10.42.2.3 node1 <none> <none>
kube-system svclb-traefik-070d5809-4dbsb 2/2 Running 0 16m 10.42.1.2 server2 <none> <none>
kube-system svclb-traefik-070d5809-t9rvc 2/2 Running 0 15m 10.42.2.2 node1 <none> <none>
kube-system traefik-df4ff85d6-sp7r4 1/1 Running 0 7m16s 10.42.1.3 server2 <none> <none>
可以看到存在 server1 上的 pod 已经全都没有了,但是由于 K3S_URL 使用的是 <Server1 IP>
,现在 server1 IP 已经不再使用,所以需要将 server1 IP 变更到 server2 IP,接下来修改 server2 和 node1 的环境变量然后重启 k3s 和 k3s-agent
server2 作为新的 master 节点,删除 k3s 服务下的环境变量即可
rm /etc/systemd/system/k3s.service.env -rf
systemctl restart k3s
接下来修改 node1 的环境变量,先查看,如下所示
K3S_TOKEN="XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX"
K3S_URL="https://<Server1 IP>:6443"
接下来编辑 etc/systemd/system/k3s-agent.service.env
,将 <Server1 IP>
变更为 <Server2 IP>
,然后保存 :wq
并重启 k3s-agent 服务
vi /etc/systemd/system/k3s-agent.service.env
systemctl restart k3s-agent
总结
主要用到的是 kubectl 的驱逐命令,如果是普通 worker 节点,则不需要后面的修改环境变量等这些操作
kubectl drain <node_name> --delete-emptydir-data --ignore-daemonsets --force
kubectl delete node <node_name>
评论 (0)