K3S master节点更换

偏向技术
/ 0 评论 / 342 阅读 / 正在检测是否收录...
温馨提示:
本文最后更新于2022年08月11日,已超过833天没有更新,若内容或图片失效,请留言反馈。

当我们部署 K3S 集群的服务器将要到期时,需要进行更换 master 节点,从网上搜索 master 迁移都没有找到相关资料,大部分都是备份和还原,在正式应用到生产环境之前,确认每个节点是可以正常更换的,对半吊子水平的来说显得尤为重要,本文记录了 master 更换(驱逐) 的过程。

过程

假设当前 K3S 集群有两个节点,一个 master 和 worker 节点,如下所示

bash
NAME      STATUS   ROLES                       AGE   VERSION
node1     Ready    <none>                      16m   v1.23.9+k3s1
server1   Ready    control-plane,etcd,master   19m   v1.23.9+k3s1
123

Server 1 安装命令

bash
curl -fsSL https://get.k3s.io | INSTALL_K3S_VERSION=v1.23.9+k3s1 \
sh -s - server --cluster-init --node-name server1 --docker \
--tls-san <Server1 IP> --node-external-ip <Server1 IP> --flannel-backend none \
--kube-proxy-arg metrics-bind-address=0.0.0.0

mkdir ~/.kube -p && ln /etc/rancher/k3s/k3s.yaml ~/.kube/config && chmod 600 ~/.kube/config

k3s kubectl annotate node server1 kilo.squat.ai/location=server1
k3s kubectl annotate node server1 kilo.squat.ai/force-endpoint=<Server1 IP>:51820
k3s kubectl annotate node server1 kilo.squat.ai/persistent-keepalive=20

k3s kubectl apply -f https://raw.githubusercontent.com/squat/kilo/main/manifests/crds.yaml
k3s kubectl apply -f https://raw.githubusercontent.com/squat/kilo/main/manifests/kilo-k3s.yaml
12345678910111213

Node 1 安装命令

bash
curl -fsSL https://get.k3s.io | INSTALL_K3S_VERSION=v1.23.9+k3s1 \
K3S_URL=https://<Server1 IP>:6443 K3S_TOKEN=XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX \
sh -s - agent --node-name node1 --docker --node-external-ip <Node1 IP> \
--kube-proxy-arg metrics-bind-address=0.0.0.0

# 在 server1 机器上为 node1 添加注解
k3s kubectl annotate node node1 kilo.squat.ai/location=node1
k3s kubectl annotate node node1 kilo.squat.ai/force-endpoint=<Node1 IP>:51820
k3s kubectl annotate node node1 kilo.squat.ai/persistent-keepalive=20
123456789

接下来,需要添加 server2,角色和 server1 的一样,启动 K3S 时执行命令的参数和安装 server1 时的一样

bash
curl -fsSL https://get.k3s.io | INSTALL_K3S_VERSION=v1.23.9+k3s1 \
K3S_URL=https://<Server1 IP>:6443 K3S_TOKEN=XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX \
sh -s - server --cluster-init --node-name server2 --docker --tls-san <Server2 IP> \
--flannel-backend none --node-external-ip <Server2 IP> \
--kube-proxy-arg metrics-bind-address=0.0.0.0

mkdir ~/.kube -p && ln /etc/rancher/k3s/k3s.yaml ~/.kube/config && chmod 600 ~/.kube/config

# 在 server1 机器上为 server2 添加注解
k3s kubectl annotate node server2 kilo.squat.ai/location=server2
k3s kubectl annotate node server2 kilo.squat.ai/force-endpoint=<Server2 IP>:51820
k3s kubectl annotate node server2 kilo.squat.ai/persistent-keepalive=20
123456789101112

安装完后,执行 k3s kubectl get nodes 应该显示如下,注意 etcd 需要 2n+1 奇数个节点

bash
NAME      STATUS   ROLES                       AGE     VERSION
node1     Ready    <none>                      30s     v1.23.9+k3s1
server1   Ready    control-plane,etcd,master   5m30s   v1.23.9+k3s1
server2   Ready    control-plane,etcd,master   78s     v1.23.9+k3s1
1234

执行 k3s kubectl get pods -A -o wide 查看 pod 分布

NAMESPACE     NAME                                      READY   STATUS      RESTARTS   AGE     IP              NODE      NOMINATED NODE   READINESS GATES
kube-system   coredns-d76bd69b-wsrh4                    1/1     Running     0          9m57s   10.42.0.5       server1   <none>           <none>
kube-system   helm-install-traefik-crd-rzfr6            0/1     Completed   0          9m57s   10.42.0.2       server1   <none>           <none>
kube-system   helm-install-traefik-q7jq9                0/1     Completed   1          9m57s   10.42.0.3       server1   <none>           <none>
kube-system   kilo-fj5cv                                1/1     Running     0          9m57s   172.19.15.217   server1   <none>           <none>
kube-system   kilo-gbpsf                                1/1     Running     0          5m59s   172.19.18.49    server2   <none>           <none>
kube-system   kilo-h2db7                                1/1     Running     0          5m11s   172.19.0.3      node1     <none>           <none>
kube-system   local-path-provisioner-6c79684f77-pxmsd   1/1     Running     0          9m56s   10.42.0.6       server1   <none>           <none>
kube-system   metrics-server-7cd5fcb6b7-tkdvr           1/1     Running     0          9m57s   10.42.0.4       server1   <none>           <none>
kube-system   svclb-traefik-070d5809-2nq6t              2/2     Running     0          9m49s   10.42.0.7       server1   <none>           <none>
kube-system   svclb-traefik-070d5809-4dbsb              2/2     Running     0          5m58s   10.42.1.2       server2   <none>           <none>
kube-system   svclb-traefik-070d5809-t9rvc              2/2     Running     0          5m1s    10.42.2.2       node1     <none>           <none>
kube-system   traefik-df4ff85d6-ctlfq                   1/1     Running     0          9m49s   10.42.0.8       server1   <none>           <none>
12345678910111213

可以看到,server1、server2、node1 一切正常,接下来在 server2 机器上使用 kubectl drain <node_name> 命令安全的驱逐 server1

bash
kubectl drain server1 --delete-emptydir-data --ignore-daemonsets --force
1

根据 pod 数量所需要的时间也有所不同,如果没有出现错误,就可以对它进行升级、维护、删除等操作,接下来查看 node 和 pod 状态

bash
NAME      STATUS                     ROLES                       AGE   VERSION
node1     Ready                      <none>                      10m   v1.23.9+k3s1
server1   Ready,SchedulingDisabled   control-plane,etcd,master   15m   v1.23.9+k3s1
server2   Ready                      control-plane,etcd,master   11m   v1.23.9+k3s1
1234
bash
NAMESPACE     NAME                                      READY   STATUS    RESTARTS   AGE     IP              NODE      NOMINATED NODE   READINESS GATES
kube-system   coredns-d76bd69b-vwkpn                    1/1     Running   0          3m23s   10.42.1.4       server2   <none>           <none>
kube-system   kilo-fj5cv                                1/1     Running   0          16m     172.19.15.217   server1   <none>           <none>
kube-system   kilo-gbpsf                                1/1     Running   0          12m     172.19.18.49    server2   <none>           <none>
kube-system   kilo-h2db7                                1/1     Running   0          11m     172.19.0.3      node1     <none>           <none>
kube-system   local-path-provisioner-6c79684f77-c9sd2   1/1     Running   0          3m23s   10.42.1.5       server2   <none>           <none>
kube-system   metrics-server-7cd5fcb6b7-hpqpr           1/1     Running   0          3m23s   10.42.2.3       node1     <none>           <none>
kube-system   svclb-traefik-070d5809-2nq6t              2/2     Running   0          16m     10.42.0.7       server1   <none>           <none>
kube-system   svclb-traefik-070d5809-4dbsb              2/2     Running   0          12m     10.42.1.2       server2   <none>           <none>
kube-system   svclb-traefik-070d5809-t9rvc              2/2     Running   0          11m     10.42.2.2       node1     <none>           <none>
kube-system   traefik-df4ff85d6-sp7r4                   1/1     Running   0          3m23s   10.42.1.3       server2   <none>           <none>
1234567891011

可以看到有 pod 仍然存在 server1 节点上,这是因为无法删除特定的 pod,接下来删除 server1 节点,依旧是在 server2 机器上执行,这个过程有点漫长

bash
kubectl delete node server1
1

等待命令执行结束后,再次查看 node 和 pod 状态

NAME      STATUS   ROLES                       AGE   VERSION
node1     Ready    <none>                      15m   v1.23.9+k3s1
server2   Ready    control-plane,etcd,master   16m   v1.23.9+k3s1
123
bash
NAMESPACE     NAME                                      READY   STATUS    RESTARTS   AGE     IP             NODE      NOMINATED NODE   READINESS GATES
kube-system   coredns-d76bd69b-vwkpn                    1/1     Running   0          7m16s   10.42.1.4      server2   <none>           <none>
kube-system   kilo-gbpsf                                1/1     Running   0          16m     172.19.18.49   server2   <none>           <none>
kube-system   kilo-h2db7                                1/1     Running   0          15m     172.19.0.3     node1     <none>           <none>
kube-system   local-path-provisioner-6c79684f77-c9sd2   1/1     Running   0          7m16s   10.42.1.5      server2   <none>           <none>
kube-system   metrics-server-7cd5fcb6b7-hpqpr           1/1     Running   0          7m16s   10.42.2.3      node1     <none>           <none>
kube-system   svclb-traefik-070d5809-4dbsb              2/2     Running   0          16m     10.42.1.2      server2   <none>           <none>
kube-system   svclb-traefik-070d5809-t9rvc              2/2     Running   0          15m     10.42.2.2      node1     <none>           <none>
kube-system   traefik-df4ff85d6-sp7r4                   1/1     Running   0          7m16s   10.42.1.3      server2   <none>           <none>
123456789

可以看到存在 server1 上的 pod 已经全都没有了,但是由于 K3S_URL 使用的是 <Server1 IP>,现在 server1 IP 已经不再使用,所以需要将 server1 IP 变更到 server2 IP,接下来修改 server2 和 node1 的环境变量然后重启 k3s 和 k3s-agent

server2 作为新的 master 节点,删除 k3s 服务下的环境变量即可

bash
rm /etc/systemd/system/k3s.service.env -rf
systemctl restart k3s
12

接下来修改 node1 的环境变量,先查看,如下所示

bash
K3S_TOKEN="XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX"
K3S_URL="https://<Server1 IP>:6443"
12

接下来编辑 etc/systemd/system/k3s-agent.service.env,将 <Server1 IP> 变更为 <Server2 IP>,然后保存 :wq 并重启 k3s-agent 服务

bash
vi /etc/systemd/system/k3s-agent.service.env
systemctl restart k3s-agent
12

总结

主要用到的是 kubectl 的驱逐命令,如果是普通 worker 节点,则不需要后面的修改环境变量等这些操作

bash
kubectl drain <node_name> --delete-emptydir-data --ignore-daemonsets --force
kubectl delete node <node_name>
12
0

评论 (0)

取消