K3s Control Node Replacement Process

tip

Starting from v3.11.8, clusters deployed using ocboot will be deployed to K3s clusters by default. This document applies to K3s deployment environments. Users using K8s deployment please refer to the document: K8s Control Node Replacement Process.

Even for high availability deployed 3-node cloudpods over K3s clusters, any 1 node may go down in production environments. Some common failures, such as replacing memory, CPU, etc., can be solved by temporarily shutting down, then restarting the node after recovery.

But if failures like hard disk occur, such as when data cannot be recovered, you need to delete the node and rejoin a new node. The following describes the steps and precautions.

Test Environment

k3s_vip: 192.168.0.66
- Uses staticpods keepalived running on 3 master nodes
- Started directly by k3s, path: /var/lib/rancher/k3s/agent/pod-manifests/keepalived.yaml
primary_master_node first initialized control node: ip: 192.168.0.254
master_node_1 second joined control node: ip: 192.168.0.244
master_node_2 third joined control node: ip: 192.168.0.243
Database: Database is deployed outside the cluster, not on the 3 nodes
CSI: Uses local-path
- local-path CSI will strongly bind pods to specified nodes, special attention needed here
- If the downed node has local-path pvc bound to it, pods using that pvc cannot drift to other Ready nodes. These pods are called stateful pods. You can see all local-path pvcs with the command kubectl get pvc -A | grep local-path

$ kubectl get nodes -o wide
NAME           STATUS   ROLES                       AGE    VERSION        INTERNAL-IP     EXTERNAL-IP   OS-IMAGE                KERNEL-VERSION                    CONTAINER-RUNTIME
lzx-ha-env     Ready    control-plane,etcd,master   7d5h   v1.28.5+k3s1   192.168.0.254   <none>        CentOS Linux 7 (Core)   5.4.130-1.yn20230805.el7.x86_64   containerd://1.7.11-k3s2
lzx-ha-env-2   Ready    control-plane,etcd,master   7d5h   v1.28.5+k3s1   192.168.0.244   <none>        CentOS Linux 7 (Core)   5.4.130-1.yn20230805.el7.x86_64   containerd://1.7.11-k3s2
lzx-ha-env-3   Ready    control-plane,etcd,master   7d5h   v1.28.5+k3s1   192.168.0.243   <none>        CentOS Linux 7 (Core)   5.4.130-1.yn20230805.el7.x86_64   containerd://1.7.11-k3s2

minio:
- Uses minio as glance backend storage in high availability deployment
- statefulset
- Uses local-path CSI as backend storage

$ kubectl get pods -n onecloud-minio -o wide
NAME      READY   STATUS    RESTARTS   AGE     IP              NODE           NOMINATED NODE   READINESS GATES
minio-0   1/1     Running   0          5d22h   10.40.47.23     lzx-ha-env-2   <none>           <none>
minio-1   1/1     Running   0          14h     10.40.139.221   lzx-ha-env     <none>           <none>
minio-2   1/1     Running   0          2d20h   10.40.147.34    lzx-ha-env-3   <none>           <none>
minio-3   1/1     Running   0          2d20h   10.40.147.35    lzx-ha-env-3   <none>           <none>

$ kubectl get pvc -n onecloud-minio
NAME             STATUS   VOLUME                                     CAPACITY   ACCESS MODES   STORAGECLASS   AGE
export-minio-0   Bound    pvc-9c052647-73c2-4eb8-a5f0-69abf6cfc898   1Ti        RWO            local-path     7d3h
export-minio-1   Bound    pvc-aaded012-d6ca-42b9-bc3a-c03da7c66ec8   1Ti        RWO            local-path     7d3h
export-minio-2   Bound    pvc-aae12617-3e12-4a27-9cd8-fef84e609b12   1Ti        RWO            local-path     7d3h
export-minio-3   Bound    pvc-721329f0-c961-43f4-a7cc-ccd0f650dbd0   1Ti        RWO            local-path     7d3h

Test

Goal

Take down primary_master_node 192.168.0.254 node and join a new node to replace it.

Steps

1. Need to Confirm Which Stateful Pods and PVCs Are Running on This Node

# Find the node name in the k8s cluster based on IP
$ kubectl get nodes -o wide | grep 192.168.0.254
lzx-ha-env     Ready    control-plane,etcd,master   7d5h   v1.28.5+k3s1   192.168.0.254   <none>        CentOS Linux 7 (Core)   5.4.130-1.yn20230805.el7.x86_64   containerd://1.7.11-k3s2

# View all local-path pvcs
$ kubectl get pvc -A | grep local-path
# onecloud-minio namespace's export-minio-x is responsible for storing glance images, is a key component
onecloud-minio        export-minio-0             Bound     pvc-9c052647-73c2-4eb8-a5f0-69abf6cfc898   1Ti        RWO            local-path     7d3h
onecloud-minio        export-minio-1             Bound     pvc-aaded012-d6ca-42b9-bc3a-c03da7c66ec8   1Ti        RWO            local-path     7d3h
onecloud-minio        export-minio-2             Bound     pvc-aae12617-3e12-4a27-9cd8-fef84e609b12   1Ti        RWO            local-path     7d3h
onecloud-minio        export-minio-3             Bound     pvc-721329f0-c961-43f4-a7cc-ccd0f650dbd0   1Ti        RWO            local-path     7d3h

# onecloud-monitoring namespace's export-monitor-minio-x is responsible for storing service logs, is not a key component
onecloud-monitoring   export-monitor-minio-0     Bound     pvc-9b2c4065-ea17-415c-bbec-840db89cf88e   1Ti        RWO            local-path     7d3h
onecloud-monitoring   export-monitor-minio-1     Bound     pvc-9cd740c2-0044-49df-8cfb-1ead09127b4a   1Ti        RWO            local-path     7d3h
onecloud-monitoring   export-monitor-minio-2     Bound     pvc-e5fc90f7-8457-43a6-9b50-b4244e53eff2   1Ti        RWO            local-path     7d3h
onecloud-monitoring   export-monitor-minio-3     Bound     pvc-7b2563db-7904-4297-9bed-bf6e0cb8f626   1Ti        RWO            local-path     7d3h

# These pvcs under onecloud namespace are all system service dependencies
# default-baremetal-agent stores bare metal management storage, if baremetal-agent is not enabled, can ignore, default is also Pending pending binding state
onecloud              default-baremetal-agent   Pending                                                                        local-path     3h35m
# default-glance stores glance service images, in high availability deployment environment, deployment default-glance will not mount this pvc, will use minio s3 storage in onecloud-minio to store images, can ignore
onecloud              default-glance            Pending
# default-victoria-metrics stores platform monitoring data, monitoring data can tolerate loss, if the node it's on goes down, can delete and recreate
onecloud              default-victoria-metrics   Bound     pvc-b0d78630-b4fd-4521-898d-e76b4b1f8e1b   20G        RWO            local-path     7d5h

# View which pods are on this node
$ kubectl get pods -A -o wide | grep onecloud | grep 'lzx-ha-env '
onecloud-minio        minio-1                                     1/1     Running     0               14h     10.40.139.221   lzx-ha-env     <none>           <none>
onecloud-monitoring   monitor-minio-1                             1/1     Running     0               14h     10.40.139.218   lzx-ha-env     <none>           <none>
onecloud              default-etcd-2lnzlkt4fc                     1/1     Running     0               14h     10.40.139.227   lzx-ha-env     <none>           <none>
onecloud              default-host-deployer-9fcck                 1/1     Running     0               31h     192.168.0.254   lzx-ha-env     <none>           <none>
onecloud              default-host-health-bxgbk                   1/1     Running     0               7d5h    192.168.0.254   lzx-ha-env     <none>           <none>
onecloud              default-host-image-h4m8v                    1/1     Running     0               7d5h    192.168.0.254   lzx-ha-env     <none>           <none>
onecloud              default-host-ntjrv                          3/3     Running     1 (14h ago)     31h     192.168.0.254   lzx-ha-env     <none>           <none>
onecloud              default-region-dns-8sdhs                    1/1     Running     0               31h     192.168.0.254   lzx-ha-env     <none>           <none>
onecloud              default-telegraf-p7zc4                      1/1     Running     0               7d5h    192.168.0.254   lzx-ha-env     <none>           <none>

Through the results of the above commands, we can filter out that stateful pods like onecloud-minio/minio-1 are on the primary_master_node.

2. Next, Shut Down and Remove primary_master_node from the Cluster

# Log in to the other two master_node nodes, for example: 192.168.0.244
$ ssh root@192.168.0.244

# View node status, find primary_master_node has become NotReady
[root@lzx-ha-env-2 ~]# kubectl get nodes -o wide
NAME           STATUS     ROLES                       AGE    VERSION     
lzx-ha-env     NotReady   control-plane,etcd,master   7d6h   v1.28.5+k3s1
lzx-ha-env-2   Ready      control-plane,etcd,master   7d6h   v1.28.5+k3s1
lzx-ha-env-3   Ready      control-plane,etcd,master   7d6h   v1.28.5+k3s1

# Delete primary_master_node node: lzx-ha-env
[root@lzx-ha-env-2 ~]$ kubectl drain --delete-local-data --ignore-daemonsets lzx-ha-env
Warning: ignoring DaemonSet-managed Pods: kube-system/calico-node-dmcnq, kube-system/traefik-sj9gw, onecloud/default-host-deployer-9fcck, onecloud/default-host-health-bxgbk, onecloud/default-host-image-h4m8v, onecloud/default-host-ntjrv, onecloud/default-region-dns-8sdhs, onecloud/default-telegraf-p7zc4
evicting pod onecloud/default-etcd-2lnzlkt4fc
evicting pod onecloud-minio/minio-1
evicting pod onecloud-monitoring/monitor-minio-1
# This command will hang because primary_master_node is already shut down and cannot delete pods. At this time, press 'Ctrl-c' to cancel the command
^C

# Use kubectl delete node to directly delete primary_master_node node
$ kubectl delete node lzx-ha-env

# Then view pods in Pending status, all are stateful pods that were on primary_master_node before
# Because these pods use local-path pvcs, these pvcs are strongly bound to nodes and still exist in the cluster
[root@lzx-ha-env-2 ~]$ kubectl get pods -A | grep Pending
onecloud-minio        minio-1                                     0/1     Pending   0               11s
onecloud-monitoring   monitor-minio-1                             0/1     Pending   0               12s
onecloud              default-etcd-4pvj4fxc8g                     0/1     Pending   0               18m

3. Delete Old primary_master_node's etcd endpoint

Download etcdctl:

$ wget https://github.com/etcd-io/etcd/releases/download/v3.5.5/etcd-v3.5.5-linux-amd64.tar.gz
$ tar xf etcd-v3.5.5-linux-amd64.tar.gz
$ cp etcd-v3.5.5-linux-amd64/etcdctl /usr/local/bin/

View etcd members

# Use etcdctl to view member list
$ etcdctl member list \
    --endpoints https://127.0.0.1:2379 \
    --cacert /var/lib/rancher/k3s/server/tls/etcd/server-ca.crt \
    --cert /var/lib/rancher/k3s/server/tls/etcd/client.crt \
    --key /var/lib/rancher/k3s/server/tls/etcd/client.key

You will find the old primary_master_node member lzx-ha-env is no longer in the etcd cluster.

ce0d368c9b2596df, started, lzx-ha-env-3-f8946fea, https://192.168.0.243:2380, https://192.168.0.243:2379, false
eed2e0188ca5085c, started, lzx-ha-env-2-110bc617, https://192.168.0.244:2380, https://192.168.0.244:2379, false

tip

If the lzx-ha-env node is still in the etcd cluster, you can use the following command to delete this member. Normally k3s automatically removes this member.

$ etcdctl member remove xxxxx-id \
    --endpoints https://127.0.0.1:2379 \
    --cacert /var/lib/rancher/k3s/server/tls/etcd/server-ca.crt \
    --cert /var/lib/rancher/k3s/server/tls/etcd/client.crt \
    --key /var/lib/rancher/k3s/server/tls/etcd/client.key

4. Replace Old primary_master_node Node, Join New master_node

The old primary_master_node node has been deleted. You will find keepalived's vip has drifted to the lzx-ha-env-2 node (vip may drift to any control node, need to manually check):

# View vip is 192.168.0.66
# If this node runs cloud platform host service, ip will be bound to br0
[root@lzx-ha-env-2 ~]$ ip addr show br0
18: br0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UNKNOWN group default qlen 1000
    link/ether 00:22:5f:cf:fa:66 brd ff:ff:ff:ff:ff:ff
    inet 192.168.0.244/24 brd 192.168.0.255 scope global br0
       valid_lft forever preferred_lft forever
    inet 192.168.0.66/32 scope global br0
       valid_lft forever preferred_lft forever
    inet6 fe80::222:5fff:fecf:fa66/64 scope link
       valid_lft forever preferred_lft forever

# View /var/lib/rancher/k3s/agent/pod-manifests/keepalived.yaml env configuration
[root@lzx-ha-env-2 ~]$ cat /var/lib/rancher/k3s/agent/pod-manifests/keepalived.yaml | grep -A 17 env
    env:
    # Corresponds to keepalived weight, except primary_master_node's keepalived will be set to 100, others on master_node are 90
    - name: KEEPALIVED_PRIORITY
      value: "90"
    # Set VIP
    - name: KEEPALIVED_VIRTUAL_IPS
      value: "#PYTHON2BASH:['192.168.0.66']"
    # Is BACKUP role
    - name: KEEPALIVED_STATE
      value: BACKUP
    # Password
    - name: KEEPALIVED_PASSWORD
      value: "MTkyLjE2"
    # router id
    - name: KEEPALIVED_ROUTER_ID
      value: "66"
    # This node's network interface actual ip
    - name: KEEPALIVED_NODE_IP
      value: "192.168.0.244"
    # keepalived bound network interface
    - name: KEEPALIVED_INTERFACE
      value: "eth0"
    # keepalived health check command
    - name: CHECK_KUBE_CMD
      value: "curl -k -XGET https://192.168.0.244:6443/healthz --cert /var/lib/rancher/k3s/server/tls/client-kube-apiserver.crt --key /var/lib/rancher/k3s/server/tls/client-kube-apiserver.key --cacert /var/lib/rancher/k3s/server/tls/client-ca.crt"

# View cluster currently only has 2 nodes
[root@lzx-ha-env-2 ocboot]$ kubectl get nodes -o wide
NAME           STATUS   ROLES                       AGE    VERSION        INTERNAL-IP     EXTERNAL-IP   OS-IMAGE                KERNEL-VERSION                    CONTAINER-RUNTIME
lzx-ha-env-2   Ready    control-plane,etcd,master   7d6h   v1.28.5+k3s1   192.168.0.244   <none>        CentOS Linux 7 (Core)   5.4.130-1.yn20230805.el7.x86_64   containerd://1.7.11-k3s2
lzx-ha-env-3   Ready    control-plane,etcd,master   7d6h   v1.28.5+k3s1   192.168.0.243   <none>        CentOS Linux 7 (Core)   5.4.130-1.yn20230805.el7.x86_64   containerd://1.7.11-k3s2

Download ocboot deployment tool code.

Download Deployment Tool

The deployment tool code is at https://github.com/yunionio/ocboot/release, select the corresponding version, and download the tar.gz package of the code.

$ wget https://github.com/yunionio/ocboot/archive/refs/tags/master-v3.11.13-0.tar.gz
$ tar xf master-v3.11.13-0.tar.gz
$ cd ocboot-master-v3.11.13-0

Now use ocboot to join a new node, edit ocboot yaml configuration:

Treat current lzx-ha-env-2 node as primary_master_node
Need to add new master_node node information:
- IP: 192.168.0.239
- Name: lzx-test-add-node

Because the old primary_master_node has been deleted, and the current master_node_1 is treated as the new cluster's primary_master_node, the configuration now becomes:

tip

Before joining a new node, need to ensure master_node_1 can ssh passwordless login to the newly added node
MySQL password can be queried with the following command:

[root@lzx-ha-env-2 ~]$ kubectl get oc -n onecloud -o yaml | grep mysql: -A 5
    mysql:
      host: 192.168.0.252
      password: 0neC1oudDB#
      port: 3306
      username: root

$ vim config-new-k8s-ha.yaml
primary_master_node:
  # Here treat the previous master_node_1 192.168.0.244 as primary_master_node
  hostname: 192.168.0.244
  use_local: false
  user: root
  onecloud_version: "v3.11.9"
  # Database connection information, fill in according to your environment
  db_host: 192.168.0.252
  db_user: "root"
  db_password: "0neC1oudDB#"
  db_port: "3306"
  image_repository: registry.cn-beijing.aliyuncs.com/yunionio
  ha_using_local_registry: false
  node_ip: "192.168.0.244"
  # keepalived exposed vip
  controlplane_host: 192.168.0.66
  controlplane_port: "6443"
  as_host: true
  # Enable ha, deploy keepavlied by default
  high_availability: true
  use_ee: false
  # High availability uses minio
  enable_minio: true
  host_networks: "eth0/br0/192.168.0.244"

master_nodes:
  # Join to k8s cluster with 192.168.0.66 vip
  controlplane_host: 192.168.0.66
  controlplane_port: "6443"
  # Run cloud platform control related components
  as_controller: true
  # As cloud platform private cloud host compute node
  as_host: true
  # Enable keepavlied
  high_availability: true
  hosts:
  - user: root
    hostname: "192.168.0.239"
    host_networks: "eth0/br0/192.168.0.239"
  - user: root
    hostname: "192.168.0.243"
    host_networks: "eth0/br0/192.168.0.243"

After writing the configuration, use ocboot to join the new node.

$ ./ocboot.sh run.py virt config-new-k8s-ha.yaml

After ocboot's ./run.py finishes running, view node information again, find the new node lzx-test-add-node(192.168.0.239) has joined:

$ kubectl get nodes -o wide
NAME                STATUS   ROLES                       AGE   VERSION        INTERNAL-IP     EXTERNAL-IP   OS-IMAGE                    KERNEL-VERSION                       CONTAINER-RUNTIME
lzx-ha-env-2        Ready    control-plane,etcd,master   8d    v1.28.5+k3s1   192.168.0.244   <none>        CentOS Linux 7 (Core)       5.4.130-1.yn20230805.el7.x86_64      containerd://1.7.11-k3s2
lzx-ha-env-3        Ready    control-plane,etcd,master   8d    v1.28.5+k3s1   192.168.0.243   <none>        CentOS Linux 7 (Core)       5.4.130-1.yn20230805.el7.x86_64      containerd://1.7.11-k3s2
lzx-test-add-node   Ready    control-plane,etcd,master   69m   v1.28.5+k3s1   192.168.0.239   <none>        openEuler 22.03 (LTS-SP3)   5.10.0-182.0.0.95.oe2203sp3.x86_64   containerd://1.7.11-k3s2

5. Restore Stateful Pods

After the new master node joins, you will find the original stateful pods are still Pending. Next, need to delete old pvcs and start them on the new master node.

# View pods in Pending status
[root@lzx-ha-env-2 ~]$ kubectl get pods -A | grep Pending
onecloud-minio        minio-1                                     0/1     Pending             0             39m
onecloud-monitoring   monitor-minio-1                             0/1     Pending             0             35m

# First cordon the current primary_master_node and old master_node, so we can ensure subsequent stateful pods are created on the new master node
# Ensure minio multiple replicas are scattered across different master nodes
[root@lzx-ha-env-2 ~]$ kubectl cordon lzx-ha-env-2 lzx-ha-env-3

First restore the minio statefulset component in onecloud-minio, because it stores images that glance depends on. Through previous commands, we found minio-1 in the onecloud-minio namespace is in Pending status. Next, delete the pvc it depends on, then the pod will restart on the new master_node, as follows:

# Find the corresponding pvc
$ kubectl get pvc -n onecloud-minio  | egrep 'minio-1'
export-minio-1   Bound    pvc-aaded012-d6ca-42b9-bc3a-c03da7c66ec8   1Ti        RWO            local-path     8d

# Delete pvc
$ kubectl delete pvc -n onecloud-minio export-minio-1

# Delete pod
$ kubectl delete pods -n onecloud-minio minio-1

# View newly created started minio-1, has started on the new node
$ kubectl get pods -n onecloud-minio  -o wide
kubectl get pods -n onecloud-minio  -o wide
NAME      READY   STATUS              RESTARTS   AGE     IP             NODE                NOMINATED NODE   READINESS GATES
minio-0   1/1     Running             0          6d19h   10.40.47.23    lzx-ha-env-2        <none>           <none>
minio-1   0/1     ContainerCreating   0          34s     <none>         lzx-test-add-node   <none>           <none>
minio-2   1/1     Running             0          43m     10.40.147.13   lzx-ha-env-3        <none>           <none>
minio-3   1/1     Running             0          43m     10.40.147.23   lzx-ha-env-3        <none>           <none>

# View minio-1 logs, find it has self-healed
$ kubectl logs  -n onecloud-minio minio-1
....
Healing disk '/export' on 1st pool
Healing disk '/export' on 1st pool complete
Summary:
{
  "ID": "ab42567e-de12-4b4c-b8d1-9c0d85fa2196",
  "PoolIndex": 0,
  "SetIndex": 0,
  "DiskIndex": 1,
  "Path": "/export",
  "Endpoint": "http://minio-1.minio-svc.onecloud-minio.svc.cluster.local:9000/export",
  "Started": "2024-12-27T06:45:21.563513912Z",
  "LastUpdate": "2024-12-27T06:45:49.066891239Z",
  "ObjectsHealed": 10,
  "ObjectsFailed": 0,
  "BytesDone": 612353850,
  "BytesFailed": 0,
  "QueuedBuckets": [],
  "HealedBuckets": [
    ".minio.sys/config",
    ".minio.sys/buckets",
    "onecloud-images"
  ]
}
...

# You can also log in to lzx-test-add-node to check if there are corresponding image parts in the minio onecloud-images bucket in local-path csi
$ ssh root@10.127.100.224

# Enter the corresponding pvc directory, directory name can be obtained using kubectl get pvc -n onecloud-minio | grep minio-1
[root@lzx-test-add-node ~]$ cd /opt/k3s/storage/pvc-42778b3d-aa41-4d9e-a6b6-443ca0b09b3a_onecloud-minio_export-minio-1/
[root@lzx-test-add-node pvc-42778b3d-aa41-4d9e-a6b6-443ca0b09b3a_onecloud-minio_export-minio-1]$ du -smh onecloud-images/
293M    onecloud-images/

Then use the same method as restoring onecloud-minio to restore monitor-minio, reference commands as follows:

$ kubectl  delete pvc -n onecloud-monitoring export-monitor-minio-1
persistentvolumeclaim "export-monitor-minio-1" deleted

$ kubectl  delete pods -n onecloud-monitoring monitor-minio-1
pod "monitor-minio-1" deleted

$ kubectl get pods -n onecloud-monitoring  | grep minio
monitor-minio-0                                   1/1     Running   0          5h17m
monitor-minio-1                                   1/1     Running   0          24s
monitor-minio-2                                   1/1     Running   0          5h17m
monitor-minio-3                                   1/1     Running   0          5h17m

Restore victoria-metrics deployment. influxdb and minio are different. minio uses statefulset management. After deleting pod and pvc, k8s will automatically create corresponding numbered pod and pvc, but deployment won't. So the steps to restore victoria-metrics are: delete pvc, then simultaneously delete the default-victoria-metrics deployment. Our onecloud-operator component will create corresponding resources, steps as follows:

# Restore victoria-metrics
$ kubectl delete  pvc -n onecloud default-victoria-metrics
$ kubectl delete  deployment -n onecloud default-victoria-metrics
$ kubectl get pods -n onecloud -o wide | grep victoria-metrics
default-victoria-metrics-774fc8499b-4rjzx   0/1     ContainerCreating   0             7s      <none>          lzx-test-add-node   <none>           <none>

Now all components on the old primary_master_node are restored. Next, enable scheduling on the cordoned nodes:

$ kubectl uncordon lzx-ha-env-2 lzx-ha-env-3

Test Environment​

Test​

Goal​

Steps​

1. Need to Confirm Which Stateful Pods and PVCs Are Running on This Node​

2. Next, Shut Down and Remove primary_master_node from the Cluster​

3. Delete Old primary_master_node's etcd endpoint​

4. Replace Old primary_master_node Node, Join New master_node​

Download Deployment Tool​

5. Restore Stateful Pods​