Skip to main content

K8s Control Node Replacement Process

Even for high availability deployed 3-node cloudpods over K8s clusters, any 1 node may go down in production environments. Some common failures, such as replacing memory, CPU, etc., can be solved by temporarily shutting down, then restarting the node after recovery.

But if failures like hard disk occur, such as when data cannot be recovered, you need to delete the node and rejoin a new node. The following describes the steps and precautions.

Test Environment​

  • k8s_vip: 10.127.100.102
    • Uses staticpods keepalived running on 3 master nodes
    • Started directly by kubelet, path: /etc/kubernetes/manifests/keepalived.yaml
  • primary_master_node first initialized control node: ip: 10.127.100.234
  • master_node_1 second joined control node: ip: 10.127.100.229
  • master_node_2 third joined control node: ip: 10.127.100.226
  • Database: Database is deployed outside the cluster, not on the 3 nodes
  • CSI: Uses local-path
    • local-path CSI will strongly bind pods to specified nodes, special attention needed here
    • If the downed node has local-path pvc bound to it, pods using that pvc cannot drift to other Ready nodes. These pods are called stateful pods. You can see all local-path pvcs with the command kubectl get pvc -A | grep local-path
$ kubectl get nodes -o wide
NAME STATUS ROLES AGE VERSION INTERNAL-IP EXTERNAL-IP OS-IMAGE KERNEL-VERSION CONTAINER-RUNTIME
lzx-ocboot-ha-test Ready master 100m v1.15.12 10.127.100.234 <none> CentOS Linux 7 (Core) 3.10.0-1160.6.1.el7.yn20201125.x86_64 docker://20.10.5
lzx-ocboot-ha-test-2 Ready master 61m v1.15.12 10.127.100.229 <none> CentOS Linux 7 (Core) 3.10.0-1160.6.1.el7.yn20201125.x86_64 docker://20.10.5
lzx-ocboot-ha-test-3 Ready master 60m v1.15.12 10.127.100.226 <none> CentOS Linux 7 (Core) 3.10.0-1160.6.1.el7.yn20201125.x86_64 docker://20.10.5
  • minio:
    • Uses minio as glance backend storage in high availability deployment
    • statefulset
    • Uses local-path CSI as backend storage
$ kubectl get pods -n onecloud-minio -o wide
kubectl get pods -n onecloud-minio -o wide
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
minio-0 1/1 Running 0 46m 10.40.99.205 lzx-ocboot-ha-test <none> <none>
minio-1 1/1 Running 0 46m 10.40.158.215 lzx-ocboot-ha-test-3 <none> <none>
minio-2 1/1 Running 0 46m 10.40.159.22 lzx-ocboot-ha-test-2 <none> <none>
minio-3 1/1 Running 0 46m 10.40.99.206 lzx-ocboot-ha-test <none> <none>
$ kubectl get pvc -n onecloud-minio
NAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS AGE
export-minio-0 Bound pvc-297ed5e5-66c8-4855-8031-c65a0ccfa4d0 1Ti RWO local-path 46m
export-minio-1 Bound pvc-4e8fe486-5b23-44a0-876c-df36d134957f 1Ti RWO local-path 46m
export-minio-2 Bound pvc-389b3c61-6000-4757-9949-db53e4e53776 1Ti RWO local-path 46m
export-minio-3 Bound pvc-3dd54509-7745-47dd-84ea-fbacfe1e2f5b 1Ti RWO local-path 46m

Test​

Goal​

Take down primary_master_node 10.127.100.234 node and join a new node to replace it

Steps​

1. Need to Confirm Which Stateful Pods and PVCs Are Running on This Node​

# Find the node name in the k8s cluster based on IP
$ kubectl get nodes -o wide | grep 10.127.100.234
lzx-ocboot-ha-test Ready master 4h15m v1.15.12 10.127.100.234 <none> CentOS Linux 7 (Core) 3.10.0-1160.6.1.el7.yn20201125.x86_64 docker://20.10.5

# View all local-path pvcs
$ kubectl get pvc -A | grep local-path
# onecloud-minio namespace's export-minio-x is responsible for storing glance images, is a key component
onecloud-minio export-minio-0 Bound pvc-297ed5e5-66c8-4855-8031-c65a0ccfa4d0 1Ti RWO local-path 3h11m
onecloud-minio export-minio-1 Bound pvc-4e8fe486-5b23-44a0-876c-df36d134957f 1Ti RWO local-path 3h11m
onecloud-minio export-minio-2 Bound pvc-389b3c61-6000-4757-9949-db53e4e53776 1Ti RWO local-path 3h11m
onecloud-minio export-minio-3 Bound pvc-3dd54509-7745-47dd-84ea-fbacfe1e2f5b 1Ti RWO local-path 3h11m
# onecloud-monitoring namespace's export-monitor-minio-x is responsible for storing service logs, is not a key component
onecloud-monitoring export-monitor-minio-0 Bound pvc-b885605f-b5ca-40ff-b968-4d95b03e8bb8 1Ti RWO local-path 3h8m
onecloud-monitoring export-monitor-minio-1 Bound pvc-520a8262-5dad-48aa-9a0e-0e25f850faad 1Ti RWO local-path 3h8m
onecloud-monitoring export-monitor-minio-2 Bound pvc-6de1ff0f-3465-4a51-8124-880f1b3c6d7a 1Ti RWO local-path 3h8m
onecloud-monitoring export-monitor-minio-3 Bound pvc-364652ca-496e-4b29-82ea-ec7e768aa8f5 1Ti RWO local-path 3h8m
# These pvcs under onecloud namespace are all system service dependencies
# default-baremetal-agent stores bare metal management storage, if baremetal-agent is not enabled, can ignore, default is also Pending pending binding state
onecloud default-baremetal-agent Pending local-path 3h35m
# default-esxi-agent stores esxi-agent service related local data
onecloud default-esxi-agent Bound pvc-b32adfcc-96e7-45e4-b8bd-b5318c954dca 30G RWO local-path 3h35m
# default-glance stores glance service images, in high availability deployment environment, deployment default-glance will not mount this pvc, will use minio s3 storage in onecloud-minio to store images, can ignore
onecloud default-glance Bound pvc-e6ee398e-2d84-46cf-9401-e94f438d87cd 100G RWO local-path 3h36m
# default-influxdb stores platform monitoring data, monitoring data can tolerate loss, if the node it's on goes down, can delete and recreate
onecloud default-influxdb Bound pvc-871b9441-c56f-4bb4-8b56-868e1df1a438 20G RWO local-path 3h35m


# View which pods are on this node
$ kubectl get pods -A -o wide | grep onecloud | grep 'lzx-ocboot-ha-test '
onecloud-minio minio-0 1/1 Running 0 3h10m 10.40.99.205 lzx-ocboot-ha-test <none> <none>
onecloud-minio minio-3 1/1 Running 0 3h10m 10.40.99.206 lzx-ocboot-ha-test <none> <none>
onecloud-monitoring monitor-kube-state-metrics-6c97499758-w69tz 1/1 Running 0 3h6m 10.40.99.214 lzx-ocboot-ha-test <none> <none>
onecloud-monitoring monitor-loki-0 1/1 Running 0 3h6m 10.40.99.213 lzx-ocboot-ha-test <none> <none>
onecloud-monitoring monitor-minio-0 1/1 Running 0 3h7m 10.40.99.211 lzx-ocboot-ha-test <none> <none>
onecloud-monitoring monitor-minio-3 1/1 Running 0 3h7m 10.40.99.212 lzx-ocboot-ha-test <none> <none>
onecloud-monitoring monitor-monitor-stack-operator-54d8c46577-qknws 1/1 Running 0 3h6m 10.40.99.216 lzx-ocboot-ha-test <none> <none>
onecloud-monitoring monitor-promtail-4mx2s 1/1 Running 0 3h6m 10.40.99.215 lzx-ocboot-ha-test <none> <none>
onecloud default-etcd-7brtldv78z 1/1 Running 0 3h10m 10.40.99.207 lzx-ocboot-ha-test <none> <none>
onecloud default-glance-6fd697b7b9-nbk9t 1/1 Running 0 3h7m 10.40.99.208 lzx-ocboot-ha-test <none> <none>
onecloud default-host-5rmg8 3/3 Running 7 3h34m 10.127.100.234 lzx-ocboot-ha-test <none> <none>
onecloud default-host-deployer-sf494 1/1 Running 7 3h34m 10.40.99.202 lzx-ocboot-ha-test <none> <none>
onecloud default-host-image-s6pwq 1/1 Running 2 3h34m 10.127.100.234 lzx-ocboot-ha-test <none> <none>
onecloud default-region-dns-2hcpv 1/1 Running 1 3h34m 10.127.100.234 lzx-ocboot-ha-test <none> <none>
onecloud default-telegraf-5jn4x 2/2 Running 0 3h34m 10.127.100.234 lzx-ocboot-ha-test <none> <none>
onecloud default-influxdb-6bqgq 1/1 Running 0 3h34m 10.127.99.218 lzx-ocboot-ha-test <none> <none>

Through the results of the above commands, we can filter out that stateful pods like onecloud-minio/minio-0, onecloud-minio/minio-3, onecloud/default-influxdb, onecloud-monitoring/monitor-minio-0, onecloud-minio/monitor-minio-3 are on the primary_master_node.

2. Next, Shut Down and Remove primary_master_node from the Cluster​

# Log in to the other two master_node nodes, for example: 10.127.100.229
$ ssh root@10.127.100.229

# Set KUBECONFIG configuration
[root@lzx-ocboot-ha-test-2 ~]$ export KUBECONFIG=/etc/kubernetes/admin.conf

# View node status, find primary_master_node has become NotReady
[root@lzx-ocboot-ha-test-2 ~]$ kubectl get nodes
NAME STATUS ROLES AGE VERSION
lzx-ocboot-ha-test NotReady master 4h37m v1.15.12
lzx-ocboot-ha-test-2 Ready master 3h58m v1.15.12
lzx-ocboot-ha-test-3 Ready master 3h57m v1.15.12

# Delete primary_master_node node: lzx-ocboot-ha-test
[root@lzx-ocboot-ha-test-2 ~]$ kubectl drain --delete-local-data --ignore-daemonsets lzx-ocboot-ha-test
WARNING: ignoring DaemonSet-managed Pods: kube-system/calico-node-fdzql, kube-system/kube-proxy-nfxvd, kube-system/traefik-ingress-controller-jms9v, onecloud-monitoring/monitor-promtail-4mx2s, onecloud/default-host-5rmg8, onecloud/default-host-deployer-sf494, onecloud/default-host-image-s6pwq, onecloud/default-region-dns-2hcpv, onecloud/default-telegraf-5jn4x
evicting pod "minio-0"
evicting pod "monitor-minio-3"
evicting pod "default-etcd-7brtldv78z"
evicting pod "monitor-kube-state-metrics-6c97499758-w69tz"
evicting pod "default-influxdb-85945647d5-6bqgq"
evicting pod "default-glance-6fd697b7b9-nbk9t"
evicting pod "minio-3"
evicting pod "monitor-monitor-stack-operator-54d8c46577-qknws"
evicting pod "monitor-loki-0"
evicting pod "monitor-minio-0"
# This command will hang because primary_master_node is already shut down and cannot delete pods. At this time, press 'Ctrl-c' to cancel the command
^C

# Use kubectl delete node to directly delete primary_master_node node
$ kubectl delete node lzx-ocboot-ha-test

# Then view pods in Pending status, all are stateful pods that were on primary_master_node before
# Because these pods use local-path pvcs, these pvcs are strongly bound to nodes and still exist in the cluster
[root@lzx-ocboot-ha-test-2 ~]$ kubectl get pods -A | grep Pending
onecloud-minio minio-0 0/1 Pending 0 61s
onecloud-minio minio-3 0/1 Pending 0 61s
onecloud-monitoring monitor-minio-0 0/1 Pending 0 61s
onecloud-monitoring monitor-minio-3 0/1 Pending 0 61s
onecloud default-influxdb-85945647d5-x5sv5 0/1 Pending 0 10m

3. Delete Old primary_master_node's etcd endpoint​

# View etcd pods under kube-system
[root@lzx-ocboot-ha-test-2 ~]$ kubectl get pods -n kube-system | grep etcd
etcd-lzx-ocboot-ha-test-2 1/1 Running 1 4h52m
etcd-lzx-ocboot-ha-test-3 1/1 Running 1 4h51m

# Enter etcd-lzx-ocboot-ha-test-2 etcd pod
[root@lzx-ocboot-ha-test-2 ~]# kubectl exec -ti -n kube-system etcd-lzx-ocboot-ha-test-2 sh

# Use etcdctl to view member list
$ etcdctl --endpoints https://127.0.0.1:2379 --cacert /etc/kubernetes/pki/etcd/ca.crt --cert /etc/kubernetes/pki/etcd/server.crt --key /etc/kubernetes/pki/etcd/server.key member list
# You will find the old primary_master_node member lzx-ocboot-ha-test is still in the etcd cluster
14da7b338b44eee0, started, lzx-ocboot-ha-test, https://10.127.100.234:2380, https://10.127.100.234:2379, false
454ae6f931376261, started, lzx-ocboot-ha-test-2, https://10.127.100.229:2380, https://10.127.100.229:2379, false
5afd19948b9009f6, started, lzx-ocboot-ha-test-3, https://10.127.100.226:2380, https://10.127.100.226:2379, false

# Delete lzx-ocboot-ha-test member
$ etcdctl --endpoints https://127.0.0.1:2379 --cacert /etc/kubernetes/pki/etcd/ca.crt --cert /etc/kubernetes/pki/etcd/server.crt --key /etc/kubernetes/pki/etcd/server.key member remove 14da7b338b44eee0

4. Replace Old primary_master_node Node, Join New master_node​

The old primary_master_node node has been deleted. You will find keepalived's vip has also drifted to master_node_1:

# View vip is 10.127.100.102
# If this node runs cloud platform host service, ip will be bound to br0
[root@lzx-ocboot-ha-test-2 ~]$ ip addr show br0
32: br0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UNKNOWN group default qlen 1000
link/ether 00:22:96:6f:6e:f1 brd ff:ff:ff:ff:ff:ff
inet 10.127.100.229/24 brd 10.127.100.255 scope global br0
valid_lft forever preferred_lft forever
inet 10.127.100.102/32 scope global br0
valid_lft forever preferred_lft forever
inet6 fe80::222:96ff:fe6f:6ef1/64 scope link
valid_lft forever preferred_lft forever

# View /etc/kubernetes/manifests/keepalived.yaml env configuration
[root@lzx-ocboot-ha-test-2 ~]$ cat /etc/kubernetes/manifests/keepalived.yaml | grep -A 15 env
env:
# Corresponds to keepalived weight, except primary_master_node's keepalived will be set to 100, others on master_node are 90
- name: KEEPALIVED_PRIORITY
value: "90"
# Set VIP
- name: KEEPALIVED_VIRTUAL_IPS
value: '#PYTHON2BASH:[''10.127.100.102'']'
# Is BACKUP role
- name: KEEPALIVED_STATE
value: BACKUP
# Password
- name: KEEPALIVED_PASSWORD
value: de17f785
# router id
- name: KEEPALIVED_ROUTER_ID
value: "12"
# This node's network interface actual ip
- name: KEEPALIVED_NODE_IP
value: 10.127.100.229
# keepalived bound network interface
- name: KEEPALIVED_INTERFACE
value: eth0
image: registry.cn-beijing.aliyuncs.com/yunionio/keepalived:v2.0.25

# View cluster currently only has 2 nodes
[root@lzx-ocboot-ha-test-2 ocboot]$ kubectl get nodes -o wide
NAME STATUS ROLES AGE VERSION INTERNAL-IP EXTERNAL-IP OS-IMAGE KERNEL-VERSION CONTAINER-RUNTIME
lzx-ocboot-ha-test-2 Ready master 4h29m v1.15.12 10.127.100.229 <none> CentOS Linux 7 (Core) 3.10.0-1160.6.1.el7.yn20201125.x86_64 docker://20.10.5
lzx-ocboot-ha-test-3 Ready master 4h28m v1.15.12 10.127.100.226 <none> CentOS Linux 7 (Core) 3.10.0-1160.6.1.el7.yn20201125.x86_64 docker://20.10.5

Now use ocboot to join a new node, edit ocboot yaml configuration:

  • Treat current lzx-ocboot-ha-test-2 node as primary_master_node
  • Need to add new master_node node information:
    • IP: 10.127.100.224
    • Name: lzx-ocboot-ha-test-4

Because the old primary_master_node has been deleted, and the current master_node_1 is treated as the new cluster's primary_master_node, the configuration now becomes:

$ cat config-new-k8s-ha.yaml
primary_master_node:
# Here treat the previous master_node_1 10.127.100.229 as primary_master_node
hostname: 10.127.100.229
use_local: false
user: root
onecloud_version: "v3.8.8"
# Database connection information, fill in according to your environment
db_host: 10.127.100.101
db_user: "root"
db_password: "0neC1oudDB#"
db_port: "3306"
image_repository: registry.cn-beijing.aliyuncs.com/yunionio
ha_using_local_registry: false
node_ip: "10.127.100.229"
# keepalived exposed vip
controlplane_host: 10.127.100.102
controlplane_port: "6443"
as_host: true
# Enable ha, deploy keepavlied by default
high_availability: true
use_ee: false
# High availability uses minio
enable_minio: true
host_networks: "eth0/br0/10.127.100.229"

master_nodes:
# Join to k8s cluster with 10.127.100.102 vip
controlplane_host: 10.127.100.102
controlplane_port: "6443"
# Run cloud platform control related components
as_controller: true
# As cloud platform private cloud host compute node
as_host: true
# Enable keepavlied
high_availability: true
hosts:
- user: root
hostname: "10.127.100.224"
host_networks: "eth0/br0/10.127.100.224"

After writing the configuration, use ocboot to join the new node.

Download ocboot deployment tool code.

# Use git clone the ocboot deployment tool locally
$ git clone -b release/3.11 https://github.com/yunionio/ocboot && cd ./ocboot

Then join the node.

$ ./run.py config-new-k8s-ha.yaml

After ocboot's ./run.py finishes running, view node information again, find the new node lzx-ocboot-ha-test-4(10.127.100.224) has joined:

[root@lzx-ocboot-ha-test-2 ~]$ kubectl get nodes -o wide
NAME STATUS ROLES AGE VERSION INTERNAL-IP EXTERNAL-IP OS-IMAGE KERNEL-VERSION CONTAINER-RUNTIME
lzx-ocboot-ha-test-2 Ready master 5h13m v1.15.12 10.127.100.229 <none> CentOS Linux 7 (Core) 3.10.0-1160.6.1.el7.yn20201125.x86_64 docker://20.10.5
lzx-ocboot-ha-test-3 Ready master 5h12m v1.15.12 10.127.100.226 <none> CentOS Linux 7 (Core) 3.10.0-1160.6.1.el7.yn20201125.x86_64 docker://20.10.5
lzx-ocboot-ha-test-4 Ready master 10m v1.15.12 10.127.100.224 <none> CentOS Linux 7 (Core) 3.10.0-1160.6.1.el7.yn20201125.x86_64 docker://20.10.5

Modify the new primary_master_node keepalived's weight and role

[root@lzx-ocboot-ha-test-2 ~]$ vim /etc/kubernetes/manifests/keepalived.yaml
...
# Change weight to 100
- name: KEEPALIVED_PRIORITY
value: "100"
...
# Change role to MASTER
- name: KEEPALIVED_STATE
value: MASTER

5. Restore Stateful Pods​

After the new master node joins, you will find the original stateful pods are still Pending. Next, need to delete old pvcs and start them on the new master node.

# View pods in Pending status
[root@lzx-ocboot-ha-test-2 ~]$ kubectl get pods -A | grep Pending
onecloud-minio minio-0 0/1 Pending 0 74m
onecloud-minio minio-3 0/1 Pending 0 74m
onecloud-monitoring monitor-minio-0 0/1 Pending 0 74m
onecloud-monitoring monitor-minio-3 0/1 Pending 0 74m
onecloud default-influxdb-85945647d5-x5sv5 0/1 Pending 0 84m

# First cordon the current primary_master_node and old master_node, so we can ensure subsequent stateful pods are created on the new master node
# Ensure minio multiple replicas are scattered across different master nodes
[root@lzx-ocboot-ha-test-2 ~]$ kubectl cordon lzx-ocboot-ha-test-2 lzx-ocboot-ha-test-3

First restore the minio statefulset component in onecloud-minio, because it stores images that glance depends on. Through previous commands, we found minio-0 and minio-3 in the onecloud-minio namespace are in Pending status. Next, delete the pvcs they depend on, then the pods will restart on the new master_node, as follows:

# Find the corresponding pvcs
[root@lzx-ocboot-ha-test-2 ~]$ kubectl get pvc -n onecloud-minio | egrep 'minio-0|minio-3'
export-minio-0 Bound pvc-297ed5e5-66c8-4855-8031-c65a0ccfa4d0 1Ti RWO local-path 4h55m
export-minio-3 Bound pvc-3dd54509-7745-47dd-84ea-fbacfe1e2f5b 1Ti RWO local-path 4h55m

# Delete pvcs
[root@lzx-ocboot-ha-test-2 ~]$ kubectl delete pvc -n onecloud-minio export-minio-0 export-minio-3

# Delete pods
[root@lzx-ocboot-ha-test-2 ~]$ kubectl delete pods -n onecloud-minio minio-0 minio-3

# View newly created started minio-0 and minio-3, have started on the new node lzx-ocboot-ha-test-4
[root@lzx-ocboot-ha-test-2 ~]$ kubectl get pods -n onecloud-minio -o wide
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
minio-0 1/1 Running 0 7s 10.40.103.200 lzx-ocboot-ha-test-4 <none> <none>
minio-1 1/1 Running 0 5h1m 10.40.158.215 lzx-ocboot-ha-test-3 <none> <none>
minio-2 1/1 Running 0 5h1m 10.40.159.22 lzx-ocboot-ha-test-2 <none> <none>
minio-3 1/1 Running 0 14s 10.40.103.199 lzx-ocboot-ha-test-4 <none> <none>

# View minio-3 logs, find it has self-healed
[root@lzx-ocboot-ha-test-2 ~]$ kubectl logs -n onecloud-minio minio-3
....
Healing disk '/export' on 1st pool
Healing disk '/export' on 1st pool complete
Summary:
{
"ID": "0e5c1947-44f0-4f8a-b7f0-e3a55f441d6f",
"PoolIndex": 0,
"SetIndex": 0,
"DiskIndex": 3,
"Path": "/export",
"Endpoint": "http://minio-3.minio-svc.onecloud-minio.svc.cluster.local:9000/export",
"Started": "2022-04-13T06:49:29.882069559Z",
"LastUpdate": "2022-04-13T06:49:59.564158167Z",
"ObjectsHealed": 10,
"ObjectsFailed": 0,
"BytesDone": 1756978429,
"BytesFailed": 0,
"QueuedBuckets": [],
"HealedBuckets": [
".minio.sys/config",
".minio.sys/buckets",
"onecloud-images"
]
}
...

# You can also log in to lzx-ocboot-ha-test-4 to check if there are corresponding image parts in the minio onecloud-images bucket in local-path csi
$ ssh root@10.127.100.224

# Enter the corresponding pvc directory, directory name can be obtained using kubectl get pvc -n onecloud-minio | grep minio-3
[root@lzx-ocboot-ha-test-4 ~]$ cd /opt/local-path-provisioner/pvc-352277cb-e69d-41bf-b58a-d65cb1e4e6f8/
[root@lzx-ocboot-ha-test-4 pvc-352277cb-e69d-41bf-b58a-d65cb1e4e6f8]$ du -smh onecloud-images/
838M onecloud-images/

Then use the same method as restoring onecloud-minio to restore monitor-minio, reference commands as follows:

[root@lzx-ocboot-ha-test-2 ~]$ kubectl  delete pvc -n onecloud-monitoring export-monitor-minio-0 export-monitor-minio-3
persistentvolumeclaim "export-monitor-minio-0" deleted
persistentvolumeclaim "export-monitor-minio-3" deleted

[root@lzx-ocboot-ha-test-2 ~]$ kubectl delete pods -n onecloud-monitoring monitor-minio-0 monitor-minio-3
pod "monitor-minio-0" deleted
pod "monitor-minio-3" deleted

[root@lzx-ocboot-ha-test-2 ~]$ kubectl get pods -n onecloud-monitoring | grep minio
monitor-minio-0 1/1 Running 0 24s
monitor-minio-1 1/1 Running 0 5h17m
monitor-minio-2 1/1 Running 0 5h17m
monitor-minio-3 1/1 Running 0 24s

Restore influxdb deployment. influxdb and minio are different. minio uses statefulset management. After deleting pod and pvc, k8s will automatically create corresponding numbered pod and pvc, but deployment won't. So the steps to restore influxdb are: delete pvc, then simultaneously delete the default-influxdb deployment. Our onecloud-operator component will create corresponding resources, steps as follows:

# Restore influxdb
[root@lzx-ocboot-ha-test-2 ~]# kubectl delete pvc -n onecloud default-influxdb
[root@lzx-ocboot-ha-test-2 ~]# kubectl delete deployment -n onecloud default-influxdb
[root@lzx-ocboot-ha-test-2 ~]# kubectl get pods -n onecloud | grep influxdb
default-influxdb-85945647d5-mdd2z 1/1 Running 0 7m44s

Now all components on the old primary_master_node are restored. Next, enable scheduling on the cordoned nodes:

[root@lzx-ocboot-ha-test-2 ~]$ kubectl uncordon lzx-ocboot-ha-test-2 lzx-ocboot-ha-test-3