跳到主要内容

控制节点替换流程

背景

即使是高可用部署的 3 节点 cloudpods over k8s 集群,在生产环境可能会出现任意 1 个节点挂掉的情况。 一些普通的故障,比如更换内存,CPU 之类可以解决的问题,可以临时关机,恢复后再重启节点解决。

但如果发生硬盘之类的故障,比如数据无法恢复的情况,需要把该节点删除再重新加入新的节点,接下来描述该步骤以及注意事项。

测试环境

  • k8s_vip: 10.127.100.102
    • 使用 staticpods keepalived 运行在 3 个 master 节点上
    • 由 kubelet 直接启动,路径:/etc/kubernetes/manifests/keepalived.yaml
  • primary_master_node 第1个初始化的控制节点: ip: 10.127.100.234
  • master_node_1 第2个加入控制节点: ip: 10.127.100.229
  • master_node_2 第3个加入的控制节点: ip: 10.127.100.226
  • 数据库: 数据库部署在集群之外,不在 3 个节点之上
  • CSI: 使用 local-path
    • local-path 的 CSI 会强绑定 pod 到指定的 node,这里需要特别注意
    • 如果挂掉的节点有对应 local-path 的 pvc 绑定在上面,使用该 pvc 的 pod 就没办法漂移到其它 Ready 的 node ,这种 pod 叫有状态的 pod,可以通过命令 kubectl get pvc -A | grep local-path 就可以看到所有的 local-path pvc
$ kubectl get nodes -o wide
NAME STATUS ROLES AGE VERSION INTERNAL-IP EXTERNAL-IP OS-IMAGE KERNEL-VERSION CONTAINER-RUNTIME
lzx-ocboot-ha-test Ready master 100m v1.15.12 10.127.100.234 <none> CentOS Linux 7 (Core) 3.10.0-1160.6.1.el7.yn20201125.x86_64 docker://20.10.5
lzx-ocboot-ha-test-2 Ready master 61m v1.15.12 10.127.100.229 <none> CentOS Linux 7 (Core) 3.10.0-1160.6.1.el7.yn20201125.x86_64 docker://20.10.5
lzx-ocboot-ha-test-3 Ready master 60m v1.15.12 10.127.100.226 <none> CentOS Linux 7 (Core) 3.10.0-1160.6.1.el7.yn20201125.x86_64 docker://20.10.5
  • minio:
    • 高可用部署下使用 minio 作为 glance 的后端存储
    • statefulset
    • 使用 local-path CSI 作为后端存储
$ kubectl get pods -n onecloud-minio -o wide
kubectl get pods -n onecloud-minio -o wide
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
minio-0 1/1 Running 0 46m 10.40.99.205 lzx-ocboot-ha-test <none> <none>
minio-1 1/1 Running 0 46m 10.40.158.215 lzx-ocboot-ha-test-3 <none> <none>
minio-2 1/1 Running 0 46m 10.40.159.22 lzx-ocboot-ha-test-2 <none> <none>
minio-3 1/1 Running 0 46m 10.40.99.206 lzx-ocboot-ha-test <none> <none>
$ kubectl get pvc -n onecloud-minio
NAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS AGE
export-minio-0 Bound pvc-297ed5e5-66c8-4855-8031-c65a0ccfa4d0 1Ti RWO local-path 46m
export-minio-1 Bound pvc-4e8fe486-5b23-44a0-876c-df36d134957f 1Ti RWO local-path 46m
export-minio-2 Bound pvc-389b3c61-6000-4757-9949-db53e4e53776 1Ti RWO local-path 46m
export-minio-3 Bound pvc-3dd54509-7745-47dd-84ea-fbacfe1e2f5b 1Ti RWO local-path 46m

测试

目标

下线 primary_master_node 10.127.100.234 节点,加入新的节点替换该节点

步骤

1. 需要确认有哪些有状态的 pod 和 pvc 运行在该节点

# 根据 IP 找到节点在 k8s 集群中的名称
$ kubectl get nodes -o wide | grep 10.127.100.234
lzx-ocboot-ha-test Ready master 4h15m v1.15.12 10.127.100.234 <none> CentOS Linux 7 (Core) 3.10.0-1160.6.1.el7.yn20201125.x86_64 docker://20.10.5

# 查看所有为 local-path 的 pvc
$ kubectl get pvc -A | grep local-path
# onecloud-minio namespace 的 export-minio-x 负责存储 glance 的镜像,是关键组件
onecloud-minio export-minio-0 Bound pvc-297ed5e5-66c8-4855-8031-c65a0ccfa4d0 1Ti RWO local-path 3h11m
onecloud-minio export-minio-1 Bound pvc-4e8fe486-5b23-44a0-876c-df36d134957f 1Ti RWO local-path 3h11m
onecloud-minio export-minio-2 Bound pvc-389b3c61-6000-4757-9949-db53e4e53776 1Ti RWO local-path 3h11m
onecloud-minio export-minio-3 Bound pvc-3dd54509-7745-47dd-84ea-fbacfe1e2f5b 1Ti RWO local-path 3h11m
# onecloud-monitoring namespace 的 export-monitor-minio-x 负责存储服务的日志,不是关键组件
onecloud-monitoring export-monitor-minio-0 Bound pvc-b885605f-b5ca-40ff-b968-4d95b03e8bb8 1Ti RWO local-path 3h8m
onecloud-monitoring export-monitor-minio-1 Bound pvc-520a8262-5dad-48aa-9a0e-0e25f850faad 1Ti RWO local-path 3h8m
onecloud-monitoring export-monitor-minio-2 Bound pvc-6de1ff0f-3465-4a51-8124-880f1b3c6d7a 1Ti RWO local-path 3h8m
onecloud-monitoring export-monitor-minio-3 Bound pvc-364652ca-496e-4b29-82ea-ec7e768aa8f5 1Ti RWO local-path 3h8m
# onecloud namespace 下面的这些 pvc 都是系统服务依赖的
# default-baremetal-agent 是存储物理机管理的存储,不开启 baremetal-agent 可以不用管,默认也是 Pending 待绑定状态
onecloud default-baremetal-agent Pending local-path 3h35m
# default-esxi-agent 负责存 esxi-agent 服务相关的本地数据
onecloud default-esxi-agent Bound pvc-b32adfcc-96e7-45e4-b8bd-b5318c954dca 30G RWO local-path 3h35m
# default-glance 负责存 glance 服务的镜像,在高可用部署环境下,deployment default-glance 不会挂载这个 pvc ,会使用 onecloud-minio 里面的 minio s3 存储镜像,可以不用管
onecloud default-glance Bound pvc-e6ee398e-2d84-46cf-9401-e94f438d87cd 100G RWO local-path 3h36m
# default-influxdb 负责存平台的监控数据,监控数据可以容忍丢失,如果所在节点挂了,可以删掉重建
onecloud default-influxdb Bound pvc-871b9441-c56f-4bb4-8b56-868e1df1a438 20G RWO local-path 3h35m


# 查看该节点上有哪些 pod
$ kubectl get pods -A -o wide | grep onecloud | grep 'lzx-ocboot-ha-test '
onecloud-minio minio-0 1/1 Running 0 3h10m 10.40.99.205 lzx-ocboot-ha-test <none> <none>
onecloud-minio minio-3 1/1 Running 0 3h10m 10.40.99.206 lzx-ocboot-ha-test <none> <none>
onecloud-monitoring monitor-kube-state-metrics-6c97499758-w69tz 1/1 Running 0 3h6m 10.40.99.214 lzx-ocboot-ha-test <none> <none>
onecloud-monitoring monitor-loki-0 1/1 Running 0 3h6m 10.40.99.213 lzx-ocboot-ha-test <none> <none>
onecloud-monitoring monitor-minio-0 1/1 Running 0 3h7m 10.40.99.211 lzx-ocboot-ha-test <none> <none>
onecloud-monitoring monitor-minio-3 1/1 Running 0 3h7m 10.40.99.212 lzx-ocboot-ha-test <none> <none>
onecloud-monitoring monitor-monitor-stack-operator-54d8c46577-qknws 1/1 Running 0 3h6m 10.40.99.216 lzx-ocboot-ha-test <none> <none>
onecloud-monitoring monitor-promtail-4mx2s 1/1 Running 0 3h6m 10.40.99.215 lzx-ocboot-ha-test <none> <none>
onecloud default-etcd-7brtldv78z 1/1 Running 0 3h10m 10.40.99.207 lzx-ocboot-ha-test <none> <none>
onecloud default-glance-6fd697b7b9-nbk9t 1/1 Running 0 3h7m 10.40.99.208 lzx-ocboot-ha-test <none> <none>
onecloud default-host-5rmg8 3/3 Running 7 3h34m 10.127.100.234 lzx-ocboot-ha-test <none> <none>
onecloud default-host-deployer-sf494 1/1 Running 7 3h34m 10.40.99.202 lzx-ocboot-ha-test <none> <none>
onecloud default-host-image-s6pwq 1/1 Running 2 3h34m 10.127.100.234 lzx-ocboot-ha-test <none> <none>
onecloud default-region-dns-2hcpv 1/1 Running 1 3h34m 10.127.100.234 lzx-ocboot-ha-test <none> <none>
onecloud default-telegraf-5jn4x 2/2 Running 0 3h34m 10.127.100.234 lzx-ocboot-ha-test <none> <none>
onecloud default-influxdb-6bqgq 1/1 Running 0 3h34m 10.127.99.218 lzx-ocboot-ha-test <none> <none>

通过上面命令的结果,可以筛选出 onecloud-minio/minio-0 onecloud-minio/minio-3,onecloud/default-influxdb,onecloud-monitoring/monitor-minio-0,onecloud-minio/monitor-minio-3 这些有状态的 pod 在 primary_master_node 上。

2. 接下来将 primary_master_node 关机踢出集群

# 登录其它两个 master_node 节点,比如:10.127.100.229
$ ssh root@10.127.100.229

# 设置 KUBECONFIG 配置
[root@lzx-ocboot-ha-test-2 ~]$ export KUBECONFIG=/etc/kubernetes/admin.conf

# 查看节点状态,发现 primary_master_node 已经变成 NotReady
[root@lzx-ocboot-ha-test-2 ~]$ kubectl get nodes
NAME STATUS ROLES AGE VERSION
lzx-ocboot-ha-test NotReady master 4h37m v1.15.12
lzx-ocboot-ha-test-2 Ready master 3h58m v1.15.12
lzx-ocboot-ha-test-3 Ready master 3h57m v1.15.12

# 删除 primary_master_node 节点:lzx-ocboot-ha-test
[root@lzx-ocboot-ha-test-2 ~]$ kubectl drain --delete-local-data --ignore-daemonsets lzx-ocboot-ha-test
WARNING: ignoring DaemonSet-managed Pods: kube-system/calico-node-fdzql, kube-system/kube-proxy-nfxvd, kube-system/traefik-ingress-controller-jms9v, onecloud-monitoring/monitor-promtail-4mx2s, onecloud/default-host-5rmg8, onecloud/default-host-deployer-sf494, onecloud/default-host-image-s6pwq, onecloud/default-region-dns-2hcpv, onecloud/default-telegraf-5jn4x
evicting pod "minio-0"
evicting pod "monitor-minio-3"
evicting pod "default-etcd-7brtldv78z"
evicting pod "monitor-kube-state-metrics-6c97499758-w69tz"
evicting pod "default-influxdb-85945647d5-6bqgq"
evicting pod "default-glance-6fd697b7b9-nbk9t"
evicting pod "minio-3"
evicting pod "monitor-monitor-stack-operator-54d8c46577-qknws"
evicting pod "monitor-loki-0"
evicting pod "monitor-minio-0"
# 该命令会卡停住,因为 primary_master_node 已经关机了,无法删除 pod ,这个时候 'Ctrl-c' 取消命令
^C

# 使用 kubectl delete node 直接删除 primary_master_node 节点
$ kubectl delete node lzx-ocboot-ha-test

# 然后再查看处于 Pending 状态的 pod 全部都是之前 primary_master_node 上面有状态的 pod
# 因为这些 pod 使用了 local-path 的 pvc ,这些 pvc 是和节点强绑定的,还存在集群中
[root@lzx-ocboot-ha-test-2 ~]$ kubectl get pods -A | grep Pending
onecloud-minio minio-0 0/1 Pending 0 61s
onecloud-minio minio-3 0/1 Pending 0 61s
onecloud-monitoring monitor-minio-0 0/1 Pending 0 61s
onecloud-monitoring monitor-minio-3 0/1 Pending 0 61s
onecloud default-influxdb-85945647d5-x5sv5 0/1 Pending 0 10m

3. 删除旧的 primary_master_node 的 etcd endpoint

# 查看 kube-system 系统下的 etcd pod
[root@lzx-ocboot-ha-test-2 ~]$ kubectl get pods -n kube-system | grep etcd
etcd-lzx-ocboot-ha-test-2 1/1 Running 1 4h52m
etcd-lzx-ocboot-ha-test-3 1/1 Running 1 4h51m

# 进入 etcd-lzx-ocboot-ha-test-2 etcd pod
[root@lzx-ocboot-ha-test-2 ~]# kubectl exec -ti -n kube-system etcd-lzx-ocboot-ha-test-2 sh

# 使用 etcdctl 查看 member list
$ etcdctl --endpoints https://127.0.0.1:2379 --cacert /etc/kubernetes/pki/etcd/ca.crt --cert /etc/kubernetes/pki/etcd/server.crt --key /etc/kubernetes/pki/etcd/server.key member list
# 会发现旧的 primary_master_node member lzx-ocboot-ha-test 还在 etcd 集群
14da7b338b44eee0, started, lzx-ocboot-ha-test, https://10.127.100.234:2380, https://10.127.100.234:2379, false
454ae6f931376261, started, lzx-ocboot-ha-test-2, https://10.127.100.229:2380, https://10.127.100.229:2379, false
5afd19948b9009f6, started, lzx-ocboot-ha-test-3, https://10.127.100.226:2380, https://10.127.100.226:2379, false

# 删除 lzx-ocboot-ha-test member
$ etcdctl --endpoints https://127.0.0.1:2379 --cacert /etc/kubernetes/pki/etcd/ca.crt --cert /etc/kubernetes/pki/etcd/server.crt --key /etc/kubernetes/pki/etcd/server.key member remove 14da7b338b44eee0

4. 替换旧的 primary_master_node 节点,加入新的 master_node

旧的 primary_master_node 节点已经被删掉,会发现 keepalived 的 vip 也漂移到了 master_node_1 上:

# 查看 vip 为 10.127.100.102
# 如果该节点运行了云平台 host 服务,ip 会绑定到 br0 上
[root@lzx-ocboot-ha-test-2 ~]$ ip addr show br0
32: br0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UNKNOWN group default qlen 1000
link/ether 00:22:96:6f:6e:f1 brd ff:ff:ff:ff:ff:ff
inet 10.127.100.229/24 brd 10.127.100.255 scope global br0
valid_lft forever preferred_lft forever
inet 10.127.100.102/32 scope global br0
valid_lft forever preferred_lft forever
inet6 fe80::222:96ff:fe6f:6ef1/64 scope link
valid_lft forever preferred_lft forever

# 查看 /etc/kubernetes/manifests/keepalived.yaml env 配置
[root@lzx-ocboot-ha-test-2 ~]$ cat /etc/kubernetes/manifests/keepalived.yaml | | grep -A 15 env
env:
# 对应 keepalived 的权重,除了 primary_master_node 的 keepalived 会设置 100,其余 master_node 上面的都是 90
- name: KEEPALIVED_PRIORITY
value: "90"
# 设置的 VIP
- name: KEEPALIVED_VIRTUAL_IPS
value: '#PYTHON2BASH:[''10.127.100.102'']'
# 是 BACKUP 角色
- name: KEEPALIVED_STATE
value: BACKUP
# 密码
- name: KEEPALIVED_PASSWORD
value: de17f785
# router id
- name: KEEPALIVED_ROUTER_ID
value: "12"
# 改节点网卡实际 ip
- name: KEEPALIVED_NODE_IP
value: 10.127.100.229
# keepalived 绑定的网卡
- name: KEEPALIVED_INTERFACE
value: eth0
image: registry.cn-beijing.aliyuncs.com/yunionio/keepalived:v2.0.25

# 查看集群当前只有 2 个节点
[root@lzx-ocboot-ha-test-2 ocboot]$ kubectl get nodes -o wide
NAME STATUS ROLES AGE VERSION INTERNAL-IP EXTERNAL-IP OS-IMAGE KERNEL-VERSION CONTAINER-RUNTIME
lzx-ocboot-ha-test-2 Ready master 4h29m v1.15.12 10.127.100.229 <none> CentOS Linux 7 (Core) 3.10.0-1160.6.1.el7.yn20201125.x86_64 docker://20.10.5
lzx-ocboot-ha-test-3 Ready master 4h28m v1.15.12 10.127.100.226 <none> CentOS Linux 7 (Core) 3.10.0-1160.6.1.el7.yn20201125.x86_64 docker://20.10.5

现在使用 ocboot 加入新的节点,编辑 ocboot yaml 配置:

  • 把当前的 lzx-ocboot-ha-test-2 这个节点当做 primary_master_node
  • 需要新加入 master_node 节点信息:
    • IP: 10.127.100.224
    • Name: lzx-ocboot-ha-test-4

因为旧的 primary_master_node 已经被删除了,又把当前的 master_node_1 当做新集群的 primary_master_node,现在的配置变成下面这样:

$ cat config-new-k8s-ha.yaml
primary_master_node:
# 这里把之前的 master_node_1 10.127.100.229 当做 primary_master_node
hostname: 10.127.100.229
use_local: false
user: root
onecloud_version: "v3.8.8"
# 数据库连接信息,根据自己的环境填写
db_host: 10.127.100.101
db_user: "root"
db_password: "0neC1oudDB#"
db_port: "3306"
image_repository: registry.cn-beijing.aliyuncs.com/yunionio
ha_using_local_registry: false
node_ip: "10.127.100.229"
# keepalived 暴露出去的 vip
controlplane_host: 10.127.100.102
controlplane_port: "6443"
as_host: true
# 启用 ha ,默认部署 keepavlied
high_availability: true
use_ee: false
# 高可用使用 minio
enable_minio: true
host_networks: "eth0/br0/10.127.100.229"

master_nodes:
# 加入到 10.127.100.102 vip 的 k8s 集群
controlplane_host: 10.127.100.102
controlplane_port: "6443"
# 运行云平台控制相关组件
as_controller: true
# 作为云平台私有云宿主机计算节点
as_host: true
# 启用 keepavlied
high_availability: true
hosts:
- user: root
hostname: "10.127.100.224"
host_networks: "eth0/br0/10.127.100.224"

写好配置后,使用 ocboot 加入新节点。

下载 ocboot 部署工具代码。

# 下载 ocboot 工具到本地
$ git clone -b release/3.11 https://github.com/yunionio/ocboot && cd ./ocboot

然后加入节点。

$ ./run.py config-new-k8s-ha.yaml

等到 ocboot 的 ./run.py 运行完成后,再查看 node 信息,发现新的节点 lzx-ocboot-ha-test-4(10.127.100.224) 已经加入进来:

[root@lzx-ocboot-ha-test-2 ~]$ kubectl get nodes -o wide
NAME STATUS ROLES AGE VERSION INTERNAL-IP EXTERNAL-IP OS-IMAGE KERNEL-VERSION CONTAINER-RUNTIME
lzx-ocboot-ha-test-2 Ready master 5h13m v1.15.12 10.127.100.229 <none> CentOS Linux 7 (Core) 3.10.0-1160.6.1.el7.yn20201125.x86_64 docker://20.10.5
lzx-ocboot-ha-test-3 Ready master 5h12m v1.15.12 10.127.100.226 <none> CentOS Linux 7 (Core) 3.10.0-1160.6.1.el7.yn20201125.x86_64 docker://20.10.5
lzx-ocboot-ha-test-4 Ready master 10m v1.15.12 10.127.100.224 <none> CentOS Linux 7 (Core) 3.10.0-1160.6.1.el7.yn20201125.x86_64 docker://20.10.5

修改新的 primary_master_node keepalived 的权重和角色

[root@lzx-ocboot-ha-test-2 ~]$ vim /etc/kubernetes/manifests/keepalived.yaml
...
# 把权重改成 100
- name: KEEPALIVED_PRIORITY
value: "100"
...
# 把角色改成 MASTER
- name: KEEPALIVED_STATE
value: MASTER

5. 恢复有状态的 pod

新的 master 节点加入后,会发现原来的有状态 pod 还是 Pending,接下来就需要删除旧的 pvc ,把它们启动到新的 master 节点上。

# 查看 Pending 状态的 pod
[root@lzx-ocboot-ha-test-2 ~]$ kubectl get pods -A | grep Pending
onecloud-minio minio-0 0/1 Pending 0 74m
onecloud-minio minio-3 0/1 Pending 0 74m
onecloud-monitoring monitor-minio-0 0/1 Pending 0 74m
onecloud-monitoring monitor-minio-3 0/1 Pending 0 74m
onecloud default-influxdb-85945647d5-x5sv5 0/1 Pending 0 84m

# 把现在的 primary_master_node 和旧的 master_node 先禁止调度,这样可以保证后面的有状态 pod 创建到新的 master 节点上
# 保证 minio 多副本打散到不同的 master 节点
[root@lzx-ocboot-ha-test-2 ~]$ kubectl cordon lzx-ocboot-ha-test-2 lzx-ocboot-ha-test-3

首先恢复 onecloud-minio 里面的 minio statefulset 组件,因为里面存储了 glance 依赖的镜像,通过之前的命令发现 onecloud-minio namespace 里面的 minio-0 和 minio-3 是 Pending 状态,接下来删除它们依赖的 pvc ,然后 pod 就会重启到新的 master_node 节点上,操作如下:

# 找到对应的 pvc
[root@lzx-ocboot-ha-test-2 ~]$ kubectl get pvc -n onecloud-minio | egrep 'minio-0|minio-3'
export-minio-0 Bound pvc-297ed5e5-66c8-4855-8031-c65a0ccfa4d0 1Ti RWO local-path 4h55m
export-minio-3 Bound pvc-3dd54509-7745-47dd-84ea-fbacfe1e2f5b 1Ti RWO local-path 4h55m

# 删除 pvc
[root@lzx-ocboot-ha-test-2 ~]$ kubectl delete pvc -n onecloud-minio export-minio-0 export-minio-3

# 删除 pod
[root@lzx-ocboot-ha-test-2 ~]$ kubectl delete pods -n onecloud-minio minio-0 minio-3

# 查看新创建启动 minio-0 和 minio-3 ,已经启动到新的 node lzx-ocboot-ha-test-4 上
[root@lzx-ocboot-ha-test-2 ~]$ kubectl get pods -n onecloud-minio -o wide
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
minio-0 1/1 Running 0 7s 10.40.103.200 lzx-ocboot-ha-test-4 <none> <none>
minio-1 1/1 Running 0 5h1m 10.40.158.215 lzx-ocboot-ha-test-3 <none> <none>
minio-2 1/1 Running 0 5h1m 10.40.159.22 lzx-ocboot-ha-test-2 <none> <none>
minio-3 1/1 Running 0 14s 10.40.103.199 lzx-ocboot-ha-test-4 <none> <none>

# 查看 minio-3 的日志,发现已经自愈完毕
[root@lzx-ocboot-ha-test-2 ~]$ kubectl logs -n onecloud-minio minio-3
....
Healing disk '/export' on 1st pool
Healing disk '/export' on 1st pool complete
Summary:
{
"ID": "0e5c1947-44f0-4f8a-b7f0-e3a55f441d6f",
"PoolIndex": 0,
"SetIndex": 0,
"DiskIndex": 3,
"Path": "/export",
"Endpoint": "http://minio-3.minio-svc.onecloud-minio.svc.cluster.local:9000/export",
"Started": "2022-04-13T06:49:29.882069559Z",
"LastUpdate": "2022-04-13T06:49:59.564158167Z",
"ObjectsHealed": 10,
"ObjectsFailed": 0,
"BytesDone": 1756978429,
"BytesFailed": 0,
"QueuedBuckets": [],
"HealedBuckets": [
".minio.sys/config",
".minio.sys/buckets",
"onecloud-images"
]
}
...

# 也可以登录到 lzx-ocboot-ha-test-4 上,查看 local-path csi 里面的 minio onecloud-images bucket 里面有没有对应的镜像 parts
$ ssh root@10.127.100.224

# 进入对应的 pvc 目录,目录名可以使用 kubectl get pvc -n onecloud-minio | grep minio-3 获得
[root@lzx-ocboot-ha-test-4 ~]$ cd /opt/local-path-provisioner/pvc-352277cb-e69d-41bf-b58a-d65cb1e4e6f8/
[root@lzx-ocboot-ha-test-4 pvc-352277cb-e69d-41bf-b58a-d65cb1e4e6f8]$ du -smh onecloud-images/
838M onecloud-images/

然后使用恢复 onecloud-minio 相同的方法,恢复 monitor-minio 就可以,参考命令如下:

[root@lzx-ocboot-ha-test-2 ~]$ kubectl  delete pvc -n onecloud-monitoring export-monitor-minio-0 export-monitor-minio-3
persistentvolumeclaim "export-monitor-minio-0" deleted
persistentvolumeclaim "export-monitor-minio-3" deleted

[root@lzx-ocboot-ha-test-2 ~]$ kubectl delete pods -n onecloud-monitoring monitor-minio-0 monitor-minio-3
pod "monitor-minio-0" deleted
pod "monitor-minio-3" deleted

[root@lzx-ocboot-ha-test-2 ~]$ kubectl get pods -n onecloud-monitoring | grep minio
monitor-minio-0 1/1 Running 0 24s
monitor-minio-1 1/1 Running 0 5h17m
monitor-minio-2 1/1 Running 0 5h17m
monitor-minio-3 1/1 Running 0 24s

恢复 influxdb deployment,influxdb 和 minio 不太一样,minio 使用的 statefulset 管理,删掉 pod 和 pvc 后,k8s 会自动创建对应序号的 pod 和 pvc ,但是 deployment 不会,所以恢复 influxdb 的步骤是删掉 pvc,然后同时删掉 default-influxdb 这个 deployment ,我们的 onecloud-operator 组件就会新建对应的资源,步骤如下:

# 恢复 influxdb
[root@lzx-ocboot-ha-test-2 ~]# kubectl delete pvc -n onecloud default-influxdb
[root@lzx-ocboot-ha-test-2 ~]# kubectl delete deployment -n onecloud default-influxdb
[root@lzx-ocboot-ha-test-2 ~]# kubectl get pods -n onecloud | grep influxdb
default-influxdb-85945647d5-mdd2z 1/1 Running 0 7m44s

现在所有在旧的 primary_master_node 上的组件都恢复,接下来把禁止调度的节点启用调度:

[root@lzx-ocboot-ha-test-2 ~]$ kubectl uncordon lzx-ocboot-ha-test-2 lzx-ocboot-ha-test-3