kubernetes-master : 3 (1 machine per instance)
kube-api-loadbalancer + easyrsa : 1 of each (on 1 machine, I do not separate this two charm)
etcd : 5 (1 machine per instance)
kubernetes-worker : 8 (5 physical machines, 3 AWS
1) All was fine, all resilience (rebooting some parts, master, etcd, workers...) OK.
2) Did an apt update && apt upgrade everywhere (== all the units)
3) I saw an upgrade concerning etcd for etcd machines, well... But `etcdctl member list && etcdctl cluster-health` was good.
4) I rebooted one by one, with ~5minutes between them, all machines of the cluster.
5) I saw that my pods are Evicting from rebooted node and repop on already rebooted nodes, but they are flapping between Running and CLBO, especially the Ingress controller.
6) After ~300 automatic restart, it finally pops to Running state, was the last part of our debugging session friday's night.
7) I rebooted only one node (planned to reboots all of them to finally test the resilience and HA before going to prod, as we fixed the yesterday's problem): all his pods switch to Unknown without NodeEviction
8) Flannel indicate some "cannot connect to etcd cluster: it's misconfigured or unavailable".
Conclusion : I think some sh*it of the disaster of friday stayed in etcd and it's what doing all the mess.