Reduce switchover time by stopping keepalived first to reduce K8s API Endpoint downtime during upgrades

During the master node upgrade as part of the cluster upgrade, currently customer losses access to K8s API endpoint for up to 15 – 30 sec due to VIP change. VIP will failover if the node or network goes down immediately. Keepalived is configured to perform a health check every 10 seconds. Thus, if K8s apiserver goes down right after the health check, it would take 9-10s for the next check + election time + upstream switch cache update to take place.

What can be done: Bring down keepalived first, forcing a VIP failover before bringing down the K8 apiserver as part of pf9-kube stop process. This would bring down the switchover time during upgrade significantly (in terms of % total time as it is now).

  • Guest
  • Oct 6 2020
  • Attach files