1 VOTE

Configurable worker node drain timeout during cluster upgrade

Today during a cluster upgrade, the workers are drained and upgraded one at a time to prevent downtime on workloads. When the "kubectl drain" is executed, it uses a 5-minute timeout. Any workload that takes longer than 5-minutes to terminate will be forcefully terminated, potentially disturbing the workload.

 

I have an example workload which may take more than 5-minutes to terminate, and we set the `terminationGracePeriodSeconds` to 2400s (it's probably longer than it needs to be, but certainly more than 5 minutes). When I upgrade my cluster, this workload regularly gets interrupted and force killed.

 

We should have a user-configurable timeout, or have it detect the `terminationGracePeriodSeconds` on all pods on that node and use that for the timeout to prevent workload interruption on cluster upgrade.

  • Avatar32.5fb70cce7410889e661286fd7f1897de Guest
  • Sep 26 2019
  • Attach files