Improved PMO Host-side monitoring to detect issues more quickly

  1. Unplug IPMI network cable - whistle was slow to get this alert, should not take > ~30 minutes to notify that a host is down

  2. Power off host - we eventually received alerts on the OpenStack host but the K8 cluster was much quicker to notify to the issue. This would have been a problem if the host was not running k8 as well.

  3. Disk fill on host - we filled up the root volume on the host, because neutron services never stopped the host still showed as connected however it was unusable. pf9-muster died, and the VMs on the host went into a paused state. Again the only reason we got an alert is that the k8 cluster died. This would not have worked if there was no k8.

  • Guest
  • Oct 31 2019
  • Planned
  • Attach files
  • Guest commented
    May 04, 2020 21:10

    Being considered for 4.5 or 4.6 release of PMO