We often see repeating problems were we run out of resources on certain nodes, but only discover that late, leading to lost time and efficiency while we triage the issue and then realize its a capacity issue that will take a few weeks to address.
we would like configurable dashboards and alerts that would allow them to monitor:
1. CPU
2. Memory
3. Storage
4. (Possibly) Network –
An example of the alert would be: If the CPU utilization on this host exceeds 95% for 10 minutes, then generate an alert and send out an email.