Sometimes that causes a situation where remote reboot does not function, remote reboot still keeps standby power for chipset etc. which can cause the server to go in a state where it does not remote reboot as the power is not physically being hard cycled on and off. It is more akin to just resetting the CPU & RAM on many built-in remote management systems. It is interesting condition, but nothing we've not seen before. Resolution is simple, remove power cable, reinsert power cable to ensure all power is off.
Total servers down was 25, most of these are not rebooting via regular remote reboot.
It is also midsummer festivities holiday in Finland, so we've been skeleton staffed. This is a 4 day period where most people leave the cities to enjoy some quiet time.
Thus for some reason no alerts reached person responsible to be on emergency call. We will look into ensuring that kind of alerts reach in future.
We are right now scheduling physical on-site intervention on the remaining down servers.
Downtime began on 18th of June, ~04:35 GMT +2. Expected ETA for full recovery is 3hours, by 17:30 GMT+2
UPDATE 1: Brownout had happened ~05:00 GMT +2 as per UPS logs. One rittal power distribution module has failed as well and has been replaced. One switch was down but that had been mostly been used for only management network, very few servers attached to it.
UPDATE 2: All servers back online and that failed PDU has been replaced.
UPDATE 3: We have added a new server monitoring solution by NodePing to avoid this kind of issue in future for faster response time, along with a new hiring for remote hands / holiday emergency on-site contact. His job is to remain at the local area during holidays ready to handle any possible alerts swiftly. Infrastructure status page is available at https://nodeping.com/reports/status/RIP4WW2JRY
Update 4: We hired new staff member as a result, and added another 3rd party monitoring solution so there won't be this long lapses in response times in future. Multiple methods of alerts to multiple people are to be used in future. Public status page is viewable and we will integrate it better soon. We have now multiple layers of monitoring, multiple alerting methods and the new staff member's #1 responsibility will be to stay at local area on alert during holidays, weekends etc. like this. Essentially, our monitoring failed to reach the right person swiftly. New monitoring should fix this. Lots of work to do with this remains however as we need to automate which servers are being monitored and add more monitors.
Another new layer of monitoring is still being planned, that will take a little bit of development work as well. That will add SMS alerts directly from our automation server for other kinds of data than simple uptime. Enviromentals and infrastructure is similarly already monitored (ie. we have ~dozen temperature monitors, several dozen wattage monitors)
Only 25 servers were down fortunately, bigger impact on remote management interfaces as the failed Rittal PDU caused one switch to be down (which is almost solely for management network). A few more was rebooted during fixing this.
We are really sorry for all of the people affected, and are applying SLA compensation swiftly. Total downtime for the affected servers were roughly 2days and 14hours.
We will strive to do better in future. It's not easy to build, maintain and monitor your own datacenter - it requires a lot of dedicated people and this time around during midsummer holiday our monitoring failed.
Update 5: as part of this MDS standard SLA level has been upgraded to Silver and Power and Power+ have Gold.
Sabato, Giugno 20, 2020
