Issue was with one of the power distribution modules. These have built-in heat based "fuses" (thermal fuse). PDM being situation very near the server exhaust does not exactly help the case, and significantly derates the rating, so the thermal fuse triggered despite functioning now for nearly 2½ years at this load level. Load level which is nowhere near the thermal fuse rating as well, and stats show this to be the case with plentifull headroom.
It seems this was just one of those cases, we use high quality and super expensive DC PDMs (Rittal) so it should be able to handle 24/7 upto it's rating, but maybe this was just one of those odd glitches that happens every now and then. Just to be certain we moved several servers from this PDM to another PDM.
Wednesday, June 20, 2018