It has been a mad mad mad week over here at Pulsed Media!
One of the server models is having serious issues, this server model we have in large numbers. They are now crashing "left and right" with no obvious reasons, and remote management is extremely flaky on this model; American Megatrends MegaRAC so doing hard reboots and debugging this has been a nightmare.
This began some time ago, but it was very random, very rare occurence. Since a node crashing again could take 4-5 months, and there never has been any error messages or any solid leads, never mind a way to reproduce; The issue this has been really hard to debug. Only common things are: This specific server model, server is not idling and has some load on it, server sometimes comes back just enough to process some automation commands. But even the load could be as low as 50Mbps average. And that coming back just enough to process some commands means we don't always catch them in time immediately and half down server could linger on ... We monitor actual responsiveness not "ping" to the server, while that ensures it's actually up and not fake up (ping responds, nothing else), it also means our downtime threshold is rather high to avoid false alerts.
We have tried a myriad of kernel settings as well, older versions of software and newer. Complete hardware replacements of course, since everything points to hardware issue, but generally that did exactly: Nothing. Only thing remaining common on those setups are the drives, but even the model, sizes etc. between crashy nodes changes etc. Our main lead at the time was it must be capacitors failing on the motherboard due to sheer weirdness of the crashes and issues, and failing caps can cause all kinds of weird issues.
Crashes were rare, but has been becoming more common in the past 6 months. There was no patterns prior to a month ago or so, same servers started crashing frequently. Roughly a week ago all this changed, we were facing big amounts of crashes, and despite knowing remote management was setup, tested and functioning correctly during HW swap, many of these failed to respond. We had a lot of crashes, but "fortunately" there was some nodes who crashed every single day, or even within hours. Finally we can do proper testing!
The motherboard model is very obscure proprietary format, MFG doesn't even list any bios updates for it. The "OEM" vendor has couple several BIOS versions, but even that was not a common factor.
Finally we found that some people with similar setups had issues with system chipset cooling, finally a lead! We started looking into this, and added direct chipset cooling on the crashiest nodes. This seemed to help, the worst offenders are still up many days after! Unfortunately, yesterday we did for more nodes this, and one of the nodes crashed soon after. This is still statistically insignifcant but worry-some, that node could actually have bad motherboard, bad capacitors etc. But enter the typical Finnish problem: No one has any stock, and even common parts are hard to find! We have to wait for more 40mm fans to arrive all the way from germany, because they are hard to get hold off in Finland without unknown length delivery times, or stupid high prices. Can you imagine paying for the cheapest possible 40mm fan 15€ (This is Akasa's small fan, i recall this going normally for more like 1.50€!), and yet they don't even tell you how many they have for sale?! And we need something like 60 of them, by yesterday Thank You!
We can fortunately use 120mm fans with some modding on them in this server model, and albeit stupid high prices once again, there is atleast some available just 1hour drive away. Only about 4-5x normal market pricing each.
How can chipset overheat?! Never head of this?!
Fairly simple actually. Our normal maintenance and server build schedule includes swapping CPU thermal compounds, we use the best on the market we can buy in larger tubes (Arctic MX-4 is avail on 20G tubes) from our usual vendors. We also remove the 2nd CPU because Seedbox, no need for that CPU power. However, the chassis fans set their speed according to CPU temps only, and even as high as 55C CPU temp reading (100% load) does not increase the fan speeds. The fan curves cannot be modified. Meanwhile, the chipset has undersized heatsinks and thermal compound you cannot even call a compound, they have solidified completely and has become so hard it is actually very hard to even remove.
Alternatively, it could be that the fan speeds is controlled by the chassis electronics and thus CPU temp did not affect that. Since we run quite low ambient (21-27C everywhere on the DC, averaging around 23C) the fans mostly "idle" (4800rpm) with cold/hot aisle setup we have.
Thus the chipset overheats, and as remote management has to go via the chipset, which is under thermal protection unable to route signals ...
Solutions, Solutions!
Replacing the chipset compound is either a long process, or a bit of a janky solution. Just adding 40mm fans on the northbridge and southbridge gets temps down by 30C, and they screw right in place, and infact looks quite professional, almost like it could be from factory. Using 120mm fans is a bit janky but cools a lot of other smaller chips too. However each fan has to be modded a little bit to fit, and they sit ever so slightly higher so barely gets any intake air, and partically blocks a lot of the airflow through the case.
The janky solution is just to cut the plastic studs holding it on the motherboard, clean everything up, a dab of MX-4 of on the chip, then glue thermal compound all around, and finillay hot glue to replace the studs. Very janky, a little bit like "arts & crafts" but this we can do fairly quickly. The proper solution would be to remove the motherboard completely from chassis, use pliers to open the now very brittle studs, clean up, put in proper thermal compound, and hope the studs did not break and still holds, or alternatively replace with larger, copper heatsinks, but taller than 11mm chipset heatsinks with the right mount is a bit hard find, even from china. We found some 40x40x11mm COPPER heatsinks from China but they are rather expensive solution, and will take way too many months to arrive.
Replacing compounds + 40mm fans is a little bit hard combination without waiting for hours for the thermal compound glue to solidify, and as we know hot glue is not exactly too stiff; Thus we would prefer for the thermal compound glue to solidify fully before mounting 40mm fans and closing the lid. Issue is the cabling, without the glue in place they are stiff enough it could potentially unseat the heatsink as we need to stretch the cabling to the fan header connectors.
Fortunately this motherboard ships with 2 unused fan headers :)
Why did we not notice chipset temps earlier?
We had noticed years ago that the chipsets run a bit high temp, but we've never seen chipset overheating, had no issues and trusted the motherboard manufacturer to know what they are doing, and that it must have very high thermal limit etc.
On top of that, when servers crash remote management is not always available, and there was no alerts for chipset temperature! The few results remain below alert threshold of 98C.
However we did also notice during this past week that the temperature readings are about 10-15C off for CPU at the very least.
Early test results?
Let's hope that one server was an anomaly! On the others we even had hints of increased performance after adding the active cooling. All but one test server has remained stable so far, it could be something silly like forgetting to plug in the fan power wire or something, during long days mistakes do happen.
Alternative solution?
We are already testing a new server model which looks quite promising, but there is a lot of steps to qualification. Unfortunately, we already found a design issue with that chassis which will take time to solve, and some configuration issues. Once we have figured it all out, it is easy and fast to start rolling them out.
Want an replacement server?
If you know your server is affected by this and you want replacement ASAP, contact support and we will arrange it to you promptly. If you do not need data migration there is extra service days to be had.
Due to us being swamped with all these fixes right now, be prepared to wait as long as 24hours for response. Most tickets still get sorted out in less than 12hrs tho.
EDIT: Couple more announcements relating to this:
0 servers down
1 server down
Saturday, August 11, 2018