There's been recent issues that we have been experiencing on some of our servers. As many of you may know, we recently moved to a new software and hardware paradigm due to the ongoing energy crisis. This move has brought many benefits in terms of performance and energy efficiency, but it has also brought some new issues, that we are currently working to resolve.
One of these issues is the occasional and unexpected halting of some of our servers. These are complete freezing of the system, with no hardware issues or errors reported. A reset is typically able to fix the problem, but we are working to find the root cause of the issue.
In some cases, the servers may show high CPU usage, but not always. A common characteristic of the issue is that no input of any kind is accepted on the console, and network and disk I/O throughput goes to zero. Sometimes, I/O metrics show high numbers just prior to the halt, but not always. There is no discernible pattern in terms of usage levels of the affected servers, with some experiencing high loads and others having low loads.
We are currently monitoring the issue and looking for information from others who may have experienced similar issues with similar configurations. We understand that this is a frustrating issue for our clients, and we want to assure you that we are working diligently to resolve it as quickly as possible.
We want to stress that it is much better to solve this issue by finding the root cause and fixing it, instead of relying on a watchdog to reboot the server. Each time a reboot is required there is a small chance of data corruption and other issues.
We apologize for any inconvenience this may have caused and we will keep you updated on the progress of our investigations. If you have any further questions or concerns, please don't hesitate to reach out to our support team.
Thank you for your continued support and patience.
Tuesday, January 24, 2023
