We are going to add several new arrays during the next few days and several new storage servers. We are now targeting smaller individual size with individual SSD caches to leverage more stable performance, and less effect on rest of the nodes.
Solo1:
5x3Tb 7200RPM with small SSD cache drive, RAID5
Jabba2:
5x3Tb 7200RPM with small SSD cache drive, RAID5
Solo2:
5x2Tb 7200RPM, RAID5
Jabba1:
Will import the array from Solo1 to migrate remainder from the initial ZFS pool which is underperforming.
This will vacate a bunch of 3Tb drives, which will be used to create a RAID6 array on Jabba1 with a 256Gb SSD cache in writeback mode. We are likely going for 12x3Tb in RAID6 or 2x5x3Tb in RAID5 for the Jabba1 after that in couple of weeks.
We have reserved 6 of the small "Solo" type of storage servers, these have each 4G ram for cache and 2xGbit links, able to sustain ~220M/s disk speeds (which is ~triple of what an array like that can sustain in heavy random I/O), and each have 6 drive spots, 1 of which will be used for a SSD boot + cache drive, and remainder for a RAID5 array.
Each of these "Solo" arrays will be used for max of 8 nodes, depending upon the setup and target, most likely for 5-6.
The initial Jabba1 pool has suffered now 38% disk failure rate - all of them were new drives, so it was a very bad batch of drives which we got for the first array, this combined with ZFS basicly being falsely marketed as outperforming RAID5 has been the culprit for most of the issues.
Istgt as the iscsi target will be dropped, we will be testing soon can Risingtide Systems iSCSI target support the special characters in the passwords used, or should we manually fix all the vnode OS images for new passwords in order to migrate to it.
The past month performance and stability is completely unacceptable, unsustainable and incomprehensible. Murphy's law at play here, when it rains it pours. None in our team has ever experienced such a degree of wierd issues as we have had to endure during the past month, starting from failing new fans, power y-cables randomly loosing connection, sata cables giving issues, 38% first month disk failure rate from a single batch (while other batches have 0%), almost none of the software used working as expected and advertised causing random issues and the level of logging being so inverbose that they are useless in debugging issues in question.
But we have learned a lot during the past month, and are heading to the right direction, several key portions of the infrastructure are now very stable and we can build on that.
As a backup, we are setting several systems with local disks, just on the high end and just a handfull.
I'm fully confident that we reach full stability and mass production during august.
Wednesday, August 7, 2013